Genome-Wide Association Studies – Karen Mohlke (2012)

Dr. Andy Baxevanis: Okay, good morning everyone, and thank you for joining us on this absolutely beautiful day here on the Bethesda Campus Our lecture today is devoted to genome-wide association studies And as you all know, these kinds of studies really help us separate genetic variations that are biologically insignificant from those that do produce some sort of change that might ultimately be detrimental or advantageous to a particular individual And study of these variations are also critical to identifying what genes are responsible for particular genetic or genomic disorder as you heard about during last week’s lecture by Lynn Jorde There’s also a much more practical reason to study these genetic variations, particularly the single nuclear type polymorphisms or SNPs that give rise to all of those subtle differences between each and every one of us in this hall since a very thorough understanding of these variations might provide a way for us to know in advance how well someone will respond to a particular drug or to a particular treatment regimen And we’ll hear much more about the pharmacogenomic implications of having this kind of knowledge in next week’s lecture in this hall by Howard McLeod This week I am very pleased to introduce to you Dr. Karen Mohlke, who will be presenting today’s lecture on genome-wide association studies Dr. Mohlke is an NHGRI alumna, having done her post-doctoral work in Francis Collins’ lab where she used genome-wide approaches to localize diabetes susceptibility genes She is currently an associate professor in the Department of Genetics at the University of North Carolina, a member of the Carolina Center for Genome Sciences, and a member of the Lineberger Comprehensive Cancer Center at UNC Her lab studies complex traits with complex inheritance patterns using many of the approaches that she will be describing to you today to study conditions such as type 2 diabetes and obesity As always, it’s a pleasure to have you here with us today, Karen And so please join me in welcoming Dr. Karen Mohlke back to the NIH campus Okay [applause] Dr. Karen Mohlke: All right All right, thank you very much It’s always a pleasure to be here, and no, no difference It’s always a pleasure to be here So, as Andy said, I’m going to be talking today about genome-wide association studies, and these are especially relevant for complex traits Oh And I have no relevant financial relationships to describe So, complex traits are traits that have both genetic and environmental contributions to them There may be many genetic factors, many environmental factors, and these factors may interact That is, that there’s not necessarily a single gene responsible for these traits And some of the genetic factors have rather subtle effects As we investigate genome-wide association studies, these are especially good at identifying common genetic factors that may be responsible for common variation in complex traits And by common factors I mean that when looking at a stretch of DNA sequence, and looking at several copies of this stretch of DNA sequence, of course many of the alleles are, many of the nucleotides are identical between those sequences, but sometimes there are differences For example, here is a “T” but in some copies of this sequence there is an “A” That is a relatively common variant Three out of 10 times in that representation it’s an “A” allele so an allele frequency of 30 percent There are also DNA variants that are less common, or rare So, for example, over, later in the sequence there is only one copy of a “G” allele where there may be a hundred, or a thousand other copies of that “C” allele When we think about the genetic architecture of genes influencing common complex traits, we can consider the different power of various approaches to identify the underlying genetic variation We consider the frequency of the variants up here being more common variants, and common is often defined as the frequency of an allele greater than about 5 percent Moving on down to the very rare alleles, those that might be present in only one person or one family And considering the effect of the allele, how strongly that variant acts to cause disease, or to increase risk to disease And so a very strong effect allele shown high in the Y axis compared to ones that have a

relatively modest effect, low on this axis So, genome-wide association studies are especially well suited to identifying common variants implicated in common diseases in contrast to, say, that rare alleles causing Mendelian disease that may, were more easily identified using linkage approaches or candidate gene approaches There have been relatively few examples of high-effect common variants that influence common diseases And as genomic technologies advance, we’re moving more from the common variants into the lower frequency variants And so lower frequencies may be from 5 percent down to about a half a percent, and as sequencing technologies develop, and more individuals are sequenced, then we’re moving into the, identifying more of the rare variants that will be identified to play a role in both common and Mendelian type disorders So today, as we talk about genome-wide associations, I’m going to talk first about what the goal of these studies are; how these studies are performed; what can be learned from the associated regions that are identified by the studies; and then what the findings tell us about disease So genome-wide association studies, the first ones were done seven years ago now perhaps There became many more done sort of in the five, three years ago and continuing on today The benefits of doing a genome-wide association study compared to classical approaches such as linkage analysis or candidate gene, genetic association studies were that, genome-wide association studies are more powerful than linkage to identify common and penetrant variants, and provide a better resolution than linkage So that the variants identified are closer to the underlying causal genes and/or variants than linkage analysis approaches And they can be performed in an unbiased approach There is no need to select candidate genes and know the underlying biology ahead of time These can be used to discover completely novel pathways involved in a disease or trait that were not previously known Now why were they only started several years ago? There was requirements to perform a genome-wide association study We need to know the catalog of human genetic variants so the genome could be sequenced and genetic variants across the genome identified There is a need for low-cost, accurate methods of genotyping, and technology advances have enabled this to be, this to be possible so now hundreds of thousands or millions of variants can be identified in a single reaction Need large studies of people, large numbers of informative samples, and along the way, efficient statistical design and analysis methods to handle the large number of variants being analyzed So the goals of a genome-wide association study are to test a large proportion of the common single nucleotide genetic variants for association with a disease or for a variation in a quantitative trait, and doing all this without having to have any prior hypothesis of how the genes may act, or what their functions might be I’ll talk through many of the steps in a genome-wide association study, so starting with ascertainment and collection of the individuals, the samples, the methods for performing genotyping, steps of quality control using that genotyping data, some of the methods of statistical analysis using this data, and the importance of replication So as we start thinking about the phenotype that is being studied, this can either be a disease or a quantitative trait So a disease such as type 2 diabetes or prostate cancer, or it could be a quantitative trait: height, cholesterol levels, something that is not discreet but has a continuous distribution of phenotype across the individuals Disease could be rare or could be common, although the common disorders are perhaps more appropriate for a genome-wide association study Quantitative traits have the advantage of being easy to measure, things like weight and height

Some of them require careful approaches to measurement, and getting an accurate measurement Genome-wide association studies can also be formed with, performed using traits such as gene expression level of all of the genes across the genome The accuracy with which a phenotype is assigned is a important step in analysis The more well defined the phenotype is the more likely one will be able to identify the genetic variants responsible for it The more heterogeneous the phenotype, if it’s really a mixture of many different causes that create that disease, then those will be sort of mixed together, and harder to identify the underlying causes When selecting the individuals to perform analysis, the strategy, so, one strategy is to perform a case control analysis, meaning ascertaining cases affected with disease, and then also ascertaining controls who do not have disease Another approach would be to do a population survey: collect many, many individuals across the population and then determine which ones of those are affected with disease Using the population survey of a smaller proportion of individuals affected with the disease, but they may be more representative of that disease in the population than if you perform a — ascertain the cases that are severely affected with disease That might be less representative, although they might lead to greater possibility of identifying the genetic variants responsible for them So in a case control analysis, the methods to, or the approaches used to define the case are relevant and important to consider when interpreting the results of a case control association study So were cases defined with extreme phenotype, were they — how were they collected? Is there some special subset of phenotype that may be especially enriched in that particular set of cases? Similarly with controls: Are the controls selected to be random members of the population that are not yet affected with disease? But then some of them, if it’s an adult onset disorder, perhaps will become affected with a disease next month or next year Perhaps less good controls when seeking to have a greater difference, although — so consideration of these approaches is important for how those results are interpreted So potential criteria that one could use when selecting cases would be to choose individuals that are more severely affected with the disease These might be individuals that have a greater genetic load, then, and so provide a greater opportunity to identify the underlying genetic factors One could require other family members to have the disease This is more evidence of a genetic factor responsible as opposed to more of an environmental contribution Choosing — for an adult onset disorder — choosing individuals with a younger age of disease onset also could enrich for genetic factors When considering criteria for selecting controls, could enrich the genetic effect by choosing individuals with a lower risk of disease rather than population-based samples It’s important to keep the ancestry of the controls and the cases matched as well as possible, and to try to match the controls to cases based on age, sex, and other demographic factors that may influence disease To show a bit of an example about a matched ancestry: If the cases are collected from the population, but have different underlying ancestry represented here by the different shadings of the different symbols here So maybe solid filled symbols, and these two different categories If that different ancestry is differently represented, if the proportions of those are differently represented between the cases and the controls and there are genetic factors or genetic variants that are more common within some of those subsets than others, then those genetic variants may appear to be associated with disease when truly they are associated with being part of that subpopulation When performing an association study in a set of samples that have not previously been analyzed genetically, you may have inadequate ancestry information prior to performing the

genotyping Ascertaining individuals from a particular area may assume that the ancestry is similar between individuals After performing genotyping with hundreds of thousands of markers across the genome, one can look at the frequency of different alleles, and identify perhaps subsets of individuals that are, create subpopulations within the sets of cases and controls So this subpopulations is, that I’ve been talking about, another word for this is population stratification So the issue being that population stratification can produce false positive association results in case control studies In addition, individuals that are cryptically related, that you don’t know are related but have — that are, say, cousins or something not collected, not known in the collection of individuals, can enrich for particular alleles within samples, and that can also create a false positive association Ways to account for or avoid stratification and relatedness: one is to perform genomic control So this is a correction that is an average — evaluates that sort of the average excess association identified and adjusts the results of the association study by this average measure to sort of alter the threshold that you use to define what a significant result is Another approach is to use the allele frequencies of variants across the genome to identify principle components of, say, subpopulations or of substructure within the samples, and then include those principle components of substructure as covariates in the analysis to counter-adjust for them Another approach to avoid population stratification would be to be perform a family-based study design, where instead of selecting cases and controls the association analysis is performed within families, and considering the relationships between the individuals On a — it — with — given a set genotyping budget, however, there is reduced power for identifying variants when individuals are related and part of those families So the genotyping process, now genotyping panels are available with as few as ten thousand SNPs, single nucleotide polymorphisms, as many as five million SNPs now Two main companies provide a number of fixed content panels available, meaning that the genotyping arrays or chips are available with set SNPs that are being evaluated on them The approaches used to select the SNPs for these panels, some of them are random SNPs Some of them are selected to be haplotype tag SNPs and Lynn Jorde talked about this, and I’ll show a slide about this as well Some of the variants, or some of the nucleotides chosen to be on these panels are not nucleotides that vary, but have different alleles in the population, but for which the intensity of the signal differs because of a copy number variation And some of the arrays now are — that are now available have fixed content but the user is allowed to add on an additional 10,000, 50,000 single nucleotide variants So if you were to perform a genome-wide association study today you may choose a panel and then say, “Oh, but these particular variants are missing from that panel.” Perhaps if you know of some less common or rare variants that are not on the panel, or some particular functional variants, or variants that you think really play a role, those could be added onto the panel Higher density genotyping, higher density SNPs in special regions of interest could be added onto those arrays So I talk about selecting haplotype tag SNPs, example shown here So in this example now there are four copies of a particular chromosome Again most of the nucleotides are the same, this is representing three single nucleotide

variants in this region When combined together with variants that are both upstream and downstream of this, the variants can be shown, represented as haplotypes And the, given the history of human populations, and the non-random recombination events that have occurred during human demographic history, there are clusters or sets of SNPs that are being inherited together in most members of the population And so selecting SNPs that are representative of variation of other SNPs allows a more efficient, fewer SNPs to be genotyped to represent a larger proportion of the variation So for example, these haplotypes of 20 variants can be represented by just choosing three SNPs within this set, and there are other variants that could be chosen as well This is sort of an example: “T-C-T-C” variant here could also easily be represented by this variant here, “C-T-C-T,” but the set of three variants represents the variation present So this also means that when interpreting the results of an association study, although a single variant might be described or reported in a paper, say this variant is described as showing strong evidence of association, it’s important to remember that there are other variants located nearby that are in linkage to equilibrium with that variant They’re inherited together in the same pattern as that variant that may — that would also show similar or identical evidence of association with that trait So I’ll talk through a few of the methods of allelic discrimination that are used in these genome-wide genotyping panels One of them is this Illumina Infinium assay And the Illumina assay DNA is amplified to generate larger amounts of DNA, and then the DNA is captured on oligonucleotides that are bound to beta rays An allele-specific extension or a mini-sequencing assay is then performed So here is the genomic DNA target It’s being, hybridizing to sequence that is on an oligo that is bound to a bead, and a sequencing reaction happens So that if the allele provided is a perfect match then the polymerase can continue on with that sequencing reaction If there is a mismatch of the end nucleotide then no continuing sequencing reaction can occur There are a few different forms of this assay that Illumina provides The Infinium 1 assay and the Infinium 2 assay, in this case there are two different bead types used to represent that single SNP, and one color of detector to, that is, detectable label that is used In this form, a single base extension reaction happens, so a single bead type is used, and two different colors of detector are involved So when Illumina describes the number of SNPs that are available on a panel and the number of, say, custom-designed SNPs that could be added to a panel, they talk about bead types, because some SNPs are assayed well with the single bead type, and some SNPs are assayed better with two bead types Okay, Affymetrix has a genotyping platform called their GeneChip Array In this strategy, the genomic DNA is sort of reduced genomic complexity by performing restriction enzyme digestion and size selection of the fragments Adaptors added, amplification steps, fragmentation end labeling, and the allelic discrimination happens based on hybridization of one allele, two sets of oligos on the array So in their GeneChip Probe Array there are millions of copies of a specific oligo probe bound, so in a given, in a given region here are DNA probes in sort of one part of the

array, and there are multiple copies of this same sequence with the same variant allele present A given SNP can be represented by many different probes Say the SNP allele, the variant allele may be in the center of a oligonucleotide, and there could be as many as the four different sequences represented on the probe representing all four possible alleles that could be bound there And then the, the variant could be offset by a nucleotide or two not precisely in the middle, but moved over or the probe could be a little bit longer, a little bit shorter With time, the choice of which, which probes are the most efficient at discriminating between the two alleles improves, and that’s what allows Affymetrix to add on additional variants to be able to fit more variants onto an array, and allows the discrimination to be optimized for given variants Affymetrix also has a newer platform, their Axiom Array In this case, the DNA is amplified and fragmented into, say, 25- to 125-base pair fragments enzymatically, and then the fragmented amplicons loaded onto the array to hybridized oligos And after selection, some — a solution of random 9-mer oligos that are labeled are hybridized to the array, and they’re hybridized such that if the alleles match then a ligation reaction can be performed And so the discrimination, the allele discrimination is based on ligation which requires the alleles of the adjacent nucleotides to be, to be matched and to hybridize well And that provides greater allelic discrimination a little bit better than, say, hybridization would, would provide And then the labels that are present are stained and imaged So here is a representation of what some of the sort of coverage of common variants is for a set of arrays that are available These are a little bit some of the older arrays And so coverage is calculated by looking at the set, some defined set of common variants that are present, and when you interpret what, what the coverage is of a particular array, you want to consider what that set of variants is Often HapMap variants will be defined, or 1000 Genomes variants The more sequencing that happens, the more variants that are identified, so knowing what that reference set is, is valuable, and then looking at the linkage disequilibrium between a given variant and the other variants that are present in that set is used to estimate what that coverage is for the given chips And the coverage is going to differ based on the population of the individuals being assayed, because allele frequencies differ, and linkage disequilibrium relationships differ between populations So some of the newer arrays that have more variants present in them do a better job, have higher coverage of common variants than, say, some of the older arrays Now, the most recent generation of SNP arrays that are available are improving coverage of the lower frequency variants So the initial arrays were covering the variants, say five, frequencies of 5 percent and greater Now the frequencies covered are moving down into the, say, less common ranges So here is a slide from Illumina One of the newer arrays that they have available is specifically chosen for the Chinese population, so this particular chip was designed to select variants based on individuals from Chinese ancestry And so they show that the coverage on the Y axis here of variants with an allele frequency greater than 5 percent is sort of shown here on this particular array compared to one of

their other genome-wide association arrays So here is a more general array, and this is the one that is chosen to be specific for the Chinese population And you can see that they’re also improving the coverage of their, of the less frequent variants, those with a minor allele frequency greater than 2.5 percent increases with this specific chip To be fair, here is also a slide showing one example of an array from Affymetrix, and they too, in their latest arrays that are available, show that they are, well they have good coverage of the common variants They’re also trying to have improved coverage of the less common variants in this little bit lower frequency, sort of that 2 to 5 percent allele frequency range Okay, so genotyping of samples, cases and controls, members of a population is performed, genotyping data comes back There are a number of quality control steps that are important to do in a genome-wide association, prior to performing the association analysis One is to look for and detect poor quality samples The samples that had a success rate less than some level, maybe at the 95 percent of the SNPs are successful The more SNPs that fail, the more that the SNPs that succeed are called into questions as to perhaps be generating inaccurate genotypes So if, if most of the samples are working very, very well and some of them are not as well then it could be that heterozygotes are being miscalled as homozygotes for particular alleles And so identifying and excluding poor quality samples is valuable An excess of heterozygous genotypes might suggest that those — a DNA sample is really a mixture of two DNA samples One can use the genotype data to evaluate whether any sample switches have happened in that process from when the DNA sample was collected from the individuals and then that, say that the tube of blood was collected It was processed into DNA It probably changed hands many times It was moved from a tube onto a plate, and a plate that was then genotyped, and that whole process Sample switches can happen, and one way to identified whether that has happened is to look at the sex of the individual based on markers on the X and Y chromosomes, and evaluate whether it matches the sex expected in that individual If DNA samples are around a lab for a while, then particular alleles that are, are particular genotypes known from one set of genotyping reactions can be compared to those done on another, you know, with another assay to see whether at another time point, to see whether any sample switches have happened in the intervening time One can use the genetic data to look for unexpected related individuals So again, when analyzing a cohort or a population for the sample for the first time, one can use pair-wise comparisons of genotype similarity and look for, say, unexpected duplicates might turn out to be monozygotic twins, or people who participated in the sample collection more than once with different identifiers And you can also use the allele frequencies of variants across the genome to look for individuals who have ancestry that may be a little bit different from the rest of the sample and then consider that, and either exclude them or account for that, those differences when performing the later analysis In addition to looking for poor quality samples, one can look for poor quality SNPs So shown here are a few examples of raw data of genotyping of sets of, set of individuals So shown over on the left here now, it’s signal intensity of one marker, said the X marker We’ll call it the A allele signal intensity of another marker It’s labeled the Y marker; let’s call it the C allele So the — this is a lovely looking marker where the allele intensity is very high on the A axis for this, the samples relatively low on C axis set these would be the AA homozygotes These similarly are very high on the C allele axis These would be the CC gene type, and these would be the heterozygotes It’s an ideal genotyping plot When doing hundreds of thousands and millions of markers, software is used to assign the

genotypes to various clusters It can occasionally, the software might not detect that these two clusters are distinct It might call them together as heterozygotes, so erroneously assigning heterozygous genotypes to these individuals Sort of trying, look for cases when that happens and fix them, or exclude those markers Some assays for given SNPs don’t work all that well and there is not much discrimination, or the discrimination is not clean between the clusters And so the individuals that are especially close between these two clusters may be more likely to be miscalled with an incorrect genotype And those genotypes can either be excluded or, or it’s at least helpful to recognize the marker, and perhaps exclude the entire marker to avoid having errors in the data that might lead to false positive or a false negative associations Other ways to, so that often happens at the genotyping level, the individuals performing the genotyping analysis are those who are looking at that raw data, evaluating some of those characteristics One can also detect SNPs that are of poor quality by looking for a genotyping success rate less than 95 percent So now this is a SNP that worked in less than 95 percent of the samples It’s sort of an arbitrary threshold but a commonly-used one Might suggest that there is some problem in that assay that the, perhaps it’s not discriminating well between the clusters Perhaps the genotypes that continue to exist are inaccurate and therefore excluding the marker would be more prudent Often these analyses are done using a small percentage of samples are duplicated, present twice within the set of samples being genotyped So then the genotypes from those duplicate samples can be compared, and finding mismatches or discrepancies between those identical samples is a bad characteristic for a SNP I’d want to exclude those particular markers Can also do a test for Hardy-Weinburg equilibrium, looking for the expected proportions of genotype, or genotype frequencies are not consistent with the observed allele frequencies This also suggests that the marker perhaps has a problem, that the — perhaps heterozygotes are more often being called homozygotes incorrectly, and so statistical tests can be used to identify that kind of an error If there are related individuals within samples such as a mom, dad, and a child, trios, then one can look for Mendelian inheritance of alleles from the parents to the child Some groups will add additional quality control samples to their genotyping, to their sets of samples to allow this kind of SNP error to be detected And then it’s also important that if, say, a set of cases are going to be compared to a set of controls, that the genotyping be done as similarly as possible between those two groups If the cases are genotyped entirely separately from the controls, then it’s possible that there is different allele missingness or that there’s different accuracy in the cause between the cases and the controls And this can lead to false positive associations, so it’s important to try to intermingle the cases and controls as much as possible to account for any differences in plates or arrays or any of the technical steps in doing the genotyping to detect any sort of potential errors Okay, so once the genotype data is cleaned, meaning that the, you know, poor quality samples, poor quality SNPs have been removed, then one can go test for association So, in a case control study, now looking for differences between the cases and controls in terms of their allele frequency, genotype frequency, lots of things So, for example, one could perform a test for trend looking at the frequencies between those different sets So look at the counts of individuals in these, with different genotypes within the cases and controls It’s valuable if there are covariates that are also associated with disease, so if the disease prevalence increases with age or if it’s more common in males than females then covariates representing all these factors should be included in the analysis to account

for them to improve the opportunity for the genetic variants’ contribution to disease risk, or the quantitative trait to be identified Often tests are done looking for an additive effect of the alleles on the trait, meaning that having one allele has an effect and having two alleles has more of an affect Other tests can be done looking for evidence of dominant or recessive models or are — however, the additional number of tests performed in doing an analysis like this would need to be considered when deciding what the threshold of significance of the overall results of the end are So, for example, in a case control study when looking, when looking for the effect of an allele on risk of developing disease one could calculate an odds ratio So if these are counts of individuals, cases and controls that have counts of the alleles A and C represented in those individuals, then one can calculate an odds ratio as the odds of having a C allele given case status over the odds of having a C allele given the control status, and this would form an odds ratio And so a value that is greater than one shows increased risk of disease for that particular allele And an odds ratios that is significantly less than one is evidence of decreased risk of disease When performing association analysis on a genome-wide scale, many, many tests are done So if 300,000 to five million SNPs are being analyzed, then one would want to correct for that number of multiple tests when defining what a significant result is, and what a sort of spurious chance result could be One approach for doing this is to take a commonly used threshold of significance, say 5 percent So one in 20 times you might see a result, a difference between cases and controls that is at this level of significance, and divide that by the number of statistical tests being performed So, a commonly-used threshold assumes that the number of common variants being tested across the population, this was designed based on a Caucasian population, was approximately a million tests And so taking a P value threshold of .05 dividing it by a million creates a new threshold of 5 times 10 to the minus 8 So this is a commonly-used threshold for declaring that a particular result is significant and not likely to have occurred by chance Achieving a threshold like this requires either a large effect of that particular variant or a large sample size to detect a more modest effect Question? Male Speaker: Just a quick question So, so is there any preference to which multiple testing procedures used on GWAS studies whether it’s [inaudible] or Benjamin Hartford [spelled phonetically] or? Dr. Karen Mohlke: So different approaches are used to define — the question is, are there different strategies one could use a false discovery as opposed to this Bonferroni correction for multiple tests Different approaches are used I would say that declaring a threshold of 5 times 10 to the minus 8 is very commonly used within the literature Although people will argue whether that is an appropriate threshold to be used, and often there are signals that do not reach that threshold that it’s due to limited power and when sample size increases in the next round of study then those variants become significant and so, it is a valuable thing to consider So I show here an example of what results would look right from an association test This is from an early test for type 2 diabetes association between comparing not quite 1,200 type 2 diabetes cases to not quite 1,200 normal glucose tolerant controls This is work of the fusion study And the results shown here are for the genome with the chromosomes lined up end to end So chromosome 1 on the left, all the way down to chromosome 22 and then the X chromosome

With each dot representing a single nucleotide variant that was tested for association, and this analysis was done using logistic regression with an additive model, and adjusting for age, sex, and birth province even within Finland to account for a potential stratification And then on the Y axis is this minus log 10 of the P value, so a P value threshold of 05 would be about there So you can see when doing this many tests that is not an appropriate threshold for defining what is significant There are many, many variants have a P value smaller than that threshold The threshold for accounting for the number of tests done here would be in the sort of 10 to the minus 7 or that 10 to the minus 8 range You’ll notice that the maximum scale here is six, so none of the results from this initial study reached that threshold of genome-wide significance As we, that makes it difficult to figure out what variants might represent true positives At the time that this study was done, sort of before genome-wide association studies were available, there were three variants, or three loci that had a well-established roll in genetic contribution to type 2 diabetes And so it we looked for the location of those variants within this data, so one of them was at the TCF7L2 locus So it was gratifying to see that within the top 10 SNPs of this association analysis that that was, variants were present So that suggested that the, that it would be possible to be identifying genetic factors Another of the variants was that the PPAR gamma locus this is maybe now the top 300 variants, and another of the variants was within established role was around 3,000th on the list of 300,000 variants analyzed One way that the, to evaluate whether there is an excess of significant results at a given threshold is to plot the P values that result from the test of association against the P values from a uniform distribution So shown here on the X axis is minus log 10 of a uniform distribution, and the Y axis minus 10 of the P value from the test of association So there is a black line showing sort of the expected right along the edge here, and the blue dots that are, represent the data that I just showed you So you can see that there is sort of a slight movement off of this line, but very much falls along the line So this is good from the perspective of there is no excess of associations that might represent population stratification or some sort of excess relatedness within the individuals But it’s bad from the perspective of there are no variants that showed strongly significant excess evidence of association in the true analysis compared to the uniform distribution If one was doing an association analysis in a population that had evidence of substructure or stratification, then a plot, similar plot might show that the variants in these dark blue dots show an excess significance sort of all the way through the, through the scale If the, this population stratification is adjusted for then the P values that result from the association test are more in line with that expected distribution And so correcting for population stratification can reduce the excess result, excess associations that are false positives that are not due to true genetic signals So, performing an association analysis and doing all that work and not identifying significant results, a frequent next step is to try to gain statistical power by increasing sample size Larger sample sizes will have a greater possibility of identifying genetic factors that have a

more modest effect So the frequent, the common way that this is performed is that each group does their own genome-wide association analysis, and then the date from several studies is combined together by performing a meta-analysis of the results for each genetic variant Now, potential issues for performing a meta-analysis across studies: one is that different genotyping platforms may be used, and different analysis strategies might have been used in the beginning; and also that the definition of cases and controls may differ So there is some heterogeneity that is introduced by the fact that different studies are performed in different ways Generally the strategy that has been applied is that larger sample size is more valuable and more powerful in the face of these, say, differences in sample collection, and so results need to be taken with, considered with some caution That, about what heterogeneity might underlie them, but the generally larger sample size is identifying additional, more variants To address the different genotyping platforms that may be used by different groups, the several strategies for imputing, or predicting the missing genetic variants between platforms have been developed So, in imputation one might have in your study sample genetic variants typed at, say a position here, a position here, a position here, but that the other genetic variants in the intervening regions were not typed They were not selected for that genotyping platform The study samples can be compared to some sort of a dense genotyping platform, or dense set of genotypes So HapMap is a commonly used set of variants So this is on, sort of samples that were chosen to try to be representative of some particular populations that were analyzed at a much denser set of genetic variants Now more recently the 1000 Genomes project has generated data, an even denser set of variants and so one could take the genotyping data from a particular study, and impute the variants from the 100 Genomes project, and fill in many more of the genetic variants So instead of analyzing, say 500,000 variants that were genotyped on the array, one could analyze two, 2.5 million variants present that are on some of these reference panels So the strategy for doing imputation is that a probabilistic search for mosaics of chromosomes that match each individual is performed So, for example, the top chromosome from this individual is represented by this haplotype within the reference panel The lower chromosome of this study individual is best represented by a mosaic of, say, one portion of a chromosome and another portion someplace else, suggesting, right, that this individual has a, that the portions of these two different haplotypes, a recombination event has occurred sometime in the past So then the genotypes can be sort of filled in from those phased chromosomes There are several different approaches to performing imputation, and often they, the analysis provides some evidence of the likelihood filling in that genotype was correct And so thresholds for quality can be used, and if a variant is, you know, part of a chromosome that has been seen many, many times in exactly that same set of variants, and has been seen in many copies of that haplotype Might have a lot of confidence filling in the intervening genotypes, whereas if it’s a region of lots of recombination, and it’s unclear exactly which haplotypes match best then the filled in genotypes may have less accuracy, less likely to be correct

And so analysis can be performed and sort of choose a threshold and not include genotypes that are imputed with a low likelihood of accuracy The advantage of doing imputation is that it allows the many different genotyping platforms, studies done on these different genotyping platforms to be combined together So here is an example of one of the arrays that, say, perhaps genotyped these particular markers, whereas a different array genotyped these particular markers, and when both, both sets of data were used to impute markers from the hatmap project, the markers shown in blue were able to be analyzed in both studies So while the overlap between the sets of data available from one platform or the other was, you know, the directly-genotyped markers that were shared was relatively small, the total number of markers that were able to be analyzed was a much larger, is a much larger set Imputation doesn’t require that the variants be perfectly in linkage disequilibrium with the variants that are tested It’s a haplotype-based approach, and so it’s possible to identify variants that have a different frequency than the variants that were typed So there are examples at least in the early stages where variants were identified to show association only when imputation was done That none of the markers on the genotyping panel themselves showed association So in this particular plot, this is a zoomed in region of a portion of chromosome 9, with some genes show below, and the minus log 10 P value for LDL cholesterol levels shown on the Y axis And the dots that are shown in red are the markers that were directly genotyped on the particular genotyping array And the dots shown in blue were the ones that were imputed based on using the genotypes from the Affy array, and imputing the variants present in the HapMap sample And so you can see that none of the red dots showed strong evidence of association in this region, however, at least one of the blue dots gets up into a more significant P value showing evidence of association This is the low density life of protein receptor locus used for, associated with LDL cholesterol A result that was known prior to this kind of analysis, but goes to show that the imputing can identify variants that were not present on the genotyping panel So here shown is an example of the structure of a meta-analysis, where seven different groups got together Each one performed their own genome-wide association analysis using a shared analysis plan for what method to use, and what model to use, and what covariates to use And then a meta-analysis of those seven studies was performed, and the top SNPs, the most strongly associated SNPs from that study or representative ones of the, of those results were selected to follow up in additional samples So some studies, some cohorts have genome-wide genotypes available Some do not and are, but are able to genotype, say, 50 SNPs to go follow up in results And so in this particular example, around 40 to 60 SNPs were selected and different groups in these replication cohorts genotyped those variants separately using a different genotyping platform And then the data from those replication cohorts was analyzed to determine which of the initial variants showed significant evidence of association So in this particular example the genome-wide association analysis was done around 20,000 individuals, and then some of the top variants were followed up in around 20,000 individuals The results of that particular analysis are shown here Now there are three genome-wide association plots, because there were three phenotypes analyzed with that set of data: LDL cholesterol, HDL cholesterol, and triglyceride levels Phenotypes measured in the same people once the genotype data is available, then looking

at the range of all phenotypes present is relatively quick So show here are three genome-wide association plots and three — these quantile-quantile plots So let me zoom in and show a portion of one of these So here is a portion the genome-wide association plot These are often called “Manhattan plots” because the tall buildings show up out of the background of shorter buildings there In this analysis, this was sort of the — not the first round of genome-wide association studies for these traits, but a later round So they show the results on this q-q plot here The grey line represents the expectation if none of the variants show significant association, and this is shown now with a 95 percent confidence interval on that line So black represents the set of all variants identified in this particular trait, LDL When removing the variants that were known previously, then the blue symbols are representative of the data being reported in this particular study So they still showed an excess of significant results There are still novel signals; evidence of association being identified If we remove the effects of those variants you can see that there’re still a little bit of excess association present, but none of the variants in particular reached the genome-wide significant level So, meta-analysis is useful and follow-up in replication of initial association results, especially ones that don’t reach genome-wide significance levels yet, can allow for increased power and increased opportunity to identify novel signals associated with a disease or a trait When performing meta-analysis, however, one has to be concerned about heterogeneity between the studies So one example to demonstrate this: when The Wellcome Trust Case-Control Consortium performed a genome-wide association of type 2 diabetes, they showed strong evidence of association of variants at the FTO locus with type 2 diabetes However, a couple of other studies that we’re doing association analysis of type 2 diabetes at the same time didn’t really see evidence of association with FTO at all It turns out that the Wellcome Trust cases were more obese than the controls in that study, whereas, the other diabetes studies, their case control selection had been more balanced with respect to body mass index — body size So the identification of this source of heterogeneity between the studies led to identification of FTO as a gene that plays a strong roll in obesity Some of that data is shown here This is a plot showing odds rations and a 95 percent confidence interval of the odds ratio So the X axis is odds ratio, 1.0 would mean that there’s no increased risk or decreased risk of a given variant Here the FTO at the A allele of this marker representing the FTO locus The initial set of Wellcome Trust cases of type 2 diabetes showed a strong odds for obesity Here are the controls that were used in that analysis So when you see the controls used in the type 2 diabetes analysis So you can see the effect on obesity is larger in these type 2 diabetes cases than in those type 2 diabetes controls That’s why it looked like evidence of association with type 2 diabetes at first When they go and collected — when they went in and collected other sets of cases, other sets of controls, and then, valuably, samples that were from population-based collections, so not disease status ascertainments, and evaluated the effect of this particular allele, you can see that it consistently shows an increased risk with obesity So this odds ratio is 1.3 and the confidence interval around it is quite narrow because

it’s a very large sample size and show that this was sort of the definitive evidence showing that these variants are associated with obesity Okay, so genome-wide association studies have been performed now for at least 237 traits This is a results cataloged by the NHGRI in a catalog of genome-wide association studies The slide shows the various chromosomes and with some colored dots representing positions of some of these loci and most recently there is, last summary here, there is about 1,449 published genome-wide association signals with P values less than 5 times 10 to the minus 8 representing 237 traits So many genome-wide association studies have been performed and many, many loci have been identified where genetic factors are associated with the trait or disease As would be expected, more loci are found with larger sample sizes So, in this recent review, the — a number of different results are summarized with the number of cases shown here on the X axis, 1,000, 10,000, and 100,000 And the number of genome-wide association hits or signals represented on the Y axis at 1, 10, and 100 And the different symbols representing different studies that were performed with different sample sizes And here is a subset of case control studies that were done for Crohn’s disease, different various studies And you can see that generally the larger the sample size and the larger the number of cases, then the larger more genome-wide association hits are identified, showing that many signals exist and that the effects for many of them are relatively modest and that large sample sizes are needed to identify them So let’s look at some of the examples of types of results that are identified in genome-wide association studies I’m going to look at a few plots of particular loci, so zooming in on the genome on the two particular regions So here is a portion of chromosome 19 and about 400 kilobases are shown on the X axis Each of these representing genes in this gene-dense region and P value test of association over here has a strongest signal here with a P value better than one times 10 to the 25 This is replicating a known association, one that’s been know for a very long time, of a variant of the APOE locus associated with LDL cholesterol levels Now this is not the variant itself that has been sort of more strongly to play a functional role at the locus, but it’s inherited at a similar pattern This example also lets me highlight that when results — so in this particular case, this variant is close enough to a known gene that this gene might be the one highlighted in a report of a genome-wide association study However, if this was a novel signal, then the evidence, the decision about what gene label to use in a report might be a little bit arbitrary, might be a little bit driven by what the biology of those underlying genes might be But it’s important to know that, when reading a paper of a genome-wide association study, that the gene label assigned is often just the nearest gene to that SNP that happens to be the top signal, and might not be the — a gene that is contributing to variation at that locus Also, even though a single gene might be provided in that label, there could be genetic variants that are affecting more than one gene at a given locus that, you know, that there’s true causal underlying variants; that there’s multiple of them and that they could be affecting different genes at that locus So, interpret with caution

Okay, so then some signals that are identified can be novel signals In this particular case, the strongest evidence of association was found within an intron of a gene, meaning that, shown down here, these little tiny boxes representing exons here and all of the variants that show the strongest evidence of association here are localized within an intron So perhaps underlying causal variants are not shown on the plot but are in linkage to equilibrium with variants in the plot and could be playing a role in the protein sequence or perhaps underlying variants are influencing gene expression of this gene or of some other gene nearby Some novel signals are found at a distance from known protein-coding genes So these are identifying possible novel biology or possible novel mechanisms So variants that are found at a distance to protein coding genes perhaps are affecting other sequencing in the genome, RNA sequences, non-protein coding genes that may be present; not all of these are annotated in the genome yet Or there could be regulatory effects, you know, having regulatory influence Say it’s as enhancers or repressors of transcription of genes that are hundreds of kilobases away More and more, multiple signals of association are identified in a given region This makes sense with what’s known about genetic variation and allelic heterogeneity for Mendelian disorders There’s more than one way to influence a gene; there’s more than one way to alter a gene So there’s often more than one common variant or signal that can play a role in association at a given locus So, shown here are two separate — really it is the same data shown twice, but it is colored based on the relationship of the variants to one another So there are really two signals here: one that’s localized quite close to this particular — the promoter of this particular gene, and another signal that is independent — independently inherited from the signal that is located tens of kilobases upstream of this particular locus One way to look for independent signals is to include a give single nucleotide polymorphism variant in a regression analysis to adjust away the effect of one variant and then see what the results of the other variant are in the region So in this particular case, if — each dot here is representing the evidence of association with the trait If one were to perform this test and include one of these variants in that test of association, at this locus the signals are independent And so, by including this evidence of association, the test of any of these other variants would essentially go away and show no evidence of association However, these variants — the association of these variants remains unaffected by that other signal So this is really strong evidence of independent signals influencing association Now there may be more variants that are not necessarily independent of each other There could be two causal functional variants that share some haplotypes but not all haplotypes with each other And so, when going into the functional biology, trying to figure out what the mechanisms are, what the underlying variants are, it’s not just independent signals, but the multiple signals that might be present that might help indicate how these DNA variants are leading to changes in gene expression or function leading to disease Here is evidence of association that shows that you can obtain different results in different populations and that populations that are older and that have more evidence of recombination events that have narrower regions of linkage to equilibrium can provide greater resolution to the signal that can show a narrower region of association than in other populations

So shown here are some evidence of association with height for a set of variants across a region And then shown below are the linkage disequilibrium Pair-wise linkage disequilibrium plots for sets of variants in this region from the CEU HapMap population HapMap sample and the YRI HapMap sample And you can see that this evidence of association which is from a populations of a European ancestry samples shows evidence of associations across this region, and that there’s a relatively wide linkage disequilibrium block in this region, whereas in the YRI samples, there is more narrow sets of these variants are more inherited together; these are more inherited together but they’re not; these and these show less association with each other The signal from Caucasian sample was quite broad; the signal in African-American individuals was strong in this region, but was not strong in this region, suggesting that the more likely location of a potentially functional underlying variant was restricted to this region and not in this one In this particular case, the variant that was showed the stronger association in the African-Americans, was also one that had been shown previously to have an effect on gene expression of one of the nearby genes perhaps providing some support for it having a functional role The more genome-wide association studies that are done with a range of traits, the more that the same variants and the same genes are being identified as associated with two or more traits Sometimes these signals are being identified that are associated with traits that one can recognize what the underlying mechanism might be Sometimes the relationship and the different diseases that are — or traits that are — show evidence of association helps provide some biological clues as to what those pathways might be that are responsible for a particular trait So, there are variants that are being identified, for example, for both diabetes and cancer And in at least one case, the same DNA variant was associated with increased risk of prostate cancer and decreased risk of type 2 diabetes Examples like this are suggesting perhaps the role of cell cycle genes and that variants can end up having different sorts of effects Looking at the collections of traits and associations might help us understand what the driving biology is underlying a signal and which association is coming, you know, sort of as a result of that initial trait So in this analysis of genome-wide association signals, the authors took the set of SNPs that had shown evidence of association with the trait or disease, and then looked at annotation classes of where those variants were found in the genome, and looked at annotation classes such as non-synonymous sites, regions around promoters, regions in introns, regions that are intergenic, and compared a randomly selected sets of variants on genome-wide association panels to those that showed evidence of association, and looked to see whether there’s an excess of variants in particular classes that had been found to be associated with disease So, in this particular analysis, here’s the odds ratio of one, so anything crossing an odds ratio of one is not significant at the 5 percent level But these classes here of non-synonymous variants, and promoter regions at sort of 1KB and 5KB definitions all showed that the trait associated SNPs were over-represented in these classes compared to just random variants on the genome-wide arrays And even though there are more variants present in the introns and, you know, many variants identified in introns and intergenic regions that show evidence of association, there are

also very more variants on the arrays that are — have these characteristics So, taken together, the genome-wide associated variants are being identified that explain some of the population variation for the various traits Shown here is a subset of traits, a partial table from a recent review And it shows a set of traits and the heritability from pedigree studies expected for these particular traits So, some traits are more highly heritable than others, and they show it in comparison the genome-wide association signal hits the ones that are sort of defined at genome-wide significance and what proportion of the variation they are explaining of the — of this heritability And so, we’re approximately in many cases looking at, say, about 10 percent of the heritability as explained by the genome-wide association hits Now analyses are being done to evaluate what the effect of all common SNPs might be, not just the ones that have reached that threshold to define significance, but the ones that maybe have not reached it yet, that with greater sample size and more power might reach it in the future, to estimate what the heritability might be of all SNPs that are being analyzed And you can see, for example, that the heritability that may be attributed to such common SNPs could increase a fair bit, still not likely to be representing all of the variation that may be present Where only genome-wide association studies are largely restricted to some of the common variants, and so this suggests that there are other genetic factors that are playing a role in heritability The use of this information to prevent disease is really dependent on the disease and heritability, and I should also say that in this particular case with type 1 diabetes, there were variants known prior to the — they included some variants known prior to the GWAS era that had a very strong effect when looking at that heritability number One way that people are characterizing individuals is based on the number of risk alleles that they have You could see some evidence of differences in groups of individuals So while the variants might not be well predicted for a given person, one can count up So in this particular case, there were more than eight SNPs available that had shown evidence of association So for each individual that had counted up how many height-increasing alleles did that person have, and then grouped them So here’s a block of individuals that had fewer than eight, or equal to eight height-increasing alleles and plotted their average height, and compared it to, in these other regions, the individuals that had at least 16 height-increasing alleles, and plotted their average height And so between the individuals that had the lowest and the highest number of height-increasing alleles, there is a few centimeter difference in how tall they are However, these are — most individuals fall in the middle of this plot, these are common SNPs and the individual predictability of the variants is relatively low The value in clinical translation, then, of these genome-wide association studies largely is starting with the novel biological insights These hundreds, more than a thousand signals identified in the past few years provides hundreds and thousands of novel biology to biological signals to go investigate and evaluate, determine what the role of those variants and those genes plays in disease, which would then in time lead to clinical advances, particular drugs, or biomarkers that represent the disease better potentially leading towards prevention

There may be some improved measures of individual genetic approaches, and I think you’ll learn more about those especially with respect to drug development and drug response next week So, in summary, when performing genome-wide association studies, it’s important, or interpreting them, it’s important to pay attention to design and quality control, large sample sizes are needed to identify signals with modest effects There are more than 1,400 signals and counting across the genome-wide association studies done to date And that finding any signal doesn’t immediately provide information on the underlying biology or clinical utility, but sets off lots of follow up analysis that can lead to these discoveries, and the time to changes in medical care are based on some of these results, it might be years, but the biology is really advancing quickly As we progress with genome-wide association studies, more and more loci are being identified- larger meta-analyses are being done- groups are gathering together more and more sets of samples- there is deeper follow-up of genome-wide association signals, so groups are creating custom arrays of not just 50 variants to follow up, but thousands of variants to follow up to identify additional signals; population-specific panels are being developed to increase the range of genetic variants that can be analyzed in a given study; more diverse populations are being used to identify variants, other types of sequence variants, not just single nucleotide variants are being incorporated; analyses are being done with multiple traits, and looking and the relationships between those traits; and these are beginning to allow gene-gene and gene-environment analyses and interactions to be evaluated; and finally the data are generating sort of evidence and spawning much future analysis to figure out the molecular and biological mechanisms underlying the signals So, thank you very much for your attention [applause]