Notes on GWAS Quality Control/Quality Assurance

Statistical Genomics for dummies and advanced. Discussions, links, usefull information.
Forum rules
Welcome! Please feel free to raise any issue. There is no issues big or small. Let's work on them together.

Please note that the first few posts of newly registered users will be moderated in order to filter out any spammers.
MariaG
GenABEL specialist
GenABEL specialist
Posts: 36
Joined: Thu Feb 03, 2011 12:29 pm

Notes on GWAS Quality Control/Quality Assurance

Postby MariaG » Sun Apr 17, 2011 9:59 pm

While doing the Association analysis, e.g. by use of GenaBEL, I was wandered how do people choose the appropriate methods and the parameters’ values for the Quality Control, and how to check the Quality assurance afterwards?

Here I provide some links which were useful for me.
I’m pretty sure you might add even better links! Please, do that :P

Extensive overview of GWA in general, and the analysis’ quality in particular:
Genome-wide association studies for complex traits: consensus, uncertainty and challenges”, NcCarthy, Abecasis, Cardon, Goldstein, Little, Loannidis and Hirschhorn. Nature, 2008
http://www.nature.com/nrg/journal/v9/n5/abs/nrg2344.html

What I personally liked in the paper:

(a) They stress that there is a “propensity for latent population structure (population stratification and cryptic relatedness) to inflate the type 1 error rate and generate spurious claims of association around variants that are informative for that substructure “
To check if my data have a population stratification and how get read of their effect I can follow the instructions in “GenABEL Tutorial”.

(b) The paper supposes to check for sample duplicates
To do that I can check the duplicates in the subjects names, and also to use the high threshold for the ibs (identical-by-state) parameter (e.g. in GenABEL ibs.threshold = 0.95)

(c) The paper shows how to estimate the quality of the single SNP’s genotyping by use of the signal intensity (cluster) plots.
The other link shows that knowledge of the goodness of clustering can also help to estimate the correct threshold for the Hardy-Weinberg Equilibrium filtering. The p-values for HWE should be in such frames, that most of the corresponding clustering would represent the good quality:
“Quality Control and Quality Assurance in Genotypic Data for gGenome-Wide Association Studies”, Laurie et al. Genetic Epidemiology, 2010
http://www.ncbi.nlm.nih.gov/pubmed/20718045

(d) The paper describes the use of Q-Q plot for the diagnostics of the GWA results. So, based on the features of the QQ plot I can choose the appropriate approach (QC parameters and the mathematical model) for GWA. The important detail is: If you see the inflation of the observed findings within the area which is close to the (0,0) , then it most probably indicates the presence of population stratification or cryptic relatedness. If the inflation is observed in the region toward the higher (x,y) values, then it can suggest of an excess of strong associations.

And another helpful method (pls, see the link below) to check if you have inflation because of the population stratification or because of the true association: To exclude the top-SNPs with the lowest p-values, and to construct Q-Q plot again: If I see no inflation at higher (x,y) then most probably the deviation was caused by the association:
“How to Interpret a Genome-wide Association Study.” Pearson, Manolio. JAMA, 2008
http://jama.ama-assn.org/content/299/11/1335.full?maxtoshow=&hits=10&RESULTFORMAT=&fulltext=Inflammatory+Bowel+Disease&searchid=1&FIRSTINDEX=0&resourcetype=HWFIG

The other two links below show that Q-Q plot can also be used for estimation of the HWE-filtering quality. Deviations from the y = x line correspond to loci that deviate from the null hypothesis. So I can check, which HWE p-value threshold gives the lowest deviation, and use it as a QC parameter.
“A tutorial on statistical methods for population association studies”, Balding. Nature reviews/Genetics, 2006
http://www.nature.com/nrg/journal/v7/n10/full/nrg1916.html
“Notes on GWAS QC/QA”, Weale. 2009
[url]http://www.docstoc.com/docs/41239840/MEWGWASQCNotes20090205
[/url]


What I didn’t find in the paper: The information about the “heterozygosity” filtering. From the link, mentioned above (Weale, 2009), samples with unusually high heterozygosity suggest possible sample contamination. In GenABEL’s default QC filtering the threshold is equal to het.fdr = 0.01. I’m not sure that I’ve really got an idea behind that. Do you know a good link to clarify the biological and statistical basis for heterozygosity filtering??

yurii
GenABEL developer
GenABEL developer
Posts: 263
Joined: Fri Jan 21, 2011 5:20 pm

Re: Notes on GWAS Quality Control/Quality Assurance

Postby yurii » Mon Apr 18, 2011 8:08 am

Adding my 2 cents, here is the link on sex verification procedures

viewtopic.php?f=6&t=446

While it is GenABEL-centric, the post lists few general principles.
Note that (Gen)ABELs are dynamically developing; while this post is intended to provide full information at the time of posting, please read on further posts, if any, as the topic may be updated with novel solutions at a later stage.

best regards,
Yurii

yurii
GenABEL developer
GenABEL developer
Posts: 263
Joined: Fri Jan 21, 2011 5:20 pm

Re: Notes on GWAS Quality Control/Quality Assurance

Postby yurii » Mon Apr 18, 2011 8:22 am

What I didn’t find in the paper: The information about the “heterozygosity” filtering. From the link, mentioned above (Weale, 2009), samples with unusually high heterozygosity suggest possible sample contamination. In GenABEL’s default QC filtering the threshold is equal to het.fdr = 0.01. I’m not sure that I’ve really got an idea behind that. Do you know a good link to clarify the biological and statistical basis for heterozygosity filtering??


I think the idea is more or less like this: if you got DNA from single person, there is a certain chance you will see a homozygote of certain type, or a heterozygote. But if you add more and more DNAs, the chances that you will see BOTH alleles increases: e.g. if MAF is 0.1, that chance to see both alleles is

P(AB)=2*.1*.9=0.18,

but if you look in DNA of 2 people, you can see A and B if (person1 is AB and person 2 is not AB) or (person 2 is AB and person 1 is not AB) or (both are AB) or (person 1 is AA and person 2 is BB) or (person 1 is BB and person 1 is AA), so

P(see A and B in 2 people) = 0.18*0.18+2*.18*(1-.18)+2*0.1^2*0.9^2 = 0.34 > 0.18

As for statistics, any 'outlier detection' procedure can be used: e.g. 5 SD, or as in case of GenABEL-package::check.marker, ones with FDR<0.01 are considered outliers.

It is also quite frequent to see that samples which do not pass heterozygosity criterion also have low call rate. It would actually be interesting to look into intensity plots for samples people not passing heterozygosity thresholds and try to guess how many DNAs / in what proportion are mixed together. I do not quite see practical side of this, but may be fun.

Yurii
Note that (Gen)ABELs are dynamically developing; while this post is intended to provide full information at the time of posting, please read on further posts, if any, as the topic may be updated with novel solutions at a later stage.

best regards,
Yurii

MariaG
GenABEL specialist
GenABEL specialist
Posts: 36
Joined: Thu Feb 03, 2011 12:29 pm

Re: Notes on GWAS Quality Control/Quality Assurance

Postby MariaG » Sun May 08, 2011 8:12 pm

Well, the further QC question, for example, is:

How important is to check the "normality" of the phenotype? I didn't see any GWAS paper where people would mention the quality of the phenotype.. Even if they use "sqtr" or "log"-transformation they usually don't show the p-values for normality .. What would you say about that? I would appreciate if you can give me a usuful links :)

Thanks

Nicola Pirastu
GenABEL senior expert
GenABEL senior expert
Posts: 151
Joined: Wed Feb 09, 2011 3:24 pm

Re: Notes on GWAS Quality Control/Quality Assurance

Postby Nicola Pirastu » Thu May 26, 2011 3:00 am

Well this is not a reference in the strict sense but I think it clarifies some points.

http://www.duke.edu/~rnau/testing.htm

So as you can read normality of the trait is not strictly required, the important thing is that errors are normally distributed expecially for estimating pvalues.

As far as I know no one reports test of normality for traits for two main reason:

1. It is not required for the reasons mentioned above

2. If you run shapiro test on real data it will always be significant (i'd love to see someone that could actually find a really normally dictributed trait).



Nicola

Jodie81
Posts: 1
Joined: Thu Apr 12, 2012 9:23 am

Re: Notes on GWAS Quality Control/Quality Assurance

Postby Jodie81 » Thu Apr 12, 2012 9:26 am

Thanks a lot for sharing, Maria and yurii!
I'm writing a talk on this topic, so your thoughts upon this topic means much.
That's very useful for me!


_______________
pdf viewer


Return to “Journal Club on Statistical Genomics”

Who is online

Users browsing this forum: No registered users and 1 guest