[url]http://www.nature.com/search/executeSearch?sp-q=gwa&sp-p=all&include-collections=journals_nature%2Ccrawled_content&exclude-collections=journals_palgrave%2Clab_animal%2Cprecedings&pag-start=1&sp-c=25&sp-m=1&sp-s=date_descending&siteCode=nprot&genre=protocol&sp-advanced=true&sp-q-9[NPROT]=1&sp-q-9[PRNE]=1[/url]
Here I would like to open a discussion on the latest one “Basic statistical analysis in genetic case-control studies”.
http://www.nature.com/nprot/journal/v6/n2/full/nprot.2010.182.html
The protocol tries to realize a great idea: to help you in understanding the reasons and basic steps in conducting the case-control associations analysis, both for candidate-gene and genome-wide approachers. In paper you can find the theory and corresponding examples with downloading data and true Plink commands.
The structure of the paper is just perfect! It's very easy to read and follow.
It has clear sub-devisions:
INTRODUCTION:
Conceptual basis for statistical analysis,
Models and measures of associations,
Tests for associations,
Multiple testing,
Interpretation of results,
Replication,
Software.
PROCEDURE:
Identify file formats
Basic descriptive summary
Single SNP tests of associations
Data visualization
Adjustment for multiple testing
Population stratification
ANTICIPATED RESULTS:
CG study
GWA study
It provides a useful glossary and exact statistical formulas for the basic associated tests.
I did my first round of reading with a great pleasure and expectations of a high benefit. However, when I started to repeat the protocol procedure myself and read the text more careful, I found a lot of small but very confusing and even disturbing nuances, especially for a person not that much familiar with Plink (However the paper says in the abstract: “Assuming no previous experience with software such a PLINK”). Here I would like to share my notes with you. Did you get the same or other filings while reading and following the protocol?!
1) The paper pretend to teach us about basic approaches in statistical genetics but produces a confusion, narrowing the focus only to Plink while saying absolutely nothing about other possibilities (including GenABEL). Nowadays there is a great variability in bioinformatical tools for association tests, and it would be really interesting to know the (theoretical/practical) reason for such a pre-selection?
2) It was nice to learn the connections between the contingency tables and association tests (Box 2) and the Plink models. However not perfect correspondence between the therms used in the paper (“allelic test of association”, “simple chi-square-test tests of association”) comparing to the Plink manual terminology was for a first moment very confusing.
Actual correspondence looks like:
The term in the protocol “allelic test of association” is equal to the therm in Plink “Basic case/control association test” and the correspondent command is --assoc.
The therm in protocol “simple chi-square-test tests of association” is equal to the therm in Plink “Alternate / full model association tests” and the correspondent command is --model.
It would be useful to specify right away that the results of the “full model” given by command --model based on “allelic model” (see the output file 'data.model' and the values [TEST: ALLELIC]) is equal to the results of “Basic case/control association test” given by command --assoc.
The other small detail to stress is that the adjustment for multiple testing is not provided for the “full model” based on “allelic model” (there is NO (!) command --model-allelic --adjust) and can be carry out ONLY by --assoc --adjust.
3) One more small nuance: the command will not work if you use a single long dash in a command, e.g. –model, instead of two small dashes --model. Unfortunately the protocol is not careful with that...
I focused on that details because you need to know them if you try to reproduce the results from the protocol on page 131, “ANTICIPATED RESULTS | CG study | Summary of results. Table 4”.
4) So, going further to the discussion of the examples. On page 129 the paper says, that 40 markers were use for the candidate gene analysis, however in data file there is only 37 markers. It's confusing...
5) In table 4 on page 131, as far as I understood Plink manual, the columns named “Genomic control”, “bonferroni”, and “Holm” are related to the adjusted p-values, so it's confusing to call them “Unadjusted” like in the table 4.
6) In the same table 4 the values in columns “Holm”, “Sidak single step”, “Sidak step-down”, “FDR BH”, and “FDR BY” look to be absolutely wrong... Do you also, like me, see other values when try to repeat the protocol ???
7) Again table 4, column “Family-wise permutations”. Rows for SNPs rs4135247, rs2120825, rs3856806 give me other values than in the table, and are equal to 0.001998, 0.001998, 0.1129 correspondingly. Do you have the same???
8) Discussing the results in table 4 on page 131, the paper says: “Here we have defined a P value to be significant if at least one of the adjusted values is smaller than the threshold required to maintain a FWER of 0.05”. It sounds strange, first, that the protocol supposes the threshold for the p-values and FDR values to be the same. And, second, that the protocol doesn't discuss at all, which p-values they would suggest to publish and by which reason? To publish all that p-values? To choose the single test with the smallest p-values for the top-hits? How would you do that and why??
9) QQ-plot. Starting on page 128, Data visualization | 5 | (A) Quantile-quantile plot | (iii): What was the reason to calculate the expected p-values in such a way? Where is a reference?
Code: Select all
exp <- -log10( c(1:length(obs)) /(length(obs) + 1))
I know a very useful R-script, proposed by a brilliant guy Stephen Turner (http://gettinggeneticsdone.blogspot.com ... plots.html),
which help to construct a very nice QQ-plot. It uses ppoints() which is an R-function from a standard package “stats” to calculate the ordinates for Probability Plotting. Ppoints() has the references and calculates the sequence of probability points as
Code: Select all
(1:m - a)/(m +(1-a)-a),
Code: Select all
where “m” is length(n), “a = ifelse(n <= 10, 3/8, 1/2)”.
So, in our case “a” shoud be equal to 1/2, but not 1 as in the current protocol... The difference is not may be that big, but confusing...
10) Let's go to the LD plot. Figure 1.
First of all a small confusing thing, the text and the picture promise that I will see the significant SNPs in white boxes... Do you? I don't ... I had to find them myself based on SNP-names in table 4 …
Second, the statement “Block 2, which is a 52-kb block of high LD (r-sqr > 0.34)” sounds kind of strange... How this value of 0.34 was calculated over the LD block? Would you personally say that the correlation of 34% is really high?...
Finally, my LD-plot (my reproduction of the protocol) represented the same LD blocks, but absolutely different correlation coefficients in the diamonds (please, see the file cg.LD.png attached)!... And what do you see?...
***
So, with all of that my impression is that sometimes the paper looks very promising, but a lot of small misleadings give a feeling that you can't really get the idea, and something which you would like to understand just gets lost...
Or may be I just did something wrong ?