Nature Protocols. Basic statistical analysis in genetic case

Statistical Genomics for dummies and advanced. Discussions, links, usefull information.
Forum rules
Welcome! Please feel free to raise any issue. There is no issues big or small. Let's work on them together.

Please note that the first few posts of newly registered users will be moderated in order to filter out any spammers.
GenABEL specialist
GenABEL specialist
Posts: 36
Joined: Thu Feb 03, 2011 12:29 pm

Nature Protocols. Basic statistical analysis in genetic case

Postby MariaG » Fri Aug 26, 2011 7:44 am

Nature protocols in 2007 – 2011 presented something like a series of papers referred to the protocols for association analysis:


Here I would like to open a discussion on the latest one “Basic statistical analysis in genetic case-control studies”.

The protocol tries to realize a great idea: to help you in understanding the reasons and basic steps in conducting the case-control associations analysis, both for candidate-gene and genome-wide approachers. In paper you can find the theory and corresponding examples with downloading data and true Plink commands.

The structure of the paper is just perfect! It's very easy to read and follow.

It has clear sub-devisions:

Conceptual basis for statistical analysis,
Models and measures of associations,
Tests for associations,
Multiple testing,
Interpretation of results,

Identify file formats
Basic descriptive summary
Single SNP tests of associations
Data visualization
Adjustment for multiple testing
Population stratification

CG study
GWA study

It provides a useful glossary and exact statistical formulas for the basic associated tests.

I did my first round of reading with a great pleasure and expectations of a high benefit. However, when I started to repeat the protocol procedure myself and read the text more careful, I found a lot of small but very confusing and even disturbing nuances, especially for a person not that much familiar with Plink (However the paper says in the abstract: “Assuming no previous experience with software such a PLINK”). Here I would like to share my notes with you. Did you get the same or other filings while reading and following the protocol?!

1) The paper pretend to teach us about basic approaches in statistical genetics but produces a confusion, narrowing the focus only to Plink while saying absolutely nothing about other possibilities (including GenABEL). Nowadays there is a great variability in bioinformatical tools for association tests, and it would be really interesting to know the (theoretical/practical) reason for such a pre-selection?

2) It was nice to learn the connections between the contingency tables and association tests (Box 2) and the Plink models. However not perfect correspondence between the therms used in the paper (“allelic test of association”, “simple chi-square-test tests of association”) comparing to the Plink manual terminology was for a first moment very confusing.

Actual correspondence looks like:

The term in the protocol “allelic test of association” is equal to the therm in Plink “Basic case/control association test” and the correspondent command is --assoc.

The therm in protocol “simple chi-square-test tests of association” is equal to the therm in Plink “Alternate / full model association tests” and the correspondent command is --model.

It would be useful to specify right away that the results of the “full model” given by command --model based on “allelic model” (see the output file 'data.model' and the values [TEST: ALLELIC]) is equal to the results of “Basic case/control association test” given by command --assoc.

The other small detail to stress is that the adjustment for multiple testing is not provided for the “full model” based on “allelic model” (there is NO (!) command --model-allelic --adjust) and can be carry out ONLY by --assoc --adjust.

3) One more small nuance: the command will not work if you use a single long dash in a command, e.g. –model, instead of two small dashes --model. Unfortunately the protocol is not careful with that...

I focused on that details because you need to know them if you try to reproduce the results from the protocol on page 131, “ANTICIPATED RESULTS | CG study | Summary of results. Table 4”.

4) So, going further to the discussion of the examples. On page 129 the paper says, that 40 markers were use for the candidate gene analysis, however in data file there is only 37 markers. It's confusing...

5) In table 4 on page 131, as far as I understood Plink manual, the columns named “Genomic control”, “bonferroni”, and “Holm” are related to the adjusted p-values, so it's confusing to call them “Unadjusted” like in the table 4.

6) In the same table 4 the values in columns “Holm”, “Sidak single step”, “Sidak step-down”, “FDR BH”, and “FDR BY” look to be absolutely wrong... Do you also, like me, see other values when try to repeat the protocol ???

7) Again table 4, column “Family-wise permutations”. Rows for SNPs rs4135247, rs2120825, rs3856806 give me other values than in the table, and are equal to 0.001998, 0.001998, 0.1129 correspondingly. Do you have the same???

8) Discussing the results in table 4 on page 131, the paper says: “Here we have defined a P value to be significant if at least one of the adjusted values is smaller than the threshold required to maintain a FWER of 0.05”. It sounds strange, first, that the protocol supposes the threshold for the p-values and FDR values to be the same. And, second, that the protocol doesn't discuss at all, which p-values they would suggest to publish and by which reason? To publish all that p-values? To choose the single test with the smallest p-values for the top-hits? How would you do that and why??

9) QQ-plot. Starting on page 128, Data visualization | 5 | (A) Quantile-quantile plot | (iii): What was the reason to calculate the expected p-values in such a way? Where is a reference?

Code: Select all

exp  <- -log10( c(1:length(obs)) /(length(obs) + 1))

I know a very useful R-script, proposed by a brilliant guy Stephen Turner ( ... plots.html),
which help to construct a very nice QQ-plot. It uses ppoints() which is an R-function from a standard package “stats” to calculate the ordinates for Probability Plotting. Ppoints() has the references and calculates the sequence of probability points as

Code: Select all

(1:m - a)/(m +(1-a)-a),

Code: Select all

where “m” is  length(n), “a = ifelse(n <= 10, 3/8, 1/2)”.

So, in our case “a” shoud be equal to 1/2, but not 1 as in the current protocol... The difference is not may be that big, but confusing...

10) Let's go to the LD plot. Figure 1.

First of all a small confusing thing, the text and the picture promise that I will see the significant SNPs in white boxes... Do you? I don't ... I had to find them myself based on SNP-names in table 4 …

Second, the statement “Block 2, which is a 52-kb block of high LD (r-sqr > 0.34)” sounds kind of strange... How this value of 0.34 was calculated over the LD block? Would you personally say that the correlation of 34% is really high?...

Finally, my LD-plot (my reproduction of the protocol) represented the same LD blocks, but absolutely different correlation coefficients in the diamonds (please, see the file cg.LD.png attached)!... And what do you see?...


So, with all of that my impression is that sometimes the paper looks very promising, but a lot of small misleadings give a feeling that you can't really get the idea, and something which you would like to understand just gets lost...

Or may be I just did something wrong ?
cg.LD.png (136.53 KiB) Viewed 6486 times

Nicola Pirastu
GenABEL senior expert
GenABEL senior expert
Posts: 151
Joined: Wed Feb 09, 2011 3:24 pm

Re: Nature Protocols. Basic statistical analysis in genetic

Postby Nicola Pirastu » Mon Sep 19, 2011 4:09 pm

Hi Maria,

just a quick note on the LD, the plot you're looking at is D' that's why you get different numbers, you need to change the settings to see r2,
It will look grey and not red that's how I know :-)


Posts: 1
Joined: Fri Jan 20, 2012 3:31 pm

Re: Nature Protocols. Basic statistical analysis in genetic

Postby lexybam » Fri Jan 20, 2012 4:31 pm

The fact is you will always get the wrong perceptive if you do not do the right thing. The code you are to find is D and if you go astray from that, you will always get the wrong signal.

Posts: 1
Joined: Mon Jul 09, 2012 1:14 pm

Re: Nature Protocols. Basic statistical analysis in genetic

Postby KathieC » Mon Jul 09, 2012 1:18 pm

I would say this is a very interesting piece of information, Maria! Thanks for sharing!

winrar free

Return to “Journal Club on Statistical Genomics”

Who is online

Users browsing this forum: No registered users and 1 guest