Imputing newly identified SNPs

Statistical Genomics for dummies and advanced. Discussions, links, usefull information.
Forum rules
Welcome! Please feel free to raise any issue. There is no issues big or small. Let's work on them together.

Please note that the first few posts of newly registered users will be moderated in order to filter out any spammers.
Posts: 1
Joined: Wed Jan 25, 2012 8:02 am

Imputing newly identified SNPs

Postby ecverbeek » Wed Jan 25, 2012 8:10 am

Hello everyone

We have used massive parallel sequencing to detect new variants in three genes. We genotyped these new variants + several tag SNPs to completely cover these genes with m.a.f. = 0.1 and r2 = 0.9.
After quality control, we had 3216 individuals and 247 SNPs, with a total genotyping rate of 97.2% in remaining individuals. I wanted to impute using MaCH, to fill in missing genotypes. I use the 1000 genomes CEU data as a reference. However, the variants that we detected are not in the 1000 genomes data.
To overcome this, I took the 60 CEU samples and merged them in PLINK with 60 of my samples (that had no missingness) in a pedfile. Then I used MaCH with the --phase command, to construct haplotypes. Because we have variants that 1000 genomes does not have and 1000 genomes has variants that we did not genotype, this will lead to the following situation:
1000G-1 AC GG CG 00
1000G-2 AA GG CC 00
Eva-1 00 GG 00 AC
Eva-2 00 GG 00 AA

So here we have several zeros, but so do the 1000 genomes CEU samples. Of course, with larger genes, there are longer stretches of zeros. For instance, one of the selected genes is almost 1mb long and in the 1000 genomes data contains over 3000 SNPs. However, for this gene, we only genotyped approximately 150 SNPs. As you can imagine, there are quite long stretches of zeros in this gene.

I only imputed for my smallest gene of approximately 50kb, but this already took 19 hours to run (100 SNPs of which for 9 we had genotypes). The other genes were so large, that the imputation process was cut short by the server that I was working on. In order to reduce memory use, I worked with the --compact and --greedy commands in MaCH.
My command line looked like this:

mach -d chr1.dat -p chr1.ped -s hr1.snps -h chr1.hap --errorMap chr1.parameters.erate --crossoverMap chr1.parameters.rec
--compact --greedy --autoFlip --mle --mldetails >& chr1.impute.log

Errormaps etc were generated using 100 iterations of the Markov chain procedure

In order to reduce memory use and runtime, I then tested what would happen if I would only use the 9 genotyped SNPs out of the total 100, to construct haplotypes and then impute. Runtime was now cut back to 7 hours. I then compared the results to the results of imputing with all 100 SNPs. Out of a total of 28944 genotypes (9 x 3216), 1142 did not match between both imputation strategies. Not a very ideal situation!

My question to you is, do you have a strategy for imputing these data in which runtime is not ridiculously long?

Nicola Pirastu
GenABEL senior expert
GenABEL senior expert
Posts: 151
Joined: Wed Feb 09, 2011 3:24 pm

Re: Imputing newly identified SNPs

Postby Nicola Pirastu » Thu Feb 16, 2012 2:01 pm

If I get this right you are imputing just to fill some missing genotypes, right?

So I don't get why you put in the 1000G samples, you could just infer on your own samples using as template the samples you have. In other words you can tell MACH or IMPUTE (although I haven't used it and I'not sure) to fill in the blanks. Another issue is the way you are using imputed data, you should never use best guess but use dosages instead. You must remeber that you are using a piece of information that's not there so you should not use it as if it was sure.

My best advice would be:
1) Use your own samples to fill in the blanks without using 1000G.

2) If you want the other SNPs present on 1000G use 1000G as template without imputing your SNPs onto 1000G.

Hope I've helped



Posts: 1
Joined: Fri Mar 30, 2012 11:24 am

Re: Imputing newly identified SNPs

Postby psb » Fri Mar 30, 2012 12:44 pm

Hi! Even I am imputing from thousand genome reference dataset using MaCH for few candidate regions (approximately 1.5Mb). My first step gets completed in 3 hours while for step 2, it has been more than 48 hours and the programme is still running. Can you tell me how much time did you require for step 2 of imputation after using --compact and --greedy option? I want to get an estimate of run time.

Posts: 1
Joined: Tue Sep 25, 2012 12:27 pm

Re: Imputing newly identified SNPs

Postby sal » Tue Sep 25, 2012 2:18 pm

I am wondering if the SNP binary files, are same to CNV to use in Plink

Return to “Journal Club on Statistical Genomics”

Who is online

Users browsing this forum: No registered users and 2 guests