Issue with the input files for analysis

Questions about ProbABEL are welcome here.
Forum rules
Please remember not to post any sensitive data on this public forum.
The first few posts of newly registered users will be moderated in order to filter out any spammers.

When get a solution to the problem you posted, please change the topic name (e.g. from "how to ..." to "[SOLVED] how to ..."). This will make it easier for the community to follow the posts yet to be attended.
sanjana_chop
Posts: 1
Joined: Wed Jan 20, 2016 8:14 pm

Issue with the input files for analysis

Postby sanjana_chop » Wed Jan 20, 2016 9:44 pm

Hello!

I'm very new to GWAS and I'm performing imputation for the first time. Please help me out here! Thanks! Forgive me if this has been addressed previously. If so, please provide me with the link.

I've imputed my data using the Michigan Imputation Server which uses the minimac3 software. The output has two files: chr#.info and chr#.dose.vcf. I would like to run the analysis using ProbABEL as it supports minimac and MaCH. I have the following questions:

1) Will I be able to run an analysis with the provided files? I don't seem to have a file with the phenotype of interest and the covariates.

2)The info file format has the columns:

Code: Select all

SNP    REF(0)    ALT(1)    ALT_Frq    MAF    AvgCall    Rsq    Genotyped    LooRsq    EmpR    EmpRsq    Dose0    Dose1
1:10177    A    AC    0.43686    0.43686    0.58041    0.01965    Imputed    -    -    -    -    -
1:10235    T    TA    0.00106    0.00106    0.99894    0.00005    Imputed    -    -    -    -    -
1:10352    T    TA    0.44435    0.44435    0.57366    0.01838    Imputed    -    -    -    -    -

How would I be able to use this file to run an analysis using ProbABEL? If yes, how should it be modified.

3)The dosage file is in vcf format and looks nothing like the MLDOSE file that is required.

Code: Select all

##fileformat=VCFv4.1
##filedate=2016.1.19
##source=Minimac3
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Estimated Posterior Probabilities for Genotypes 0/0, 0/1 and 1/1 ">
##INFO=<ID=MAF,Number=1,Type=Float,Description="Estimated Alternate Allele Frequency">
##INFO=<ID=R2,Number=1,Type=Float,Description="Estimated Imputation Accuracy">
##INFO=<ID=ER2,Number=1,Type=Float,Description="Empirical (Leave-One-Out) R-square (available only for genotyped variants)">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT

How should I proceed?

4) How else can I analyze the data? Is there a different program that would suit my data and I can use to analyze?

Thanks!

lckarssen
Site Admin
Site Admin
Posts: 321
Joined: Tue Jan 04, 2011 3:04 pm
Location: Utrecht, The Netherlands

Re: Issue with the input files for analysis

Postby lckarssen » Sun Feb 28, 2016 10:21 pm

sanjana_chop wrote:1) Will I be able to run an analysis with the provided files? I don't seem to have a file with the phenotype of interest and the covariates.

In principle you can run association analysis with the data in these VCF files, but as you noted reformatting is required since ProbABEL doesn't know how to read VCF files (feel free to add a feature request on our Github page (https://github.com/GenABEL-Project/ProbABEL/issues). As for the phenotype and covariate data, those files obviously have to be created by you yourself. How could the Michigan imputation server know about you phenotype data?
sanjana_chop wrote:2)The info file format has the columns:

Code: Select all

SNP    REF(0)    ALT(1)    ALT_Frq    MAF    AvgCall    Rsq    Genotyped    LooRsq    EmpR    EmpRsq    Dose0    Dose1
1:10177    A    AC    0.43686    0.43686    0.58041    0.01965    Imputed    -    -    -    -    -
1:10235    T    TA    0.00106    0.00106    0.99894    0.00005    Imputed    -    -    -    -    -
1:10352    T    TA    0.44435    0.44435    0.57366    0.01838    Imputed    -    -    -    -    -

How would I be able to use this file to run an analysis using ProbABEL? If yes, how should it be modified.

You need to slightly alter that file to make it compatible with ProbABEL's info file (See the ProbABEL manual for the exact specification). The reformatting can be done with a tool like GAWK. For example, to select only the first 5 columns and the 7th one use:

Code: Select all

gawk '{print $1, $2, $3 $4, $5, $7}' chr1.info > chr1_probabel.info


sanjana_chop wrote:3)The dosage file is in vcf format and looks nothing like the MLDOSE file that is required.

How should I proceed?

This is indeed more intricate than the conversion of the info files. In principle you should extract the dosage data (the DS information in the VCF file) for each SNP and individual. The trouble here is that the ProbABEL dosage formats require the individuals as rows and variants as columns, whereas the VCF file is ordered in a 'transposed' way. I can think of several ways to accomplish this, but they all require some scripting (e.g. in Bash, Perl or maybe R), so if you're not somewhat experienced in that I suggest you contact your local bioinformatician.


sanjana_chop wrote:4) How else can I analyze the data? Is there a different program that would suit my data and I can use to analyze?

There are several (e.g. SNPTest, EMMAX), but unfortunately I don't have experience with them and I'm not sure if they support the data from the Michigan server without the need of reformatting.
-------
Lennart Karssen
PolyOmica
The Netherlands
-------


Return to “ProbABEL”

Who is online

Users browsing this forum: No registered users and 3 guests