Error in loading large data set

Questions about GenABEL (aka *ABEL) suite of packages
Forum rules
Please remember not to post any sensitive data on this public forum.
The first few posts of newly registered users will be moderated in order to filter out any spammers.

When get a solution to the problem you posted, please change the topic name (e.g. from "how to ..." to "[SOLVED] how to ..."). This will make it easier for the community to follow the posts yet to be attended.
zhangge
Posts: 4
Joined: Wed Nov 13, 2013 5:38 pm

Error in loading large data set

Postby zhangge » Wed Nov 13, 2013 5:48 pm

I encounter an error to load a large GWAA data set with imputed genotype data of 1,500 samples on ~7 million SNPs.

R code and output

Code: Select all

data <- load.gwaa.data(phe = "pheno.tab", gen = "gdata.raw", force=T)
ids loaded...
marker names loaded...
chromosome data loaded...
map data loaded...
allele coding data loaded...
strand data loaded...
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  too many items


My platform:
* Fedora 18 (3.11.4-101.fc18.x86_64)
* R version 3.0.2 (2013-09-25) -- "Frisbee Sailing", Platform: x86_64-redhat-linux-gnu (64-bit)
* GenABEL v. 1.7-6 (May 16, 2013)

lckarssen
Site Admin
Site Admin
Posts: 322
Joined: Tue Jan 04, 2011 3:04 pm
Location: Utrecht, The Netherlands

Re: Error in loading large data set

Postby lckarssen » Wed Nov 13, 2013 11:34 pm

Maybe this is caused by memory shortage. How much memory (RAM) does your system have?
-------
Lennart Karssen
PolyOmica
The Netherlands
-------

zhangge
Posts: 4
Joined: Wed Nov 13, 2013 5:38 pm

Re: Error in loading large data set

Postby zhangge » Wed Nov 13, 2013 11:59 pm

The machine has "huge" memory -- 128GB.

The load.gwaa.data function works fine if I split the genotype data set into two each with ~4M SNPs.

I also tested the data set on a different machine:
* Fedora 12 (2.6.31.5-127.fc12.x86_64)
* R version 2.11.1
* GenABEL v. 1.6-5 (February 07, 2011)

The error message with a little bit different:

Code: Select all

data <- load.gwaa.data(phe = "pheno_master.tab", gen = "gdata.raw", force=T)
ids loaded...
marker names loaded...
chromosome data loaded...
map data loaded...
allele coding data loaded...
strand data loaded...
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  negative length vectors are not allowed

zhangge
Posts: 4
Joined: Wed Nov 13, 2013 5:38 pm

Re: Error in loading large data set

Postby zhangge » Thu Nov 14, 2013 4:41 pm

It seems to me this problem is caused by the upper limit of vector length 2^31-1 in R/GenABEL. This means the current GenABEL can only handle (2^31-1)*4 genotypes, corresponding approximately less than 4.3 million SNPs in 2,000 samples. This is a real limitation given a lot of GWA studies will exceed this size especially when doing imputation. The development team should consider to fix this problem since most machines are running 64-bit systems and R started supporting long vectors since version 3.0.

lckarssen
Site Admin
Site Admin
Posts: 322
Joined: Tue Jan 04, 2011 3:04 pm
Location: Utrecht, The Netherlands

Re: Error in loading large data set

Postby lckarssen » Sun Nov 17, 2013 11:34 pm

128GB RAM should be sufficient indeed.

You are right about the vector size in R. However, since as you say R supports larger vectors since v3.0, this would mean that GenABEL also benefits of the larger vectors. I don't think we intentionally limit the number of items in a vector (although I haven't actually checked it in the source code of GenABEL).

The error message you reported is sent by R's scan() function. Maybe something is wrong in that function...

Could you please file a bug report at our R-forge page: https://r-forge.r-project.org/tracker/? ... unc=browse That would help us address the problem.
-------
Lennart Karssen
PolyOmica
The Netherlands
-------

zhangge
Posts: 4
Joined: Wed Nov 13, 2013 5:38 pm

Re: Error in loading large data set

Postby zhangge » Tue Nov 19, 2013 9:01 pm

It is not just a problem of the R's scan() function. Even after I managed to load the large data set by the following "split and merge" procedure, the GenABEL cannot handle the large data set with more than (2^31-1)/4 genotypes. A lot of internal functions are restricted by the 2^31-1 limitation.

I have filed the bug report.

Code: Select all

# two small data sets
#============
gt1 <- data1@gtdata
gt2 <- data2@gtdata

# merge genotype data
#============
nbytes = gt1@nbytes
nids = gt1@nids
male = gt1@male
idnames = gt1@idnames

nsnps = gt1@nsnps + gt2@nsnps;
snpnames = c(gt1@snpnames, gt2@snpnames);
chromosome = factor(c(gt1@chromosome, gt2@chromosome));
coding = new( "snp.coding", as.raw(c(gt1@coding, gt2@coding)) );
strand = new( "snp.strand", as.raw(c(gt1@strand, gt2@strand)) );
map = c(gt1@map, gt2@map);

gtps = cbind(gt1@gtps, gt2@gtps);
gt <- snp.data(nids, gtps, idnames = idnames, snpnames = snpnames, chromosome = chromosome, map = map, coding=coding, strand=strand, male = male)
data <- new("gwaa.data", phdata = data1@phdata, gtdata = gt)

lckarssen
Site Admin
Site Admin
Posts: 322
Joined: Tue Jan 04, 2011 3:04 pm
Location: Utrecht, The Netherlands

Re: Error in loading large data set

Postby lckarssen » Wed Nov 20, 2013 11:00 am

Thanks for filing the bug report in the tracker (#5118). I'm still confused about why this doesn't seem to grow with the new max. vector size in R>=3.0
We're looking into it!

About the example code that you posted: can you post the error that you get as well?
-------
Lennart Karssen
PolyOmica
The Netherlands
-------


Return to “GenABEL”

Who is online

Users browsing this forum: No registered users and 1 guest