Idea #10122: Lightning tile arrays for Sarah - Lightning - Arvados

Actions

Copy link

Idea #10122

closed

Lightning tile arrays for Sarah

Added by Abram Connelly over 9 years ago. Updated almost 9 years ago.

Status:

Closed

Priority:

Normal

Assigned To:

Target version:

Start date:

09/23/2016

Due date:

Story points:

Description

Sarah would like to recreate Sally's results and wants the newest version of the encoded genomes for PCA analysis in Python.

The format should be:

a text file
two columns for each dataset (genomes), one per allele
rows are tile positions (around 10M)
each entry should be:
- one-hot for high quality, non-anchor spanning tiles as an ascii encoded binary string
- 0 for high quality non-anchor spanning tiles
- NaN for low quality tiles

Sarah wants two matricies, the first will all information for all 10M positions and all the genomes we have (680+). This should be about 60Gb uncompressed ASCII. The second is the same matrix but only restricted to high quality information.

Files

Download all files

test-matrix.txt.bz2 (20.5 MB) test-matrix.txt.bz2		Abram Connelly, 10/18/2016 09:15 AM
data-sample.tar.gz (253 KB) data-sample.tar.gz		Abram Connelly, 12/14/2016 11:51 PM

Actions

Copy link

Updated by Abram Connelly over 9 years ago

Status changed from New to In Progress

Actions

Copy link

Updated by Abram Connelly over 9 years ago

Talking to Sarah some more, it'll be easier to produce a matrix that has callsets on one axis and tile position x allele on the other.

For example:

person0: [ [0,0], [1,0], [1,5] ]
person1: [ [0,0], [2,3], [1,5] ]

callset \ tile-position-allele | pos0-allele0 | pos1-allele0 | pos2-allele0 | pos0-allele1 | pos1-allele1 | pos2-allele1 |
                               |-----------------------------------------------------------------------------------------|
                      person0  |      0       |      1       |      1       |      0       |      0       |      5       |
                               |-----------------------------------------------------------------------------------------|
                      person1  |      0       |      2       |      1       |      0       |      3       |      5       |
                               \-----------------------------------------------------------------------------------------/

No-calls should be filled in with an indicator value of -1 and non-anchor spanning tiles should be filled in with an indicator value of -2.

Actions

Copy link

Updated by Abram Connelly over 9 years ago

File test-matrix.txt.bz2 test-matrix.txt.bz2 added

Sent a sample tile array to Sarah for review. The sample tile array has 6 rows, one dataset per row. The first column is the dataset name with the subsequent columns being the tile variant id for the tile positions of the first allele followed by the variant ids for each tile position of the second allele.

There is one dataset from the Thousand Genomes data set, the hg19 dataset and 4 from the Harvard PGP project.

The sample file has been attached.

Assuming the sample file looks good, I'll create a large file with all datasets we have.

Actions

Copy link

Updated by Abram Connelly over 9 years ago

From a conversation with Sarah, there have been some modifications to the desired format:

Tile positions should be interleaved
The one-hot encoding should be a 'sub-vector' within the matrix
The one-hot encoding has lsb on the left

For the one-hot encoding, an example is:

tile pos    0   1                   0                1
allele     a b a b            a           b        a    b
   \
dataset0 | 0 5 2 0 |  -> | 1 0 0 0  0 0 0 0 0 1  0 0 1  1 |
dataset1 | 3 1 0 0 |  -> | 0 0 0 1  0 1 0 0 0 0  1 0 0  1 |

Where the extra spaces are for clarity. The left hand side of the arrows represent an input matrix of 2 rows by 4 column matrix and the matrix on the right is 2 rows by 14 columns. The number of columns to use in the embedded one-hot encoded vector is variable, depending on the maximum tile variant for that position.

Since about 30% of the data for each encoded genome is low quality, this means that roughly 30% of genome will run the range of tile variant ids.

There are about 700 genomes, each with 10M tiles and 2 alleles. As a rough count, we can partition the data into either canonical tiles or low quality tiles, with the canonical tiles taking 1 byte and low quality tiles needing around 700 characters for the one hot encoding (stored as ASCII). With this in mind, we get an estimate of 700*10,000,000*2*(.7 + .3*700) bytes needed, or about 2.7TB of data, uncompressed.

Sarah points out that there's a lot of efficiency to be gained since the matrices are sparse and whatever other gains can be had from binary storage.

Actions

Copy link

Updated by Abram Connelly over 9 years ago

File data-sample.tar.gz data-sample.tar.gz added

Attached is a file that has the newly updated 'flattened' 1-hot numpy arrays. The 'data.tar.gz' decompresses into a 'data' directory with the following files:

032 : The "integer" tile vector array (index as tile position with values of the tile variant or -2 if it's a nocall or -1 if it's a non-anchor spanning tile)
032-1hot : The "flattened" 1-hot representation as per our discussion
032-1hot-info: A float array that holds the tile position for the index in the '032-1hot' array. The modulus is '0' for the first allele and '0.5' for the second.
35e : The "integer" tile vector array (index as tile position with values of the tile variant or -2 if it's a nocall or -1 if it's a non-anchor spanning tile)
35e-1hot : The "flattened" 1-hot representation as per our discussion
35e-1hot-info: A float array that holds the tile position for the index in the '35e-1hot' array. The modulus is '0' for the first allele and '0.5' for the second.
names : The names of datasets each row in the above datasets was derived from
l7g-tile.npz : An ".npz" file containing all the above files

The format of each dataset should be clear after poking around with it but as an example:

>>> print(z['names'][3], z['35e'][3][17*2:18*2], z['35e-1hot'][3][1178:1182], z['35e-1hot-info'][1178:1182])
b'hu34D5B9-GS01670-DNA_E02' [0 0] [ 1.  0.  1.  0.] [ 17.   17.   17.5  17.5]

This shows that dataset 'hu34D5B9-GS01670-DNA_E02' has row index of 3. The tile variant value at tile position 17 (0 reference) is the default tile of '0' for both alleles, which is encoded as '[1, 0]' in the 1-hot vector and has the index information in the '-info' numpy array.

Actions

Copy link

Updated by Abram Connelly almost 9 years ago

Status changed from In Progress to Closed

This issue has been resolved and Sarah should have the 1hot lightning tile arrays that she needs.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Lightning

Custom queries

Watchers (1)

Idea #10122

Lightning tile arrays for Sarah

Updated by Abram Connelly over 9 years ago

Updated by Abram Connelly over 9 years ago

Updated by Abram Connelly over 9 years ago

Updated by Abram Connelly over 9 years ago

Updated by Abram Connelly over 9 years ago

Updated by Abram Connelly almost 9 years ago