Story #7427
closedCGF of 1kg and PGP for paths 2c5 and 247 (cont.)
0%
Updated by Sarah Guthrie over 9 years ago
- Project changed from 39 to Lightning
- Target version changed from 111 to 2015-10-23 Lightning sprint
Updated by Abram Connelly over 9 years ago
- Status changed from New to In Progress
Updates to CGF schema:
https://github.com/abeconnelly/scratch/blob/3652000012953abfdefae94afb3e82de66a69004/Documentation/CGF-Schema.md
Some preliminary estimates put the nocall information at about 17Kb for a single path for a single sample. This scales to about 17Mb for a whole genome. Gzipping the resulting binary file pushes it down to 11Kb (which would imply about 11Mb whole genome).
I think a lesson should be taken from the CGLF and I should use this representation and move on. We can optimize the nocall information at a future date. For now the resulting nocall binary structure can be gzipped and unpacked when needed. A Code
field has been added to future proof the LowQualityInfo
structure. For now a Code
is 0 to represent the structure presented. Maybe we can put in a Code
of one to represent the gzipped portion.
It's still an estimate so the nocall information might need more like ~20Mb all told.
Updated by Abram Connelly over 9 years ago
Packed representation is progressing. Currently there's a final nocall entry that's being missed at the end but otherwise it looks good. The current snapshot is:
https://github.com/abeconnelly/cgf/tree/8a1782ad43ab55ec798911bc01494438d80db218
Current schema is:
I don't think there are any structural changes. The schema (data structure section) has been clarified to indicate what exactly the offset and position arrays hold. The Offset
array holds the byte offset of the Stride*k
low quality entry starting from LoqInfo[0]
. The StepPosition
array holds the Stride*k
tile position entry of the for the LoqInfo
record. This means the above Vector
and potentially Overflow
(and FinalOverflowMap
) structures will need to be consulted above to reconstruct the tile position that the low quality information is for.
The HetHomFlag
(to be renamed HomFlag
) is set to true if the record is homozygous. The array is a bit vector where a 1
represents the corresponding entry in the LoqInfo
structure is homozygous. Note that "homozygous" here refers to the low quality entries and does not refer to the tiles or their variants. This means that there could be a heterozygous tile pair/sequence/group while still having a "homozygous" low quality information. If the low quality information is homozygous, all this means is that the low quality information for that tile position is the same on both alleles. The corresponding bit position is read LSB first, so that bit 0 represents the 8*k
low quality record, bit position 1 represents the (8*k)+1
low quality record, etc.
Updated by Abram Connelly over 9 years ago
Trailing no-calls taken care of.
Initial version of CGF (for two paths on a single sample) has been created.
In the process of creating a CGF reader that will read the binary CGF and confirm the bytes written are what we expect.
https://github.com/abeconnelly/cgf/tree/f0614796e0f2ce94860cc3f459b2f0f42cce3bde
Updated by Sarah Guthrie about 9 years ago
- Target version changed from 2015-10-23 Lightning sprint to 2015-11-13 Lightning sprint
Updated by Abram Connelly about 9 years ago
- Status changed from In Progress to Closed
Around 600 samples for the two paths, 0x247 and 0x2c5, have had their CGF created. The CGF has been verified to produce FastJ that matches the input FastJ.
Current source for the cgf program is at: https://github.com/abeconnelly/cgf/tree/02fa80d2665bec3a1da54a230cc4f9cba5fd444a
Updated by Abram Connelly over 8 years ago
The current lightning prototype has both of these tile paths:
- Underlying data: https://workbench.su92l.arvadosapi.com/projects/su92l-j7d0g-2hk0kr9bayye8n0#Data_collections
- Specification and documentation: https://github.com/abeconnelly/l7g
- Proxy sever: https://github.com/abeconnelly/lci
- Tile Server: https://github.com/abeconnelly/glfd
- CGF tools and server: https://github.com/abeconnelly/cgf
- Phenotype server (untap): https://github.com/abeconnelly/l7g-p7e-untap
- Variant server (ClinVar): https://github.com/abeconnelly/l7g-v5t-clinvar
- Prototype Docker image: https://hub.docker.com/r/abeconnelly/lightning/