Idea #10101
closedRevise CGF to separate low quality tiles from high quality tiles better
Description
The current specification for CGF interleaves the low quality tiles with the high quality tiles in the synopsis bits and hexit region for the vector data. This has the potential to slow down high quality queries as low quality tiles "push out" high quality tiles in the hexit region.
Instead, we should separate the low quality information from the high quality information as much as possible. We should create an auxiliary structure just for low quality tiles. Maybe this is done by providing an Overflow, FinalOverflowMap and FinalOverflowMapOpt structure just for low quality tiles. Lookups on high quality tiles will be done in the same way as they are now, skipping over low quality tiles from the low quality tile bit vector.
Updated by Tom Morris almost 9 years ago
I apparently don't have the privileges to assign this or otherwise modify the metadata, but it sounds similar to stuff in CGF "v3"
Is this in progress? Done? Who's responsible for it?
Updated by Abram Connelly almost 9 years ago
- Status changed from New to Closed
CGFv3 resolves this. The specification can be seen in the CGFv3-Schema.md document.
This new scheme reduces has the following major features:
- Low quality tiles are completely separated from the high quality ones. This required an extra bit vector field to indicate whether any given tile is high quality or low quality and adds complexity when determining the tile variant for a given position but reduces the number of tile variant cache overflows.
- Overflow structures are fixed width instead of variable width, as before.
- sdsl-lite structures are used to save the low quality information. The choice between
sdsl::enc_vector(Elias Delta Coding) andsdsl::vlc_vector(Variable Length Code) was chosen empirically to reduce size.
I would guess that the "high quality" tile data should be stored in around 8-10Mb, dependent on how many cache overflows there are. This is ~3-4x larger than our guessed theoretical minimum of ~2.5Mb. The extra size comes the extra Loq structure (adding ~1.2Mb), the Span structure storing whether a tile is a spanning tile (another ~1.2Mb) and the Overflow structures, which are fixed width.
Total size of the CGF file should be in the range of ~30Mb-40Mb which includes the low quality data and the high quality data. From some tests, it looks like the CGFv3 files are slightly larger than the old CGF files. From some small tests, the lookup speed is ~100x faster and the CGF data layout and decoding is simpler.
In the future, the Loq and Span could be compressed to save space. The Overflow structure could also be compressed in some way, either by partitioning heterozygous tiles versus homozygous tiles to only use half the space for the homozygous tiles or some other better compression scheme. Care has to be taken as compression adds overhead and can quickly destroy the benefits of quick lookup.
Further reference: