Story #11672
closedAdd "band" functionality to fjt tool
0%
Description
Extend the fjt tool to output "band format".
We've settled on a kind of "intermediate" format for describing tiled genomes, which is this "band format". This is 2 or 4 vectors of numbers representing. The first pair of vectors represents the tile variant ID, with negative numbers indicating a non-trivial tile. The second pair of vectors represents the "low quality" information, with each position consisting of an array with an even number of entries in each, even entries indicating start position from the start of a tile and odd entries indicating the length of the no call.
The variant tile values can be summarized as:
>=0
: Tile variant with the appropriate value (0 being the "default" or canonical tile)-1
: Indicates a non-anchor spanning tile
For example, here is a band representation of path 0x35e:
[ 79 8 0 0 0 0 0 -1 0 0 0 389 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 -1 34 -1 185 1] [ 79 2 0 0 0 0 0 -1 0 0 0 390 0 0 0 0 0 1 0 0 0 0 0 0 26 0 0 1 0 0 -1 34 -1 185 1] [[ ][ ][ ][ ][ ][ ][ 903 1 ][ ][ 16 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 96 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 291 2 ]] [[ ][ ][ ][ ][ ][ ][ 903 1 ][ ][ 16 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 96 1 ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ ][ 291 2 ]]
With this format, we can easily create a CGFv3 file.
This suggests a basic workflow:
- Convert from source format to FastJ
- Collect, deduplicate and "impute" all tiles to create the tile library
- Use the source FastJ and tile library to create a band format
- Use the band format to create the CGFv3
The cgft can create CGFv3 from band data. The fjt tool should take care of the third step, converting from FastJ and tile library (SGLF) to band format.
As a general rule, splitting by tile path seems to be a good compromise between functionality, memory footprint and speed. With this, an example usage of fjt
could be:
fjt -L 035e.sglf -i 035e.fj -B -p 862
Which should output the above band format.
Updated by Abram Connelly over 7 years ago
- Status changed from New to In Progress
- Assigned To set to Abram Connelly
Making sure that the conversions to the new CGFv3 format don't introduce errors, it'd be nice to double check the FastJ against the CGF created. To facilitate this, it'd be nice to be able to convert directly from FastJ to band format instead.
Since other CGF conversion tickets depend on this, I'm prioritizing FastJ to band format conversion by updating the fjt
tool.
Updated by Abram Connelly over 7 years ago
- Status changed from In Progress to Closed
In addition, some simple tests along with a test script has been added to fjt. Tests are minimal for now and only test a small tile path (0x035e) and for only one allele. Future tests should test a bigger tile path along with both alleles.