Story #11671
closedConvert 650+ CGF files to new CGFv3
0%
Description
CGF files in the l7g Data project need to be converted over to the new CGFv3 format.
There are two directories of CGF files:
The workflow should be to convert each CGF file to 'band' information, then use the new cgft to convert to the CGFv3 format.
This should be fairly straight forward except for some CGF files that have errors in their mitochondrial DNA sequence conversion (tile path 0x35e).
A new tile library for tile path 0x35e needs to be generated before this can be properly completed.
The resulting CGF files should also be stored in the l7g Data project, with the old ones renamed with a timestamp in that same project to differentiate them.
Updated by Abram Connelly over 7 years ago
Updated by Abram Connelly over 7 years ago
- Target version set to Lightning Sprint (2017-05-15 to 2017-05-29)
Updated by Abram Connelly over 7 years ago
After conversion, a double check needs to occur to make sure the conversion went correctly. I think the best way is to do the following:
- For all tile paths except 0x035e, check to make sure the original CGF matches the band format produced by CGFv3.
cgb
can be used to get the band format for CGFv2 andcgft
can be used to produce the band format for CGFv3. - For the mitochondrial DNA tile path, 0x35e, checking that the hashes of the sequence produced by concatenating the FastJ are the same as what's produced from the CGFv3 should be sufficient.
The conversion from CGFv3 to sequence can be done via:
cgft
to band format- extend the
fjt
tool (or extend/make a tool) to take in band format (and an SGLF file) and output FastJ (or CSV) - concatenate FastJ (or CSV) to sequence
This process is slow but since tile path 0x35e is so small, this should be quick enough to do.
Updated by Abram Connelly over 7 years ago
- Status changed from New to In Progress
Updated by Abram Connelly over 7 years ago
- Status changed from In Progress to Closed
720 CGFv3 files have been converted/created. They've been uploaded to the cgfv3 collection under the l7g Data project.
I've checked the mitochondrial sequences to make sure they match. The script was run on lightning-dev1, so the context makes it hard to re-run elseewhere, but it's provided here to give an idea of what's involved:
#!/bin/bash
sglfgz="/data-sdd/data/sglf/035e.sglf.gz"
cgfdir="stage.cgfv3"
for fjgz in `find ./stage ./stage.okg -name 035e.fj.gz` ; do
name=`basename $( dirname $fjgz )`
echo $name
cgfv3="$cgfdir/$name.cgfv3"
a0=`cgft -b 862 $cgfv3 | fjt -b -L <( zcat $sglfgz ) | fjt -c 0 | tr -d '\n' | md5sum | cut -f1 -d' '`
b0=`fjt -c 0 <( zcat $fjgz ) | tr -d '\n' | md5sum | cut -f1 -d' '`
a1=`cgft -b 862 $cgfv3 | fjt -b -L <( zcat $sglfgz ) | fjt -c 1 | tr -d '\n' | md5sum | cut -f1 -d' '`
b1=`fjt -c 1 <( zcat $fjgz ) | tr -d '\n' | md5sum | cut -f1 -d' '`
if [[ "$a0" != "$b0" ]] || [[ "$a1" != "$b1" ]] ; then
echo "ERROR: $cgfv3 mismatch between mt sequences"
else
echo " ok"
fi
done
A new sglf collection was also created with the new 0x35e sglf tile path library. This was needed for the FastJ conversion.
I'm considering this issue closed. If further checks are needed, we can open another ticket to take care of them.