Bug #13469
closedvcfbed2homref misreported homref region
Description
When testing with the GIAB dataset, vcfbed2homref (v0.1.0) run on HG004 reported an end bound in a homozygous reference region that was before the start of following line.
The offending VCF line was:
7 85067121 rs149973664 TATTGCACAACAAAAGAAAAACAAAAGATATTTAGTAC T 50 PASS platforms=3;platformnames=CG,Illumina,10X;datasets=5;datasetnames=CGnormal,HiSeqPE300x,HiSeq250x250,10XChromium,HiSeqMatePair;callsets=8;callsetnames=CGnormal,HiSeqPE300xfreebayes,HiSeq250x250freebayes,HiSeqPE300xGATK,HiSeq250x250GATK,10XGATKhaplo,HiSeqMatePairfreebayes,HiSeqMatePairGATK;datasetsmissingcall=IonExome;callable=CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable,CS_HiSeq250x250freebayes_callable;filt=CS_CGnormal_filt,CS_HiSeqMatePairGATK_filt GT:PS:DP:ADALL:AD:GQ 0/1:.:836:220,154:0,0:1460
With the corresponding BED file region being:
7 85058314 85067126 7 85067151 85068558 7 85068659 85071179 7 85071280 85073955
I believe the issue is that the bed file position wasn't updated when the homozygous start window was updated (around here).
Updated by Abram Connelly almost 8 years ago
This is specifically the error that was causing tabix to not index properly:
7 85065014 . G <NON-REF> . PASS END=85067120 GT 0/0 7 85067121 rs149973664 TATTGCACAACAAAAGAAAAACAAAAGATATTTAGTAC T 50 PASS platforms=3;platformnames=CG,Illumina,10X;datasets=5;datasetnames=CGnormal,HiSeqPE300x,HiSeq250x250,10XChromium,HiSeqMatePair;callsets=8;callsetnames=CGnormal,HiSeqPE300xfreebayes,HiSeq250x250freebayes,HiSeqPE300xGATK,HiSeq250x250GATK,10XGATKhaplo,HiSeqMatePairfreebayes,HiSeqMatePairGATK;datasetsmissingcall=.;callable=CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable,CS_HiSeq250x250freebayes_callable;filt=CS_CGnormal_filt,CS_HiSeqMatePairGATK_filt GT:PS:DP:ADALL:AD:GQ 0/1:.:836:220,154:0,0:1460 7 85067159 . T <NON-REF> . PASS END=85067126 GT 0/0 7 85067152 . T <NON-REF> . PASS END=85067723 GT 0/0 7 85067724 rs79027572 T C 50 PASS platforms=2;platformnames=Illumina,CG;datasets=4;datasetnames=HiSeqPE300x,CGnormal,HiSeq250x250,HiSeqMatePair;callsets=7;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,HiSeq250x250GATK,HiSeq250x250freebayes,HiSeqMatePairfreebayes,HiSeqMatePairGATK;datasetsmissingcall=10,.;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable,CS_HiSeq250x250GATK_callable,CS_HiSeq250x250freebayes_callable;filt=CS_HiSeqPE300xGATK_filt GT:PS:DP:ADALL:AD:GQ 0/1:.:1049:180,211:79,84:1299
Notice 85067152 comes after in the file but falls before the start of the previous lines reported 85067159.
tabix reports the following error (with out.vcf.gz being the output VCF file):
[E::hts_idx_push] Unsorted positions on sequence #1: 85067159 followed by 85067152 tbx_index_build failed: out.vcf.gz
Updated by Abram Connelly almost 8 years ago
- Status changed from New to Resolved
I subtracted the beginning BED start from the source BED file as well as the VCF file so that we could create a small test snippet with a reference that was checked in.
Bug has been fixed and tests have been added to make sure the problem region passes.