Story #13376
closedTest the effect of phasing imputation workflow
Added by Jiayong Li over 6 years ago. Updated over 5 years ago.
0%
Description
Examine the output of the following runs:
GS12877:
https://workbench.su92l.arvadosapi.com/container_requests/su92l-xvhdp-7n7aty5cnojbaap
NA12878:
https://workbench.su92l.arvadosapi.com/container_requests/su92l-xvhdp-gexbre3xknqqxyj
Files
generate_test_genome.py (1.56 KB) generate_test_genome.py | Keldin Sergheyev, 04/26/2018 05:10 PM | ||
get_switch_errors.py (2.77 KB) get_switch_errors.py | Jiayong Li, 06/20/2018 08:35 PM |
Updated by Jiayong Li over 6 years ago
- Description updated (diff)
- Status changed from In Progress to New
- Assigned To set to Jiayong Li
Updated by Jiayong Li over 6 years ago
- Related to Story #13216: Write phasing imputation workflow added
Updated by Jiayong Li over 6 years ago
Some sample lines of GS12877.
unphased:
1 52238 . T G . PASS . GT 1/1 1 52727 . C G . PASS . GT 0/1 1 53206 . G C . PASS . GT 1/0 1 55164 . C A . PASS . GT 1/1
phased:
1 52238 . T G . PASS . GT 1|1 1 55164 . C A . PASS . GT 1|1
imputed:
1 52238 rs2691277 T G . PASS AR2=0.00;DR2=0.00;AF=1 GT:DS 1|1:2 1 52253 rs530867301 C G . PASS AR2=0.00;DR2=0.00;AF=0;IMP GT:DS 0|0:0 1 53195 rs530111162 T C . PASS AR2=0.00;DR2=0.00;AF=0;IMP GT:DS 0|0:0 1 53234 rs199502715 CAT C . PASS AR2=0.00;DR2=0.00;AF=0;IMP GT:DS 0|0:0 1 54353 rs140052487 C A . PASS AR2=0.00;DR2=0.00;AF=0.0029;IMP GT:DS 0|0:0.01 1 54490 rs141149254 G A . PASS AR2=0.00;DR2=0.00;AF=0.061;IMP GT:DS 0|0:0.12 1 54564 rs558796213 G T . PASS AR2=0.00;DR2=0.00;AF=0;IMP GT:DS 0|0:0 1 54591 rs561234294 A G . PASS AR2=0.00;DR2=0.00;AF=0;IMP GT:DS 0|0:0 1 54712 rs552304420 T C . PASS AR2=0.00;DR2=0.00;AF=0.0029;IMP GT:DS 0|0:0.01 1 54712 rs568927205 T TTTTC . PASS AR2=0.00;DR2=0.00;AF=0.59;IMP GT:DS 1|1:1.18 1 54716 rs569128616 C T . PASS AR2=0.00;DR2=0.00;AF=0.20;IMP GT:DS 0|0:0.39 1 54763 rs548455890 T G . PASS AR2=0.00;DR2=0.00;AF=0;IMP GT:DS 0|0:0 1 54815 rs568686875 T C . PASS AR2=0.00;DR2=0.00;AF=0;IMP GT:DS 0|0:0 1 54945 rs569799965 C A . PASS AR2=0.00;DR2=0.00;AF=0;IMP GT:DS 0|0:0 1 55136 rs574692163 T C . PASS AR2=0.00;DR2=0.00;AF=0;IMP GT:DS 0|0:0 1 55164 rs3091274 C A . PASS AR2=0.00;DR2=0.00;AF=1 GT:DS 1|1:2
- Since the phasing process will drop variants (when it can't phase), and then imputation will add variants, there is potential conflict in the workflow.
- Also, in the imputed vcf, the position 54712 appeared twice.
Updated by Jiayong Li over 6 years ago
Also beagle includes the following quality information in the imputed vcf:
##FORMAT=<ID=DS,Number=A,Type=Float,Description="estimated ALT dose [P(RA) + P(AA)]">
P(RA): P(REF,ALT)
P(AA): P(ALT,ALT)
More info here https://genome.sph.umich.edu/wiki/Minimac3_Info_File#Dosage
Minimac3 estimates imputed dosage at an haplotype level by finding the posterior probability of the alternate allele at that site. The genotype dosage is next evaluated as the sum of the haplotype dosages of each haplotype. For e.g. if the estimated posterior probability of the alternate allele is 0.98 and 0.96 in each haplotype, the genotype dosage is output as 0.98 + 0.97 = 1.95.
Also the imputed marker
##INFO=<ID=IMP,Number=0,Type=Flag,Description="Imputed marker">
Note that the reported variants in phased doesn't have the "IMP" marker.
Another observation is that AF (allele frequency) appears to be correlated with DS. DS is roughly 2*AF.
Updated by Jiayong Li over 6 years ago
Note that in the imputed vcf, the original annotations are dropped. For example, in NA12878
unphased:
1 837214 rs72631888 G C 50 PASS platforms=3;platformnames=Illumina,CG,Solid;datasets=3;datasetnames=HiSeqPE300x,CGnormal,SolidSE75bp;callsets=4;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,SolidSE75GATKHC;datasetsmissingcall=10XChromium,IonExome,SolidPE50x50bp;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable GT:DP:ADALL:AD:GQ:IGT:IPS:PS 0|1:558:122,138:135,166:581:0/1:.:PATMAT
phased:
1 837214 rs72631888 G C 50 PASS platforms=3;platformnames=Illumina,CG,Solid;datasets=3;datasetnames=HiSeqPE300x,CGnormal,SolidSE75bp;callsets=4;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,SolidSE75GATKHC;datasetsmissingcall=10XChromium,IonExome,SolidPE50x50bp;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable GT:DP:ADALL:AD:GQ:IGT:IPS:PS 0|1:558:122,138:135,166:581:0/1:.:PATMAT
imputed
1 837214 rs72631888 G C . PASS AR2=0.00;DR2=0.00;AF=0.5 GT:DS 0|1:1
Updated by Keldin Sergheyev over 6 years ago
- File generate_test_genome.py generate_test_genome.py added
This is the python script that I used to remove positions from GIAB. This version has no errors, but could be more selective about which positions it converts to ./.
Updated by Jiayong Li over 6 years ago
Keldin Sergheyev wrote:
This is the python script that I used to remove positions from GIAB. This version has no errors, but could be more selective about which positions it converts to ./.
Thanks Keldin.
Updated by Jiayong Li over 6 years ago
NA12878 (original):
https://workbench.su92l.arvadosapi.com/collections/su92l-4zz18-0btgexp21dhe6zu
NA12878 (phased):
https://workbench.su92l.arvadosapi.com/collections/su92l-4zz18-0c5vff0u6x0486v
I've discovered there are two main potential problems with the algorithm.
1. Flipping the phase¶
NA12878 has 3,775,119 variants, 2,249,190 of them are phased, but eagle flipped 1,076,255 of them. Statistics are obtained by running the following command.
vcf-compare -g HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz NA12878_phased.vcf.gz
For example, in the original NA12878
1 1171417 rs6603782 C T 50 PASS platforms=4;platformnames=Illumina,CG,10X,Solid;datasets=4;datasetnames=HiSeqPE300x,CGnormal,10XChromium,SolidSE75bp;callsets=5;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,10XGATKhaplo,SolidSE75GATKHC;datasetsmissingcall=IonExome,SolidPE50x50bp;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable,CS_10XGATKhaplo_callable GT:DP:ADALL:AD:GQ:IGT:IPS:PS 1|0:639:147,157:163,176:762:0/1:.:PATMAT 1 1172907 rs715643 C T 50 PASS platforms=3;platformnames=Illumina,CG,10X;datasets=3;datasetnames=HiSeqPE300x,CGnormal,10XChromium;callsets=4;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,10XGATKhaplo;datasetsmissingcall=IonExome,SolidPE50x50bp,SolidSE75bp;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable;filt=CS_SolidSE75GATKHC_filt GT:DP:ADALL:AD:GQ:IGT:IPS:PS 0|1:553:114,125:147,172:738:0/1:.:PATMAT
In the phased NA12878
1 1171417 rs6603782 C T 50 PASS platforms=4;platformnames=Illumina,CG,10X,Solid;datasets=4;datasetnames=HiSeqPE300x,CGnormal,10XChromium,SolidSE75bp;callsets=5;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,10XGATKhaplo,SolidSE75GATKHC;datasetsmissingcall=IonExome,SolidPE50x50bp;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable,CS_10XGATKhaplo_callable GT:DP:ADALL:AD:GQ:IGT:IPS:PS 0|1:639:147,157:163,176:762:0/1:.:PATMAT 1 1172907 rs715643 C T 50 PASS platforms=3;platformnames=Illumina,CG,10X;datasets=3;datasetnames=HiSeqPE300x,CGnormal,10XChromium;callsets=4;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,10XGATKhaplo;datasetsmissingcall=IonExome,SolidPE50x50bp,SolidSE75bp;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable;filt=CS_SolidSE75GATKHC_filt GT:DP:ADALL:AD:GQ:IGT:IPS:PS 1|0:553:114,125:147,172:738:0/1:.:PATMAT
2. Missing variants¶
The phased NA12878 missed some unphased variants of the original (meaning that it cannot phase them), but it also missed some phased ones too (this seems counter-intuitive).
For example, missed unphased variants
1 239339 rs184451216 A G 50 PASS platforms=1;platformnames=10X;datasets=1;datasetnames=10XChromium;callsets=1;callsetnames=10XGATKhaplo;datasetsmissingcall=HiSeqPE300x,CGnormal,IonExome,SolidPE50x50bp,SolidSE75bp;callable=CS_10XGATKhaplo_callable;difficultregion=hg19_self_chain_split_withalts_gt10k GT:DP:ADALL:AD:GQ:IGT:IPS:PS 0/1:0:0,0:0,0:99:0/1
Missed phased variants
1 567239 rs78150957 CG C 50 PASS platforms=3;platformnames=Illumina,CG,10X;datasets=3;datasetnames=HiSeqPE300x,CGnormal,10XChromium;callsets=4;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,10XGATKhaplo;datasetsmissingcall=IonExome,SolidPE50x50bp,SolidSE75bp;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable;filt=CS_HiSeqPE300xfreebayes_filt,CS_SolidPE50x50GATKHC_filt,CS_SolidSE75GATKHC_filt;difficultregion=AllRepeats_lt51bp_gt95identity_merged_slop5 GT:DP:ADALL:AD:GQ:IGT:IPS:PS 1|1:534:0,214:38,252:241:1/1:.:PATMAT
Note that the missed variants are extracted by the following command, and then looking at the fn.vcf.gz (false negatives)
rtg vcfeval -b HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf.gz -c NA12878_phased.vcf.gz -t ~/keep/by_id/cd0ace4987f72d1e4d8f3ef5281cef17+63099/genomes/GRCh37/rtg/GRCh37.sdf -o rtg-NA12878-NA12878_phased
There are 313,952 variants missing, 300,952 of them are phased originally, and 118,426 of them are hom alt (namely, GT being 1|1).
Updated by Jiayong Li over 6 years ago
For imputation of GIAB NA12878, beagle adds total of 720,714 variants, and 125,936 of them lie in the confidence region. This makes sense, since 83% of the imputed variants are outside of the confidence region.
Updated by Jiayong Li over 6 years ago
Regarding eagle flipping the phases, I've found this PLOS Genetics paper http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1007308 that evaluates phasing strategies on GIAB NA12878, and eagle and beagle are among them.
In that paper, they used the metric Switch Error, which is defined as follows
To identify switch errors, the phase of each site is compared with upstream neighboring phased sites.
This means the switch error is only 1 for a whole block of variants whose phases are all switched. In this metric, eagle did better than beagle. See http://journals.plos.org/plosgenetics/article/figure?id=10.1371/journal.pgen.1007308.t001
Updated by Jiayong Li over 6 years ago
- File get_switch_errors.py get_switch_errors.py added
See the attached script for calculating switch errors.
Summary:
GIAB NA12878:¶
3,775,119 variants (2,249,190 of them phased)
After applying eagle for phasing:¶
3,461,167 phased variants
313,952 missing variants (300,952 of them phased originally, 118,426 of them hom alt, namely, GT being 1|1)
1,076,255 phase switches
186,362 switch errors (phase switch compared to last phased het upstream)
Updated by Jiayong Li over 5 years ago
- Status changed from In Progress to Resolved