Idea #8954: Convert 178 Harvard PGP public whole genomes to FastJ - Lightning - Arvados

Actions

Copy link

Idea #8954

closed

Convert 178 Harvard PGP public whole genomes to FastJ

Added by Abram Connelly almost 10 years ago. Updated over 9 years ago.

Status:

Closed

Priority:

Normal

Assigned To:

Target version:

Start date:

04/12/2016

Due date:

Story points:

Description

Convert the 178 publicly available whole genome data available from the public project su92l-j7d0g-nf54gdds5jj03tc.

The new conversion should happen through the pasta utility, either from CGI-Var to FastJ or from the GFF to FastJ. If there's time, it would be good to confirm that the FastJ conversion from the CGI-Var and GFF are identical (or very close).

The result should be a project with each converted FastJ genome as a sub collection. Each FastJ collection should be named as "huID.source_file" (e.g. "hu82651.var-GS000037338-ASM.tsv.bz2") with the contents of each collection the ~1k FastJ files, each split out into it's own path. Each FastJ path file should be (bgzip) compressed and named named as it's four digit hex path with an '.fj.gz' extension (e.g. "001f.fj.gz"). Each file should have a FASTA index file with the headers in the FASTA index file replaced with the appropriate path in 4-2-4-4 hex digit notation (e.g. "001f.00.01e1.0001"). Though this is an abuse of the FASTA index file, the index file should still work with the compressed file. The FastJ files should have fixed width for each of the sequences.

This project can be added to when more genome files are converted to FastJ.