Project

General

Profile

Actions

Idea #8956

closed

Create a Go an output sensitive dynamic programming function

Added by Abram Connelly almost 10 years ago. Updated over 9 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Target version:
-
Start date:
04/12/2016
Due date:
Story points:
-

Description

When converting to some variant difference format (GFF, gVCF, etc.) from FastJ, there is an alignment step that takes place to align the FastJ tile to the reference tile to produce the variant difference format. Often times these sequences are nearly identical, with only a few bases difference either from a SNP or small INDEL, for example. Naive dynamic programming can be done in O(N^2) time and space. Using Hirschberg's algorithm, space can be reduced to O(N) but still takes O(N^2) time.

It would be better to have an output sensitive algorithm that only uses time/space proportional to the number of differences in the two sequences. Such an algorithm is well known. I believe Ukkonen's algorithm is one such method. There is also work on using a checkpointing scheme.

Implementing one or both of these would significantly speed up conversion from FastJ to (e.g.) VCF. Most conversions are of ~250bp but there are a few which are in the ~10k region which slow down conversion and do not make arbitrary conversion from FastJ to VCF (etc.) fast.

The goal would be to extend something like memz with an output sensitive dynamic programming function or to create another stand alone library/function. Input should be a simple byte array. User specified fixed gap, match and mismatch costs should be allowed. Pairwise alignment would be desirable but not necessary.

Other References:

Actions #1

Updated by Abram Connelly over 9 years ago

  • Status changed from New to Closed

This has been completed with the caveat that a heuristic needs to be employed for some of the sequences that are too disparate. The asmukk library has been created that uses Ukkonen's algorithm for approximate string matching. The has O(d*n) space and time complexity. The better solution is to use the checkpointing algorithm listed above but that can be done later if need be.

Even with the O(d*n) space/time complexity, there are some sequences that blow up when trying to align. For these, a heuristic is used to speed up this alignment. See the ClumsyAlign function for details.

Actions

Also available in: Atom PDF