Settings

Theme

How I Wrote an Ultra-Fast DNA Sequence Alignment Algorithm in JavaScript

medium.com

106 points by jebus989 11 years ago · 52 comments

Reader

tstactplsignore 11 years ago

This is a neat JS library with some possible applications, but if you're only matching substrings without gaps, then you're basically string searching, not performing alignment. Alignment is an entirely different and more complicated computational problem with decades of algoritm development (see Smith-Waterman, Needleman-Wunsch, Burrows-Wheeler transform) and implementation going into it, and it is far more biologically useful and practical than substring searching. Considering this, in practice heuristic implementations like BLAST, FASTA, and lastz are used to do most big alignment searching, and that's not even considering multiple sequence alignment, which is another problem into itself. In other words, it's a useful JS library and interesting implementation, but it isn't alignment.

  • keithwhor 11 years ago

    The algorithm maps every position of seqSearch to seqGenome, aggregates the results, and returns them. It is not merely a string search. (If it were, the Booyer-Moore [1] algorithm would outperform this significantly, binary or not, and we'd call it a day.) It is aligning seqSearch to seqGenome and returning the best results, in order.

    Alignment is not restricted to "gapped" alignments only (or there wouldn't be a need for such a distinction!).

    You can modify the algorithm to include gapped alignments (by introducing gaps in your search sequence), as it supports a "gap" character ("-"). The gap character will just always count as "not a match."

    [1] http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_sear...

    • tstactplsignore 11 years ago

      So, ungapped alignment is just substring searching with degeneracy, which is fine. While that is its own computational problem and may even have the occasional biological application, sequence alignment is defined as a gapped alignment problem (see the Wikipedia page, for example [1], which defines gap insertion as a critical step, and all alignment variants on the page are gapped alignments)

      I am not sure you understand what gapped alignment is- it is not the alignment of a sequence with known gaps, but an algorithm which determines the best placement of gaps in a query sequence to obtain the highest matching score. This is a very different problem than the one you just described, and is essentially "the hard part" of sequence alignment. [1] http://en.m.wikipedia.org/wiki/Sequence_alignment

      • mbreese 11 years ago

        See here: https://news.ycombinator.com/item?id=9096064

        Turns out they aren't actually after an alignment per se, but rather trying to match a DNA binding motif.

        (You're right though about this not being an alignment, rather just a substring search with degeneracy).

      • keithwhor 11 years ago

        Hmm, while I understand the problem of gapping is traditionally the hard part, I'm under the impression that the argument you're putting forward is primarily one of semantics.

        "In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences." (From the article you linked.)

        Gapped sequence alignment is certainly far more robust (and biologically relevant - insertions are a common error - when comparing sequences across organisms) than ungapped, and a much harder problem, but as for the definition of "alignment" itself, I don't believe I've misnamed anything here.

        If we're going to be overly pedantic about the use of the word "alignment" that's fine, but I'm not sure it's a worthwhile debate to have. A quick search for "ungapped sequenced alignment" returns a great deal of results on Google [1]. So if I am mistaken, I'm certainly not the first (nor do I believe I'll be the last.)

        Furthermore, there's nothing preventing anybody from using the methods described here from implementing an ungapped sequence alignment tool that outperforms tools that only use string comparisons. :)

        [1] https://www.google.com/search?q=ungapped%20sequence%20alignm....

        • lightlike 11 years ago

          you shouldn't dismiss this as an argument over semantics if your understanding of the term differs from researchers' use of the term. if you introduce "a fast tool for XYZ" and researchers understand XYZ to mean A, where you understand it to mean B, then the tool is not useful for researchers to perform what they know as XYZ.

          tools like BLAST are extremely sophisticated and have been under development for decades, and I'm fairly confident they've moved past naive string comparisons by now.

          • keithwhor 11 years ago

            Fair. Though I'm not convinced "ungapped sequence alignment" is particularly confusing to a researcher, considering there are tools and papers that have existed for decades using this description [1][2][3]. Though the algorithm described in my article is extremely focused on raw performance (and relatively naive with scoring), I would still choose to categorize it as primarily a tool that deals with ungapped sequence alignment, specifically supporting IUPAC degenerate nucleotide sequences. Thus, I believe the initial argument is, indeed, overly pedantic.

            And to be clear, nowhere am I comparing what I've developed to BLAST. (They have very different applications.)

            [1] http://schneider.ncifcrf.gov/paper/malign/

            [2] http://www.ncbi.nlm.nih.gov/pubmed/9697204

            [3] http://www.ncbi.nlm.nih.gov/pubmed/15130540

xaa 11 years ago

My colleague (an editor at Bioinformatics journal) and I were joking the other day about how every other paper title, especially about aligners, seems to include the word "fast" in it. This takes it to a new level.

As a learning exercise, this is interesting and fine. I am trying very hard to suppress the inner "reviewer" right now. Walking away...not comparing this to existing algorithms which are implemented in highly optimized C/C++, or CUDA, or even hardware. About why other authors would go to such extraordinary lengths if high-level languages are suitable. Not going to ask how conclusions can be drawn about the suitability of Javascript for very computationally-intensive tasks without solving the actual alignment problem rather than a subset, let alone comparing to existing tools. Not going to ask about the application of the tool. It's not a paper. I'm breathing. OK.

  • keithwhor 11 years ago

    This was certainly not intended as an academic submission, merely a description of a method used. You can breathe. :)

    The initial point of the tool / algorithm was to find all potential binding sites for a DNA-binding domain of a protein in 100kbp - 1Mbp genome. (Even those with sequence identity ~50% or less.) This is provided you have a consensus sequence that contains ambiguous nucleotides. (For example, roughly discerned from a sequence logo.) It quickly turned into development of a general bioinformatics library in JavaScript, and a chance to see how far and fast I could push V8 at doing these sequence comparisons.

    I would love (at some point) to go into significantly more detail and compare what I've written here with existing tools. If you're willing to offer mentorship or guidance (or know somebody who would be), it would be fantastic to present the information in a more thoroughly peer-reviewed context. Otherwise, the post (and associated library) are meant primarily as learning tools for both biologists and developers.

    • xaa 11 years ago

      Alas, alignment is not in my area of expertise, nor of the people I know directly. I'm a heavy user, but not a developer of aligners (except a toy one). You could consider contacting Cole Trapnell or someone well known for making a widely used aligner and propose a collaboration, but that is a hard cold call. On the plus side, it means I probably wouldn't be your reviewer :) (but editors do sometimes get desperate...)

      Your initial problem, if you frame it as the desire to simply enumerate all the degenerate sequences and loci, could be solved any number of ways as other commenters have mentioned. Probably I would reach first for a regex. But sure, no crime in learning to implement a new algorithm while also testing the limits of V8. Probably half of my grad school time was spent that way ;)

      I think your best bet if you wanted to publish would be to find a use case for in-browser alignment. It would be hard to answer the obvious question, "why not server-side?" though. But who knows, people are somewhat often taking old bioinformatics algorithms and saying "but now you can do it on your phone!!". And they do publish.

      But as for matching both speed and accuracy of state-of-the-art aligners with Javascript, it is my considered, scholarly opinion that you have no chance in hell. So you shouldn't present it that way. It would be unnecessary to compare (at least for speed) if you weren't making claims about speed.

  • dalke 11 years ago

    Well, yes. "fast" can mean "not as slow as doing it by hand in the most naive way possible".

    "Faster", now that's a more interesting title if only because it was compared to something else. (Then again, I reviewed one paper which said the algorithm was faster and more memory efficient than the previous version of itself. But with no numbers.)

    "Fastest" is of course much harder to pull off. :)

    • xaa 11 years ago

      > I reviewed one paper which said the algorithm was faster and more memory efficient than the previous version of itself. But with no numbers.

      The aforementioned colleague says he usually editorially rejects those.

      Actually another amusing thing about this title is that actually this is not the first time that "ultra-fast" has been used to describe an aligner in a title. The STAR aligner did too.

      No sane reviewer or editor would allow "fastest" unless in the context of "provably fastest" which is probably not within the skillset of most bioinformaticians. If it is just "fastest among the currently available tools" then that claim will be out of date within a year.

      • dalke 11 years ago

        The one "fastest" I know of is "Fastest Fourier Transform of the West". The abstract for the 1997 paper starts:

        > This paper describes FFTW, a portable C package for computing the one- and multidimensional complex discrete Fourier transform (DFT). FFTW is typically faster than all other publicly available DFT software, including the well-known FFTPACK and the code from Numerical Recipes. More interestingly, FFTW is competitive with or better than proprietary, highly-tuned codes such as Sun’s Performance Library and IBM’s ESSL library.

        If a "fastest" paper comes in with a good, robust benchmark and tests against the top performing and the most accepted implementations, then it would be worthwhile. And rare as hen's teeth.

  • collyw 11 years ago

    Bioinformaticians love to reinvent the wheel.

    • xaa 11 years ago

      Yes, yes they do. I think it is because of the following sequence of events: a) bioinformaticians are usually shitty programmers, therefore b) a lot of the tools are bad, at least from a programming perspective, therefore (?) c) I'll write a new one that will be "better".

searine 11 years ago

Except this is totally reinventing the wheel.

If I want to do fast genome alignment, I'll use lastz which is already blazing fast and written in C. If I want to just look for homology, I'll use blast.

There isn't much need for improved alignment algorithms. If it's a big job, most bioinformatics have access to clusters. If it is a small job, who cares about speed.

  • iten 11 years ago

    To be fair, to my knowledge, LASTZ, BLAST, or BLAT don't treat IUPAC ambiguous bases (K, Y, R, etc.) in the way the OP was looking for. That's not to blame the tools, since they have a very good reason not to (they build an index on the target first, and treating ambiguous bases properly would increase the size of the index).

    That said, I wonder if grep wouldn't be much faster, since this program is only looking for exact matches, which are easily transformed into a regex by replacing the ambiguous nucleotides with something like (A|T|C).

discardorama 11 years ago

I liked the exposition, but it would have been nice if the author had compared his implementation with something more meaty. Like Sustik-Moore[1], for instance.

[1] http://www.cs.utexas.edu/users/moore/publications/sustik-moo...

  • mbreese 11 years ago

    It would have been good to compare the JS implementation to a C implementation. Given the algorithm, it shouldn't be too bad. I'd suspect that the C version would work much better, but I could be surprised.

    • keithwhor 11 years ago

      Funny anecdote, my first implementation of the algorithm was actually in Go (for concurrency sake). It was an earlier version and wasn't as optimized as the one available here, but I found it actually performed 10% slower at matching nucleotides as compared the same algorithm running in V8. At the time, I attributed it to V8 "writing" better-optimized instructions than I possibly could.

    • lmm 11 years ago

      C is one of the few languages that might be more error-prone than JS. I'd be more interested to see a comparison to a safer high-performance language, e.g. Haskell.

  • keithwhor 11 years ago

    Author here. :)

    The article wasn't meant to be highly comprehensive, but if my method is indeed new, I'd be more than happy spending more time writing an article that's a bit more technical. (To note, storing and comparing nucleotide sequences as binary strings isn't novel in and of itself. I haven't found evidence of the method of comparison I've used, however.)

    • DannyBee 11 years ago

      The method you are using for comparison is commonly simdized and used for string matching.

      Note also that intel/AMD SSE4+ has a 32 bit/64 bit popcnt instruction with 3 cycle latency/1 cycle throughput (for both 32 bit and 64 bit version), and so is faster for counting bits/matches than any of the methods you are using :)

      • keithwhor 11 years ago

        Nice! Would love to see it implemented - though that popcnt instruction isn't accessible via JavaScript as far as I'm aware. ;)

        • DannyBee 11 years ago

          Also note - your while loop isn't 0 cycles, in fact, it's probably worse than the bit-twiddling depending on sparseness.

          This is because the test and branch always happens, and is essentially guaranteed to be mispredicted a lot.

          These branch mispredictions are likely to cost a lot more than the n cycles it theoretically saves.

    • dekhn 11 years ago

      Doesn't BLAT do that? (2bit format)

      • mbreese 11 years ago

        BLAT doesn't use a 4-bit format for ambiguous bases, but it's a common enough optimization. I've done it before in private aligners... and I'm sure I'm not the first.

        The problem with 4-bit format is that ambiguous bases are so rare in a full reference that it's more efficient to handle the ambiguous bases in a special code path, rather than waste the 2 extra bits per base.

        The 2-bit format stores only ACGT, with regions of 'N's stored separately (since N's are mainly contiguous in the reference genomes).

        http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob_...

        • keithwhor 11 years ago

          You're right about the rarity of ambiguous bases in reference sequences, but potential search sequences are important as well.

          The reason I started development on this to begin with was because I found myself banging my head against a wall when trying to use BLAST to identify sequences that matched a rather loosely-defined DNA-binding consensus sequence. The potential combinations of nucleotides multiply exponentially when you have degenerate nucleotides in your search query.

          It wasn't the reference I was concerned with, it was the sequence I was searching for. :)

          • mbreese 11 years ago

            If you're looking for a DNA binding sequence, you might be using the wrong algorithm here... Have you thought of using a position weight matrix (PWM)[1]? That's the way that I've always searched for binding sites, since that's how motifs are usually described. You can use these to still match ambiguous sequences, but favor some bases more than others.

            [1] http://en.wikipedia.org/wiki/Position_weight_matrix

            • keithwhor 11 years ago

              A position weight matrix would certainly work, but can't be implemented to run as quickly. If a nucleotide could be represented as 0.6A and 0.4T, for example, representing it as "W" is nearly as accurate and can be compared using bit operations far more quickly.

              Think about a length 20 sequence, if only 5 in 100 identified high-affinity 20-mers contained a "C" at position 0, and the others all contained a "G", do I really care about the weight matching, or can I approximate that position as a "G" and still get roughly the same results?

              (Though it would be interesting to apply a PWM to the top [x] results of this algorithm, once completed, to specify exact rank.)

              • mbreese 11 years ago

                If you're going to go that route, you might as well just do a hashed aligner with a bitmasked searching. Then after the face, you can do the PWM calculation. This is how the BFAST aligner works...

                It really depends on how degenerate your motif really is...

                However, once you start adding in the probabilities, I think that it might be better to do the proper calculation across the board.

                Without knowing what you're actually looking for, it's hard to pinpoint what the optimal algorithm should be. The 4-bit optimization is a common choice for ambiguous sequences, so that might be a good place to start (and as a bonus, the revcomp can be a simple bitstring reversal if done correctly). But I have my doubts. Hell, given what you've said you're trying to do, a well formed regex might even work just as well. :)

                • keithwhor 11 years ago

                  While regex could replace the conditional check in the "initial" (string-only) implementation to allow for degenerate nucleotide matching, it would certainly not speed it up there.

                  Additionally, I'm not sure how you'd intend to do the exhaustive alignment checks / scoring with a regex. (I want to find all sequences that match my 20-mer with an identity score of at least 14/20 --- what's my well-formed regex for that? There are 38,760 different ways to choose 14 of 20 nucleotides.)

                  Plus, there's the added overhead of having to come up with a well-formed regex to begin with. :)

          • iten 11 years ago

            What proportion of bases in your query were ambiguous? If it was a fairly low percentage (~10-20%?), I think you could probably get away with using LASTZ, treating all ambiguous bases as completely ambiguous (--ambiguous=iupac), then write a short script to go through the alignments afterward and filter out ones that don't have a reference base that matches the IUPAC character in the query.

            But that's the lazy approach :) Good article.

            • keithwhor 11 years ago

              Thanks. :)

              Sometimes reinventing a wheel (or at the very least a spoke or two) can be fun and provide some neat results.

      • keithwhor 11 years ago

        Degenerate nucleotide matching (using this method) is only possible with 4 bits of data.

        If BLAT is using this method for counting matches, I'm unaware. The algorithm is reliant on storing nucleotides in 4 bits, so I doubt it uses this exact method.

        • dekhn 11 years ago

          Actually, given huffman coding, you can do it with fewer than 4 because the degenerates are so rare.

          • keithwhor 11 years ago

            Kills the algorithm (as described here) performance when you have nonstandard nucleotide widths (in bits). Rather, makes it completely unsuitable. It's dependent upon fixed-width binary data.

AdamTReineke 11 years ago

I'm not familiar with the ASM.js spec[0] to know how hard it would be, but I am curious to know if there would be a perf improvement if you could get it to run in that mode. Nice write-up!

[0] http://asmjs.org/spec/latest/

Ultimatt 11 years ago

Not sure I get why the author of the article didn't just use the well known Bitap algorithm? http://en.wikipedia.org/wiki/Bitap_algorithm

hashtree 11 years ago

With io.js coming on to the scene, I've recently revisited my thoughts on using JavaScript for scientific/statistical computing. With ES6, V8 Turbofan coming down the pipe, and enough support for serious functional programming (e.g. fantasy land, transducers, ramda, etc), I've decided to shift from using Scala/Haskell/R to JavaScript. I think there is a bright future in these areas and a lot of room for devs to make a name for themselves doing this kind of development. R is particularly easy to beat on the performance front, from the libs I have rewritten to JS to date.

evanpw 11 years ago

I thought Atwood's Law [1] was a joke, but this is seriously interesting. I remember when Java was just a fun little toy language for making web pages, and I'm still surprised every time someone mentions it as a serious number-crunching platform. (I've heard that the entire Nasdaq matching engine is a single-threaded Java application.) Is JavaScript going to end up the same way?

[1] http://blog.codinghorror.com/the-principle-of-least-power/

  • pimlottc 11 years ago

    For what it's worth, the Chicago Mercantile Exchange's matching engine is written in Java as well.

jerven 11 years ago

Super neat JS work! I have similar code for doing GC/AT counts in java. Using longs and popcount one can do 16 nucleotide per cycle. Also what is nice is that one can easily use a memory mapped byte buffer as long buffer to run through the code. To go faster than that AVX needs to be used/jitted in.

What it really shows is that the FASTA format is just terrible for computational efficiency :(

https://gist.github.com/JervenBolleman/d1430d0549028861504c

Rhapso 11 years ago

this is the commonly used BLAST/alignment tool: http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearc...

for comparison.

daemonk 11 years ago

Cool. I saw this the other day when you posted it at BioStar (I am Damian Kao at BioStar). Next step is to establish some alignment metrics (some kind of p-value).

  • keithwhor 11 years ago

    There's work I've started on this, actually. It's available in the README of the NtSeq repository[1]. Near the bottom. The probability becomes a mess (or at least, I can't figure out the correct model) when you start including degenerate nucleotides, and I didn't want to release a half-done statistical analysis tool, so I axed it from the current release.

    [1] https://github.com/keithwhor/NtSeq

tresil 11 years ago

Very exciting work and a fantastic write up! Looking forward to giving this a shot.

yatoomy 11 years ago

Loving the rise of JS algs. Keep it up!

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection