Settings

Theme

How Perl Saved the Human Genome Project

drdobbs.com

40 points by p3ll0n 15 years ago · 27 comments

Reader

draegtun 15 years ago

Slightly related video presentations Curing Cancer with Perl by David Dooling of the Washington University Genome Center:

* part 1 - http://blip.tv/file/1997719/

* part 2 - http://blip.tv/file/1998152

* part 3 - http://blip.tv/file/2000983/

  • ben1040 15 years ago

    Wow, I filmed that video!

    I cannot speak officially for the Genome Center, but I'll throw out there that the ORM that powers much of the GC's analysis platform is out on Github and CPAN.

    It's actually more than an ORM in that it also supports features like automated creation/smart rewriting of class files based on database tables, quick and easy command modules that get turned into hierarchical command-line tools for free, and an automated test harness that can even parallelize onto an LSF cluster if you've got one.

    Github http://github.com/sakoht/UR

    CPAN w/ documentation http://search.cpan.org/dist/UR/lib/UR.pm

westbywest 15 years ago

Many moons ago, I worked on an FPGA-based platform that was among several research projects targeted at the Genome Project. The general idea was to offload BLAST-style sequence alignment to purpose-compiled FPGAs, such that sequencing across the entire dataset could be performed in order of magnitude less time. It really wasn't all that complex (I just implemented Smith-Waterman directly, as a demonstration), only intended to perform fuzzy matches at Gbps speeds to winnow the working dataset down to a size more palatable to a desktop workstation.

My understanding is that all these projects (mine included) were cast adrift when the funding for them evaporated in the post-9/11 climate. In the intervening years, I was aware that Perl was being picked rapidly at the Genomics labs in the nearby university hospital (i.e. since we never delivered them the FPGA platform), and I'm happy to read Perl has risen to fill this niche.

dstorrs 15 years ago

The part that made me smile was when he said "In all, between one and TERAbytes of data would generated!!!!" [exaggerated emphasis mine]

I've got 3-4 terabytes of storage within a dozen feet of me as I type this; it really drives home the pace of change in computing.

pasbesoin 15 years ago

In the same vein, though sometimes with less detail:

http://oreilly.com/pub/a/oreilly/perl/news/success_stories.h...

O'Reilly also published some of these in at least two folded/stapled pamphlets that were handed out for free e.g. at conferences. I recall a finance-centered application where the Perl prototype far outperformed the subsequent implementation and ended up taking over the production role.

It looks like maintenance at that URL stopped in about 2004, but in googling "perl success stories" I saw a few more recent articles that might qualify.

blahedo 15 years ago

(1997)

hackermom 15 years ago

Alternate title: How it happened to be Perl instead of any other just as capable language that saved the Human Genome Project (in the land of Dangling Participles and Allusion Errors).

  • jbert 15 years ago

    Python, perl and ruby are roughly the same language. The differences between them are primarily cultural, rather than technical.

    I suspect the reason perl flourished here was a combination of luck and the cultural fit. Culture here includes the newbie-friendly online help (e.g. perlmonks), the ease of "publish and re-use components" (CPAN).

    • adorton 15 years ago

      Also, remember that when the project started, Python and Ruby didn't exist yet. Perl still wasn't the only dynamic scripting language on the block, but it probably the most mature and best-suited to this problem domain.

      I wonder if perl would still be used if the project was started today.

      • c1sc0 15 years ago

        Perl was the only language on the block with strong built-in text-processing capabilities. For many a biologist the Camel book was the only programming book they read before moving on to solve real biological problems instead of fiddling with programs.

        • elblanco 15 years ago

          It's also performant enough that it wasn't worth the time to learn a faster performing language.

        • sprout 15 years ago

          From the bit of Bioinformatics work I did in college, that actually seemed a problem rather than a boon. Genes are not ascii sequences, and Perl is not really made to manipulate them.

          Perl is however an excellent scripting language, and though it's ugly, it's just as easy as Python to pick up and use. Thus its danger.

          • elblanco 15 years ago

            Can you elaborate a bit on "genes are not ascii sequences"? My understanding is that genes are regularly computer stored as ascii sequences e.g. GATTACA.

            • sprout 15 years ago

              Yes. I think I came off as more negative towards Perl than I meant to. The point is that they are far simpler than ascii. They are quaternary data, and as such Regexes aren't really a good tool for modifying them any more than Regexes are a good tool for modifying raw binary, even if your binary is stored as a string of ascii characters.

              There are other facilities of Perl though, in general its quick scripting, that make it as good a tool as any other, just so you don't get sucked into the idea that regexes are a good tool here.

              • elblanco 15 years ago

                Thanks for the reply. That's definitely true. I'd imagine a gene sequencing system using two bit encoding would allow for the gene data to be more compact and more performant in many cases.

      • draegtun 15 years ago

        Python & Ruby did exist however their VMs were no where near as reliable and fast as Perl at that time. Another viable option would have been tcl but like you say Perl was probably better suited.

        I wonder if perl would still be used if the project was started today.

        I think so. But would that be perl5 or perl6? :)

  • draegtun 15 years ago

    I think a better alternate title would be: How it happened a dynamic language was used to save the Human Genome Project

    Because at the time Perl was probably the only capable dynamic/scripting language.

p3ll0nOP 15 years ago

In addition to Lincoln's thoughts I think one of the main reasons bioinformaticians are attracted to Perl is because it is forgiving. Biological data is often incomplete, fields can be missing, or a field that is expected to be present once occurs several times (because, for example, an experiment was run in duplicate), or the data was entered by hand and doesn't quite fit the expected format. Perl doesn't particularly mind if a value is empty or contains odd characters. Regular expressions can be written to pick up and correct a variety of common errors in data entry. Of course this flexibility can be also be a curse.

  • chronomex 15 years ago

    A paragraph very similar to this one occurs in the article.

    • pjscott 15 years ago

      From the article:

      "Perl is forgiving. Biological data is often incomplete, fields can be missing, a field that is expected to be present once occurs several times (because, for example, an experiment was run in triplicate) or the data gets entered by hand and doesn't quite fit the expected format. Perl doesn't particularly mind if a value is empty or contains odd characters. Regular expressions can be written to detect and correct a variety of common errors in data entry. Of course, this flexibility can also be a curse, as I'll discuss in more detail later."

      A few words are different. The article says triplicate, and p3ll0n says duplicate, for example. But they are similar enough to use as testing input to a diff algorithm.

      EDIT: Also from this guy's comment history:

      http://news.ycombinator.com/item?id=1456105

      Some of the phrasing looks to have been copied and pasted from this article by Jonathan Ellis:

      http://www.rackspacecloud.com/blog/2009/11/09/nosql-ecosyste...

      I bet if you could make a bot to do this -- go out and find relevant information, and summarize it -- you could actually provide a serious public service. As long as you cited your sources, so it's not a plagiarism-bot.

      • chromatic 15 years ago

        That bot is easy to write in Perl! I have a document summarizer written already.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection