Settings

Theme

A collaborative residency program in mathematical biology and deep learning

topos.house

52 points by szany 10 years ago · 12 comments

Reader

noname123 10 years ago

Hi, I unfortunately cannot apply to a FT residency program as I have a FT job (as I suspect most people here do).

However, I do have an interest in Bioinformatics/Computational Biology. Unfortunately, I find that most textbooks that cover these topics are very theoretical, e.g., going over how Blast algorithm works, explanations of NGS assembly and annotation.

I'm curious any peeps in the field have any simple learning projects/practical papers for a programmer interested in Bioinformatics?

Things I'm thinking of are, (would like to hear from the experts any hints, tips or links):

1) Downloading a GWAS (Genome Wide Association) data-set for a psychiatric or cancer disease and replicating the statistical techniques that scientists used to determine the clusters of responsible genes (e.g., http://www.nature.com/nature/journal/v511/n7510/full/nature1...)

2) Downloading different strains of Cannabis Indica and Sativa genome draft assemblies; running initial phylenlogy tree and run genome annotation pipelines to identify known genes; and using RNASeq data to isolate the potencies of different varieties of Cannabis and metabolic pathways to produce hemp (e.g., http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3359589/)

3) Downloading data-set of RNASeq data from synthetic biology experiments involving yeast and replicating the statistical technique that authors used to determine the gene regulators/promoters used to program the cells to produce plastic, fuel and drugs for them! (http://www.nature.com/nature/journal/v440/n7086/full/nature0...)

  • joe_the_user 10 years ago

    I think Bioinformatics is a challenging field because, as the residence alludes to, it is aiming to marry different fields that don't necessarily talk to each other. Lior Pachter has blogged about this [1]. Pachter's book, Algebraic Statistics For Computational Biology is fairly accessible.

    [1] https://liorpachter.wordpress.com/2014/12/30/the-two-culture...

  • szanyOP 10 years ago

    This isn't full-time — one of the goals is explicitly "to complement rather than conflict with the tenure of a full-time position elsewhere".

  • searine 10 years ago

    >Things I'm thinking of are, (would like to hear from the experts any hints, tips or links):

    I mean, if you just want to replicate what they did, that's fine, but it kind of misses the point of it all. Science is about finding something new.

    Why spin your wheels redoing someone else's work, when you can push into new territory? Science is all about having a question, and then just "learn by doing" until you find an answer.

    >Things I'm thinking of are

    GWAS data is messy, and usually has already been picked clean by the time it has been published. They do get pretty sophisticated with the stats analysis though.

    The Cannabis reference sequence is awful (I should know, I was the first person to make an assembly). Plants have really really complex genomes, and there isn't the research interest to improve the references. This, combined with the fact there is almost zero available RNAseq data would make this project difficult. Yes, there is some RNAseq data, but to separate signal from noise you'd need several technical replicates, which we don't have. As for phylogeny, the macro phylogeny is pretty well known from 16s sequence, and we don't have enough individuals to tease out interesting population structure. Genome annotation, in the absence of molecular work, is always boring because it can only find what is already known from other organisms.

    Most all of that experiment you linked was molecular work. They PCR'd up some genes and transformed them into yeast. They didn't use stats to identify genes because the genes were already well known and sequenced. After that it was just a matter of putting the genes in front of a high output promoter and thats that. If you had RNAseq data for that experiment, which we don't, you wouldn't be able to see the difference in regulation because the transgene isn't part of the yeast reference genome. You could do a de novo transcriptome assembly, and then it would probably turn up, but it would be hidden among 5,000 other novel/incomplete transcripts. This is all beside the point, because, if you know the gene you transformed, then why do RNAseq at all? A Q-PCR would give you a more accurate result at 1/100th the cost.

    Sorry to shoot down all your ideas, I think your initiative is great. There is A LOT of sequencing data out there, I am sure you can find a set with some fruit still on it.

    Maybe start thinking about questions that are important to you, personally. What organisms are important to you? What resources are available on the Sequence Read Archive? What question has yet to be answered about the organism?

    If you want someone to bounce ideas off of, I'm happy to help.

    • noname123 10 years ago

      Thank you searine for your thoughtful and helpful reply to my post.

      I will take your suggestion to "practice science" and follow up on the NCBI SRA website and look up some raw data upload; and see what piques my interest.

      A readily available data source I see on NCBI is genetic surveillance project of Mycobacterium tuberculosis around the world.

      There are three separate metagenomics/sequencing project of M. tuberculosis samples taken place in Japan, Nepal and Malawi. http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=studies&...

      A very (vague) question I'd like to answer is: Are the SNPs identified in these separate geographical studies distinct or similar? Do different regions give way to distinct strains of TB? Do the metabolic pathways identified from these different strains of TB explain different viral adaptations to the host and its environment? I'm sure my posed questions are either already answered in literature, not posed properly and not feasible. But that'll be my starting point.

      Much appreciated again!

      • searine 10 years ago

        A couple general tips : 1. For second-wave bioinformatic analysis, stick to organisms that have a (semi) decent reference sequence. It removes untold among of noise from your eventual analysis. M. tuberculosis, for example is a good candidate as it has an excellent reference.

        2. Human diseases always have tons of data!

        >Are the SNPs identified in these separate geographical studies distinct or similar?

        Human pathogens almost always show strong population structure that corealtes with human migration.

        That means, distincit old world/new world/oceania varieties characterized by differing patterns in variation. They also show recent population bottlenecks that correspond to the out-of-africa migration, and the columbian expansion. (http://www.nature.com/ng/journal/v45/n10/full/ng.2744.html)

        These kind of studies usually sequence a bunch of individuals, and then run RaXML to make a maximum likelihood tree, and run a principle component analysis to further characterize the population structure. This is usually followed by some first-pass selection scans to identify genes of interest.

        The problem is that this is all "low hanging fruit", and it is always picked by the people who did (and paid for) the sequencing.

        > Do the metabolic pathways identified from these different strains of TB explain different viral adaptations to the host and its environment?

        Metabolism is an essential feature to fitness, it rarely changes at the population level. If metabolic genes do change, they are usually secondary metabolism. Central metabolism is extremely conserved.

        In human pathogens the things evolving are (in order of magnitude) :

        1. Resistance to antibiotics/antivirals/antiparasitic. This is huge. It creates MASSIVE selective sweeps.

        2. Host/pathogen interface. Classic red-queen hypothesis. Antigen genes are rapidly evolving to evade host defence and maintain ability for invasion.

        3. Trasmission. Virulence factors which enhance the fitness of the pathogen.

        > I'm sure my posed questions are either already answered in literature, not posed properly and not feasible. But that'll be my starting point.

        I think, with a second-wave analysis. You need to gear your questions according to the datasets you can get your hands on.

        Start by picking an organism. Then research how many individual genomes from that species you can get. Then design an experiment around that.

        A classic "second wave" bioinformatic analysis uses two reference genomes, and a sequenced population of individuals from one of the two species. The experiment performs a simple statistical test to measure the accrual of non-synonymous substitutions within a species, and compares it to the accural of NS substitutions between species. Using these rates you can infer evolutionary adaptation/neutrality/conservation. It's called the McDonald-Kreitman test, here is a web implementation for a single-gene example http://mkt.uab.es/mkt/help_mkt.asp <-- try doing this by hand by pulling protein sequences from online databases.

        It's a simple algorithm, that can easily be scaled to a genome wide scan for adaptation (followed by an FDR correction for multiple testing!!!!!!).

        The experiment for M. tuberculosis would go like this :

        1. Either find an existing set of ortholog predictions between M. tuberculosis compared some other near species with a reference genome (preferably same-genus), or calculate them yourself with orthoMCL. Filter this to exclude everything BUT one-to-one orthologs. You can only run the MK test on one-to-one orthologs.

        2. Extract out the protein sequence (gene->mRNA->translated protein) for each annotated gene in each of the two reference files. A GFF file (gene feature file) should be available for each reference telling you the coordinates of each gene so you can get the exons and translate them to make a protein. If there isn't a gff, you can predict genes using genome annotation software. Match protein from genome A, to its ortholog in genome B, using the data from step 1, and put the two sequences into a single fasta file. Do this for all single copy genes.

        2. Pull a population of 10 or so individual M. tuberculosis genomes from SRA. Align the reads to the M. tuberculosis reference. Pull out proteins like before. Match them to the ortholog sequences from step 1 and 2, so everything in all in one file.

        3. Take your fasta file which contains M. tuberculosis reference, second reference, and the 10 or so individuals and align each protein file using Muscle (an alignment program, look it up).

        4. Implement a MK test that can run on your aligned fasta file. Spot check it against the online single-gene implementation I linked.

        5. Take your resulting p-values and perform an FDR correction to account for testing thousands of genes within the genome.

        6. Calculate alpha (its on the MKT wiki article), and determine which of your significant gens have a positive value of alpha (because positive alpha means more within-species NS substituions == positive selection == interesting).

        7. Congrats, you have done the science and learned about whats changing within the M. tuberculosis species.

    • heurist 10 years ago

      > Why spin your wheels redoing someone else's work, when you can push into new territory?

      Verification of results?

iskander 10 years ago

It seems a little unusual to see a "residency" which makes you pay for housing ($2400 / month).

Is it unfair to assume they're having trouble paying rent at the Topos house?

  • szanyOP 10 years ago

    Thanks for raising that question. Actually the house was rented for the purpose of hosting the program, not the other way around.

    Of course we're trying not to operate at a loss, but we're willing to accept that if necessary to accomplish the stated goals.

    • iskander 10 years ago

      >Actually the house was rented for the purpose of hosting the program, not the other way around.

      From the CFP I got the impression that the residency is hosted in the same place you all are currently living:

      >live for four months in a 10,000 ft² mansion in San Francisco (of which the organizers are long-term residents)

      Is that not the case?

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection