Saturday, March 21, 2009

Do you choose the research, or does the research choose you?

Note: I give advanced warning to my non-biologist readers that the next few paragraphs below contain a good dose of biology. While I have attempted to keep it conversational, if you feel your eyes glazing over, skip down a few paragraphs for the real meat.

to the air

Yesterday (Friday) I attended a talk by Susan Gottesman about small non-coding RNAs (sRNAs) and how they are involved in protein degradation. At this point in her esteemed career, Gottesman's best known work revolves around a particular Escherichia coli sigma factor—a protein responsible for transcription—called RpoS. RpoS facilitates translation of messenger RNAs (mRNAs) into proteins at low temperature levels. Now, RpoS only appears in E. coli cells during low temperature conditions, but mysteriously (or so it was), the gene that encodes RpoS gets expressed even when the cells are growing at a comfortable temperature.

As Gottesman's lab discovered, the mRNA for RpoS can actually bend back around and stick to itself such that ribosomes aren't able to bind to the mRNA and translate it into the RpoS protein. An sRNA called DsrA, however, which is expressed in low temperature conditions, binds to part of the RpoS mRNA, preventing the mRNA from folding back on itself, and giving ribosomes access to the transcript to translate it into RpoS protein. Why is this important?

Well, previously, sRNAs had only been thought to inhibit translation and prevent proteins from appearing. That is, we say that sRNAs usually inhibit the expression of a protein, so if you found an sRNA, you would bet that its target wouldn't appear when it appeared. Add sRNA and the protein won't be found in the cell; take the sRNA away, and the protein re-appears. The Gottesman lab, however, demonstrated a case where the sRNA actually is responsible for making the proteins appear. That is, when the sRNA DsrA appears, its target, RpoS, appears too; and if you take away DsrA, the protein goes away, too! Craziness! In Biology, we call this a paradigm shift. Paradigm shifts are "big deals", because Biology is all about figuring out the rules, and then identifying the exceptions so we have to re-write the rules. Biology is the science of exceptions.

The story continues, but I'll leave it to the reader to check out Gottesman's publications for more, because as much as I liked the story of her research, what I found most interesting about the talk was this side comment that she made towards the end, which I paraphrase here:

We published this work with RpoS, but then we wanted to work in other directions. We'd try something, then discover we couldn't go in that direction because we needed to know something else about RpoS. Then we'd attempt something else, but again, it would always come back to RpoS. Finally we just said, "Forget it! Fine! We'll just study RpoS. Clearly there's enough here to work on for a while."
I don't know if the fellow grad students in the audience caught the subtle significance of this statement, or if perhaps I was the only person who found this significant. What Gottesman said, in more words, is that she didn't really choose her research; her research chose her. Yet, in spite of spending her career in an area she never intended to stay in, once she identified that she was mired in it, she made the best of it, leading to great scientific contributions and earning her accolades and prestige that even the most jaded of us junior researchers catch ourselves fantasizing about from time to time.

I find this significant because, also from time to time, I wonder how the researchers, and even my peers, that I have come to admire wound up doing the research that they're doing. In my earlier days, I often thought they must possess great foresight and wisdom. While I don't doubt they're clever people, the longer my tenure in research and the more people I harass to tell me about their own careers, the more I've begun to think that a lot of it just comes by chance rather than deliberate choice. We find ourselves in a particular unique positions, somewhat stuck, and somewhat stumped, and we throw up our hands and say, "Aw, Hell! I guess I might as well dig around while I'm here." We do have to make choices about where we dig, but we seem to get to choose our own particular holes about as well as seeds scattered by the winds. (Though, from time to time, we can try to ride the winds to another hole.)

I suppose that I just find it amusing that life is stochastic from the molecular level all the way up to our own grand plans. Like each of our cells, we may as well just deal with the cards we're dealt as best we can. For everything else... well, "Cast Your Fate to the Wind".

Monday, March 16, 2009

If a tree falls in a random forest: a summary of Chen and Jeong, 2009

Trees in fog w shadows

I had to write a summary for a paper, "Sequence-based prediction of protein interaction sites with an integrative method" by Xue-wen Chen and Jong Cheol Jeong[1], for my Problem Solving in Bioinformatics course. I thought I'd share the review here on my blog, in case anybody finds it remotely useful. I doubt anyone will, but it's my blog, so there. Be forewarned, this is my, "Hey, buddy, I'm just a biologist" interpretation of their paper. If you spot any specious, misleading, or just plain incorrect statements, please, by all means, offer corrections.


Chen and Jeong have essentially found a method to apply a machine learning technique called random forests to predict specific binding sites on proteins given only the amino acid sequence with greater accuracy than previously existing methods. Identification of binding sites in proteins remains an important task for both basic and applied life sciences research, for these sites make possible the protein-protein and protein-ligand interactions from which phenotypes, and indeed, the properties of life emerge. These sites also serve as important drug targets for pharmaceutical research.

Traditionally, researchers have identified binding sites from in vivo or in vitro studies involving point mutations that affect phenotypes, as well as through analysis of protein structures as identified through protein crystallography. With the advent and continuous improvement of DNA sequencing technology, however, researchers contribute ever more knowledge in the form of amino acid sequence, rather than structures. Sequencing has rapidly outpaced crystallography, necessitating prediction of proteins' functional characteristics based solely on their amino acid sequence, which Chen and Jeong cite as the motivation behind research presented in this paper.

Previous efforts to infer binding sites purely from amino acid sequence used a different machine learning method called Support Vector Machine (SVM). I'm not entirely certain how SVMs operate, but like random forests, they require a training set of known binding sites and sites not involved in binding. One of the confounding factors about amino acid sequences when applied to machine learning methods like SVMs is that the residues are unevenly distributed between the two categories; in other words, few amino acids in a sequence (1 in 9 in the dataset used by Chen and Jeong) will sit at the interface of the protein and its ligand. Chen and Jeong chose to use random forests because they are robust against this bias in the data. This has to do with the way that random forests are constructed.

For constructing random forests, one must have a set of data. In Chen and Jeong's study, the set is comprised of amino acids belonging to 99 polypeptide chains—or chunks of proteins—culled from a protein-protein interaction set used in previous studies. One must also have a set of features, or measures, about each item in and the data set. In this study, there were 1050 features (as stored in vectors) for each amino acid, which fall into one of three categories: those measuring physical or chemical characteristics (e.g., hydrophobicity, isoelectric point, propensity—which is a fancy word for saying whether an amino acid is likely to be on the surface of a protein or buried deep within it), those measuring the amino acid's minimum distance to any other given amino acid along the sequence, and the position specific score matrix (PSSM), which has to do with how likely certain amino acid substitutions are likely to be at that point.

With this data set and features in hand, one feeds it to the random forest generator. To construct one random decision tree, follow a process like this:

  1. Count the total number of known interface sites (we'll call these positives), and call this number N.
  2. Count the number of features available, and call this number M.
  3. Randomly select a subset of N sites out of the entire set with—and this is important—replacement. This solves the problem of the unbalanced data set. If I recall my statistics correctly (I don't) this has to do with each site now having equal chance at influencing the training.
  4. Now we build the tree. We randomly select m features from the total M features, where m is a lot smaller than M. Then, of those m features, we choose the one which best splits the subset of sites. We continue to do this recursively until all sites have been "classified".
  5. We repeat steps 1-4 to construct the number of desired trees (100 in this study), which gives us our "forest" of randomly generated trees.

With the random forest constructed, essentially you feed in an amino acid site into the random forest, then the site trickles down each tree, and each tree then "votes" as to whether or not it classified the site as an interaction site or not. A simple majority can be used to categorize the site, or more stringent criteria, such as "at least 5 votes are necessary to categorize the site as an interface site". Increasing the votes required improves the confidence at which one claims a site is an interaction site (specificity), but decreases the probability of detecting interaction sites (sensitivity).

Using these measures of sensitivity and specificity in conjunction with leave-one-out studies (one polypeptide sequence is used as the test case, and the other 98 are used as training data), Chen and Jeong demonstrated that their random forests approach performed significantly better than the SVM approach used by the earlier studies. They attribute this improved performance to two things: random forests are more robust to unbalanced data sets, and their approach considered many more features than the previous studies'. When they used only the features used in the previous studies, they found decreased performance, albeit still significantly better than the previous methods'. Chen and Jeong note that a major feature of random forests is that their accuracy increases, rather than decreases, when the number of features increases, due to the random sampling.

Chen and Jeong finished their study with a prediction of binding sites on the DnaK (or Hsp70 in eukaryotes) chaperone system. Their results corroborated with several in vivo studies of mutants where mutations near the sites they predicted yielded changes in phenotypes for both prokaryotic and eukaryotic forms. Their visualization of predicted interaction sites using 3d molecular modeling software provided additional support.

  1. Xue-wen Chen and Jong Cheol Jeong, "Sequence-based prediction of protein interaction sites with an integrative method," Bioinformatics 25, no. 5 (March 1, 2009): 585-591, doi:10.1093/bioinformatics/btp039.

Sunday, March 15, 2009

Why Biopython needs to move to GitHub or Launchpad

Air hosting?

Paulo Nuin wrote a spot on post about the ridiculousness that is Biopython still using CVS as its revision control system (a.k.a. source code management, or SCM), when we code in an era of arguably superior tools in the form of distributed SCMs (DSCMs). Please read his post if you haven't yet. Do not pass go. Do not collect $200. This post will be here for you when you get back.

I'll continue along the thread that Paulo started, in which one of the hangups that the Biopython community must overcome is: "Supposing we do switch to a DSCM, where do we host the code?" Until the Biopython project can decide on an answer to this question, the project won't move to anything.

Peter Cock seems sincerely determined that the code be hosted on the Open Bioinformatics Foundation (OBF) servers at Biopython.org. If I understand Peter's rationale correctly, the notion stems from the desire to maintain control of the code hosting. The alternative to self-hosting the code is to use one of the big players. I'm particularly referring to GitHub and Launchpad. GitHub and Launchpad host repositories of for the DSCMs Git and Bazaar, respectively, and provide a set of tools around these repositories to facilitate collaboration and interactions between the developers and their communities. Launchpad has the backing of Canonical, best known for managing the Ubuntu GNU/Linux distribution, and GitHub has the backing of the only group more rabid than the Python community—the Ruby community; hence, I refer to them as "the big players".

I respect Peter's legitimate concerns. I also really respect Peter, who is much more of a Biopythonista than I'll ever be, and I recognize it will take his blessing for the transition to Git or Bazaar to succeed. I dedicate this blog post to changing Peter's opinion and convincing him that hosting on GitHub or Launchpad is the best option available to us at the time.* Hopefully I'll convince a few other Biopython (or Bio-anything) Devs along the way, too. :-)

The following are my top five reasons for hosting Biopython on GitHub/Launchpad:

  1. It's free. Yeah, okay, only "as in beer"**, but the Biopython source will, itself, remain open. The hosting is generously on someone else's dime, and that's all we need.
  2. It already exists. I do not have technical experience nor interest in running my own webserver-based interface to either Bazaar or Git. From the recent discussions on the Biopython mailing list, I will guess nobody on the Biopython Dev team does or has the time to learn how to, either. Since the OBF staff are volunteers, helping us set these up won't be high on their priority list. Bazaar and Git don't even exist on the servers, yet. Launchpad and GitHub already have the tools in place. The amount of time the Biopython community has to spend setting up the projects here is pretty minimal and painless. In fact, it's already been done. Launchpad and GitHub are clearly very good at what they do. They have the experts, the redundancy, and the robustness to manage hosting code in a public space, and all the headaches that come with it, so that we don't have to.
  3. They have established social networks. I'm already on GitHub and Launchpad. A lot of us are already on these sites, working on our own and other open source projects. These places let other people discover our work, and allow serendipitous connections to occur. "Hmm, this gal works on Biopython. What's that?" This doesn't occur at Biopython.org—people only go there when they know what they're looking for (and not many people are looking for "bioinformatics python"). Additionally, potential employers, co-workers, and employees are on these sites; not all of us will be (un)fortunate or content enough to stay in bioinformatics and computational biology forever.
  4. Everybody else is doing it. Sure, right now, GitHub only hosts very minor, niche projects like Ruby on Rails, Cappuccino, and BioRuby (like that will ever go anywhere), and Launchpad has some lesser known ones like MySQL, Zope, and something called Ubuntu, but I hear that some major players will join these sites really soon! They do seem to be gaining in popularity very rapidly. ;-)
  5. Vendor lock-in is just not an issue. There's some concern that using a third-party site such as GitHub or Launchpad will make the Biopython project vulnerable to possibly unreasonable whims of the owners of the sites. Terms and conditions could change unfavorably (e.g., "You have to pay to continue using our service."), or the service will go under. However, the OBF provides no more protection than Launchpad or GitHub, particularly for the latter scenario. When I think about who's least likely to run out of operating funding—the OBF, Launchpad, or GitHub—I'm not betting on OBF. But let's suppose that the uthinkable happens, and the site closes its doors to Biopython. So what? It's a distributed SCM; we have all of the code! This isn't CVS or Subversion, where a downed server takes all the revision history with it to the grave. We'll just set up shop somewhere else, point it towards our repositories, and sally on. We can burn that bridge when we get there; in the meantime, don't fret about it.

At this point, I'm sure there's more discussion to have. I just hope it's not too much, given that the transition to Subversion stalled tragically, which I take responsibility for. It would be nice to have this settled by May. I'd rather be fielding "How do I do this in Git/Bazaar?" than discussing "Why should I do this in Git/Bazaar?" My fingers are crossed, my hopes are high, and my stubornness is fiercer than two years ago.

* I'm excluding Mercurial and Bitbucket here because they haven't received consideration on the mailing list. They could be a great solution, but we're least familiar with them, and we have to narrow down the choices somehow. ** Okay, so Launchpad is going to be open sourced, but I don't want to be in charge of running an instance of it if nobody's going to pay me; see 2.

Wednesday, March 11, 2009

This is a stick up! Give me all your genomes!

This blog post is based on a previous entry of the same title I posted to FriendFeed. This post provides an extended explanation of what we're trying to accomplish.

Thieves

I'm working with Prof. John Jelesko on a project for one of my courses in which he's investigating metabolic pathways in plants. At the heart of it, we need to set up a local database for running FASTA homology searches. The Jelesko lab wants this database to contain every amino acid sequence predicted in every currently available whole genome (assembled and annotated) available at NCBI, prokaryotic and eukaryotic. [Edit: We don't need every sequenced genome, actually, we only need a representative genome per organism. I hadn't previously considered that there may be more than one genome per organism. Thanks to Brad Chapman for pointing out the need for clarification.]

We have sequences from locations other than NCBI which we need to include in the FASTA search space; hence, we can't just run FASTA searches over NCBI data, which EBI's FASTA search might be able to otherwise do. This necessitates a local database. The Jelesko lab also needs the nucleotide sequence corresponding to the amino acid sequence, as well as the intron/exon locations for the longest available splicing. The questions are: is it feasible to store this amount of data in a database (we'll be using MySQL), and if so, how do we go about getting this data?

We're naïvely assuming it is feasible, so I'm attempting to figure out how to get at this data. The one file format that seems to store all information that we need in one place is the GenBank (GBK) format:

  • a gene ID
  • taxonomic classification of the organism from which the gene came
  • start and stop positions for each exon
  • the translated amino acid sequence

It seems that in one shape or another, these GenBank format files are available from NCBI's FTP site. While the GBK files for the prokaryotic genomes are relatively easy to get in one fell swoop at ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.gbk.tar.gz. For good ol' eukaryotic genomes, however, the data is all over the place. Sometimes it's stored as gzipped files in CHR folders, while other times, the files aren't compressed, and still other times, the directory is really just a container for directories that have the genome data. In short, it's a mess, especially when we consider we want to automate the retrieval of this data, not to mention want to update it periodically, should NCBI deposit new data.

There's also the dilemma of not actually needing most of the data (the genome sequence) contained in the GBK files—we just need the sequence covering start to stop for translation, including intronic sequence for the mRNA. I can write a hack of a Python script to trudge through the FTP directories and yank any GBK (compressed or otherwise) to local disk, but it seems like a big waste of bandwidth and local disk space. It seems like there must be better ways [Doesn't it always?], but I don't have the knowledge of NCBI's services to identify what these might be. If you have any ideas, please share! Meanwhile, I think I'll try contacting NCBI and see if they might point me in the right direction. I'll report back on what we decide to use, which could be my FTP hack given our limited time for this project.

Update: I've received suggestions on the FriendFeed entry for this blog post worth checking out.