Monday, March 16, 2009

If a tree falls in a random forest: a summary of Chen and Jeong, 2009

Trees in fog w shadows

I had to write a summary for a paper, "Sequence-based prediction of protein interaction sites with an integrative method" by Xue-wen Chen and Jong Cheol Jeong[1], for my Problem Solving in Bioinformatics course. I thought I'd share the review here on my blog, in case anybody finds it remotely useful. I doubt anyone will, but it's my blog, so there. Be forewarned, this is my, "Hey, buddy, I'm just a biologist" interpretation of their paper. If you spot any specious, misleading, or just plain incorrect statements, please, by all means, offer corrections.


Chen and Jeong have essentially found a method to apply a machine learning technique called random forests to predict specific binding sites on proteins given only the amino acid sequence with greater accuracy than previously existing methods. Identification of binding sites in proteins remains an important task for both basic and applied life sciences research, for these sites make possible the protein-protein and protein-ligand interactions from which phenotypes, and indeed, the properties of life emerge. These sites also serve as important drug targets for pharmaceutical research.

Traditionally, researchers have identified binding sites from in vivo or in vitro studies involving point mutations that affect phenotypes, as well as through analysis of protein structures as identified through protein crystallography. With the advent and continuous improvement of DNA sequencing technology, however, researchers contribute ever more knowledge in the form of amino acid sequence, rather than structures. Sequencing has rapidly outpaced crystallography, necessitating prediction of proteins' functional characteristics based solely on their amino acid sequence, which Chen and Jeong cite as the motivation behind research presented in this paper.

Previous efforts to infer binding sites purely from amino acid sequence used a different machine learning method called Support Vector Machine (SVM). I'm not entirely certain how SVMs operate, but like random forests, they require a training set of known binding sites and sites not involved in binding. One of the confounding factors about amino acid sequences when applied to machine learning methods like SVMs is that the residues are unevenly distributed between the two categories; in other words, few amino acids in a sequence (1 in 9 in the dataset used by Chen and Jeong) will sit at the interface of the protein and its ligand. Chen and Jeong chose to use random forests because they are robust against this bias in the data. This has to do with the way that random forests are constructed.

For constructing random forests, one must have a set of data. In Chen and Jeong's study, the set is comprised of amino acids belonging to 99 polypeptide chains—or chunks of proteins—culled from a protein-protein interaction set used in previous studies. One must also have a set of features, or measures, about each item in and the data set. In this study, there were 1050 features (as stored in vectors) for each amino acid, which fall into one of three categories: those measuring physical or chemical characteristics (e.g., hydrophobicity, isoelectric point, propensity—which is a fancy word for saying whether an amino acid is likely to be on the surface of a protein or buried deep within it), those measuring the amino acid's minimum distance to any other given amino acid along the sequence, and the position specific score matrix (PSSM), which has to do with how likely certain amino acid substitutions are likely to be at that point.

With this data set and features in hand, one feeds it to the random forest generator. To construct one random decision tree, follow a process like this:

  1. Count the total number of known interface sites (we'll call these positives), and call this number N.
  2. Count the number of features available, and call this number M.
  3. Randomly select a subset of N sites out of the entire set with—and this is important—replacement. This solves the problem of the unbalanced data set. If I recall my statistics correctly (I don't) this has to do with each site now having equal chance at influencing the training.
  4. Now we build the tree. We randomly select m features from the total M features, where m is a lot smaller than M. Then, of those m features, we choose the one which best splits the subset of sites. We continue to do this recursively until all sites have been "classified".
  5. We repeat steps 1-4 to construct the number of desired trees (100 in this study), which gives us our "forest" of randomly generated trees.

With the random forest constructed, essentially you feed in an amino acid site into the random forest, then the site trickles down each tree, and each tree then "votes" as to whether or not it classified the site as an interaction site or not. A simple majority can be used to categorize the site, or more stringent criteria, such as "at least 5 votes are necessary to categorize the site as an interface site". Increasing the votes required improves the confidence at which one claims a site is an interaction site (specificity), but decreases the probability of detecting interaction sites (sensitivity).

Using these measures of sensitivity and specificity in conjunction with leave-one-out studies (one polypeptide sequence is used as the test case, and the other 98 are used as training data), Chen and Jeong demonstrated that their random forests approach performed significantly better than the SVM approach used by the earlier studies. They attribute this improved performance to two things: random forests are more robust to unbalanced data sets, and their approach considered many more features than the previous studies'. When they used only the features used in the previous studies, they found decreased performance, albeit still significantly better than the previous methods'. Chen and Jeong note that a major feature of random forests is that their accuracy increases, rather than decreases, when the number of features increases, due to the random sampling.

Chen and Jeong finished their study with a prediction of binding sites on the DnaK (or Hsp70 in eukaryotes) chaperone system. Their results corroborated with several in vivo studies of mutants where mutations near the sites they predicted yielded changes in phenotypes for both prokaryotic and eukaryotic forms. Their visualization of predicted interaction sites using 3d molecular modeling software provided additional support.

  1. Xue-wen Chen and Jong Cheol Jeong, "Sequence-based prediction of protein interaction sites with an integrative method," Bioinformatics 25, no. 5 (March 1, 2009): 585-591, doi:10.1093/bioinformatics/btp039.

No comments:

Post a Comment