Wednesday, July 1, 2009

"Who arrre you?" Getting the hostname back in the Jaunty GDM greeter

Remendos - PatchworkThe latest Ubuntu release, 9.04, codename "Jaunty Jackelope", has turned out to be one of the best, maybe even on par with the "Gutsy Gibbon" release. The aesthetics definitely got some love; for example, if you're not running the "Dust" theme, you're missing out. [Hint: go to Preferences -> Appearance -> Theme and select "Dust"] The GDM greeter login screen looks the best of any Ubuntu release.

Unfortunately, a little bit of usability got lost along the way; most notably, the hostname no longer appears anywhere on the graphical login. This probably bothers only a minority of people, but our lab, for example, just updated all its machines to Jaunty, and we couldn't tell from the greeters which machine belonged to which hostname without logging in. I sat down this afternoon for a few minutes to figure out how the GDM themes work. It turns out they're just coded as fairly simple XML, and looking at other themes, I eventually figured out what to tweak. This patch will bring back the beloved hostname to the GDM login.

To use this patch, just do
sudo patch -p0 < /path/to/hostname_patch_for_Human.xml.patch


Now you'll no longer have to look at login screens and wonder, "Who arrre you?"

Update (16:17): Apparently Blogger's software won't allow XML in their pre tags, so I just hosted the patch on my server instead. All the more reason why I need to host my own blog with Wordpress or something soon...

Sunday, June 28, 2009

Is it in one's Nature?

Bonsai MoonToday is Sunday, a day of rest to some, but to heathens with Monday meetings like myself, a day of catching up and doing all the things we thought we'd get done earlier. Unfortunately for me, our LDAP server that gives us access to the network is down... again... for the third weekend in a row, preventing access to our workstations, data, and worst for me, my research notebook, which I keep on our group's wiki. I admittedly felt a strong temptation to get out, enjoy the sunshine, and play a little guitar, but here I sit, in the cold, gray, fluorescent-lit cube. I'm here because I'm trying to be less incompetent as a scientific researcher.

One of the things that particularly makes me feel incompetent is my lack of knowledge of scientific literature, and (to a greater extent?) my lack of enthusiasm for reading it. I don't know why, and I give myself grief for this, but I often find reading scientific papers just plain boring. The funny thing is, I really appreciate science, by which I mean the technique of elucidating one's knowledge of the world through rigorous, reproducible means, and keeping a skeptical mindset, especially when it comes to one's own work. Likewise, I will never cease to find biology or computational technology among the most satisfactory pursuits for the very limited time and energy I have here on this good Earth. Yes, science, itself, is awesome, but the excitement of it gets stripped away in a lot of formal education environments, and for me, in the way scientists present it in their formal literature.

I have to qualify that last statement as pertaining to myself because I have colleagues who clearly find the literature still stimulating; a good example is Arjun Krishnan. At any given point, Arjun can tell you a few relevant papers he's read on seemingly any subject, he can give you solid summaries, and he turns it into good research questions, some of which he's following up on. He's a paragon of the Good Graduate Student; I have no doubts Arjun is going to be a superstar scientist in whatever field he ends up in, if not in general. I am certainly no Arjun, however, so I have to focus on humbler goals.

One of our tasks as students in the Murali group is to canvas over a dozen of the journals in bioinformatics and computational biology and scout for pertinent articles. I decided to use my "downtime" to have at the growing stack of journal headlines in my RSS feeds, and since I needed a place to start, I thought I'd tackle my Nature stack, which I'd neglected since the end of May. This meant a back log of over two hundred articles. I scanned through each headline, pausing at ones that had life sciences subjects, opening up a few that had keywords that caught my attention, taking a genuine look at a few of those, and skipping over the rest. At the end of the process, I felt really disappointed.

Of the several hundred articles, I only wound up reading three research highlights, the abstract of one letter, the abstract and some of the figures in another, and the abstract and some of the methods in another, and none of these proved at all pertinent to research I am supposed to be doing now.

Worth pointing out more, at no time did I read the title of a full-fledged research article and think, "Wow, I should read that," or even, "Gee, that sounds interesting." The vast majority of the titles just struck me as extremely esoteric, and this confuses me the most. Aren't Nature, Science, and PNAS supposed to have articles that are of interest not just to a specific field, but to the entire scientific community? But you know, I'm not interested that "GOLPH3 modulates mTOR signalling and rapamycin sensitivity in cancer", or that "Histone H4 lysine 16 acetylation regulates cellular lifespan", or in "A newly discovered protein export machine in malaria parasites". I fail to feel these discoveries shaking my perception of the world around me, of giving me a new topic to explore, or helping me make my own discoveries.

Nature is a journal that can make tenure, a journal where scientists experience great thrills for getting in and great envy when their colleagues do, a journal that says, "I publish like a boss!" It's a journal where my boss says, "You should be reading it anyway." So obviously, like so many things in scientific research, I just don't get it. And now, after my attempt to gain a little face today, I'm right back to where I started. Go ahead, just say it—I'm the worst scientist in the world. I'm a cotton-headed ninny muggins.

Thursday, April 30, 2009

How symbolic: on removing symlinks in Bazaar VCS

the weakest linkI have a strong affinity for distributed revision control systems, and my favorite has been Bazaar VCS (a.k.a., bzr). Like any piece of software, bzr has its quirks and shortcomings. Tonight, I encountered its rather tricky behavior when it comes to symbolic links (symlinks).

I keep my configurations under revision control, which gives me the benefits of rolling back changes when I inevitably break things, and of setting up home on a new system, even a remote one, very quickly and easily. All was well, but I discovered that when I naively placed my .vim/ directory under the repository, I added a ton of symlinks to files in /usr/share/vim/addons. These symlinks were present because I used Ubuntu's vim-scripts and vim-addon-manager packages to install these addons to my Vim profile, which essentially just sets up symlinks to the addons, stored in /usr/share. It's a pretty reasonable system, actually, but it doesn't make sense to have these symbolic links stored in my branch. I can't guarantee that each system I work on will have the files the symlinks point to, therefore, I thought it best to remove them. Therein I encountered a sticky issue with bzr: you really can't remove symlinks from its revision tracking easily.

I thought I could be clever and write a simple one-liner in Bash to remove all the symlinks presently tracked by bzr from further tracking, but still leave them on the file system (I still need the symlinks there, after all, or my Vim goodness will break).

for file in `bzr ls -V`; do # use bzr ls -VR in later versions
if [ -h $file ]; then # see if the file is a symlink
echo "Removing $file";
bzr rm --keep $file; # remove from tracking, not the FS
fi;
done

Okay, so I reformatted it for annotation, but trust me, it fits on one line. Anyway, I immediately encountered problems, getting this as output:

.vim/compiler/tex.vim
bzr: ERROR: Not a branch: "/usr/share/vim/addons/compiler/tex.vim/".
.vim/doc/NERD_commenter.txt
bzr: ERROR: Not a branch: "/usr/share/vim-scripts/doc/NERD_commenter.txt/".
.vim/doc/bufexplorer.txt
bzr: ERROR: Not a branch: "/usr/share/vim-scripts/doc/bufexplorer.txt/".
.vim/doc/imaps.txt.gz
bzr: ERROR: Not a branch: "/usr/share/vim/addons/doc/imaps.txt.gz/".
.vim/doc/latex-suite-quickstart.txt.gz
...

WTF? "Not a branch!?"

Okay, so, what happens here is that Bazaar de-references the symlink before attempting to remove it, which is not at all what I had in mind. Poking around Launchpad, you can find several bug reports regarding the way Bazaar deals with symlinks. The workaround solutions proposed in those—remove the symlink using rm—wouldn't work for me, because I needed to retain the actual symlinks on the filesystem.

At this point I had solicited the attention of Robert Collins, a.k.a. lifeless in #bzr on Freenode. When I told him the workaround wouldn't work for me, and that I'd need to write a script, he suggested I use WorkingTree.unversion() from bzrlib. Despite being a Python fanatic [understatement] and bzr's codebase being in Python, when I said "script", I meant "Bash script". It never occurred to me to actually write a Python script until he mentioned that. By the completion of the thought, though, I was digging into the codebase of bzrlib to figure out what to do.

My initial aproach plan included using os.walk() to move through the filesystem, os.path.islink() to identify the symbolic links, and then WorkingTree.unversion() to mark the files for removal from tracking. I ran into a problem, however, in that unversion() only accepts a list of file IDs, as specified by the bzr metadata. Robert pointed me towards a method called path2ids(), but I had trouble figuring out how I was going to give it the proper paths. os.walk will let me construct absolute paths to files, but I really needed relative paths to the files, truncated at a certain point past the root (e.g., .vim/compiler/tex.vim instead of /home/chris/shell-configs/.vim/compiler/tex.vim). I could see it was getting a little hairy, so I decided to dig a little further into WorkingTree code and see if there was anything else I could use.

What I discovered was the jackpot in the form of WorknigTree.walktree()—a method written precisely for what I needed: traversing the filesystem, identifying the filetypes (especially symlinks), and providing file IDs. Within a few minutes, I banged out a script that did exactly what I needed it to do, presented below.
#!/usr/bin/env python
# -*- coding: UTF-8 -*-

# Copyright (c) 2009 Chris Lasher
#
# Licensed under the Apache License, Version 2.0 (the "License"); you may
# not use this file except in compliance with the License. You may obtain
# a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
# License for the specific language governing permissions and limitations
# under the License.

"""
A simple script to go through a Bazaar repository and ruthlessly
remove all symbolic links (symlinks) from further tracking.

It's important to note that this will not actually remove the symlinks
from the physical filesystem. This is left to the user, if so desired.

"""

__author__ = 'Chris Lasher'
__email__ = 'chris DOT lasher AT gmail DOT com'

import bzrlib.workingtree
import os

tree = bzrlib.workingtree.WorkingTree.open(os.getcwd())
try:
# use protection -- one-on-one action only
tree.lock_write()
symlink_ids = []
for dir, file_list in tree.walkdirs():
# dir[1] (the file_id) will be None if it's not under revision
# control, so this will skip it if it's not
if dir[1]:
for file_data in file_list:
# file_data[2] is the file type, and file_data[4] is the
# file_id, the necessary specifier for removing the file
# from revision tracking
if file_data[2] == 'symlink' and file_data[4]:
print "Removing %s" % file_data[0]
symlink_ids.append(file_data[4])

tree.unversion(symlink_ids)

finally:
# okay, all yours
tree.unlock()
Hopefully someone else will find this little script useful. It's under the Apache version 2 license; make whatever use of it you can for your particular predicament.

So what were the lessons learned here:
  1. Exercise a little restraint and consideration about what you put under revision control in the first place.
  2. It's awesome to be able to have direct contact with developers of your tools.
  3. It's even more awesome to be able to dig right into their code and help yourself.
  4. Just like in Murali's brutal Theory of Algorithms course, in real life, when facing difficulty solving a problem one way, don't be afraid to step back and try an approach from another (the opposite) direction. Trust your gut—if it feels like the hard way of doing something, it probably is; find the lazy (smart) way.
A special thanks to Robert for his guidance and help.

Saturday, March 21, 2009

Do you choose the research, or does the research choose you?

Note: I give advanced warning to my non-biologist readers that the next few paragraphs below contain a good dose of biology. While I have attempted to keep it conversational, if you feel your eyes glazing over, skip down a few paragraphs for the real meat.

to the airYesterday (Friday) I attended a talk by Susan Gottesman about small non-coding RNAs (sRNAs) and how they are involved in protein degradation. At this point in her esteemed career, Gottesman's best known work revolves around a particular Escherichia coli sigma factor—a protein responsible for transcription—called RpoS. RpoS facilitates translation of messenger RNAs (mRNAs) into proteins at low temperature levels. Now, RpoS only appears in E. coli cells during low temperature conditions, but mysteriously (or so it was), the gene that encodes RpoS gets expressed even when the cells are growing at a comfortable temperature.

As Gottesman's lab discovered, the mRNA for RpoS can actually bend back around and stick to itself such that ribosomes aren't able to bind to the mRNA and translate it into the RpoS protein. An sRNA called DsrA, however, which is expressed in low temperature conditions, binds to part of the RpoS mRNA, preventing the mRNA from folding back on itself, and giving ribosomes access to the transcript to translate it into RpoS protein. Why is this important?

Well, previously, sRNAs had only been thought to inhibit translation and prevent proteins from appearing. That is, we say that sRNAs usually inhibit the expression of a protein, so if you found an sRNA, you would bet that its target wouldn't appear when it appeared. Add sRNA and the protein won't be found in the cell; take the sRNA away, and the protein re-appears. The Gottesman lab, however, demonstrated a case where the sRNA actually is responsible for making the proteins appear. That is, when the sRNA DsrA appears, its target, RpoS, appears too; and if you take away DsrA, the protein goes away, too! Craziness! In Biology, we call this a paradigm shift. Paradigm shifts are "big deals", because Biology is all about figuring out the rules, and then identifying the exceptions so we have to re-write the rules. Biology is the science of exceptions.

The story continues, but I'll leave it to the reader to check out Gottesman's publications for more, because as much as I liked the story of her research, what I found most interesting about the talk was this side comment that she made towards the end, which I paraphrase here:
We published this work with RpoS, but then we wanted to work in other directions. We'd try something, then discover we couldn't go in that direction because we needed to know something else about RpoS. Then we'd attempt something else, but again, it would always come back to RpoS. Finally we just said, "Forget it! Fine! We'll just study RpoS. Clearly there's enough here to work on for a while."
I don't know if the fellow grad students in the audience caught the subtle significance of this statement, or if perhaps I was the only person who found this significant. What Gottesman said, in more words, is that she didn't really choose her research; her research chose her. Yet, in spite of spending her career in an area she never intended to stay in, once she identified that she was mired in it, she made the best of it, leading to great scientific contributions and earning her accolades and prestige that even the most jaded of us junior researchers catch ourselves fantasizing about from time to time.

I find this significant because, also from time to time, I wonder how the researchers, and even my peers, that I have come to admire wound up doing the research that they're doing. In my earlier days, I often thought they must possess great foresight and wisdom. While I don't doubt they're clever people, the longer my tenure in research and the more people I harass to tell me about their own careers, the more I've begun to think that a lot of it just comes by chance rather than deliberate choice. We find ourselves in a particular unique positions, somewhat stuck, and somewhat stumped, and we throw up our hands and say, "Aw, Hell! I guess I might as well dig around while I'm here." We do have to make choices about where we dig, but we seem to get to choose our own particular holes about as well as seeds scattered by the winds. (Though, from time to time, we can try to ride the winds to another hole.)

I suppose that I just find it amusing that life is stochastic from the molecular level all the way up to our own grand plans. Like each of our cells, we may as well just deal with the cards we're dealt as best we can. For everything else... well, "Cast Your Fate to the Wind".

Monday, March 16, 2009

If a tree falls in a random forest: a summary of Chen and Jeong, 2009

Trees in fog w shadowsI had to write a summary for a paper, "Sequence-based prediction of protein interaction sites with an integrative method" by Xue-wen Chen and Jong Cheol Jeong[1], for my Problem Solving in Bioinformatics course. I thought I'd share the review here on my blog, in case anybody finds it remotely useful. I doubt anyone will, but it's my blog, so there. Be forewarned, this is my, "Hey, buddy, I'm just a biologist" interpretation of their paper. If you spot any specious, misleading, or just plain incorrect statements, please, by all means, offer corrections.




Chen and Jeong have essentially found a method to apply a machine learning technique called random forests to predict specific binding sites on proteins given only the amino acid sequence with greater accuracy than previously existing methods. Identification of binding sites in proteins remains an important task for both basic and applied life sciences research, for these sites make possible the protein-protein and protein-ligand interactions from which phenotypes, and indeed, the properties of life emerge. These sites also serve as important drug targets for pharmaceutical research.

Traditionally, researchers have identified binding sites from in vivo or in vitro studies involving point mutations that affect phenotypes, as well as through analysis of protein structures as identified through protein crystallography. With the advent and continuous improvement of DNA sequencing technology, however, researchers contribute ever more knowledge in the form of amino acid sequence, rather than structures. Sequencing has rapidly outpaced crystallography, necessitating prediction of proteins' functional characteristics based solely on their amino acid sequence, which Chen and Jeong cite as the motivation behind research presented in this paper.

Previous efforts to infer binding sites purely from amino acid sequence used a different machine learning method called Support Vector Machine (SVM). I'm not entirely certain how SVMs operate, but like random forests, they require a training set of known binding sites and sites not involved in binding. One of the confounding factors about amino acid sequences when applied to machine learning methods like SVMs is that the residues are unevenly distributed between the two categories; in other words, few amino acids in a sequence (1 in 9 in the dataset used by Chen and Jeong) will sit at the interface of the protein and its ligand. Chen and Jeong chose to use random forests because they are robust against this bias in the data. This has to do with the way that random forests are constructed.

For constructing random forests, one must have a set of data. In Chen and Jeong's study, the set is comprised of amino acids belonging to 99 polypeptide chains—or chunks of proteins—culled from a protein-protein interaction set used in previous studies. One must also have a set of features, or measures, about each item in and the data set. In this study, there were 1050 features (as stored in vectors) for each amino acid, which fall into one of three categories: those measuring physical or chemical characteristics (e.g., hydrophobicity, isoelectric point, propensity—which is a fancy word for saying whether an amino acid is likely to be on the surface of a protein or buried deep within it), those measuring the amino acid's minimum distance to any other given amino acid along the sequence, and the position specific score matrix (PSSM), which has to do with how likely certain amino acid substitutions are likely to be at that point.

With this data set and features in hand, one feeds it to the random forest generator. To construct one random decision tree, follow a process like this:

  1. Count the total number of known interface sites (we'll call these positives), and call this number N.
  2. Count the number of features available, and call this number M.
  3. Randomly select a subset of N sites out of the entire set with—and this is important—replacement. This solves the problem of the unbalanced data set. If I recall my statistics correctly (I don't) this has to do with each site now having equal chance at influencing the training.
  4. Now we build the tree. We randomly select m features from the total M features, where m is a lot smaller than M. Then, of those m features, we choose the one which best splits the subset of sites. We continue to do this recursively until all sites have been "classified".
  5. We repeat steps 1-4 to construct the number of desired trees (100 in this study), which gives us our "forest" of randomly generated trees.
With the random forest constructed, essentially you feed in an amino acid site into the random forest, then the site trickles down each tree, and each tree then "votes" as to whether or not it classified the site as an interaction site or not. A simple majority can be used to categorize the site, or more stringent criteria, such as "at least 5 votes are necessary to categorize the site as an interface site". Increasing the votes required improves the confidence at which one claims a site is an interaction site (specificity), but decreases the probability of detecting interaction sites (sensitivity).

Using these measures of sensitivity and specificity in conjunction with leave-one-out studies (one polypeptide sequence is used as the test case, and the other 98 are used as training data), Chen and Jeong demonstrated that their random forests approach performed significantly better than the SVM approach used by the earlier studies. They attribute this improved performance to two things: random forests are more robust to unbalanced data sets, and their approach considered many more features than the previous studies'. When they used only the features used in the previous studies, they found decreased performance, albeit still significantly better than the previous methods'. Chen and Jeong note that a major feature of random forests is that their accuracy increases, rather than decreases, when the number of features increases, due to the random sampling.

Chen and Jeong finished their study with a prediction of binding sites on the DnaK (or Hsp70 in eukaryotes) chaperone system. Their results corroborated with several in vivo studies of mutants where mutations near the sites they predicted yielded changes in phenotypes for both prokaryotic and eukaryotic forms. Their visualization of predicted interaction sites using 3d molecular modeling software provided additional support.
  1. Xue-wen Chen and Jong Cheol Jeong, "Sequence-based prediction of protein interaction sites with an integrative method," Bioinformatics 25, no. 5 (March 1, 2009): 585-591, doi:10.1093/bioinformatics/btp039.