It's Not Easy Being Genes: 2009

Friday, November 13, 2009

Time out: deterring brute force SSH attacks with iptables

These are some simple iptables rules I keep around on my firewall to deter brute force SSH attacks. The original idea came from Dominik Borkowski, a sysadmin at VBI.

If the attacker attempts more than 4 connections within a minute, these rules temporarily blacklist them for the next minute—or as I like to say, "put them in time-out". Their packets will be dropped; to them, it will seem that the machine simply disappeared from the intarwebs. The rules will also log such violators to your syslog. I've found them very effective. Most scripts that these crackers run will drop off after one iteration and look for lower hanging fruit.

Of course, if you forget your password, or have a habit of making a couple of simultaneous connections to your computer, the door will shut on you, too, but the good news is that you'll only be blocked for a minute. More draconian methods that append to actual blacklists have a habit of locking their owners out. (Not that I'm speaking from personal experience at all.) The rules escape this pitfall but will prove just as effective.

## Below includes very successful deterrents for SSH brute force
## that allows a maximum of 4 connection attempts within a minute.
iptables -A INPUT -p tcp -m state --state NEW --dport 22 -m recent --name sshattack --set
iptables -A INPUT -m recent --name sshattack --rcheck --seconds 60 --hitcount 4 -m limit --limit 4/minute -j LOG --log-prefix 'SSH attack: '
iptables -A INPUT -m recent --name sshattack --rcheck --seconds 60 --hitcount 4 -j DROP
iptables -A INPUT -p tcp -m tcp --dport 22 -j ACCEPT

I keep this in a firewall (shell) script that controls iptables rules and executes on bootup. If there's sufficient demand, I can make the entire script available.

Wednesday, July 1, 2009

"Who arrre you?" Getting the hostname back in the Jaunty GDM greeter

The latest Ubuntu release, 9.04, codename "Jaunty Jackelope", has turned out to be one of the best, maybe even on par with the "Gutsy Gibbon" release. The aesthetics definitely got some love; for example, if you're not running the "Dust" theme, you're missing out. [Hint: go to Preferences -> Appearance -> Theme and select "Dust"] The GDM greeter login screen looks the best of any Ubuntu release.

Unfortunately, a little bit of usability got lost along the way; most notably, the hostname no longer appears anywhere on the graphical login. This probably bothers only a minority of people, but our lab, for example, just updated all its machines to Jaunty, and we couldn't tell from the greeters which machine belonged to which hostname without logging in. I sat down this afternoon for a few minutes to figure out how the GDM themes work. It turns out they're just coded as fairly simple XML, and looking at other themes, I eventually figured out what to tweak. This patch will bring back the beloved hostname to the GDM login.

To use this patch, just do

sudo patch -p0 < /path/to/hostname_patch_for_Human.xml.patch

Now you'll no longer have to look at login screens and wonder, "Who arrre you?"

Update (16:17): Apparently Blogger's software won't allow XML in their pre tags, so I just hosted the patch on my server instead. All the more reason why I need to host my own blog with Wordpress or something soon...

Sunday, June 28, 2009

Is it in one's Nature?

Today is Sunday, a day of rest to some, but to heathens with Monday meetings like myself, a day of catching up and doing all the things we thought we'd get done earlier. Unfortunately for me, our LDAP server that gives us access to the network is down... again... for the third weekend in a row, preventing access to our workstations, data, and worst for me, my research notebook, which I keep on our group's wiki. I admittedly felt a strong temptation to get out, enjoy the sunshine, and play a little guitar, but here I sit, in the cold, gray, fluorescent-lit cube. I'm here because I'm trying to be less incompetent as a scientific researcher.

One of the things that particularly makes me feel incompetent is my lack of knowledge of scientific literature, and (to a greater extent?) my lack of enthusiasm for reading it. I don't know why, and I give myself grief for this, but I often find reading scientific papers just plain boring. The funny thing is, I really appreciate science, by which I mean the technique of elucidating one's knowledge of the world through rigorous, reproducible means, and keeping a skeptical mindset, especially when it comes to one's own work. Likewise, I will never cease to find biology or computational technology among the most satisfactory pursuits for the very limited time and energy I have here on this good Earth. Yes, science, itself, is awesome, but the excitement of it gets stripped away in a lot of formal education environments, and for me, in the way scientists present it in their formal literature.

I have to qualify that last statement as pertaining to myself because I have colleagues who clearly find the literature still stimulating; a good example is Arjun Krishnan. At any given point, Arjun can tell you a few relevant papers he's read on seemingly any subject, he can give you solid summaries, and he turns it into good research questions, some of which he's following up on. He's a paragon of the Good Graduate Student; I have no doubts Arjun is going to be a superstar scientist in whatever field he ends up in, if not in general. I am certainly no Arjun, however, so I have to focus on humbler goals.

One of our tasks as students in the Murali group is to canvas over a dozen of the journals in bioinformatics and computational biology and scout for pertinent articles. I decided to use my "downtime" to have at the growing stack of journal headlines in my RSS feeds, and since I needed a place to start, I thought I'd tackle my Nature stack, which I'd neglected since the end of May. This meant a back log of over two hundred articles. I scanned through each headline, pausing at ones that had life sciences subjects, opening up a few that had keywords that caught my attention, taking a genuine look at a few of those, and skipping over the rest. At the end of the process, I felt really disappointed.

Of the several hundred articles, I only wound up reading three research highlights, the abstract of one letter, the abstract and some of the figures in another, and the abstract and some of the methods in another, and none of these proved at all pertinent to research I am supposed to be doing now.

Worth pointing out more, at no time did I read the title of a full-fledged research article and think, "Wow, I should read that," or even, "Gee, that sounds interesting." The vast majority of the titles just struck me as extremely esoteric, and this confuses me the most. Aren't Nature, Science, and PNAS supposed to have articles that are of interest not just to a specific field, but to the entire scientific community? But you know, I'm not interested that "GOLPH3 modulates mTOR signalling and rapamycin sensitivity in cancer", or that "Histone H4 lysine 16 acetylation regulates cellular lifespan", or in "A newly discovered protein export machine in malaria parasites". I fail to feel these discoveries shaking my perception of the world around me, of giving me a new topic to explore, or helping me make my own discoveries.

Nature is a journal that can make tenure, a journal where scientists experience great thrills for getting in and great envy when their colleagues do, a journal that says, "I publish like a boss!" It's a journal where my boss says, "You should be reading it anyway." So obviously, like so many things in scientific research, I just don't get it. And now, after my attempt to gain a little face today, I'm right back to where I started. Go ahead, just say it—I'm the worst scientist in the world. I'm a cotton-headed ninny muggins.

Thursday, April 30, 2009

How symbolic: on removing symlinks in Bazaar VCS

I have a strong affinity for distributed revision control systems, and my favorite has been Bazaar VCS (a.k.a., bzr). Like any piece of software, bzr has its quirks and shortcomings. Tonight, I encountered its rather tricky behavior when it comes to symbolic links (symlinks).

I keep my configurations under revision control, which gives me the benefits of rolling back changes when I inevitably break things, and of setting up home on a new system, even a remote one, very quickly and easily. All was well, but I discovered that when I naively placed my .vim/ directory under the repository, I added a ton of symlinks to files in /usr/share/vim/addons. These symlinks were present because I used Ubuntu's vim-scripts and vim-addon-manager packages to install these addons to my Vim profile, which essentially just sets up symlinks to the addons, stored in /usr/share. It's a pretty reasonable system, actually, but it doesn't make sense to have these symbolic links stored in my branch. I can't guarantee that each system I work on will have the files the symlinks point to, therefore, I thought it best to remove them. Therein I encountered a sticky issue with bzr: you really can't remove symlinks from its revision tracking easily.

I thought I could be clever and write a simple one-liner in Bash to remove all the symlinks presently tracked by bzr from further tracking, but still leave them on the file system (I still need the symlinks there, after all, or my Vim goodness will break).

for file in `bzr ls -V`; do          # use bzr ls -VR in later versions
    if [ -h $file ]; then            # see if the file is a symlink
        echo "Removing $file";
        bzr rm --keep $file;         # remove from tracking, not the FS
    fi;
done

Okay, so I reformatted it for annotation, but trust me, it fits on one line. Anyway, I immediately encountered problems, getting this as output:

.vim/compiler/tex.vim
bzr: ERROR: Not a branch: "/usr/share/vim/addons/compiler/tex.vim/".
.vim/doc/NERD_commenter.txt
bzr: ERROR: Not a branch: "/usr/share/vim-scripts/doc/NERD_commenter.txt/".
.vim/doc/bufexplorer.txt
bzr: ERROR: Not a branch: "/usr/share/vim-scripts/doc/bufexplorer.txt/".
.vim/doc/imaps.txt.gz
bzr: ERROR: Not a branch: "/usr/share/vim/addons/doc/imaps.txt.gz/".
.vim/doc/latex-suite-quickstart.txt.gz
...

WTF? "Not a branch!?"

Okay, so, what happens here is that Bazaar de-references the symlink before attempting to remove it, which is not at all what I had in mind. Poking around Launchpad, you can find several bug reports regarding the way Bazaar deals with symlinks. The workaround solutions proposed in those—remove the symlink using rm—wouldn't work for me, because I needed to retain the actual symlinks on the filesystem.

At this point I had solicited the attention of Robert Collins, a.k.a. lifeless in #bzr on Freenode. When I told him the workaround wouldn't work for me, and that I'd need to write a script, he suggested I use WorkingTree.unversion() from bzrlib. Despite being a Python fanatic [understatement] and bzr's codebase being in Python, when I said "script", I meant "Bash script". It never occurred to me to actually write a Python script until he mentioned that. By the completion of the thought, though, I was digging into the codebase of bzrlib to figure out what to do.

My initial approach plan included using os.walk() to move through the filesystem, os.path.islink() to identify the symbolic links, and then WorkingTree.unversion() to mark the files for removal from tracking. I ran into a problem, however, in that unversion() only accepts a list of file IDs, as specified by the bzr metadata. Robert pointed me towards a method called path2ids(), but I had trouble figuring out how I was going to give it the proper paths. os.walk will let me construct absolute paths to files, but I really needed relative paths to the files, truncated at a certain point past the root (e.g., .vim/compiler/tex.vim instead of /home/chris/shell-configs/.vim/compiler/tex.vim). I could see it was getting a little hairy, so I decided to dig a little further into WorkingTree code and see if there was anything else I could use.

What I discovered was the jackpot in the form of WorknigTree.walktree()—a method written precisely for what I needed: traversing the filesystem, identifying the filetypes (especially symlinks), and providing file IDs. Within a few minutes, I banged out a script that did exactly what I needed it to do, presented below.

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

# Copyright (c) 2009 Chris Lasher
#
# Licensed under the Apache License, Version 2.0 (the "License"); you may
# not use this file except in compliance with the License. You may obtain
# a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
# License for the specific language governing permissions and limitations
# under the License.

"""
A simple script to go through a Bazaar repository and ruthlessly
remove all symbolic links (symlinks) from further tracking.

It's important to note that this will not actually remove the symlinks
from the physical filesystem. This is left to the user, if so desired.

"""

__author__ = 'Chris Lasher'
__email__ = 'chris DOT lasher AT gmail DOT com'

import bzrlib.workingtree
import os

tree = bzrlib.workingtree.WorkingTree.open(os.getcwd())
try:
    # use protection -- one-on-one action only
    tree.lock_write()
    symlink_ids = []
    for dir, file_list in tree.walkdirs():
        # dir[1] (the file_id) will be None if it's not under revision
        # control, so this will skip it if it's not
        if dir[1]:
            for file_data in file_list:
                # file_data[2] is the file type, and file_data[4] is the
                # file_id, the necessary specifier for removing the file
                # from revision tracking
                if file_data[2] == 'symlink' and file_data[4]:
                    print "Removing %s" % file_data[0]
                    symlink_ids.append(file_data[4])

    tree.unversion(symlink_ids)

finally:
    # okay, all yours
    tree.unlock()

Hopefully someone else will find this little script useful. It's under the Apache version 2 license; make whatever use of it you can for your particular predicament.

So what were the lessons learned here:

Exercise a little restraint and consideration about what you put under revision control in the first place.
It's awesome to be able to have direct contact with developers of your tools.
It's even more awesome to be able to dig right into their code and help yourself.
Just like in Murali's brutal Theory of Algorithms course, in real life, when facing difficulty solving a problem one way, don't be afraid to step back and try an approach from another (the opposite) direction. Trust your gut—if it feels like the hard way of doing something, it probably is; find the lazy (smart) way.

A special thanks to Robert for his guidance and help.

Saturday, March 21, 2009

Do you choose the research, or does the research choose you?

Note: I give advanced warning to my non-biologist readers that the next few paragraphs below contain a good dose of biology. While I have attempted to keep it conversational, if you feel your eyes glazing over, skip down a few paragraphs for the real meat.

Yesterday (Friday) I attended a talk by Susan Gottesman about small non-coding RNAs (sRNAs) and how they are involved in protein degradation. At this point in her esteemed career, Gottesman's best known work revolves around a particular Escherichia coli sigma factor—a protein responsible for transcription—called RpoS. RpoS facilitates translation of messenger RNAs (mRNAs) into proteins at low temperature levels. Now, RpoS only appears in E. coli cells during low temperature conditions, but mysteriously (or so it was), the gene that encodes RpoS gets expressed even when the cells are growing at a comfortable temperature.

As Gottesman's lab discovered, the mRNA for RpoS can actually bend back around and stick to itself such that ribosomes aren't able to bind to the mRNA and translate it into the RpoS protein. An sRNA called DsrA, however, which is expressed in low temperature conditions, binds to part of the RpoS mRNA, preventing the mRNA from folding back on itself, and giving ribosomes access to the transcript to translate it into RpoS protein. Why is this important?

Well, previously, sRNAs had only been thought to inhibit translation and prevent proteins from appearing. That is, we say that sRNAs usually inhibit the expression of a protein, so if you found an sRNA, you would bet that its target wouldn't appear when it appeared. Add sRNA and the protein won't be found in the cell; take the sRNA away, and the protein re-appears. The Gottesman lab, however, demonstrated a case where the sRNA actually is responsible for making the proteins appear. That is, when the sRNA DsrA appears, its target, RpoS, appears too; and if you take away DsrA, the protein goes away, too! Craziness! In Biology, we call this a paradigm shift. Paradigm shifts are "big deals", because Biology is all about figuring out the rules, and then identifying the exceptions so we have to re-write the rules. Biology is the science of exceptions.

The story continues, but I'll leave it to the reader to check out Gottesman's publications for more, because as much as I liked the story of her research, what I found most interesting about the talk was this side comment that she made towards the end, which I paraphrase here:

We published this work with RpoS, but then we wanted to work in other directions. We'd try something, then discover we couldn't go in that direction because we needed to know something else about RpoS. Then we'd attempt something else, but again, it would always come back to RpoS. Finally we just said, "Forget it! Fine! We'll just study RpoS. Clearly there's enough here to work on for a while."

I don't know if the fellow grad students in the audience caught the subtle significance of this statement, or if perhaps I was the only person who found this significant. What Gottesman said, in more words, is that she didn't really choose her research; her research chose her. Yet, in spite of spending her career in an area she never intended to stay in, once she identified that she was mired in it, she made the best of it, leading to great scientific contributions and earning her accolades and prestige that even the most jaded of us junior researchers catch ourselves fantasizing about from time to time.

I find this significant because, also from time to time, I wonder how the researchers, and even my peers, that I have come to admire wound up doing the research that they're doing. In my earlier days, I often thought they must possess great foresight and wisdom. While I don't doubt they're clever people, the longer my tenure in research and the more people I harass to tell me about their own careers, the more I've begun to think that a lot of it just comes by chance rather than deliberate choice. We find ourselves in a particular unique positions, somewhat stuck, and somewhat stumped, and we throw up our hands and say, "Aw, Hell! I guess I might as well dig around while I'm here." We do have to make choices about where we dig, but we seem to get to choose our own particular holes about as well as seeds scattered by the winds. (Though, from time to time, we can try to ride the winds to another hole.)

I suppose that I just find it amusing that life is stochastic from the molecular level all the way up to our own grand plans. Like each of our cells, we may as well just deal with the cards we're dealt as best we can. For everything else... well, "Cast Your Fate to the Wind".

Monday, March 16, 2009

If a tree falls in a random forest: a summary of Chen and Jeong, 2009

I had to write a summary for a paper, "Sequence-based prediction of protein interaction sites with an integrative method" by Xue-wen Chen and Jong Cheol Jeong[1], for my Problem Solving in Bioinformatics course. I thought I'd share the review here on my blog, in case anybody finds it remotely useful. I doubt anyone will, but it's my blog, so there. Be forewarned, this is my, "Hey, buddy, I'm just a biologist" interpretation of their paper. If you spot any specious, misleading, or just plain incorrect statements, please, by all means, offer corrections.

Chen and Jeong have essentially found a method to apply a machine learning technique called random forests to predict specific binding sites on proteins given only the amino acid sequence with greater accuracy than previously existing methods. Identification of binding sites in proteins remains an important task for both basic and applied life sciences research, for these sites make possible the protein-protein and protein-ligand interactions from which phenotypes, and indeed, the properties of life emerge. These sites also serve as important drug targets for pharmaceutical research.

Traditionally, researchers have identified binding sites from in vivo or in vitro studies involving point mutations that affect phenotypes, as well as through analysis of protein structures as identified through protein crystallography. With the advent and continuous improvement of DNA sequencing technology, however, researchers contribute ever more knowledge in the form of amino acid sequence, rather than structures. Sequencing has rapidly outpaced crystallography, necessitating prediction of proteins' functional characteristics based solely on their amino acid sequence, which Chen and Jeong cite as the motivation behind research presented in this paper.

Previous efforts to infer binding sites purely from amino acid sequence used a different machine learning method called Support Vector Machine (SVM). I'm not entirely certain how SVMs operate, but like random forests, they require a training set of known binding sites and sites not involved in binding. One of the confounding factors about amino acid sequences when applied to machine learning methods like SVMs is that the residues are unevenly distributed between the two categories; in other words, few amino acids in a sequence (1 in 9 in the dataset used by Chen and Jeong) will sit at the interface of the protein and its ligand. Chen and Jeong chose to use random forests because they are robust against this bias in the data. This has to do with the way that random forests are constructed.

For constructing random forests, one must have a set of data. In Chen and Jeong's study, the set is comprised of amino acids belonging to 99 polypeptide chains—or chunks of proteins—culled from a protein-protein interaction set used in previous studies. One must also have a set of features, or measures, about each item in and the data set. In this study, there were 1050 features (as stored in vectors) for each amino acid, which fall into one of three categories: those measuring physical or chemical characteristics (e.g., hydrophobicity, isoelectric point, propensity—which is a fancy word for saying whether an amino acid is likely to be on the surface of a protein or buried deep within it), those measuring the amino acid's minimum distance to any other given amino acid along the sequence, and the position specific score matrix (PSSM), which has to do with how likely certain amino acid substitutions are likely to be at that point.

With this data set and features in hand, one feeds it to the random forest generator. To construct one random decision tree, follow a process like this:

Count the total number of known interface sites (we'll call these positives), and call this number N.
Count the number of features available, and call this number M.
Randomly select a subset of N sites out of the entire set with—and this is important—replacement. This solves the problem of the unbalanced data set. If I recall my statistics correctly (I don't) this has to do with each site now having equal chance at influencing the training.
Now we build the tree. We randomly select m features from the total M features, where m is a lot smaller than M. Then, of those m features, we choose the one which best splits the subset of sites. We continue to do this recursively until all sites have been "classified".
We repeat steps 1-4 to construct the number of desired trees (100 in this study), which gives us our "forest" of randomly generated trees.

With the random forest constructed, essentially you feed in an amino acid site into the random forest, then the site trickles down each tree, and each tree then "votes" as to whether or not it classified the site as an interaction site or not. A simple majority can be used to categorize the site, or more stringent criteria, such as "at least 5 votes are necessary to categorize the site as an interface site". Increasing the votes required improves the confidence at which one claims a site is an interaction site (specificity), but decreases the probability of detecting interaction sites (sensitivity).

Using these measures of sensitivity and specificity in conjunction with leave-one-out studies (one polypeptide sequence is used as the test case, and the other 98 are used as training data), Chen and Jeong demonstrated that their random forests approach performed significantly better than the SVM approach used by the earlier studies. They attribute this improved performance to two things: random forests are more robust to unbalanced data sets, and their approach considered many more features than the previous studies'. When they used only the features used in the previous studies, they found decreased performance, albeit still significantly better than the previous methods'. Chen and Jeong note that a major feature of random forests is that their accuracy increases, rather than decreases, when the number of features increases, due to the random sampling.

Chen and Jeong finished their study with a prediction of binding sites on the DnaK (or Hsp70 in eukaryotes) chaperone system. Their results corroborated with several in vivo studies of mutants where mutations near the sites they predicted yielded changes in phenotypes for both prokaryotic and eukaryotic forms. Their visualization of predicted interaction sites using 3d molecular modeling software provided additional support.

Xue-wen Chen and Jong Cheol Jeong, "Sequence-based prediction of protein interaction sites with an integrative method," Bioinformatics 25, no. 5 (March 1, 2009): 585-591, doi:10.1093/bioinformatics/btp039.

Sunday, March 15, 2009

Why Biopython needs to move to GitHub or Launchpad

Paulo Nuin wrote a spot on post about the ridiculousness that is Biopython still using CVS as its revision control system (a.k.a. source code management, or SCM), when we code in an era of arguably superior tools in the form of distributed SCMs (DSCMs). Please read his post if you haven't yet. Do not pass go. Do not collect $200. This post will be here for you when you get back.

I'll continue along the thread that Paulo started, in which one of the hangups that the Biopython community must overcome is: "Supposing we do switch to a DSCM, where do we host the code?" Until the Biopython project can decide on an answer to this question, the project won't move to anything.

Peter Cock seems sincerely determined that the code be hosted on the Open Bioinformatics Foundation (OBF) servers at Biopython.org. If I understand Peter's rationale correctly, the notion stems from the desire to maintain control of the code hosting. The alternative to self-hosting the code is to use one of the big players. I'm particularly referring to GitHub and Launchpad. GitHub and Launchpad host repositories of for the DSCMs Git and Bazaar, respectively, and provide a set of tools around these repositories to facilitate collaboration and interactions between the developers and their communities. Launchpad has the backing of Canonical, best known for managing the Ubuntu GNU/Linux distribution, and GitHub has the backing of the only group more rabid than the Python community—the Ruby community; hence, I refer to them as "the big players".

I respect Peter's legitimate concerns. I also really respect Peter, who is much more of a Biopythonista than I'll ever be, and I recognize it will take his blessing for the transition to Git or Bazaar to succeed. I dedicate this blog post to changing Peter's opinion and convincing him that hosting on GitHub or Launchpad is the best option available to us at the time.* Hopefully I'll convince a few other Biopython (or Bio-anything) Devs along the way, too. :-)

The following are my top five reasons for hosting Biopython on GitHub/Launchpad:

It's free. Yeah, okay, only "as in beer"**, but the Biopython source will, itself, remain open. The hosting is generously on someone else's dime, and that's all we need.
It already exists. I do not have technical experience nor interest in running my own webserver-based interface to either Bazaar or Git. From the recent discussions on the Biopython mailing list, I will guess nobody on the Biopython Dev team does or has the time to learn how to, either. Since the OBF staff are volunteers, helping us set these up won't be high on their priority list. Bazaar and Git don't even exist on the servers, yet. Launchpad and GitHub already have the tools in place. The amount of time the Biopython community has to spend setting up the projects here is pretty minimal and painless. In fact, it's already been done. Launchpad and GitHub are clearly very good at what they do. They have the experts, the redundancy, and the robustness to manage hosting code in a public space, and all the headaches that come with it, so that we don't have to.
They have established social networks. I'm already on GitHub and Launchpad. A lot of us are already on these sites, working on our own and other open source projects. These places let other people discover our work, and allow serendipitous connections to occur. "Hmm, this gal works on Biopython. What's that?" This doesn't occur at Biopython.org—people only go there when they know what they're looking for (and not many people are looking for "bioinformatics python"). Additionally, potential employers, co-workers, and employees are on these sites; not all of us will be (un)fortunate or content enough to stay in bioinformatics and computational biology forever.
Everybody else is doing it. Sure, right now, GitHub only hosts very minor, niche projects like Ruby on Rails, Cappuccino, and BioRuby (like that will ever go anywhere), and Launchpad has some lesser known ones like MySQL, Zope, and something called Ubuntu, but I hear that some major players will join these sites really soon! They do seem to be gaining in popularity very rapidly. ;-)
Vendor lock-in is just not an issue. There's some concern that using a third-party site such as GitHub or Launchpad will make the Biopython project vulnerable to possibly unreasonable whims of the owners of the sites. Terms and conditions could change unfavorably (e.g., "You have to pay to continue using our service."), or the service will go under. However, the OBF provides no more protection than Launchpad or GitHub, particularly for the latter scenario. When I think about who's least likely to run out of operating funding—the OBF, Launchpad, or GitHub—I'm not betting on OBF. But let's suppose that the uthinkable happens, and the site closes its doors to Biopython. So what? It's a distributed SCM; we have all of the code! This isn't CVS or Subversion, where a downed server takes all the revision history with it to the grave. We'll just set up shop somewhere else, point it towards our repositories, and sally on. We can burn that bridge when we get there; in the meantime, don't fret about it.

At this point, I'm sure there's more discussion to have. I just hope it's not too much, given that the transition to Subversion stalled tragically, which I take responsibility for. It would be nice to have this settled by May. I'd rather be fielding "How do I do this in Git/Bazaar?" than discussing "Why should I do this in Git/Bazaar?" My fingers are crossed, my hopes are high, and my stubornness is fiercer than two years ago.

* I'm excluding Mercurial and Bitbucket here because they haven't received consideration on the mailing list. They could be a great solution, but we're least familiar with them, and we have to narrow down the choices somehow. ** Okay, so Launchpad is going to be open sourced, but I don't want to be in charge of running an instance of it if nobody's going to pay me; see 2.

Wednesday, March 11, 2009

This is a stick up! Give me all your genomes!

This blog post is based on a previous entry of the same title I posted to FriendFeed. This post provides an extended explanation of what we're trying to accomplish.

I'm working with Prof. John Jelesko on a project for one of my courses in which he's investigating metabolic pathways in plants. At the heart of it, we need to set up a local database for running FASTA homology searches. The Jelesko lab wants this database to contain every amino acid sequence predicted in every currently available whole genome (assembled and annotated) available at NCBI, prokaryotic and eukaryotic. [Edit: We don't need every sequenced genome, actually, we only need a representative genome per organism. I hadn't previously considered that there may be more than one genome per organism. Thanks to Brad Chapman for pointing out the need for clarification.]

We have sequences from locations other than NCBI which we need to include in the FASTA search space; hence, we can't just run FASTA searches over NCBI data, which EBI's FASTA search might be able to otherwise do. This necessitates a local database. The Jelesko lab also needs the nucleotide sequence corresponding to the amino acid sequence, as well as the intron/exon locations for the longest available splicing. The questions are: is it feasible to store this amount of data in a database (we'll be using MySQL), and if so, how do we go about getting this data?

We're naïvely assuming it is feasible, so I'm attempting to figure out how to get at this data. The one file format that seems to store all information that we need in one place is the GenBank (GBK) format:

a gene ID
taxonomic classification of the organism from which the gene came
start and stop positions for each exon
the translated amino acid sequence

It seems that in one shape or another, these GenBank format files are available from NCBI's FTP site. While the GBK files for the prokaryotic genomes are relatively easy to get in one fell swoop at ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.gbk.tar.gz. For good ol' eukaryotic genomes, however, the data is all over the place. Sometimes it's stored as gzipped files in CHR folders, while other times, the files aren't compressed, and still other times, the directory is really just a container for directories that have the genome data. In short, it's a mess, especially when we consider we want to automate the retrieval of this data, not to mention want to update it periodically, should NCBI deposit new data.

There's also the dilemma of not actually needing most of the data (the genome sequence) contained in the GBK files—we just need the sequence covering start to stop for translation, including intronic sequence for the mRNA. I can write a hack of a Python script to trudge through the FTP directories and yank any GBK (compressed or otherwise) to local disk, but it seems like a big waste of bandwidth and local disk space. It seems like there must be better ways [Doesn't it always?], but I don't have the knowledge of NCBI's services to identify what these might be. If you have any ideas, please share! Meanwhile, I think I'll try contacting NCBI and see if they might point me in the right direction. I'll report back on what we decide to use, which could be my FTP hack given our limited time for this project.

Update: I've received suggestions on the FriendFeed entry for this blog post worth checking out.

Sunday, January 25, 2009

FriendFeed PyAPI, or "What I did over winter break"

In mid-December 2008 I made the decision to go dark and execute on a solid coding project I could sink my teeth into. On January 13, 2009, I emerged with the fruits of a lot of labor of the fingertips: a fully fledged Python interface library to the FriendFeed API, suitably named FriendFeed PyAPI.

This library began from the original Python code available from the FriendFeed Google Code repository. This library provided a great basis, as it showed me how to implement method calls to the FriendFeed API, as well as contained code necessary for authentication which I wouldn't possibly have known how to write. The original library returned native Python structures of dictionaries, lists, strings, integers, and the like, parsed out using one of several available JSON parsing libraries which may be available on the systems. While this worked well enough, I saw a chance to really improve the library by having it work with and return full fledged data structures to represent the various FriendFeed entities, including users, rooms, entries, comments, and media. Calvin Spealman, a.k.a. ironfroggy, asked me two shrewd questions: 1) Wasn't I just creating an ORM [object-relational mapping]? 2) Why would I do that? The answer to 1) was "Yes". My answer to 2) was, essentially, "Because I want to." Calvin understood what I now know about undertaking the process: it takes a lot of time doing grunt work coding to create an ORM. I had experience using the object-oriented Python interface to the Twitter API for another project for Bertalan Meskó, and I really enjoyed the "feel" of that library, and so I made it my goal to bring the same kind of feel to the FriendFeed library. The result was an expansion and refactoring of the original library of 812 lines of code to nearly 4,000 lines, 45 unit tests, 8 entity classes, about a dozen exceptions, and support for nearly all the API calls available.

I think the real joy for me came from creating methods to parse the JSON structures recursively and instantiate appropriate objects at each depth. These objects are then appropriately set as attributes of their parent objects (that is, the objects they "belong to"). All of this is done quite simply with a mapping scheme of entity names to methods (e.g. mapping the key 'users' to the method _parse_users), and it feels quite elegant having it all work together, calling the appropriate parsing method for each structure, and returning beautiful little self-documented class instances. Witnessing it work in concert for the first time was definitely a "blinking LED moment," as my friend Ian Firkin would say.

Perhaps the most important lesson came not from the specific technical hurdles I made my way through, but from the personal insight that I absolutely love programming. I love writing code; I love to talk about writing code; and I really love interacting with other developers. Over the course of the couple of weeks, I consulted Stack Overflow, hit up #python on IRC, and had direct email exchanges with Ben Golub at FriendFeed (who, by the way, is an absolutely stand-up developer and a fantastic representative for the young service). I have a genuine sense of satisfaction from the code and documentation I produced for the project, and that feeling makes for a happier life more than any other currency (except, possibly, beer).

So what now? Well, I released FriendFeed PyAPI under the same Apache License (Version 2) that FriendFeed released the original library under. This means you may fork it, play with it, and modify it to your heart's content, and if you care to, let me know what improvements you've made so I can merge them back into the trunk branch. (Of course, you may also keep any and all modifications to yourself, in your quest for world domination, though you'll still have to attribute FriendFeed and me as taking a part in your doomsday device.) [Edit: On second thought, please don't attribute me in those events.] I also have a list of future directions, and a few ideas of my own, including the one that actually spurned this spurt of code-writing, that I look forward to releasing upon FriendFeeders. So go out and use it! Ask questions about it! Most importantly, please report bugs!

Class attributes and scoping in Python, Part 1

This latest post comes courtesy of Hari Jayaram, one of those people who's on my "I'd like to meet" list. Hari asks, [paraphrased] "Does Python treat class variables as having an instance scope, while at the same time treat class lists and class dictionaries as having class scope?"

Let's use a simple example to illustrate some gotchas with regards to class and instance attributes. We'll begin by coding up a simple class with a couple of reporter functions to help us out later on; right now I'd just like to draw your attention to class Foo's two attributes of interest: class_attr which shall represent our class attribute, and instance_attr which—you guessed it—represents our instance attribute.

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import pprint

class Foo(object):
    class_attr = 0

    def __init__(self, item):
        self.instance_attr = item


    def report(self):
        print "My 'class_attr' is: %s" % self.class_attr
        print "My '__class__.class_attr' is: %s" % \
            self.__class__.class_attr
        print "My 'instance_attr' is: %s" % self.instance_attr


    def print_self_dict(self):
        pprint.pprint(self.__dict__)


    def change_class_attr(self, item):
        self.__class__.class_attr = item

Alright, let's throw this puppy into the interactive interpreter and play with it a bit. We'll start off by creating two instances and checking them out.

>>> from foo import Foo
>>> a = foo.Foo('a')
>>> b = foo.Foo('b')
>>> a.report()
My 'class_attr' is: 0
My '__class__.class_attr' is: 0
My 'instance_attr' is: a
>>> b.report()
My 'class_attr' is: 0
My '__class__.class_attr' is: 0
My 'instance_attr' is: b

So right now we see that a and b share the same class attribute value of 0 but different instance attribute values of 'a' and 'b', respectively. Now, let's attempt change the class variable:

>>> a.class_attr = 1
>>> print a.class_attr
1
>>> print b.class_attr
0

Wait, a has our expected value for the class attribute, but instance b doesn't. That doesn't make sense; it's a class attribute after all! Let's take a closer look at the internals, though:

>>> a.report()
My 'class_attr' is: 1
My '__class__.class_attr' is: 0
My 'instance_attr' is: a
>>> b.report()
My 'class_attr' is: 0
My '__class__.class_attr' is: 0
My 'instance_attr' is: b

Notice the discrepancy between the reported values for self.class_attr and self.__class__.class_attr for a? Huh. It looks as if Python actually made the assignment of the new value to an instance variable of the name class_attr rather than assign the value to the Foo class's class_attr. We can take a look at the instance and class dictionaries to help extricate this. First, let's compare the internal dictionaries of a and b.

>>> a.print_self_dict()
{'class_attr': 1, 'instance_attr': 'a'}
>>> b.print_self_dict()
{'instance_attr': 'b'}

Ha! Python, we've found you out! We now can see, indeed, Python made the new value assignment to a brand new instance variable in a called (deceptively, in our deceptive case) class_attr.

Now let's explore how to actually convince Python to do what we meant to do: reassign the class variable. Let's get a clean slate.

>>> a = foo.Foo('a')
>>> b = foo.Foo('b')
>>> a.report()
My 'class_attr' is: 0
My '__class__.class_attr' is: 0
My 'instance_attr' is: a
>>> b.report()
My 'class_attr' is: 0
My '__class__.class_attr' is: 0
My 'instance_attr' is: b

One generic means by which we can reassign the class variable is to directly assign it via the class, rather via an instance of the class.

>>> Foo.class_attr = 1
>>> a.report()
My 'class_attr' is: 1
My '__class__.class_attr' is: 1
My 'instance_attr' is: a
>>> b.report()
My 'class_attr' is: 1
My '__class__.class_attr' is: 1
My 'instance_attr' is: b
>>> a.print_self_dict()
{'instance_attr': 'a'}
>>> b.print_self_dict()
{'instance_attr': 'b'}

That worked a treat. But often in production code, we don't want to tie the fate of a class variable assignment to a hard-coded class name somewhere in some file, soon to break when we refactor our code and give the class a new name. This is where using the special variable __class__ comes in handy. Take another look at the method change_class_attr().

    def change_class_attr(self, item):
        self.__class__.class_attr = item

This uses the instance's inherent knowledge of what class it belongs to (accessed via __class__) to make the necessary assignment to the class variable. So, we see, this also works:

>>> a.change_class_attr(2)
>>> a.report()
My 'class_attr' is: 2
My '__class__.class_attr' is: 2
My 'instance_attr' is: a
>>> b.report()
My 'class_attr' is: 2
My '__class__.class_attr' is: 2
My 'instance_attr' is: b

There's an important caveat here: this method, too, is fragile for sub-classes. For example, let's create a sub-class of Foo called Bar, and an instance c.

>>> class Bar(Foo):
...     pass
...
>>> c = Bar('c')
>>> c.report()
My 'class_attr' is: 2
My '__class__.class_attr' is: 2
My 'instance_attr' is: c

Now let's observe what happens when we assign a new value to the class variable via c's change_class_attr().

>>> c.change_class_attr(3)
>>> c.report()
My 'class_attr' is: 3
My '__class__.class_attr' is: 3
My 'instance_attr' is: c

All's well, but notice this only affected the Bar class's class_attr, not the Foo class's:

>>> a.report()
My 'class_attr' is: 2
My '__class__.class_attr' is: 2
My 'instance_attr' is: a
>>> print Foo.class_attr
2

Failing to make note of this can come back to bite Python programmers in the tail. For example, you may use a class attribute to keep track of the number of instances of that class. If you would like to keep track of compatible sub-class instances, too, however, the __class__ trick will prove insufficient; a hard-coded class name would prove more suitable. Use this knowledge to make the right decision for your particular scenario.

In the next part, I'll be covering an even more interesting scoping question dealing with lists and other mutables as class variables.

Friday, January 23, 2009

Tab-completion and history in the Python interpreter

I usually use IPython as my interactive Python interpreter, but it has problems with Unicode decoding which can have detrimental effects for times when I need to deal with Unicode (such as when I'm working with FriendFeed PyAPI). When complaining about this on #python, one of the user told me I should use the standard Python interpreter anyway. When I told him I did not use the standard interpreter because I loved the convenience of tab-completion in the IPython shell, he informed me that, indeed, the standard interactive interpreter can do auto-complete.

After some Googling, I came upon this blog post. I wound up using a modified solution posted in the comments. Here's my .pythonrc file:

import atexit
import os.path

try:
   import readline
except ImportError:
   pass
else:
   import rlcompleter

   class IrlCompleter(rlcompleter.Completer):
       """
       This class enables a "tab" insertion if there's no text for
       completion.

       The default "tab" is four spaces. You can initialize with '\t' as
       the tab if you wish to use a genuine tab.

       """

       def __init__(self, tab='    '):
           self.tab = tab
           rlcompleter.Completer.__init__(self)


       def complete(self, text, state):
           if text == '':
               readline.insert_text(self.tab)
               return None
           else:
               return rlcompleter.Completer.complete(self,text,state)


   #you could change this line to bind another key instead tab.
   readline.parse_and_bind('tab: complete')
   readline.set_completer(IrlCompleter().complete)


# Restore our command-line history, and save it when Python exits.
history_path = os.path.expanduser('~/.pyhistory')
if os.path.isfile(history_path):
   readline.read_history_file(history_path)
atexit.register(lambda x=history_path: readline.write_history_file(x))

I then added the following line to my .bashrc:

export PYTHONSTARTUP="$HOME/.pythonrc"

Now I can remain a happy camper using the native interactive interpreter.

Update (2008-1-25): Thanks to Bob Erb's comments, I corrected some poor indentation (whoops!) and also added the final lines to remove the atexit and os.path modules from the main namespace.

Update (2009-4-18): I removed the deletion of atexit and os.path from the main namespace. That seemed to wreck any script that needed either of those; quite a few scripts rely on os.path, in particular.

It's Not Easy Being Genes