It's Not Easy Being Genes: August 2008

Tuesday, August 19, 2008

Update of delicous.com bulk for tagger

While waiting for BLAST results this afternoon, I pushed an update to a script that lets you bulk for-tag Delicious bookmarks to friends in your network. You can obtain it from the link above, or with Subversion.

svn export https://gotgenes.com/svn/toolbox/trunk/delicious-bulk-for.py

This version brings in a little more privacy. Your account information should now be stored in a profile file (~/.dbf_profile by default) rather than be explicitly entered on the command line, which exposed your password to anyone looking over your shoulder. It was also a chance for me to get a little more acquainted with the Python standard library's ConfigParser. Additionally, it has an honest-to-goodness CLI parser now, powered by one of my favorite Python stdlib modules, optparse.

If you happen to use it, let me know if you have any problems. I have written zero unit tests for the script, so I consider the code fragile, but it improves as more and more people ask me for it. So far, though, the new version works for me.

(As a side note, I really miss the old del.icio.us name. Foo on delicious.com.)

Friday, August 8, 2008

Not the Biopythonista I thought I'd be

I had thought at this point in my graduate career that I would not only have a firm grasp of the Biopython library, but also have become a primary contributor to the project. Like many aspirations I've had throughout my life, these proved naïve, but I felt they warranted reflection.

Most who of you who know me online or professionally also know me as a Python advocate. Only some of you know of my darker days programming in Perl. Through patient proselytizing by Rob Waldo (now Ph.D.), a friend of a lab mate at the University of Georgia Dept. of Microbiology, I rose up from my Perl-ish ways. Most of the reasons I love Python stem from the language's foundational philosophy, which I find wholly agreeable; indeed, my fanaticism for Python goes so far as for me to make a claim my friends often rib me about: Python is "arguably superior".

As a general computing language, I stand firmly by that claim. As a language for bioinformatics, however, I can not. By far, Perl has a nearly immovable foothold as the de facto language in the field I hope to have a career in. In the beginning of a field dominated by text manipulation, Perl seemed the perfect choice to trailblazers. As more people moved into the field, more used Perl, because that's what the gal next door said she used. This positive feedback loop created a critical mass community that gave rise to the Bioperl library. Other language communities spawned their own Bio* projects, though none of them approach the size, completeness, and popularity of Bioperl, in large part due to the Catch-22 of not having the size, completeness, and popularity of Bioperl.

Making matters more challenging in the Python sector, a separate effort, Biology in Python (BIP), arose last August during the SciPy 2007 Conference, led by Titus Brown. The first BIP Birds of a Feather meeting at SciPy 2007 held that very few bioinformatician types use Biopython, instead preferring to roll their own parsers and libraries, out of ignorance or preference. (Read Titus's account of the BoF for more detail.) The BoF founded BIP with the intention of providing a unified community of bioinformatics and computational biology efforts in Python larger than a single project like Biopython. The meeting had talk over incorporating Biopython into its effort, but to me it still felt fractious: Biopython was a "them" and not an "us". Today, the two communities remain largely segregated and little communication exists between them.

While the Python efforts remain fragmented, those in the Perl community continue to snowball: more and more useful modules are added to Bioperl, which brings more users into the community, who write more and more useful modules, and so on. Jonathan Rockway recently wrote a pertinent blog post where the punchline emphasizes the importance of language libraries which prevent wheel reinvention and allow developers to really get things done.

So while Perl's popularity in bioinformatics irks me, I recognize it helps researchers get things done. I also recognize proselytizing won't bring greater part of that mind share to Python. A Biopython as good or better than Bioperl might. "You must be the change you wish to see in world," said Mahatma Gandhi. In this vein, I had hoped to become an expert on Biopython, and to actually get down to writing the "missing code". Well, to date, I have yet to be the change I wish to see.

I have made one contribution to Biopython's code base over the past three years, and that being a one-line change in a testing module. I have filed and helped with only a handful of bug reports. Originally, I led an effort to migrate Biopython from CVS to Subversion, though this has remained stalled for a year and a half. As far as I can tell, only two steadfast programmers, Peter Cock, and another developer who wishes to remain unmentioned, have kept this project going. To confirm my suspicions, I ran a CVS log on Biopython and perfunctorily scanned commits made in 2008. These two programmers made nearly all the commits, with Frank Kauff and Tiago Antão providing a handful of the remaining commits.

I subscribe to both the Biopython users mailing list and the developers list, but these days I check neither list's emails. Instead, my Gmail filters label and archive them, and there they sit until I get so intimidated by the number of unread threads that I simply "mark all as read". At this point, my involvement with the Biopython project remains minimal. I do not recall the last time I looked at the code base. I last pulled updates from the CVS repository five months ago, prior to pulling them during the course of writing this post. My lab mate, Andrew Warren, might even know more about Biopython than I do, at this point, and I turned him on to the package.

Ultimately, I have to ask, why have I not sat down and gone through the Biopython library as I intended to? Why have I not written the modules that seem missing from the library? The most simple answer is that I did not want to. Really, deep down, if I wanted to do it, I would have. After all, if there's anything I learned from Prof. Randy Pausch, it's that if you really want to make something happen, you can make it happen. This doesn't really address the question, though, since I've established that at some level, I certainly intended to get my hands dirty with Biopython. I think I have not worked with Biopython because I am not encouraged to do so, and am actually discouraged, because of research, and the current culture of academia.

My research project only uses the SeqIO class to parse through Genbank and FASTA boilerplate and get to the sequence information I need to use. I make far more use out of the NetworkX graph library for the bulk of what my code does, and even then, it's in a limited manner. This puts efforts towards Biopython as orthogonal—antagonistic, even—to what will earn me praise by my adviser and committee (namely publications). Therein lies a terrible problem, because ultimately, a more user-friendly or powerful (or both) Biopython could benefit more people than my own research, but academia seems to offer little to no reward for such tool building.

So, I imagine I'll not make much headway into Biopython for the remainder (probably [hopefully?] three more years) of my graduate career, and that will be that. I can find no immediate resolution to this conundrum of feeling compelled to help a cause that most PI's would find a fool's pursuit. I'd love to hear from contributors and users of any of the Bio* projects on how they work in time to attend to and learn these libraries. Perhaps some research projects out there really do leverage these libraries, or perhaps the field of bioinformatics is just too vast and too devoid of overlap, rendering code reuse a pointless pursuit. Maybe it's really just a roll-your-own kind of world.

Wednesday, August 6, 2008

Tips Interviewing prospective bioinformatics graduate students

Interviewing a prospective bioinformatics graduate student? Here are a few tips.

Keep your lab's research web page up to date. The key to getting a good student comes before the interview begins.
Know a bit about the prospective student. You can expect the prospective student to know a bit about your work (so long as you fulfill tip 1). You should take an interest and know something about the prospective student you're interviewing. This sets a very positive tone to the interview and makes the student feel welcome and comfortable. I had excellent interviews with professors who took the time to read the dossier the application committee made of me. (If your application committee does not prepare such dossiers, make certain it does next year.) As an addendum, if you know a colleague at a prospective student's previous place of education, feel free to ask if the student knows the colleague, but do not expect them to say yes.
Talk about your research first. Sure, the student who is really motivated will have already read your research web page. (See 1.) Regardless of preparedness, however, most prospective students will still feel nervous going into your office and sitting with you for the first few minutes. Ease the tension and break the ice by talking about your own work before having them talk about theirs. If you allow that buffer time of several minutes for the prospective student to calm down, the interview will go much better for both of you. From personal experience, my best interviews started this way. I felt most welcome by professors who could also articulate exactly what areas they hoped a grad student could step in and help. If you allow that buffer time of several minutes for the prospective student to calm down, the interview will go much better for both of you.
Don't ask the prospective student to code a particular function in <insert favorite programming language>. A prospective student can take offense at such a request. It shows a distrust in the skills they claim to have and immediately puts the prospective student on the defensive. Also consider the pointlessness in the exercise, which places an overemphasis on coding skills. In my experience thus far, programming is but a small part of a bioinformatics Ph.D. It is a skill that can be improved with remedial courses and Q&A among fellow grad students. Instead, focus on asking questions that probe what really matters: the student's interest in learning and self-motivation. If the student has a particular interest in programming, then sure, let the conversation wander in that direction. (For instance, Prof. Paul Magwene and I had a kindred-spirit talk about our love of Python.) In general, though, take a student's word that they program in the languages they claim, and if they don't program at all let them know that they'll need to take remedial courses to learn and check that their reaction is positive to this.
Ask the student about his or her research. The only way you'll really get a sense for the student's propensity for research is asking about his or her previous experiences in it. The more specific your questions about the student's research, the better. (See 2.) An example of a terrible question would be, "What was the greatest challenge in your research project?" An example of an good question would be, "It sounds like you worked with several different phylogenetic models. Were the resulting trees similar, though?" The former question is too open-ended (much like my most hated question of "Where do you see yourself in five years?" [Note: Best answer, "In the mirror, of course."]) The latter is specific but still leaves room for the student to explain his or her findings.

I'd like to thank the Biogang on FriendFeed for inspiring this post.

It's Not Easy Being Genes