Friday, August 8, 2008

Not the Biopythonista I thought I'd be

I had thought at this point in my graduate career that I would not only have a firm grasp of the Biopython library, but also have become a primary contributor to the project. Like many aspirations I've had throughout my life, these proved naïve, but I felt they warranted reflection.

Most who of you who know me online or professionally also know me as a Python advocate. Only some of you know of my darker days programming in Perl. Through patient proselytizing by Rob Waldo (now Ph.D.), a friend of a lab mate at the University of Georgia Dept. of Microbiology, I rose up from my Perl-ish ways. Most of the reasons I love Python stem from the language's foundational philosophy, which I find wholly agreeable; indeed, my fanaticism for Python goes so far as for me to make a claim my friends often rib me about: Python is "arguably superior".

As a general computing language, I stand firmly by that claim. As a language for bioinformatics, however, I can not. By far, Perl has a nearly immovable foothold as the de facto language in the field I hope to have a career in. In the beginning of a field dominated by text manipulation, Perl seemed the perfect choice to trailblazers. As more people moved into the field, more used Perl, because that's what the gal next door said she used. This positive feedback loop created a critical mass community that gave rise to the Bioperl library. Other language communities spawned their own Bio* projects, though none of them approach the size, completeness, and popularity of Bioperl, in large part due to the Catch-22 of not having the size, completeness, and popularity of Bioperl.

Making matters more challenging in the Python sector, a separate effort, Biology in Python (BIP), arose last August during the SciPy 2007 Conference, led by Titus Brown. The first BIP Birds of a Feather meeting at SciPy 2007 held that very few bioinformatician types use Biopython, instead preferring to roll their own parsers and libraries, out of ignorance or preference. (Read Titus's account of the BoF for more detail.) The BoF founded BIP with the intention of providing a unified community of bioinformatics and computational biology efforts in Python larger than a single project like Biopython. The meeting had talk over incorporating Biopython into its effort, but to me it still felt fractious: Biopython was a "them" and not an "us". Today, the two communities remain largely segregated and little communication exists between them.

While the Python efforts remain fragmented, those in the Perl community continue to snowball: more and more useful modules are added to Bioperl, which brings more users into the community, who write more and more useful modules, and so on. Jonathan Rockway recently wrote a pertinent blog post where the punchline emphasizes the importance of language libraries which prevent wheel reinvention and allow developers to really get things done.

So while Perl's popularity in bioinformatics irks me, I recognize it helps researchers get things done. I also recognize proselytizing won't bring greater part of that mind share to Python. A Biopython as good or better than Bioperl might. "You must be the change you wish to see in world," said Mahatma Gandhi. In this vein, I had hoped to become an expert on Biopython, and to actually get down to writing the "missing code". Well, to date, I have yet to be the change I wish to see.

I have made one contribution to Biopython's code base over the past three years, and that being a one-line change in a testing module. I have filed and helped with only a handful of bug reports. Originally, I led an effort to migrate Biopython from CVS to Subversion, though this has remained stalled for a year and a half. As far as I can tell, only two steadfast programmers, Peter Cock, and another developer who wishes to remain unmentioned, have kept this project going. To confirm my suspicions, I ran a CVS log on Biopython and perfunctorily scanned commits made in 2008. These two programmers made nearly all the commits, with Frank Kauff and Tiago Antão providing a handful of the remaining commits.

I subscribe to both the Biopython users mailing list and the developers list, but these days I check neither list's emails. Instead, my Gmail filters label and archive them, and there they sit until I get so intimidated by the number of unread threads that I simply "mark all as read". At this point, my involvement with the Biopython project remains minimal. I do not recall the last time I looked at the code base. I last pulled updates from the CVS repository five months ago, prior to pulling them during the course of writing this post. My lab mate, Andrew Warren, might even know more about Biopython than I do, at this point, and I turned him on to the package.

Ultimately, I have to ask, why have I not sat down and gone through the Biopython library as I intended to? Why have I not written the modules that seem missing from the library? The most simple answer is that I did not want to. Really, deep down, if I wanted to do it, I would have. After all, if there's anything I learned from Prof. Randy Pausch, it's that if you really want to make something happen, you can make it happen. This doesn't really address the question, though, since I've established that at some level, I certainly intended to get my hands dirty with Biopython. I think I have not worked with Biopython because I am not encouraged to do so, and am actually discouraged, because of research, and the current culture of academia.

My research project only uses the SeqIO class to parse through Genbank and FASTA boilerplate and get to the sequence information I need to use. I make far more use out of the NetworkX graph library for the bulk of what my code does, and even then, it's in a limited manner. This puts efforts towards Biopython as orthogonal—antagonistic, even—to what will earn me praise by my adviser and committee (namely publications). Therein lies a terrible problem, because ultimately, a more user-friendly or powerful (or both) Biopython could benefit more people than my own research, but academia seems to offer little to no reward for such tool building.

So, I imagine I'll not make much headway into Biopython for the remainder (probably [hopefully?] three more years) of my graduate career, and that will be that. I can find no immediate resolution to this conundrum of feeling compelled to help a cause that most PI's would find a fool's pursuit. I'd love to hear from contributors and users of any of the Bio* projects on how they work in time to attend to and learn these libraries. Perhaps some research projects out there really do leverage these libraries, or perhaps the field of bioinformatics is just too vast and too devoid of overlap, rendering code reuse a pointless pursuit. Maybe it's really just a roll-your-own kind of world.

6 comments:

  1. (some comments on FriendFeed, some here since I could have spammed up FriendFeed with too many comment boxes).

    I also contributed a small module once (an AAIndex parser, actually).. it wasn't amazingly elegant or fully complete, but it did the job in a 'production setting' so it should have been okay. The mailinglist email was overlooked by the core Biopython devs, and it was ignored. A few months later, someone came asking for an AAIndex parser .. I emailed again, but by Iddo had already written his own implementation, which he felt fitted the "biopython-style" better, and included that instead. I'm not miffed that my code wasn't included ... this is a strength of open source ... we had *two* implementations to choose from, and the better implementation was chosen (although I can't find an AAIndex parser in the Biopython, I think it was dropped at some later stage). I guess I just feel like the project could take a few tips from Producing Open Source Software by Karl Fogel.

    It has crossed my mind in recent weeks that maybe a Biopython rejuvenation project would be a good idea, but I haven't been following the mailinglist, and like you, don't feel like I have time to do any of it properly alone, and knowing that the critical mass BioPerl has is lacking.

    Once thing I'd like to do is survey which parts of Bioperl are actually most used (maybe just via Google Code search), and see if there are any key holes in Biopython that could be filled to help bring some parity. Are you up for a collaborative weekend of "busty work" sometime ?

    ReplyDelete
  2. Andrew, what's this about "busty" work :). Sorry, couldn't resist

    ReplyDelete
  3. @Andrew Thanks so much for your thoughtful comments. That's interesting about your AAIndex parser not fitting the "Biopython style" because as far as I know, there are no style guidelines really set. And really, why should it? Standardization is something that I think will come about after a library is established, but a working library is not going to be established by setting standards.

    A rejuvenation project is a good idea. Again, how can we do this? It takes time and it really would help knowing what is already in the library, and what isn't and is critically missing. It takes a lot of users probably to get an idea of both of those, particularly the latter. I had a thought of documenting a Biopython module a month. I wonder if this would be a feasible project.

    I think I will solicit the Biopython and Biology in Python lists for more input on this. While Titus Brown wrote a great blog post recently on the importance of execution, I think if I could get more of us Pythonistas at least talking about this and being in contact, that's at least a start. BIP may have a meeting at the upcoming SciPy conference, too, and maybe I can convince them to make discussing this stuff a priority there.

    Also, again, and as a side note, I'd like to bring up this conflict I feel in that by helping Biopython (or Biology in Python--whatever gets some cohesion in the Python bioinformatics community), I am doing more to progress science than by doing research in my own little thesis project, but I am only rewarded (read, kept on a stipend) by doing the latter. What would be fantastic is if grants existed for the purpose of building and maintaining these libraries and providing educational opportunities to life sciences. Maybe that's a pipe dream, but maybe someone with some influence on funding will read this and think this is a Good Idea™. :-)

    Also, thanks for mentioning the FriendFeed thread. I love the feedback coming through there, as well.

    @Deepak, I so was thinking the same thing. Haha!

    ReplyDelete
  4. I had/have the same aspirations of being neck deep in biopython in the closing days as an undergrad in bioinformatics. I start my third year as an undergrad on monday and still feel like I can't code my way out of a paper bag, at least not without a bunch of helper code. Looking back toward your days as an undergrad what would you have done differently?

    ReplyDelete
  5. I agree with the post above. I have had ideas about getting involved with the biopython community and my stumbling block comes the fact of being unable to pursue anything I can find to be meaningful. I am newbie interested in developing my biopython skills by learning about the project and contributing in anyway I can. But I just cannot find anything I can do at a newbie level. It would be really nice if things could be assigned to people according to their skillsets. For one I think biopython has awful documentation. Someone should really need to work on that. But exactly what? I wish someone would lay down the guidelines so others can pick up. I find that kind of community sense sadly lacking in biopython. Just my 2 cents.

    ReplyDelete
  6. I recognize myself in everything you said and in the comments above.

    I have tried to propose a template for tests and file parsers that could be used by newbies, but they seem to don't care.
    Too bad..

    - http://bugzilla.open-bio.org/show_bug.cgi?id=2749

    ReplyDelete