Sunday, March 15, 2009

Why Biopython needs to move to GitHub or Launchpad

Air hosting?

Paulo Nuin wrote a spot on post about the ridiculousness that is Biopython still using CVS as its revision control system (a.k.a. source code management, or SCM), when we code in an era of arguably superior tools in the form of distributed SCMs (DSCMs). Please read his post if you haven't yet. Do not pass go. Do not collect $200. This post will be here for you when you get back.

I'll continue along the thread that Paulo started, in which one of the hangups that the Biopython community must overcome is: "Supposing we do switch to a DSCM, where do we host the code?" Until the Biopython project can decide on an answer to this question, the project won't move to anything.

Peter Cock seems sincerely determined that the code be hosted on the Open Bioinformatics Foundation (OBF) servers at Biopython.org. If I understand Peter's rationale correctly, the notion stems from the desire to maintain control of the code hosting. The alternative to self-hosting the code is to use one of the big players. I'm particularly referring to GitHub and Launchpad. GitHub and Launchpad host repositories of for the DSCMs Git and Bazaar, respectively, and provide a set of tools around these repositories to facilitate collaboration and interactions between the developers and their communities. Launchpad has the backing of Canonical, best known for managing the Ubuntu GNU/Linux distribution, and GitHub has the backing of the only group more rabid than the Python community—the Ruby community; hence, I refer to them as "the big players".

I respect Peter's legitimate concerns. I also really respect Peter, who is much more of a Biopythonista than I'll ever be, and I recognize it will take his blessing for the transition to Git or Bazaar to succeed. I dedicate this blog post to changing Peter's opinion and convincing him that hosting on GitHub or Launchpad is the best option available to us at the time.* Hopefully I'll convince a few other Biopython (or Bio-anything) Devs along the way, too. :-)

The following are my top five reasons for hosting Biopython on GitHub/Launchpad:

  1. It's free. Yeah, okay, only "as in beer"**, but the Biopython source will, itself, remain open. The hosting is generously on someone else's dime, and that's all we need.
  2. It already exists. I do not have technical experience nor interest in running my own webserver-based interface to either Bazaar or Git. From the recent discussions on the Biopython mailing list, I will guess nobody on the Biopython Dev team does or has the time to learn how to, either. Since the OBF staff are volunteers, helping us set these up won't be high on their priority list. Bazaar and Git don't even exist on the servers, yet. Launchpad and GitHub already have the tools in place. The amount of time the Biopython community has to spend setting up the projects here is pretty minimal and painless. In fact, it's already been done. Launchpad and GitHub are clearly very good at what they do. They have the experts, the redundancy, and the robustness to manage hosting code in a public space, and all the headaches that come with it, so that we don't have to.
  3. They have established social networks. I'm already on GitHub and Launchpad. A lot of us are already on these sites, working on our own and other open source projects. These places let other people discover our work, and allow serendipitous connections to occur. "Hmm, this gal works on Biopython. What's that?" This doesn't occur at Biopython.org—people only go there when they know what they're looking for (and not many people are looking for "bioinformatics python"). Additionally, potential employers, co-workers, and employees are on these sites; not all of us will be (un)fortunate or content enough to stay in bioinformatics and computational biology forever.
  4. Everybody else is doing it. Sure, right now, GitHub only hosts very minor, niche projects like Ruby on Rails, Cappuccino, and BioRuby (like that will ever go anywhere), and Launchpad has some lesser known ones like MySQL, Zope, and something called Ubuntu, but I hear that some major players will join these sites really soon! They do seem to be gaining in popularity very rapidly. ;-)
  5. Vendor lock-in is just not an issue. There's some concern that using a third-party site such as GitHub or Launchpad will make the Biopython project vulnerable to possibly unreasonable whims of the owners of the sites. Terms and conditions could change unfavorably (e.g., "You have to pay to continue using our service."), or the service will go under. However, the OBF provides no more protection than Launchpad or GitHub, particularly for the latter scenario. When I think about who's least likely to run out of operating funding—the OBF, Launchpad, or GitHub—I'm not betting on OBF. But let's suppose that the uthinkable happens, and the site closes its doors to Biopython. So what? It's a distributed SCM; we have all of the code! This isn't CVS or Subversion, where a downed server takes all the revision history with it to the grave. We'll just set up shop somewhere else, point it towards our repositories, and sally on. We can burn that bridge when we get there; in the meantime, don't fret about it.

At this point, I'm sure there's more discussion to have. I just hope it's not too much, given that the transition to Subversion stalled tragically, which I take responsibility for. It would be nice to have this settled by May. I'd rather be fielding "How do I do this in Git/Bazaar?" than discussing "Why should I do this in Git/Bazaar?" My fingers are crossed, my hopes are high, and my stubornness is fiercer than two years ago.

* I'm excluding Mercurial and Bitbucket here because they haven't received consideration on the mailing list. They could be a great solution, but we're least familiar with them, and we have to narrow down the choices somehow. ** Okay, so Launchpad is going to be open sourced, but I don't want to be in charge of running an instance of it if nobody's going to pay me; see 2.

3 comments:

  1. Hi :)

    I also wrote a similar blog post, there:
    - http://bioinfoblog.it/2009/02/biopython-looking-for-a-new-vcs/

    I have been trying the unofficial biopython migration on github, and I feel like there are some advantages.

    For example, if you need to create a customized copy of biopython (some people already did, for different reasons), then it will be easier to keep these copies updated with the current code.

    Moreover, if you want to propose something, like a new unittest, new module, or whatever, you can create an experimental branch, and show the code to other people.

    Have a look at this graph:
    - http://github.com/biopython/biopython/network
    You can see that I have created a branch called 'qualityscores-experimental' to test the changes proposed in a discussion on the dev mailing list: I think it is easier to discuss code if you can show a working code to other people, and there are more advantages.

    ReplyDelete
  2. The BioPython project has already been created in Launchpad, and it appears that we tried to import the cvs repo into a bzr branch (so you don't have to). If the cvs repository had true anonymous access, you could easily test drive Launchpad while still working with CVS by having Launchpad sync a Bazaar branch with the CVS repository, and then you can just merge from that, and continue working in bzr seamlessly.

    See also https://help.launchpad.net/Code/Imports

    ReplyDelete