Thursday, December 18, 2008

Robust imports in Python, guaranteed fresh: how to import code for testing

UPDATE 2010-01-19: As captnswing pointed out an alternative, and I should say more commonly used method, is to simply put the following before your import statement for your packages or modules, assuming you keep your tests in a subdirectory of your code.

import os.path
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), os.path.pardir))

Beer TastingAnyone who knows me knows I like unit tests. I mean, I really like unit tests. Like, if Mr. Software Engineering were to offer to betroth me to one of his daughters, I would ask him to betroth me to Miss Unit Test.

One thing that comes up when preparing tests in Python is, "Where the hell do I put them?" To this, my first answer is, "If you're willing and diligent enough to write them, you can put them damn well anywhere you please!" If that answer doesn't satisfy you, though, that's good, you're not alone. Python programmers have raised this topic on several forums, including recently on the Testing in Python mailing list and on Stack Overflow.

I'm a fan of the following method, which seems to have taken dominance in the Python community. It's based around the following directory structure:

rootdir/
rootdir/mymodule.py
rootdir/tests/
rootdir/tests/mymodule_tests.py

We have a directory containing our module of interest, mymodule.py, and we have a module, mymodule_tests.py, containing our unit tests for mymodule.py. We create a sudbirectory, tests/, under the root directory, rootdir/, of the project, and we place our mymodule_tests.py under this directory so that its path is rootdir/tests/mymodule_tests.py.

We've got to import the module we want to test into the module containing the tests for it. The import statement works for all packages/modules currently in the import path, found in the list sys.path. Since the current directory, '.', is in sys.path by default, we can easily import any packages/modules on the same level as our importing module. This would be in the form of a simple import statement of

import mymodule

For the typical testing layout, though, this won't suffice. We'll get a big fat ImportError. This is because the path of mymodule.py is in rootdir/, above our testing module's rootdir/tests/ path. The next logical step, then, is to place rootdir/ in sys.path for mymodule_tests.py to access mymodule.py. The initial thought for doing this is to add the directory above to the sys.path using relative path.

#!/usr/bin/env python
import os
import sys

sys.path.insert(0, os.pardir)

Unfortunately, this is fragile. If we run mymodule_tests.py from outside its own directory, this will break the path. Take the following script as an example:

#!/usr/bin/env python
# parpath.py: print the parent path
import os

print "parent directory:", os.path.abspath(os.pardir)

I place this script in the path of /home/chris/development/playground/, and then run it from this directory

[chris]─[@feathers]─[2495]─[15:35]──[~/development/playground]
$ python parpath.py
parent directory: /home/chris/development

When I run the script from the parent directory, however, my results differ.

[chris]─[@feathers]─[2496]─[15:36]──[~/development/playground]
$ cd ..

[chris]─[@feathers]─[2497]─[15:36]──[~/development]
$ python python/parpath.py
parent directory: /home/chris

In the words of Austin Powers, "That's not right." Now instead of getting the directory I wanted (/home/chris/development/playground), I get the one above it (/home/chris/development). This is because relative paths is sys.path are relative to where you executed the script, not relative to where the script exists. Phooey!

I used to just ignore this fragility and be very careful about running tests from within the same directory as the test modules. However, last night I came across a robust solution by way of some Google Code Search Fu—specifically, while browsing test code for MoinMoin. It turns out the solution is to use a method of the following:

path_of_exec = os.path.dirname(sys.argv[0])
parpath = os.path.join(path_of_exec, os.pardir)
sys.path.insert(0, os.path.abspath(parpath))

If we take a look at the first line, we see that it's capturing the first argument to the command line, and using that to construct a robust path that understands where the actual module is. The very first argument in sys.argv is always what immediately follows python in the command line (or if executing directly by ./) In our examples, these would by path.py and playground/path.py, respectively. Then, running os.path.dirname on these, we get the results of '' and 'playground', respectively. By joining these to the parent directory, we get the desired effect.

#!/usr/bin/env python
# parpath.py

import os
import sys

print "parent path:", os.path.abspath(os.pardir)

path_of_exec = os.path.dirname(sys.argv[0])
print "execution path:", path_of_exec
parpath = os.path.abspath(os.path.join(path_of_exec, os.pardir))
print "true parent path:", parpath
This gives us the following results:
[chris]─[@feathers]─[2467]─[16:32]──[~/development/playground]
$ python parpath.py
parent path: /home/chris/development
execution path:
true parent path: /home/chris/development

[chris]─[@feathers]─[2467]─[16:32]──[~/development/playground]
$ cd ..

[chris]─[@feathers]─[2467]─[16:33]──[~/development]
$ python playground/parpath.py
parent path: /home/chris
execution path: playground
true parent path: /home/chris/development

Now we're cooking with the good sauce! Ultimately, you can create a shortened version which looks similar to the one from MoinMoin:

#!/usr/bin/env python
import os
import sys

parpath = os.path.join(os.path.dirname(sys.argv[0]), os.pardir)
sys.path.insert(0, os.path.abspath(parpath))

So now you, too, can enjoy a fine import from the comfort of your own ~, or anywhere else.

Thursday, December 11, 2008

Today, I quit

DipToday, I quit.

I had discussions with my advisor, the head of our Ph.D. program, a distinguished, experienced, disinterested professor, and my closest friends and colleagues. I re-read The Dip. I mulled over my thoughts. I made my decision. I parted from my research group.

I quit because I understood I have to change groups to get my Ph.D. The current situation did not work. It was a Cliff. I could not change the situation; I had to change situations. I saw the choice I had to make: I could squander time and energy—mine, my advisor's, my colleagues', the taxpayers', the world's—until I fell off a Cliff, or I could quit, and find a Dip where I will excel and flourish.

Today I quit.

I dedicate this post to the patience, understanding, advice, and aid of those who helped me make this decision. You have my deepest gratitude.

Wednesday, December 10, 2008

Stack Overflow: What's in it for the programmer?

Roderic Page made a recent post to FriendFeed about the website Stack Overflow that set off the little hamster-powered mechanical wheels within my brain on a question I have had since I first encountered the site: Why would a programmer expend the time and energy to answer questions there? What's in it for the programmer?

stack overflowStack Overflow, to me, is the latest in my encounters with developer help forums, others including the the Cavern of COBOL on the Something Awful forums, the Python Tutor malining list, and particularly on comp.lang newsgroups on Usenet (where I've received some of the greatest answers to my questions). That's not to mention all the channels I haunt on Freenode. Up until Roderic's post, however, I hadn't really considered how little I thought of the economics behind answering someone else's programming question.

Stack Overflow offers one key feature the other help forums lack: reputation points—Slashdot karma for code monkeys. Before the days of Stack Overflow, you had to lurk for a while to figure out who the Steve Holdens and Alan Gaulds were, or to know you should be excited that the effbot answered one of your posts. You built credibility slowly by answering posts astutely and asking really interesting questions, yet, your credibility really only extended to others who made the time to be "in the know".

The Stack Overflow model brings instant recognition of credibility by someone new to the place. This provides tangible incentive to stick with the community. You still have to pay your dues to get your credit: ask smart questions, write good answers. Now, though, you get to carry those contributions with you as a scout carries her sash of merit badges. Now the newbie can see your sash of merit badges and compare them to everyone else's sashes, and make valuable decisions based on social status that would have previously only been possible after months of lurking, which can save a lot of time for those who only scan answers.

Ultimately, though, I ask the question, "Where's the beef?" If you're a programmer, shouldn't you be... well... programming? If you have your own work to get done, why help do someone else's, for no pay? Is the other person's problem more intellectually stimulating than your own? If so, shouldn't you quit your job and spend the time finding yourself a more challenging one?

If I was hiring a programmer, found a potential hire's profile on Stack Overflow, and discovered they accrued a lot of points, I'd have two minds about it. On the upside, this programmer knows what she's talking about enough to convince other programmers she knows what she's talking about; on the downside this programmer spent a tremendous amount of time doing work that's not her own. Now I don't hire programmers, and it's not clear I ever will, but as someone who would like to be hired for programming, I have these concerns on my mind.

It's worthwhile to compare Stack Overflow points to Launchpad or GitHub points. On Launchpad or GitHub, a programmer gains points by submitting patches, doing bugfixes, and making commits to projects. On the surface, I feel like these are two different point systems, where the Launchpad/GitHub points actually mean more, and would be seen as more productive. Under re-examination, though, I don't feel confident I can defend contributions on these social developments sites from the same critical questions I posed above on Stack Overflow and other programmer forums.

Supposing your job is to work on a piece of software tracked by Launchpad or GitHub, then all your points really indicate your productivity to a manager or potential employer. In the cases where your work is hobbyist in nature, then I think one could make the same argument for concern that I made for Stack Overflow.

I'll put out a few caveats to you lovely readers here: I consider myself nowhere near the paragon of the focused worker, and in fact, staying on task is one of my greatest shortcomings, and the one I spend the most time working on. (Exhibit A: this blog post.) Also, I acknowledge that there is a certain indescribable joy in doing the act of community service: providing aid to the cost of yourself for the benefit of the receiver. And sometimes, you just have an itch to scratch. I like people who do community service, and I was raised to think it's a Good Thing.

I will continue to think on programmer forums similar to economists who wonder about free open source software and contributions to Wikipedia. In the meantime, well, I'll probably pose this as a question to Stack Overflow.

Update: Just as an aside, the user with the highest reputation on Stack Overflow is also a bioinformatics grad student.

How tags compress semantics

I just had a simple, yet (personally) powerful revelation—a moment of grokness, if you will. While searching through my Delicious account for a bookmark to a TED talk to link to in another blog post, I came face to face with a predicament that made me really stop and think.

I began my search by using the tag "ted", with which I've tagged all TED talks I've bookmarked. I have 78 TED talks bookmarked. The bookmark entries for these posts have 280 distinct tags, 1024 words total in their bookmark title fields, and 1668 words total in the comment fields. The predicament is, do I look through 78 posts to find the one of interest, or do I instead look through the 280 tags?Tag

Or is it? If we rephrase the question, we see I'm really asking, "Can I find what I'm looking for faster using 280 words, or 2692 words?" See, those 280 tag words actually represent a compression of the semantics (the meanings) of the 2692 descriptive words. I can quickly scan 280 tags to identify the closest to my concept, giving me a significantly more manageable subset of posts to scan in more detail.

Tags seemed very straightforward and powerful before, for example, reading Clay Shirky's article on the power of tagging, but it took this moment to really understand the power behind them, much like the "A ha!" moment of seeing a binary search when you've always thought of search as linear.

Two side notes:

  • I'd like to thank the developers of pydelicious for providing me the software to extract those statistics about my Delicious tags.
  • It turns out the video I was looking for had the clip of interest removed due to copyright permissions, and so the real answer to the question was to Google it. Still, it was worth it for the thought.

Wednesday, November 19, 2008

Wanted: separation of personal and professional me

Split Personality

I subscribe to a number of online social services. I began using these services, particularly Twitter, for personal communication with friends I knew from meatspace (i.e. face to face interaction). Something interesting happened, however, when I stumbled upon Deepak Singh's blog and discovered he, too, was on Twitter. From there I traced through his network and became a subscriber to the tweets of a dozen of other researchers, all posting notes about research in bioinformatics and the life sciences.

After the Dark Age of Twitter in the summer of 2008, when the site consistently suffered downtime and slow performance, I also joined up with the FriendFeed service after chatter from the bio Twitter gang about that service. Once there, I discovered a nice stream of everyone's activities, most of which I find professionally interesting and relevant.

In fact, to a large degree, the people that I follow on FriendFeed keep their own streams extremely professional, albeit sometimes opinionated. I want these people to follow me, too. I want to use these services to build professional contacts. I want these people to see me as a potential employee/collaborator/expert. To convince them of this, I have to stay concerned with keeping my signal to noise ratio very high, and every personal matter I share moves that ratio in the wrong direction for these people.

On the other hand, my friends with whom I socialize typically don't have an interest in my professional pursuits. We share interests in hobbies, film, humor, and common emotional trials and triumphs. To my friends, silencing personal interaction removes their desire to keep in touch with me via these media, which, after all, I began using because they proved effective at communicating with them.

This presents me with a quandary. In meatspace socializing, I can be who I need to be for each person—the student, the friend, the musician—and I do it all under one identity. Each of these is a facet of the whole that is me. In online social networks, however, I cannot do these under one identity. I cannot distinguish between "music" me, "friend" me, and "Pythonista" me.

So here's my call for reform. Dear Twitter, FriendFeed, and any and all social sites: Give me the ability to state the context of interest of each post. Give my subscribers the the ability to filter which content they receive from me and how much or little of it they wish to see. Let them mix, match, and mash up those subscriptions as they need to. Let me be me, and let everyone else see only the facets that interest them most.

Tuesday, October 28, 2008

Saying "When" to Literature

As an unpublished researcher, I tend to wonder what proportion of citations appearing in publications in life sciences journals (including bioinformatics and computational biology journals) have authors typically read thoroughly? All? A quarter?

On a related note, how does one know how much of a paper to read? More importantly, how does one know how much effort to expend comprehending a publication? The law of diminishing returns holds in the pursuit of reading scientific literature as much as any other, but when has one really taken due diligence in reading a publication?

My continued difficulty reading scientific literature motivated these questions. I suspect I may have a reading disability, as even when reading for pleasure, I consume material at around half to a third of the speed of my bookish acquaintances. The possibility of this aside, however, I typically find the writing in scientific publications dense and abstruse. Some of this could come from attempting to read literature at the crossroads of biology, computer science, and statistics, subjects of which I possess some smattering but no authoritative comprehension. When I get stuck on a particular section, that usually means I've reached my limit of understanding that section, as the field of bioinformatics is so vast that even within my own research group, we tend to not read the same literature, preventing discussion and collaborative learning.

Often when reading a publication, I lament the inability to cull something immediately and practically useful from the time spent reading it the same way I could when reading a book on a programming language, a blog post, or technical document. I wonder if perhaps I have wound up in the wrong field of work (engineer's mind in a scientist's world) or have not chosen an appropriate topic of study.

I'd love to hear some anecdotal evidence and suggestions from researchers and other students out there in this crazy pursuit we call science.

Tuesday, August 19, 2008

Update of delicous.com bulk for tagger

While waiting for BLAST results this afternoon, I pushed an update to a script that lets you bulk for-tag Delicious bookmarks to friends in your network. You can obtain it from the link above, or with Subversion.

svn export https://gotgenes.com/svn/toolbox/trunk/delicious-bulk-for.py

This version brings in a little more privacy. Your account information should now be stored in a profile file (~/.dbf_profile by default) rather than be explicitly entered on the command line, which exposed your password to anyone looking over your shoulder. It was also a chance for me to get a little more acquainted with the Python standard library's ConfigParser. Additionally, it has an honest-to-goodness CLI parser now, powered by one of my favorite Python stdlib modules, optparse.

If you happen to use it, let me know if you have any problems. I have written zero unit tests for the script, so I consider the code fragile, but it improves as more and more people ask me for it. So far, though, the new version works for me.

(As a side note, I really miss the old del.icio.us name. Foo on delicious.com.)

Friday, August 8, 2008

Not the Biopythonista I thought I'd be

I had thought at this point in my graduate career that I would not only have a firm grasp of the Biopython library, but also have become a primary contributor to the project. Like many aspirations I've had throughout my life, these proved naĆÆve, but I felt they warranted reflection.

Most who of you who know me online or professionally also know me as a Python advocate. Only some of you know of my darker days programming in Perl. Through patient proselytizing by Rob Waldo (now Ph.D.), a friend of a lab mate at the University of Georgia Dept. of Microbiology, I rose up from my Perl-ish ways. Most of the reasons I love Python stem from the language's foundational philosophy, which I find wholly agreeable; indeed, my fanaticism for Python goes so far as for me to make a claim my friends often rib me about: Python is "arguably superior".

As a general computing language, I stand firmly by that claim. As a language for bioinformatics, however, I can not. By far, Perl has a nearly immovable foothold as the de facto language in the field I hope to have a career in. In the beginning of a field dominated by text manipulation, Perl seemed the perfect choice to trailblazers. As more people moved into the field, more used Perl, because that's what the gal next door said she used. This positive feedback loop created a critical mass community that gave rise to the Bioperl library. Other language communities spawned their own Bio* projects, though none of them approach the size, completeness, and popularity of Bioperl, in large part due to the Catch-22 of not having the size, completeness, and popularity of Bioperl.

Making matters more challenging in the Python sector, a separate effort, Biology in Python (BIP), arose last August during the SciPy 2007 Conference, led by Titus Brown. The first BIP Birds of a Feather meeting at SciPy 2007 held that very few bioinformatician types use Biopython, instead preferring to roll their own parsers and libraries, out of ignorance or preference. (Read Titus's account of the BoF for more detail.) The BoF founded BIP with the intention of providing a unified community of bioinformatics and computational biology efforts in Python larger than a single project like Biopython. The meeting had talk over incorporating Biopython into its effort, but to me it still felt fractious: Biopython was a "them" and not an "us". Today, the two communities remain largely segregated and little communication exists between them.

While the Python efforts remain fragmented, those in the Perl community continue to snowball: more and more useful modules are added to Bioperl, which brings more users into the community, who write more and more useful modules, and so on. Jonathan Rockway recently wrote a pertinent blog post where the punchline emphasizes the importance of language libraries which prevent wheel reinvention and allow developers to really get things done.

So while Perl's popularity in bioinformatics irks me, I recognize it helps researchers get things done. I also recognize proselytizing won't bring greater part of that mind share to Python. A Biopython as good or better than Bioperl might. "You must be the change you wish to see in world," said Mahatma Gandhi. In this vein, I had hoped to become an expert on Biopython, and to actually get down to writing the "missing code". Well, to date, I have yet to be the change I wish to see.

I have made one contribution to Biopython's code base over the past three years, and that being a one-line change in a testing module. I have filed and helped with only a handful of bug reports. Originally, I led an effort to migrate Biopython from CVS to Subversion, though this has remained stalled for a year and a half. As far as I can tell, only two steadfast programmers, Peter Cock, and another developer who wishes to remain unmentioned, have kept this project going. To confirm my suspicions, I ran a CVS log on Biopython and perfunctorily scanned commits made in 2008. These two programmers made nearly all the commits, with Frank Kauff and Tiago AntĆ£o providing a handful of the remaining commits.

I subscribe to both the Biopython users mailing list and the developers list, but these days I check neither list's emails. Instead, my Gmail filters label and archive them, and there they sit until I get so intimidated by the number of unread threads that I simply "mark all as read". At this point, my involvement with the Biopython project remains minimal. I do not recall the last time I looked at the code base. I last pulled updates from the CVS repository five months ago, prior to pulling them during the course of writing this post. My lab mate, Andrew Warren, might even know more about Biopython than I do, at this point, and I turned him on to the package.

Ultimately, I have to ask, why have I not sat down and gone through the Biopython library as I intended to? Why have I not written the modules that seem missing from the library? The most simple answer is that I did not want to. Really, deep down, if I wanted to do it, I would have. After all, if there's anything I learned from Prof. Randy Pausch, it's that if you really want to make something happen, you can make it happen. This doesn't really address the question, though, since I've established that at some level, I certainly intended to get my hands dirty with Biopython. I think I have not worked with Biopython because I am not encouraged to do so, and am actually discouraged, because of research, and the current culture of academia.

My research project only uses the SeqIO class to parse through Genbank and FASTA boilerplate and get to the sequence information I need to use. I make far more use out of the NetworkX graph library for the bulk of what my code does, and even then, it's in a limited manner. This puts efforts towards Biopython as orthogonal—antagonistic, even—to what will earn me praise by my adviser and committee (namely publications). Therein lies a terrible problem, because ultimately, a more user-friendly or powerful (or both) Biopython could benefit more people than my own research, but academia seems to offer little to no reward for such tool building.

So, I imagine I'll not make much headway into Biopython for the remainder (probably [hopefully?] three more years) of my graduate career, and that will be that. I can find no immediate resolution to this conundrum of feeling compelled to help a cause that most PI's would find a fool's pursuit. I'd love to hear from contributors and users of any of the Bio* projects on how they work in time to attend to and learn these libraries. Perhaps some research projects out there really do leverage these libraries, or perhaps the field of bioinformatics is just too vast and too devoid of overlap, rendering code reuse a pointless pursuit. Maybe it's really just a roll-your-own kind of world.

Wednesday, August 6, 2008

Tips Interviewing prospective bioinformatics graduate students

Interviewing a prospective bioinformatics graduate student? Here are a few tips.

  1. Keep your lab's research web page up to date. The key to getting a good student comes before the interview begins.
  2. Know a bit about the prospective student. You can expect the prospective student to know a bit about your work (so long as you fulfill tip 1). You should take an interest and know something about the prospective student you're interviewing. This sets a very positive tone to the interview and makes the student feel welcome and comfortable. I had excellent interviews with professors who took the time to read the dossier the application committee made of me. (If your application committee does not prepare such dossiers, make certain it does next year.) As an addendum, if you know a colleague at a prospective student's previous place of education, feel free to ask if the student knows the colleague, but do not expect them to say yes.
  3. Talk about your research first. Sure, the student who is really motivated will have already read your research web page. (See 1.) Regardless of preparedness, however, most prospective students will still feel nervous going into your office and sitting with you for the first few minutes. Ease the tension and break the ice by talking about your own work before having them talk about theirs. If you allow that buffer time of several minutes for the prospective student to calm down, the interview will go much better for both of you. From personal experience, my best interviews started this way. I felt most welcome by professors who could also articulate exactly what areas they hoped a grad student could step in and help. If you allow that buffer time of several minutes for the prospective student to calm down, the interview will go much better for both of you.
  4. Don't ask the prospective student to code a particular function in <insert favorite programming language>. A prospective student can take offense at such a request. It shows a distrust in the skills they claim to have and immediately puts the prospective student on the defensive. Also consider the pointlessness in the exercise, which places an overemphasis on coding skills. In my experience thus far, programming is but a small part of a bioinformatics Ph.D. It is a skill that can be improved with remedial courses and Q&A among fellow grad students. Instead, focus on asking questions that probe what really matters: the student's interest in learning and self-motivation. If the student has a particular interest in programming, then sure, let the conversation wander in that direction. (For instance, Prof. Paul Magwene and I had a kindred-spirit talk about our love of Python.) In general, though, take a student's word that they program in the languages they claim, and if they don't program at all let them know that they'll need to take remedial courses to learn and check that their reaction is positive to this.
  5. Ask the student about his or her research. The only way you'll really get a sense for the student's propensity for research is asking about his or her previous experiences in it. The more specific your questions about the student's research, the better. (See 2.) An example of a terrible question would be, "What was the greatest challenge in your research project?" An example of an good question would be, "It sounds like you worked with several different phylogenetic models. Were the resulting trees similar, though?" The former question is too open-ended (much like my most hated question of "Where do you see yourself in five years?" [Note: Best answer, "In the mirror, of course."]) The latter is specific but still leaves room for the student to explain his or her findings.

I'd like to thank the Biogang on FriendFeed for inspiring this post.

Monday, April 28, 2008

Broken wing: on upgrading to Ubuntu 8.04 Hardy Heron

This past weekend I upgraded from Ubuntu 7.10 (Gutsy Gibbon) to 8.04 (Hardy Heron). Right off the bat I encountered an unexpected hitch. Previously I had upgraded by using the Alternate ISO. This serves two purposes: 1) Normally on the days immediately following an Ubuntu release, the package servers are choked by users doing automatic upgrades, but the ISOs are available from bittorrents, which thrive in this type of situation. 2) I have multiple machines to upgrade, so it makes more sense to download once and upgrade many.

Unfortunately, my attempt to upgrade from the CD, and only the CD, failed on a package related to nvidia-glx, according to the logs, though according to the output presented directly to the user, the failure occurred because of "obsolete or locally installed packages, unofficial repositories, or [something else]". I had already moved all my third-party repositories out of sources.list.d. I then decided to allow the upgrade process to access the latest packages on the net, only to discover to my horror that I would still wind up downloading 500 MB of packages from thu Ubuntu repositories, essentially negating having downloaded the 600 MB ISO. The download actually occurred fairly quickly, even over my DSL connection, but the installation and configuration took well over an hour.

After rebooting, I was greeted with the login screen, on which my laptop's trackpad caused the mouse pointer to go crazy any time I attempted to move it. (Note, this is probably due to my Xorg configuration which is set up for both one and two monitors, but it pisses me off because it didn't do this before.) After logging in, compiz started fine and the desktop seemed to be peachy. The actual process of logging in seems to take just as long as it did in 7.10, and that blasted trackerd still crushes system performance immediately upon login by riding the hard drive harder than a bucking bronco as it indexes the hard drive.

I did what I normally do upon finally getting control of my system: firing up Firefox. I was presented with a Firefox 3 prompt informing me that I had about a half dozen plugins incompatible with Firefox 3 beta 5, and if I would like to proceed, I would need to click the button that would disable them all. "No problem," I thought, "I'll just go back to Firefox 2." I did play around in Firefox 3 for a bit and tried it out to see if it was as fast and slick as I had heard. Generally, I had to agree it was a good browsing experience, but utterly worthless with no support for critical addons like del.icio.us bookmarks, CookieSafe, and Zotero. When I had my fun, I used aptitude to install Firefox 2 and fired it up. Well, I quickly discovered that about the only thing left of Firefox 2 was its bookmarks. All of my cookie settings were gone, and no plugins were running on it at all. When I went to the Addons menu, I discovered I was unable to re-enable any of the plugins. After digging around the net and many searches, I came up empty handed with a working solution and wound up having to delete my Firefox profile and starting afresh with Firefox 2.

To skip ahead, I later learned once Firefox 3 is launched, it will render a profile backwards-incompatible with Firefox 2. At least, this is the case for Ubuntu 8.04. According to Paulo Nuin, this is not the case for Windows. (Which probably grates on me more.) So for Ubuntu users running into this problem, back up what Firefox data you can (bookmarks, etc.), rm -rf your profile, and start again in Firefox 2. Then make sure to set Firefox 2 as your web browser in the Preferred Applications menu, and only launch from launchers connected to Firefox 2. In fact, you may want to apt-get remove Firefox 3 all together.

After that, there was the problem that my fonts looked like trash in the gnome-terminal, and also for certain pages browsing with Firefox 2 (presumably those calling on system fonts because their font was not explicitly set). I double-checked that I had "Subpixel Smoothing" checked in the GNOME fonts dialogue in the Appearance application in the Preferences menu. Then I had to do some digging around the Ubuntu forums to find other people who had this issue. I came across some threads that suggested setting the DPI in the about:config of Firefox to 96, but this wasn't much of a help. I finally came upon this post that gave directions for reconfiguring the fonts using dpkg-reconfigure. Restarting Xorg and firing up the gnome-terminal, I was relieved to see that the fonts were finally rendering with antialiasing. I felt the terminal font still looked ugly--much taller than in 7.10--so I wound up setting it to Lucida Sans Typewriter.

Firefox still rendered terribly for some pages (really poor kerning), particularly on Twitter. The only other obvious thing to try would be to set the fonts. I almost without fail use defaults so I hated this option, but I set all the fonts to Bitstream fonts, restarted Firefox, and finally the troublesome pages were readable again. I breathed a sigh of relief.

A few peeves remained. For one, only one program could make use of the soundcard at a time--no good! I went into the sound and disabled ESD, but that still didn't do it, so I had to set all the sound devices from Autodetect to ALSA, which finally allowed for multiple sounds at once. Of course, not being content, I decided to fiddle around more, and instead re-activated ESD and set the devices to use ESD. This led to a horrible out of control spiral as gnome-sound-preferences proceeded to consume every bit of the 2 GB of RAM in my laptop and crawled out into the swap space, sending my system thrashing horribly until I could kill the process. I have no idea what that was about but it was terrible. I immediately reverted everything to ALSA. Lastly, my compiz settings weren't entirely preserved; particularly, the hot corners that I had for the Scale plugin weren't available anymore, and it took me a while to actually figure out how to set them again, completely missing the fact that the little screen icons were indicating that's how I was supposed to set hot corners. In 7.10, there was just a simple dialogue box that said in words "Screen" or something like this. I'm sure this was changed to a picture of a screen for usability, but it had the opposite effect on me.

So this is where I stand with Hardy Heron today. I still consider Gutsy Gibbon the greatest release of Ubuntu, and Hardy Heron as a significant backslide. In fact, I can't believe Hardy is going to be a long-term support release, particularly in light of the Firefox 3 fiascos. How do they plan to support a beta version of a web browser for three years? Plugins such as Zotero already require newer versions of Firefox 3 because fixes have been implemented. What other plugins are waiting for Firefox 3 to solidify before converting? I dunno, it just seems a poor choice to have that as their default browser, given that the web browser is now the most critical piece of user software.

I'm skipping this release for all my other machines and holding out for what I hope will be a much better release in 8.10. Best of luck to you other brave souls moving to this lame bird that is 8.04.

Wednesday, April 23, 2008

On the appropriate blogging software

As a meta-post following my reply to Thomas Upton, I'm thinking about how inappropriate a medium this blog is for publishing code. Hence, I posted the code to a pastebin, though I'm not sure how long it will live there. I think it goes to show that at some point, I'm moving this blog off of Blogger and onto some software on my site. Things that I must have in the blogging software are colorized syntax highlighting, particularly for Python and C. Also, I need the ability to do LaTeX markup, for a lot of the research I do requires mathematical notation, and I'd like to be able to get on the open research train and whore out for ideas on how to improve my research.

I know Wordpress must have syntax plugins, and I know it has a LaTeX plugin. But it's Wordpress, which seems to be an exploitfest unless you update it every five hours. I've dabbled a bit in Django and have thought about coding up a blog in it, but then I get to the wheel reinvention issue, and I have to balance the sheer enjoyment of coding something with the pragmatic point of view of doing something productive. Of course, there must be ready-to-launch Django blog applications, and writing my own module to take a LaTeX formatted string and convert it to a PNG via dvipng must not be hard. That looks like the most favorable option. I dunno. Thoughts?

The good news is that I have my blog RSS hooked up via FeedBurner, so (if and) when I do move the blog, nobody will have to change the address in their RSS reader--their syndication shall continue unbroken (much to their chagrin, I'm sure). I'm glad I found out about that service, and I recommend it to all bloggers as a Good Idea™.

Weather powered by Python

No, I haven't gone delusional in my ever-continuing worship of Python and exaggerated its capabilities. My friend and fellow hacker Thomas Upton posted a clever script to obtain weather forecasts from Yahoo! RSS feeds, specifically for the use in Geek Tool.

I wanted to point out to Thomas a couple of areas of improvement on his blog, but unfortunately, he doesn't seem to have commenting enabled on his Wordpress blog because he's paranoid that his site will get pwned by pr0n ads. (Rightly so, it's PHP software after all.) Hence, I'll post my response here.

I made a few revisions to Thomas's code. Namely, I ported it from the getopt Python standard library module, which is somewhat deprecated, to another handy stdlib module, optparse. I also removed the option for the two-day forecast and replaced it with an argument for the number of days to forecast. I may have changed an option name or two, as well.

You can see the expressiveness of optparse compared to getopt. The boilerplate is standard for optparse and prevents you from having to roll your own for a lot, including help documentation, usage statements, and most helpfully, in parsing the CLI for errors.

The final thing of note is the movement to string formatting as opposed to lots of string concatenation. String formatting is considered more Pythonic. Of course string formatting is going to be a lot easier and prettier in Python 3.0; thus this script will definitely have to get overhauled for that (though the 2to3 tool should take care of this). You'd have to do it anyways, though, because print will no longer be a statement but a function in Python 3.0.

Without further ado, you can have a look at my modifications at http://rafb.net/p/YWiyo548.html.

You can also apply a patch if you would like, which I supply here:

=== modified file 'weather.py'
--- weather.py  2008-04-23 20:11:38 +0000
+++ weather.py  2008-04-23 21:48:58 +0000
@@ -1,45 +1,67 @@
-#! /usr/bin/python
+#!/usr/bin/env python
+
+"""Fetches weather reports from Yahoo!"""

import sys
import urllib
-import getopt
+from optparse import OptionParser
from xml.dom.minidom import parse

+# Yahoo!'s limit on the number of days they will forecast for
+DAYS_LIMIT = 2
WEATHER_URL = 'http://xml.weather.yahoo.com/forecastrss?p=%s'
WEATHER_NS = 'http://xml.weather.yahoo.com/ns/rss/1.0'

-def get_weather(zip_code):
+def get_weather(zip_code, days):
+    """
+    Fetches weather report from Yahoo!
+
+    :Parameters:
+    -`zip_code`: A five digit US zip code.
+    -`days`: number of days to obtain forecasts for
+
+    :Returns:
+    -`weather_data`: a dictionary of weather data
+
+    """
+
# Get the correct weather url.
url = WEATHER_URL % zip_code

# Parse the XML feed.
dom = parse(urllib.urlopen(url))
-
+
# Get the units of the current feed.
yunits = dom.getElementsByTagNameNS(WEATHER_NS, 'units')[0]
-
+
# Get the location of the specified zip code.
ylocation = dom.getElementsByTagNameNS(WEATHER_NS, 'location')[0]
-
+
# Get the currrent conditions.
ycondition = dom.getElementsByTagNameNS(WEATHER_NS, 'condition')[0]

# Hold the forecast in a hash.
forecasts = []
-
+
# Walk the DOM in order to find the forecast nodes.
-    for node in dom.getElementsByTagNameNS(WEATHER_NS,'forecast'):
-  
-        # Insert the forecast into the forcast hash.
-        forecasts.append ({
-            'date': node.getAttribute('date'),
-            'low': node.getAttribute('low'),
-            'high': node.getAttribute('high'),
-            'condition': node.getAttribute('text')
-        })
-
-    # Return a hash of the weather that we just parsed.
-    return {
+    for i, node in enumerate(
+            dom.getElementsByTagNameNS(WEATHER_NS,'forecast')):
+        # stop if obtained forecasts for number of requested days
+        if i + 1 > days:
+            break
+        else:
+            # Insert the forecast into the forcast dictionary.
+            forecasts.append (
+                {
+                    'date': node.getAttribute('date'),
+                    'low': node.getAttribute('low'),
+                    'high': node.getAttribute('high'),
+                    'condition': node.getAttribute('text')
+                }
+            )
+
+    # Return a dictionary of the weather that we just parsed.
+    weather_data = {
   'current_condition': ycondition.getAttribute('text'),
   'current_temp': ycondition.getAttribute('temp'),
   'forecasts': forecasts,
@@ -47,87 +69,117 @@
   'city': ylocation.getAttribute('city'),
   'region': ylocation.getAttribute('region'),
}
-
-def usage():
-    print "Usage: weather.py [-cflv] zip-code\n"
-    print "-c\tSuppress the current weather.\n"
-    print "-f\tPrint the next two days' forecast.\n"
-    print "-l\tPrint the location of the weather.\n"
-    print "-v\tPrint headers for each weather section.\n"
-
-def main():
-    try:
-        # Attempt to get the command line arguments and options.
-        opts, args = getopt.getopt(sys.argv[1:], "cflv")
-
-    except GetoptError, err:
-
-        # Print the getopt() error.
-        print str(err)
-  
-        # Show the user the proper usage.
-        usage()
-  
-        # Exit with error code 2.
-        sys.exit(2)
-
+    return weather_data
+
+
+def create_report(weather_data, options):
+    """
+    Constructs a weather report as a string.
+
+    :Parameters:
+    -`weather_data`: a dictionary of weather data
+    -`options`: options to determine output selections
+
+    :Returns:
+    -`report_str`: a formatted string reporting weather
+
+    """
+
+    report = []
+    if options.location:
+        if options.verbose:
+            # Add location header
+            report.append("Location:")
+
+        # Add the location
+        locstr = "%(city)s, %(region)s\n" % weather_data
+        report.append(locstr)
+
+    if (not options.nocurr):
+        if options.verbose:
+            # Add current conditions header
+            report.append("Current conditions:")
+
+        # Add the current weather.
+        currstr = "%(current_temp)s%(units)s | %(current_condition)s \n" % weather_data
+        report.append(currstr)
+
+    if options.verbose:
+        # Add the forecast header.
+        report.append("Forecast:")
+
+    # Add the forecasts.
+    for forecast in weather_data['forecasts']:
+        forecast['units'] = weather_data['units']
+        forecast_str = """\
+%(date)s
+  Low: %(low)s%(units)s
+  High: %(high)s%(units)s
+  Condition: %(condition)s
+""" % forecast
+        report.append(forecast_str)
+
+    report_str = "\n".join(report)
+    return report_str
+
+
+def create_cli_parser():
+    """Creates command line interface parser."""
+
+    usage = (
+        "python %prog [OPTIONS] ZIPCODE DAYS",
+        __doc__,
+        """\
+Arguments:
+    ZIPCODE: The ZIP code for the region of interest.
+    DAYS: Days to forecast for (0 if only current conditions desired).
+"""
+    )
+    usage = "\n\n".join(usage)
+    cli_parser = OptionParser(usage)
+    # add CLI options
+    cli_parser.add_option('-c', '--nocurr', action='store_true',
+        help="Suppress reporting the current weather conditions"
+    )
+    cli_parser.add_option('-l', '--location', action='store_true',
+        help="Give the location of the weather"
+    )
+    cli_parser.add_option('-v', '--verbose', action='store_true',
+        help="Print the weather section headers"
+    )
+
+    return cli_parser
+
+
+def main(argv):
+
+    cli_parser = create_cli_parser()
+    opts, args = cli_parser.parse_args(argv)
+
# Check that an argument was passed.
-    if not args:
-  
-        # Show the user the proper usage.
-        usage()
-  
-        # Exit with error code 1.
-        sys.exit(1)
-
-    # Check that the first argument exists.
-    if args[0]:
-
-        # Get the weather.
-        weather = get_weather(args[0])
-
-        # Set the option flags
-        c = True
-        f = False
-        l = False
-        v = False
-
-        # Check the options
-        for o, a in opts:
-            if o == "-c":
-                c = False
-            elif o == "-f":
-                f = True
-            elif o == "-l":
-                l = True
-            elif o == "-v":
-                v = True
-      
-        if l and v:
-            # Print the location header.
-            print "Location:"
-
-        if l:
-            # Print the location
-            print weather['city'] + ", " + weather['region'] + "\n"
-  
-        if c and v:
-            # Print the current conditions header.
-            print "Current conditions:"
-
-        if c:
-            # Print the current weather.
-            print weather['current_temp'] + weather['units'] + " | " + weather['current_condition'] + "\n"
-
-        if f and v:
-            # Print the forecast header.
-            print "Forecast:"
-
-        if f:
-            # Print the forecast.
-            for forecast in weather['forecasts']:
-                print forecast['date'] + "\n  Low: " + forecast['low'] + weather['units'] + "\n  High: " + forecast['high'] + weather['units'] + "\n  Condition: " + forecast['condition'] + "\n"
-          
+    if len(args) != 2:
+        cli_parser.error("Not enough arguments supplied.")
+
+    zip_code = args[0]
+    if len(zip_code) != 5 or not zip_code.isdigit():
+        cli_parser.error("ZIP code must be 5 digits")
+
+    days = args[1]
+    if not days.isdigit():
+        cli_parser.error("Days to forecast must be an integer.")
+    days = int(days)
+    if days <> DAYS_LIMIT:
+        cli_parser.error("Days to forecast must be within 0 to %d" %
+                DAYS_LIMIT)
+
+    # Get the weather forecast.
+    weather = get_weather(zip_code, days)
+
+    # Create the report.
+    report = create_report(weather, opts)
+
+    print report
+
+
if __name__ == "__main__":
-    main()
-
\ No newline at end of file
+    main(sys.argv[1:])

I'm going to hope Thomas doesn't sue me for breaking copyright or something, and suck money out of my massive salary that I receive as a grad student.

EDIT: One final comment. Thomas, your code looked very clean and you did a fantastic job adhering to PEP 8!

Friday, March 21, 2008

GenBank balks at third-party curation

Today's issue of Science has an interesting news item regarding a group of mycolgists' call to open GenBank up to third-party curation, citing inaccuracy in up to 20% of the records. GenBank stands staunchly against allowing anyone but the primary authors to edit records.

"That we would wholesale start changing people's records goes against our idea of an archive," says David Lipman, director of the National Center for Biotechnology Information (NCBI), GenBank's home in Bethesda, Maryland. "It would be chaos."
Lipman's foreseeing "chaos" as a result of allowing third-party curation seems a great overstated opinion—a kneejerk reaction, if you will. I fail to see GenBank's point of view, which seems to value above all authority (the primary author knows her data best), whereas the point of view of Bidartondo et al. values truth above all (wisdom of community). Life science research stands the greatest chance with truth.

As the Science news article points out, the demand for third-party curation will only continue to grow. Given GenBank remains obstinate, the solution then will come from the community. This makes me wonder, how hard would it be to duplicate GenBank? Given that one can't apply for a grant to do so, where will funding come from? Will the duplication come in the form of many "yet another" databases (bottom-up) or a collective effort from a consortium of movers and shakers (top-down)?

On an unrelated note, I'd like to give a shout out to Deepak Singh, whose great blog posts have inspired me to try my hand at scientific blogging. Thanks, Deepak!