It's Not Easy Being Genes: December 2008

Thursday, December 18, 2008

Robust imports in Python, guaranteed fresh: how to import code for testing

UPDATE 2010-01-19: As captnswing pointed out an alternative, and I should say more commonly used method, is to simply put the following before your import statement for your packages or modules, assuming you keep your tests in a subdirectory of your code.

import os.path
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), os.path.pardir))

Anyone who knows me knows I like unit tests. I mean, I really like unit tests. Like, if Mr. Software Engineering were to offer to betroth me to one of his daughters, I would ask him to betroth me to Miss Unit Test.

One thing that comes up when preparing tests in Python is, "Where the hell do I put them?" To this, my first answer is, "If you're willing and diligent enough to write them, you can put them damn well anywhere you please!" If that answer doesn't satisfy you, though, that's good, you're not alone. Python programmers have raised this topic on several forums, including recently on the Testing in Python mailing list and on Stack Overflow.

I'm a fan of the following method, which seems to have taken dominance in the Python community. It's based around the following directory structure:

rootdir/
rootdir/mymodule.py
rootdir/tests/
rootdir/tests/mymodule_tests.py

We have a directory containing our module of interest, mymodule.py, and we have a module, mymodule_tests.py, containing our unit tests for mymodule.py. We create a sudbirectory, tests/, under the root directory, rootdir/, of the project, and we place our mymodule_tests.py under this directory so that its path is rootdir/tests/mymodule_tests.py.

We've got to import the module we want to test into the module containing the tests for it. The import statement works for all packages/modules currently in the import path, found in the list sys.path. Since the current directory, '.', is in sys.path by default, we can easily import any packages/modules on the same level as our importing module. This would be in the form of a simple import statement of

import mymodule

For the typical testing layout, though, this won't suffice. We'll get a big fat ImportError. This is because the path of mymodule.py is in rootdir/, above our testing module's rootdir/tests/ path. The next logical step, then, is to place rootdir/ in sys.path for mymodule_tests.py to access mymodule.py. The initial thought for doing this is to add the directory above to the sys.path using relative path.

#!/usr/bin/env python
import os
import sys

sys.path.insert(0, os.pardir)

Unfortunately, this is fragile. If we run mymodule_tests.py from outside its own directory, this will break the path. Take the following script as an example:

#!/usr/bin/env python
# parpath.py: print the parent path
import os

print "parent directory:", os.path.abspath(os.pardir)

I place this script in the path of /home/chris/development/playground/, and then run it from this directory

[chris]─[@feathers]─[2495]─[15:35]──[~/development/playground]
$ python parpath.py
parent directory: /home/chris/development

When I run the script from the parent directory, however, my results differ.

[chris]─[@feathers]─[2496]─[15:36]──[~/development/playground]
$ cd ..

[chris]─[@feathers]─[2497]─[15:36]──[~/development]
$ python python/parpath.py
parent directory: /home/chris

In the words of Austin Powers, "That's not right." Now instead of getting the directory I wanted (/home/chris/development/playground), I get the one above it (/home/chris/development). This is because relative paths is sys.path are relative to where you executed the script, not relative to where the script exists. Phooey!

I used to just ignore this fragility and be very careful about running tests from within the same directory as the test modules. However, last night I came across a robust solution by way of some Google Code Search Fu—specifically, while browsing test code for MoinMoin. It turns out the solution is to use a method of the following:

path_of_exec = os.path.dirname(sys.argv[0])
parpath = os.path.join(path_of_exec, os.pardir)
sys.path.insert(0, os.path.abspath(parpath))

If we take a look at the first line, we see that it's capturing the first argument to the command line, and using that to construct a robust path that understands where the actual module is. The very first argument in sys.argv is always what immediately follows python in the command line (or if executing directly by ./) In our examples, these would by path.py and playground/path.py, respectively. Then, running os.path.dirname on these, we get the results of '' and 'playground', respectively. By joining these to the parent directory, we get the desired effect.

#!/usr/bin/env python
# parpath.py

import os
import sys

print "parent path:", os.path.abspath(os.pardir)

path_of_exec = os.path.dirname(sys.argv[0])
print "execution path:", path_of_exec
parpath = os.path.abspath(os.path.join(path_of_exec, os.pardir))
print "true parent path:", parpath

This gives us the following results:

[chris]─[@feathers]─[2467]─[16:32]──[~/development/playground]
$ python parpath.py
parent path: /home/chris/development
execution path:
true parent path: /home/chris/development

[chris]─[@feathers]─[2467]─[16:32]──[~/development/playground]
$ cd ..

[chris]─[@feathers]─[2467]─[16:33]──[~/development]
$ python playground/parpath.py
parent path: /home/chris
execution path: playground
true parent path: /home/chris/development

Now we're cooking with the good sauce! Ultimately, you can create a shortened version which looks similar to the one from MoinMoin:

#!/usr/bin/env python
import os
import sys

parpath = os.path.join(os.path.dirname(sys.argv[0]), os.pardir)
sys.path.insert(0, os.path.abspath(parpath))

So now you, too, can enjoy a fine import from the comfort of your own ~, or anywhere else.

Thursday, December 11, 2008

Today, I quit

Today, I quit.

I had discussions with my advisor, the head of our Ph.D. program, a distinguished, experienced, disinterested professor, and my closest friends and colleagues. I re-read The Dip. I mulled over my thoughts. I made my decision. I parted from my research group.

I quit because I understood I have to change groups to get my Ph.D. The current situation did not work. It was a Cliff. I could not change the situation; I had to change situations. I saw the choice I had to make: I could squander time and energy—mine, my advisor's, my colleagues', the taxpayers', the world's—until I fell off a Cliff, or I could quit, and find a Dip where I will excel and flourish.

Today I quit.

I dedicate this post to the patience, understanding, advice, and aid of those who helped me make this decision. You have my deepest gratitude.

Wednesday, December 10, 2008

Stack Overflow: What's in it for the programmer?

Roderic Page made a recent post to FriendFeed about the website Stack Overflow that set off the little hamster-powered mechanical wheels within my brain on a question I have had since I first encountered the site: Why would a programmer expend the time and energy to answer questions there? What's in it for the programmer?

Stack Overflow, to me, is the latest in my encounters with developer help forums, others including the the Cavern of COBOL on the Something Awful forums, the Python Tutor malining list, and particularly on comp.lang newsgroups on Usenet (where I've received some of the greatest answers to my questions). That's not to mention all the channels I haunt on Freenode. Up until Roderic's post, however, I hadn't really considered how little I thought of the economics behind answering someone else's programming question.

Stack Overflow offers one key feature the other help forums lack: reputation points—Slashdot karma for code monkeys. Before the days of Stack Overflow, you had to lurk for a while to figure out who the Steve Holdens and Alan Gaulds were, or to know you should be excited that the effbot answered one of your posts. You built credibility slowly by answering posts astutely and asking really interesting questions, yet, your credibility really only extended to others who made the time to be "in the know".

The Stack Overflow model brings instant recognition of credibility by someone new to the place. This provides tangible incentive to stick with the community. You still have to pay your dues to get your credit: ask smart questions, write good answers. Now, though, you get to carry those contributions with you as a scout carries her sash of merit badges. Now the newbie can see your sash of merit badges and compare them to everyone else's sashes, and make valuable decisions based on social status that would have previously only been possible after months of lurking, which can save a lot of time for those who only scan answers.

Ultimately, though, I ask the question, "Where's the beef?" If you're a programmer, shouldn't you be... well... programming? If you have your own work to get done, why help do someone else's, for no pay? Is the other person's problem more intellectually stimulating than your own? If so, shouldn't you quit your job and spend the time finding yourself a more challenging one?

If I was hiring a programmer, found a potential hire's profile on Stack Overflow, and discovered they accrued a lot of points, I'd have two minds about it. On the upside, this programmer knows what she's talking about enough to convince other programmers she knows what she's talking about; on the downside this programmer spent a tremendous amount of time doing work that's not her own. Now I don't hire programmers, and it's not clear I ever will, but as someone who would like to be hired for programming, I have these concerns on my mind.

It's worthwhile to compare Stack Overflow points to Launchpad or GitHub points. On Launchpad or GitHub, a programmer gains points by submitting patches, doing bugfixes, and making commits to projects. On the surface, I feel like these are two different point systems, where the Launchpad/GitHub points actually mean more, and would be seen as more productive. Under re-examination, though, I don't feel confident I can defend contributions on these social developments sites from the same critical questions I posed above on Stack Overflow and other programmer forums.

Supposing your job is to work on a piece of software tracked by Launchpad or GitHub, then all your points really indicate your productivity to a manager or potential employer. In the cases where your work is hobbyist in nature, then I think one could make the same argument for concern that I made for Stack Overflow.

I'll put out a few caveats to you lovely readers here: I consider myself nowhere near the paragon of the focused worker, and in fact, staying on task is one of my greatest shortcomings, and the one I spend the most time working on. (Exhibit A: this blog post.) Also, I acknowledge that there is a certain indescribable joy in doing the act of community service: providing aid to the cost of yourself for the benefit of the receiver. And sometimes, you just have an itch to scratch. I like people who do community service, and I was raised to think it's a Good Thing.

I will continue to think on programmer forums similar to economists who wonder about free open source software and contributions to Wikipedia. In the meantime, well, I'll probably pose this as a question to Stack Overflow.

Update: Just as an aside, the user with the highest reputation on Stack Overflow is also a bioinformatics grad student.

How tags compress semantics

I just had a simple, yet (personally) powerful revelation—a moment of grokness, if you will. While searching through my Delicious account for a bookmark to a TED talk to link to in another blog post, I came face to face with a predicament that made me really stop and think.

I began my search by using the tag "ted", with which I've tagged all TED talks I've bookmarked. I have 78 TED talks bookmarked. The bookmark entries for these posts have 280 distinct tags, 1024 words total in their bookmark title fields, and 1668 words total in the comment fields. The predicament is, do I look through 78 posts to find the one of interest, or do I instead look through the 280 tags?

Or is it? If we rephrase the question, we see I'm really asking, "Can I find what I'm looking for faster using 280 words, or 2692 words?" See, those 280 tag words actually represent a compression of the semantics (the meanings) of the 2692 descriptive words. I can quickly scan 280 tags to identify the closest to my concept, giving me a significantly more manageable subset of posts to scan in more detail.

Tags seemed very straightforward and powerful before, for example, reading Clay Shirky's article on the power of tagging, but it took this moment to really understand the power behind them, much like the "A ha!" moment of seeing a binary search when you've always thought of search as linear.

Two side notes:

I'd like to thank the developers of pydelicious for providing me the software to extract those statistics about my Delicious tags.
It turns out the video I was looking for had the clip of interest removed due to copyright permissions, and so the real answer to the question was to Google it. Still, it was worth it for the thought.

It's Not Easy Being Genes