This blog post is based on a previous entry of the same title I posted to FriendFeed. This post provides an extended explanation of what we're trying to accomplish.
I'm working with Prof. John Jelesko on a project for one of my courses in which he's investigating metabolic pathways in plants. At the heart of it, we need to set up a local database for running FASTA homology searches. The Jelesko lab wants this database to contain every amino acid sequence predicted in every currently available whole genome (assembled and annotated) available at NCBI, prokaryotic and eukaryotic. [Edit: We don't need every sequenced genome, actually, we only need a representative genome per organism. I hadn't previously considered that there may be more than one genome per organism. Thanks to Brad Chapman for pointing out the need for clarification.]
We have sequences from locations other than NCBI which we need to include in the FASTA search space; hence, we can't just run FASTA searches over NCBI data, which EBI's FASTA search might be able to otherwise do. This necessitates a local database. The Jelesko lab also needs the nucleotide sequence corresponding to the amino acid sequence, as well as the intron/exon locations for the longest available splicing. The questions are: is it feasible to store this amount of data in a database (we'll be using MySQL), and if so, how do we go about getting this data?
We're naïvely assuming it is feasible, so I'm attempting to figure out how to get at this data. The one file format that seems to store all information that we need in one place is the GenBank (GBK) format:
- a gene ID
- taxonomic classification of the organism from which the gene came
- start and stop positions for each exon
- the translated amino acid sequence
It seems that in one shape or another, these GenBank format files are available from NCBI's FTP site. While the GBK files for the prokaryotic genomes are relatively easy to get in one fell swoop at ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.gbk.tar.gz. For good ol' eukaryotic genomes, however, the data is all over the place. Sometimes it's stored as gzipped files in CHR folders, while other times, the files aren't compressed, and still other times, the directory is really just a container for directories that have the genome data. In short, it's a mess, especially when we consider we want to automate the retrieval of this data, not to mention want to update it periodically, should NCBI deposit new data.
There's also the dilemma of not actually needing most of the data (the genome sequence) contained in the GBK files—we just need the sequence covering start to stop for translation, including intronic sequence for the mRNA. I can write a hack of a Python script to trudge through the FTP directories and yank any GBK (compressed or otherwise) to local disk, but it seems like a big waste of bandwidth and local disk space. It seems like there must be better ways [Doesn't it always?], but I don't have the knowledge of NCBI's services to identify what these might be. If you have any ideas, please share! Meanwhile, I think I'll try contacting NCBI and see if they might point me in the right direction. I'll report back on what we decide to use, which could be my FTP hack given our limited time for this project.
Update: I've received suggestions on the FriendFeed entry for this blog post worth checking out.