CreateBlastDB.cpp
Aggregating Multiple Databases
Introduction
Have you ever gotten tired of looking up NCBI's sparse yet convoluted tutorials on using utilities such as makeblastdb and blastdb_aliastool? Well how about being fatigued by merging multiple BLAST databases together? Or gathering all genes from a certain phylogenetic clade? Fret no further because I have encountered all of these problems (and a few more) and decided to do something about it. This is my solution.
Features
- Can accept multiple reference FASTA files or numerous files containing GI numbers; or both at the same time!
- Given a list of NCBI taxonomy IDs, locates the last common ancestor of these taxa and grabs all sequences related to it (and even all sequences pertaining to its children taxa).
- Aggregates multiple pre-existing BLAST databases into a single alias database.
Dependencies
- Boost - For command line parameters
Demonstration
Sometimes you have numerous genes you wish to use as a reference to BLAST against, which are divided amongst multiple FASTA files. Here's how to deal with that:
- ./CreateBlastDB \
- -o myDB \
- -r reference1.fasta \
- -r reference2.fasta \
- -r [...]
But say you also have some stored as GI numbers you ripped straight off NCBI:
- ./CreateBlastDB \
- -o myDB \
- -r reference1.fasta \
- -r reference2.fasta \
- -r [...] \
- -g gilist1.txt \
- -g gilist2.txt \
- -g [...]
But oh crap, you also want to include all sequences in another clade as well:
- ./CreateBlastDB \
- -o myDB \
- -r reference1.fasta \
- -r reference2.fasta \
- -r [...] \
- -g gilist1.txt \
- -g gilist2.txt \
- -g [...] \
- -t taxalist.txt \
- --children
And finally, you have a pre-existing BLAST database containing some mitochondrial genes:
- ./CreateBlastDB \
- -o myDB \
- -r reference1.fasta \
- -r reference2.fasta \
- -r [...] \
- -g gilist1.txt \
- -g gilist2.txt \
- -g [...] \
- -t taxalist.txt \
- --children \
- -d mitoGenesDB
Pretty brain-dead I must admit. I designed it such that even if I forget how I even wrote it, I could always depend on it to do every BLAST database task under the sun (because it's an absolute waste/tragedy repeatedly looking up how to create a BLAST database only to see that you've clicked the same links before). Hopefully, this will save you valuable time that you could be getting non-significant p-values or writing that manuscript that your supervisor keeps delaying by making you perform more experiments.