CreateBlastDB.cpp

Aggregating Multiple Databases

Introduction

Have you ever gotten tired of looking up NCBI's sparse yet convoluted tutorials on using utilities such as makeblastdb and blastdb_aliastool? Well how about being fatigued by merging multiple BLAST databases together? Or gathering all genes from a certain phylogenetic clade? Fret no further because I have encountered all of these problems (and a few more) and decided to do something about it. This is my solution.

Features

Can accept multiple reference FASTA files or numerous files containing GI numbers; or both at the same time!
Given a list of NCBI taxonomy IDs, locates the last common ancestor of these taxa and grabs all sequences related to it (and even all sequences pertaining to its children taxa).
Aggregates multiple pre-existing BLAST databases into a single alias database.

Download

CreateBlastDB.zip

CreateBlastDB.tar.gz

View on GitHub

Dependencies

Boost - For command line parameters

Demonstration

Sometimes you have numerous genes you wish to use as a reference to BLAST against, which are divided amongst multiple FASTA files. Here's how to deal with that:

./CreateBlastDB \
-o myDB \
-r reference1.fasta \
-r reference2.fasta \
-r [...]

But say you also have some stored as GI numbers you ripped straight off NCBI:

./CreateBlastDB \
-o myDB \
-r reference1.fasta \
-r reference2.fasta \
-r [...] \
-g gilist1.txt \
-g gilist2.txt \
-g [...]

But oh crap, you also want to include all sequences in another clade as well:

./CreateBlastDB \
-o myDB \
-r reference1.fasta \
-r reference2.fasta \
-r [...] \
-g gilist1.txt \
-g gilist2.txt \
-g [...] \
-t taxalist.txt \
--children

And finally, you have a pre-existing BLAST database containing some mitochondrial genes:

./CreateBlastDB \
-o myDB \
-r reference1.fasta \
-r reference2.fasta \
-r [...] \
-g gilist1.txt \
-g gilist2.txt \
-g [...] \
-t taxalist.txt \
--children \
-d mitoGenesDB

Pretty brain-dead I must admit. I designed it such that even if I forget how I even wrote it, I could always depend on it to do every BLAST database task under the sun (because it's an absolute waste/tragedy repeatedly looking up how to create a BLAST database only to see that you've clicked the same links before). Hopefully, this will save you valuable time that you could be getting non-significant p-values or writing that manuscript that your supervisor keeps delaying by making you perform more experiments.