CreateBlastDB.cpp

Aggregating Multiple Databases

CreateBlastDB Workflow

Introduction

   Have you ever gotten tired of looking up NCBI's sparse yet convoluted tutorials on using utilities such as makeblastdb and blastdb_aliastool? Well how about being fatigued by merging multiple BLAST databases together? Or gathering all genes from a certain phylogenetic clade? Fret no further because I have encountered all of these problems (and a few more) and decided to do something about it. This is my solution.

Features

  • Can accept multiple reference FASTA files or numerous files containing GI numbers; or both at the same time!
  • Given a list of NCBI taxonomy IDs, locates the last common ancestor of these taxa and grabs all sequences related to it (and even all sequences pertaining to its children taxa).
  • Aggregates multiple pre-existing BLAST databases into a single alias database.

Dependencies

  • Boost - For command line parameters

Demonstration

   Sometimes you have numerous genes you wish to use as a reference to BLAST against, which are divided amongst multiple FASTA files. Here's how to deal with that:

  1. ./CreateBlastDB \
  2.   -o myDB \
  3.   -r reference1.fasta \
  4.   -r reference2.fasta \
  5.   -r [...]
More

   But say you also have some stored as GI numbers you ripped straight off NCBI:

  1. ./CreateBlastDB \
  2.   -o myDB \
  3.   -r reference1.fasta \
  4.   -r reference2.fasta \
  5.   -r [...] \
  6.   -g gilist1.txt \
  7.   -g gilist2.txt \
  8.   -g [...]
More

   But oh crap, you also want to include all sequences in another clade as well:

  1. ./CreateBlastDB \
  2.   -o myDB \
  3.   -r reference1.fasta \
  4.   -r reference2.fasta \
  5.   -r [...] \
  6.   -g gilist1.txt \
  7.   -g gilist2.txt \
  8.   -g [...] \
  9.   -t taxalist.txt \
  10.   --children
More

   And finally, you have a pre-existing BLAST database containing some mitochondrial genes:

  1. ./CreateBlastDB \
  2.   -o myDB \
  3.   -r reference1.fasta \
  4.   -r reference2.fasta \
  5.   -r [...] \
  6.   -g gilist1.txt \
  7.   -g gilist2.txt \
  8.   -g [...] \
  9.   -t taxalist.txt \
  10.   --children \
  11.   -d mitoGenesDB
More

   Pretty brain-dead I must admit. I designed it such that even if I forget how I even wrote it, I could always depend on it to do every BLAST database task under the sun (because it's an absolute waste/tragedy repeatedly looking up how to create a BLAST database only to see that you've clicked the same links before). Hopefully, this will save you valuable time that you could be getting non-significant p-values or writing that manuscript that your supervisor keeps delaying by making you perform more experiments.