AnnotateContigs.py

Annotating de novo Generated Contigs

Introduction

A bioinformatics focused script which does what it says on the box: annotates contigs. BLAST your contigs against a subset of NCBI's database and it'll stitch contigs together, effectively reconstructing the original DNA/RNA sequence.

Features

4 non-mutually exclusive output modes:
- Alignments: Outputs every contig-reference pairwise alignment in one convenient file for each reference gene.
- Consensus: Attempts to merge the aligned contigs into one meaningful consensus sequence.
- Spliced Consensus: In addition to creating a single consensus sequence, it splices said gene given exon-intron barriers (Note: requires internet access).
- (Statistical) Data Provides desirable statistics about the annotation and assembly process, such as % non-coding/coding, breadth of coverage, and % sequence identity.
Multithreading capability
Even without an explicit set of reference gene, can still attempt annotation of sequences by directly querying NCBI for "sequence hit" metadata (Note: requires BLAST XML file to be generated by aligning to a subset or entire NCBI nt database)

Download

AnnotateContigs.zip

AnnotateContigs.tar.gz

View on GitHub

Dependencies

BioPython - For bioinformatical work
xlsxwriter - For creating Excel files

Demonstration

As my position as bioinformatician for University of Toronto's Chang Lab, I was tasked with juxtaposing two widely used techniques in genome/transcriptome sequencing: guided and de novo assembly. This script was designed for the latter. Let's cut to the gist; use this if utilizing a set of reference sequences (Note: only FASTA format currently supported):

python AnnotateContigs.py \
-n 10 \
-r <reference.fasta> \
-i 0.7 \
-s 0.7 \
<contigs.fasta> \
acds

This rejects sequences with sequence identity/similarity less than 70% when compared to those found in reference.fasta (for DNA/RNA, identity and similarity are the same) while outputting sequence alignments, consenses, spliced consenses, and statistical data to a default folder dubbed AnnotateContigs/, all while utilizing a total of 10 threads. If instead of providing a reference database, one can use NCBI's nucleotide database as follows:

python AnnotateContigs.py \
-n 10 \
-i 0.7 \
-s 0.7 \
<contigs.fasta> \
acds

This alternative usage may take a while longer as the program queries NCBI via the internet for each unique sequence encountered in our set of alignments (thus bandwidth can be a concern). Additionally NCBI limits their server access to 3 times a second, which may sound fast to us in human time, but is arduously slow for computers.

In summary, the contigs will be merged based on the their alignments. This effectively "annotates" the contigs, but allows for the possibiity of polymorphism if there are two contigs that differ in a few sites. Also, if probe tiling was not completely successful due to more variable regions, this program will fill the unaligned regions with gap characters - essentially briding two contigs that would otherwise be separated due to no overlap.

I must however note for my own conscious's sake: the algorithm here is not foolproof. During its creation, I was naive and eliminated any indels that arose via BLAST alignments. The statistical calculations will be correct, but the consenses *might* be wrong. I duly apologize for this and was working on a more robust algorithm with a proper statistical basis which plagues some of the problems of using this program. Unfortunately, I ran out of time - it's in an intermediate state that requires polish. This new attempt can be seen here: Reciprocal Best Matching Algorithm. However for similar species, this program as is should be sufficient for biological inference (you'd need an insertion/deletion of a triplet to really trip this up).