Wednesday, April 1, 2015

Annotating a de-Novo transcriptome assembly

Here's how I'm annotating a de-Novo transcriptome assembly:

This post is kind of halfway between a how-to and a lab notebook entry.

I'm on a Linux box with 12 cores (24 if you count hyper threading), 48 GB of RAM, and Xubuntu Desktop 13.10

I'm basically just using the Trinity pipeline. I think it's turning into a really nice set of tools both for assembly and for annotation (because Trinotate is so new and apparently under heavy active development, the information in this guide may not be current for very long). I'm also using Blast2GO, which has been around longer and is very widely used.

I'm trying to annotate 5 transcriptomes simultaneously, so as I'm going, I'm writing little shell scripts to apply each step to all of the transcriptomes at once (sadly, I had a hard-drive crash and lost those scripts... oh well). At some point I might consolidate them into a more fully automated pipeline, but I'm not going to worry about that for now.

Needed software:
Trinotate (and all of its dependencies which are listed on its site)
GNU-parallel, to easily run things in parallel from the command line

Install local Blast2GO database:
  • follow the instructions here
    • One issue I ran into was that it gave me the error "ERROR 1148 (42000) at line 1: The used command is not allowed with this MySQL version"
      which I solved by adding the option "--local-infile=1" when calling mysql
    •  Loading the data into the database takes forever. Don't try to cut corners by trying to load all the data at once. It appears you need to do it in order to avoid errors :-(.
    • Use the official Oracle Java 7, not the one in the standard Ubuntu repository. I followed the guide here (the section Installing Oracle JDK).

Extract protein sequences:
  • To extract protein sequences you have a few options:
    • Pick the longest open reading frame. For example using BioPython according to the example in the BioPython cookbook (although you'll also want to ignore everything before the first M probably) . In most cases the longest open reading frame in a transcript is the correct one, but not always. Also, sometimes mis-assembly will introduce premature stop codons or frameshifts, and an incomplete assembly may cause the start codon to not be present. To deal with the possibility of a missing start codon you could just use everything before the first stop codon as the coding sequence (which will give you long, incorrect protein sequences in cases where the assembled sequence includes the 5' UTR), or you could use the first M in all cases (which will give you greatly truncated protein sequences in cases where the start codon did not get assembled).
    • ESTScan: An old, but still very popular program. It picks the reading frames that are most enriched for common codons (you need to supply it with a matrix file for your species of interest or a close relative). It can even compensate for assembly mistakes that result in internal frame shifts. And it doesn't rely on the presence of a start or stop codon. I've used this and I find it to work quite well. It would be interesting to see a comparison between ESTScan and TransDecoder.
    • TransDecoder is another utility for extracting protein sequences from mRNA sequences. As far as I know there is not yet an associated publication describing how exactly it works (of course, it is open source so there's nothing to stop me or anyone else from just examining the source code, but I have not bothered to do that). This is the tool I use now, primarily because it plays nicely with the Trinotate workflow. And because, unlike ESTScan, it is being actively maintained.
Blast against SwissProt:
  • A lot of people (apparently) do Blastx of the assembled contigs against NCBI NR protein database. But that takes forever (NR is huge, and Blastx runs 6 searches for every contig, one for each reading frame in each direction). Also most of the sequences in NR are uncharacterized, so their annotations come from homology to better annotated proteins. So when you annotate based on homology to NR, your getting second or third hand annotations (Blast2GO compensates for this somewhat by selecting annotations not only based on homology how good the hit is, but also based on the evidence code for the annotation on the hit). SwissProt is a manually curated database that is enriched for proteins with direct annotations.
    By extracting protein sequences first, and then doing Blastp against Swissprot, you save a lot of time and computing resources without (I think, but have no proof for) decreasing the quality of your annotation by much.
  • To run a BLAST search, you need to install local BLAST+ as well as a database (in my case Swissprot). 
    • Ubuntu has BLAST available from the package manager, so all you need to do is "sudo apt-get install ncbi-blast+". 
    • Complete Windows install instructions here
    • Regardless of operating system, you can also download it directly from NCBI
    • It's convenient to also put it in your shell's executable search PATH.
    • It's also nice to set an environment variable to give BLAST a default database location, so you don't have to give full paths to commonly used databases.
      • On Ubuntu I put "export BLASTDB=/media/langelab/data/databases/blast/" at the bottom of .bashrc in my home folder.
  • Trinotate output lists Blastx hits as well as Blastp hits against swissprot, so I'll do them both, it will be interesting to see if the top hits are both the same. Contrary to the Trinotate requirement, I'm using xml output for my blast searches because Blast2GO likes to have input in xml format. I'm also taking the top 20 hits instead of the top 1 (once again, for Blast2GO, also this is a habit I picked up from when I used NR as my target database and often times the number 1 hit is something uninformative like "Putative uncharacterized protein").
  • Trinotate has the rather esoteric requirement that Blast hits be in the "outfmt6" format, which has the following columns: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
    • I usually output as xml because it's more useful, so I wrote a biopython script (available in srj_chembiolib) to grab the top hit for each sequence from the xml file and save it in outfmt6 format.
      • percent identity is not reported in the xml file, and must be calculated. The formula is: 100 * identities / alignment length
      • mismatches is not reported. The formula is: align length - (identities + gaps)
      • gap openings is not reported: use a regular expression to count the number of times in each sequence in an alignment there is the pattern: /[^-]-/     (in English: a dash preceded by not a dash (gaps won't ever occur at the beginning of an alignment, if they did, you'd have to use the pattern /(^|[^-])-/  or something like that))
  • Download right from Janelia's hmmer website, it's easy enough to install from source, just follow the instructions in the README
    • On windows, I've used the pre-built binaries, which worked well (although they are somewhat out of date)
  • Ran hmmscan of the peptide sequences against Pfam-A as described in the Trinotate guide, without any problem.
SignalP RNAmmer and TMHMM
  • I'm just following the Trinotate guide here, and noting any potentially confusing aspects I notice.
  • For SignalP, Trinotate says to change $MAX_ALLOWED_ENTRIES, you should also change $ENV{SIGNALP} .
  • For RNAmmer, you need to change the $INSTALL_PATH variable in the file rnammer. Also remember to change the $HMMSEARCH_BINARY path to wherever you have HMMER 2 installed. I had to build HMMER 2.3.2 with the option "--enable-threads" when I ran ./configure. I also had to install the XML::Simple library from cpan "sudo cpan install XML::Simple"
  • For TMHMM I just changed the shebang line, and that was it.
Generating the Trinotate SQLite database and getting the output
  • I had to install DBD::SQLite: "sudo cpan DBD::SQLite"
  • I also had to install sqlite3 "sudo apt-get install sqlite3"
  • I had a problem with the outfmt6 loader because it splits input files on whitespace, whereas I think it would be more robust to split input on tabs (why disallow sequence identifiers with internal spaces?). Turns out my original script for converting xml to outfmt6 was giving the full name of the sequence, whereas outfmt6 only gives what's before the first whitespace.
  • Another problem I ran into was  that the name formating in the Swissprot database I'm using (downloaded from the NCBI preformatted BLAST databases). I submitted a patch to Trinotate to solve this, hopefully they accept it (edit: nope, they didn't want it. Based on their response, I'm not really sure they even read my whole e-mail . Oh well, at least I tried be helpful).
Running Blast2GO
  • There used to be a command line version of Blast2GO, called b2g4pipe, but it appears they've now gotten rid of that and replaced it with a piece of commercial software. (Hopefully they haven't gotten rid of it, but just made it difficult to find, if anyone knows where it is and if it's being updated, let me know!)
    • This is most disturbing.
  • Once the local database is installed, it's pretty much just a matter of clicking the buttons on Blast2GO in the right order.
    • Load fasta file
    • Load BLASTx hits against swissprot. 
    • Make sure it's set to use your local database
      • Tools -> General Settings -> DataAccess Settings -> Own Database -> fill in the proper info
    • Mapping -> Run GO-Mapping Step -> (default settings?)
    • Annotation -> Run Annotation Step -> (default settings?)
Making a cheap ripoff of Blast2GO that runs quicker and can be accessed from the command line
  • I'm not thrilled about having to use a GUI to run my analyses.
    • I want the functionality of Blast2GO in a command line application, but I don't want to pay for it.
  • Luckily they published their rationale and methods.
  • This should be relatively straight-forward to do. But I'm out of time for now, maybe next time!


No comments:

Post a Comment