Breath Mints for Penguins: Mira4 assembly of 454 reads from SRA

I want to make an assembly of the Annona squamosa fruit transcriptome data from this paper (http://dx.doi.org/10.1186/s12864-015-1248-3). They give in the paper a link to a web resource (http://www.annonatranscriptome.nabi.res.in/), but the resource appears to now be defunct, so to get contigs reads, I will have to assemble the reads myself. The reads are from two different cultivars of Annona squamosa, so I'm going to assemble each cultivar separately first, and then if that works, I'll try a combined assembly.

MIRA is a nice, free, software package that can assemble 454 data. I've had success with it before, so that's what I'll use for this project too.

I'm on a 64bit Ubuntu system.

Software that needs to be downloaded:
Mira version 4, and the third-party tools (mira_3rdparty): https://sourceforge.net/projects/mira-assembler/files/
NCBI SRA toolkit (https://www.ncbi.nlm.nih.gov/sra/docs/toolkitsoft/)
EMBOSS (for seqret) (http://emboss.sourceforge.net/)
NCBI local Blast

Data files that need to be downloaded:
The 454 data should be downloaded from NCBI in the sff format. The sff format is much larger than the fastq format, but it contains the clipping information which is vital to a good assembly. Also the sequence names are in a format that MIRA can accept. MIRA will throw an error if an assembly is tried with fastq files obtained directly from SRA.

The reads can all be found under SRP042646 (https://www.ncbi.nlm.nih.gov/sra/?term=SRP042646), notice that there are 8 runs total, 4 for each cultivar. The different runs for each cultivar represent developmental stages, but I'll assemble all developmental stages together all at once.

fastq format and clipping information can be extracted from the sff file using sff_extract from the mira third party tools.

Here is a bash script to download the data and extract the clipping information and fastq files for input into MIRA:

for i in SRR1335735 SRR1335740 SRR1335741 SRR1335738 SRR1335736 SRR1321143 SRR1335737 SRR1335739
do
/home/sean/Desktop/sratoolkit.2.8.2-1-ubuntu64/bin/sff-dump $i
/home/sean/Desktop/mira/mira_third_party/sff_extract ${i}.sff
done

The assembly:

Once you have the data files prepared, create the manifest file to set MIRA parameters:

Here is the manifest file I used for the NMK1 assembly:

# A manifest file can contain comment lines, these start with the #-character

# First part of a manifest: defining some basic things

project = NMK1
job = est,denovo,accurate
parameters = -GE:not=4

# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups", for more information
# please consult the MIRA manual, chapter "Reference"

readgroup = fruit
technology = 454
data = reads/SRR1335737.fastq reads/SRR1335736.fastq reads/SRR1335735.fastq reads/SRR1321143.fastq
data = reads/SRR1335737.xml reads/SRR1335736.xml reads/SRR1335735.xml reads/SRR1321143.xml

Notice that you have to provide the fastq and the xml files separately as datafiles. MIRA won't know to look for the xml files unless you tell it about them. The parameter -GE:not=4 tells MIRA to use 4 threads.

Follow up:
Assembly of NMK1 with just these parameters looked pretty good, there were 62,136 contigs, which is reasonable. I'd like to get a coverage table for the contigs so I can have some idea of relative expression. A nice summary table can be found in the [name]_d_info/[name]_info_contigstats.txt file.

It seemed like maybe there were some unclipped adapter sequences, because there were a few very long contigs (2 of 10 kb), and there MIRA reported 7 "strong unresolved repeats", although no megahubs. Apart from adapter sequences, the presence of rRNA may be contaminating the library

To look for adapter sequences, I merged all of the reads into one file, and created a blast database out of it to search for potential adapters:

cat *.fastq > merged.fastq
seqret -sequence merged.fastq -outseq merged.fasta
makeblastdb -input_type fasta -dbtype nucl -in merged.fasta -out reads

First I checked for the adapters mentioned in the MIRA manual (http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#sect_faq_adaptors_and_pairedend_linker_sequences)

I made a fasta file with the following contents:
>AdaptorA
CTGAGACAGGGAGGGAACAGATGGGACACGCAGGGATGAGATGG
>AdaptorB
CTGAGACACGCAACAGGGGATAGGCAAGGCACACAGGGGATAGG

>5prime454adaptor
GCCTCCCTCGCGCCATCAGATCGTAGGCACCTGAAA
>3prime454adaptor
GCCTTGCCAGCCCGCTCAGATTGATGGTGCCTACAG

>flxlinker
GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC

>titlinker1
TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG
>titlinker2
CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA

And blasted it against the reads database:

blastn -query potential_adapters.txt -db reads -max_target_seqs 100000 -outfmt 6 -out adapter_search.txt -word_size 6 -num_threads 4

This gives less than 10 hits for each of the adapter sequences except for Adaptor B, which has about 12,000 hits (about 0.65 % of all reads). But only a segment of it. This suggests that perhaps this is the adapter sequence that was used. But maybe there is more to it. We should extract the reads that contain this sequence and compare them to each other to see if there is a longer consensus sequence that should be removed.

The fastq files extracted from the sff file contain upper and lowercase letters. The lowercase letters are adapter sequence and low quality sequence that should not be included in the assembly. When fastq files are downloaded directly from genbank, these sequences have already been clipped off. The BLAST above was based on the SRA fastq files, and demonstrates that there are still some un-clipped adapter sequences.

Breath Mints for Penguins

Saturday, April 29, 2017

Mira4 assembly of 454 reads from SRA

No comments:

Post a Comment

Labels

About Me