Sunday, June 12, 2016

Assembling the Mentha Longifolia genome

A group of researchers, including myself, recently made a draft genome assembly for Mentha longifolia. The publication for which is currently under review. This post is my notes for my part of that effort. Basically that means that this is my lab notebook for part of the summer of 2014. It's pretty disorganized and may or may not be useful to anyone ever....

Long story short:
I was trying to figure out how to integrate PacBio reads with Illumina reads.
In the end, I used ECtools to correct the PacBio reads, and then used PBJelly to merge the PacBio reads with an assembly generated from the Illumina reads. It is my understanding that there may be better options for PacBio error correction today, but at the time ECtools seemed to be the best.

Read on for the juicy details....

M. longifolia has been proposed as a model species for mint genetics due to its relatively small, diploid genome. Here I'm documenting my attempts to generate a reasonable assembly of the genome from two datasets, a set of paired end illumina reads, and a set of PacBio reads.

Before I get to far, I want to know a bit about the data. What is the quality of the reads, and what are the read length distributions. Based on the answers to these questions, I will have to cull and trim the data to get rid of low quality reads and parts of reads.

I have the following files:
filtered_subreads.fasta   The PacBio reads in fasta format. They've already been adapter trimmed and run through a 0.75 quality threshold.
filtered_subreads.fastq   The same reads, but with phred-33 quality scores
mentha_plastome_2.27.12.fas An assembly of the chloroplast genome.
Mint_allcontigs_052311.fa An assembly of the genome based on illumina data (a velvet assembly)
Mlong_all_contigs.fa A variant of the genome assembly based on illumina data (a velvet assembly)
s_5_seqsMerged.fastq The illumina reads (about 135 million 101 bp reads, paired end) Appears to be Illumina 1.5+ Phred-64 scoring (there are "B" but not "@" characters)
s_5_seqsMerged.fastq.gz A zip of the illumina reads

Make a histogram of sequence lengths (based on the biopython tutorial, I'm doing this in ipython):
from Bio import SeqIO
import json

sizes = [len(rec) for rec in SeqIO.parse("filtered_subreads.fasta", "fasta")] #get the length of each sequence
counts = dict() 
for x in sizes: #make a dict where keys are size, and values are the number of sequences of that length
    counts[x] = counts.get(x,0) + 1
with open('filtered_subreads_lengths.json', 'w') as outfile: #save the dict as a json file, so we can use it later if we want
    json.dump(counts, outfile)
len(sizes), min(sizes), max(sizes) # outputs the number of sequences and range of the sequence lengths (4117507, 1, 16013), if not using an interactive interpreter, you could print these 
import pylab
pylab.hist(sizes, bins=30)
pylab.title("%i PacBio sequences\nLengths %i to %i" \
% (len(sizes),min(sizes),max(sizes)))
pylab.xlabel("Sequence length (bp)")
Not real pretty. Lots of really short sequences, but also some longer ones. I've got tons of illumina data in the form of 101nt paired end reads. So where the PacBio data will be most useful will be for reads longer than 202 nt (or perhaps 101 nt in some regions). So lets see what the data looks like if we get rid of all the reads shorter than 202 nt.

subset = [x for x in sizes if x > 202]
len(subset),min(subset),max(subset) #(2590650, 203, 16013)
Looks like we've still got about 2.5 million reads, which is 63% of our original sequences. The histogram once again looks like exponential decay. If I repeat for sequences longer than 1 kb, I find that there are 1.27 million of them, 31% of the original reads. Which gives us a ballpark estimate of (haploid) genome coverage for the ~500 Mb genome of about 3x-5x. If we use a 500 bp cutoff we get an estimated 6x coverage. From the illumina data we have 13.5 gigabases which is about 27x coverage.

The target fragment size for the Illumina library must have been around the typical 200 bp. Based on an estimate from a CLC-Bio assembly output, the average fragment size was actually a bit less than that, maybe 150 bp or so.
The short insert size indicates that merging overlapping pairs may improve assemblies. I used FLASH to merge reads with default settings except for switching from phred 33 to phred 64 with the option "-p 64".

There are many potential strategies we can use to try to assemble a draft genome from this data:
1. Use the Celera Assembler and PacBioToCA. I haven't seen many (any) reported success stories for using this with largish genomes like this one, but they give instructions for large genomes in the FAQ. They say "Approximately 100X (or more) of C2 sequencing is ideal." which is way more depth than we've got, but it couldn't hurt to try. This process is also explained in this PacBio tutorial.
2. Assemble the short-reads by themselves (by whatever means... There are lots of shortread assemblers out there), and then try to fill gaps and glue contigs together with the PacBio reads. Gap filling could be done with PBjelly. Or this blog seems to suggest doing it manually or with custom scripts.
3. Correct the reads with LSC, or PacBioToCA, and do a hybrid assembly with an assembler that can use multiple read sources.
4. Use Cerulean assembler. They don't recommend using it for large genomes, but apparently they're working on that, so it may be worth trying eventually.
5. Use the process described by Allen Van Deynze in his spinach genome talk  where he discusses binning the PacBio reads into a bin of shorter and a bin of longer reads, then using the shorter reads to correct the longer reads using FALCON an HBAR-DTK. He had way longer reads and deeper coverage than we've got, but it may be worth it to try anyways.
6. Use the process proposed by Mike Schatz (as mentioned in this blog) and briefly explained in the second half of this presentation. And further elaborated towards the end of this presentation. Basically, it involves first using Celera Assembler to generate unitigs from illumina reads, then use those to correct the PacBio reads, then assemble the PacBio reads (also with Celera). A partial implementation of this process can be found as ectools.
7. In the process of trying out all of the strategies listed above, I've devised by own strategy. Correct PacBio reads (initially using Celera unitigs from assembling the Illumina data as single end reads, for some reason treating them as paired end reads gave worse unitigs, could also try other methods of generating unitigs). Assemble the corrected PacBio reads with Celera Assembler. Generate a short read assembly using only the Illumina reads (try FLASH merged, and not FLASH merged, also try a range of K-values and assemblers). Merge assemblies with GAM-NGS, then run PBjelly.
8. Like 7, except assemble all the reads (corrected PacBio, FLASHed Illumina) together in one assembly (try Abyss, MaSuRCA, SOAPdeNovo2, Celera, and Minia, CLC, whatever), then PBjelly with uncorrected PacBio reads.

1. Following the tutorial on the PBcR page. First I had to install AMOS, configure complains about a bunch of dependencies. This was pretty easy, as Ubuntu has packages for most of them, and CPAN has some.
sudo apt-get install mummer
sudo apt-get install qt4-default
sudo apt-get install libboost-graph-dev
cpan DBI
cpan Statistics::Descriptive
(for cpan Statistics::Descriptive to install correctly, I had to make cpan install packages system-wide, as explained here). After all this, BLAT was the only dependency left, and I think I should be ok without it. make gave an error: "error: ‘getopt’ was not declared in this scope" but google turned up a solution here (in src/Align/ add #include <getopt.h>). After compiling add it to $PATH, and off we go. Prepare the illumina frg file with fastqToCA. Compile BLASR and put it in $PATH (this is straight forward). Download the sample data from the PBcR tutorial. The computer I'm using is similar enough to the one from the tutorial that I'm just following their configuration without modification. Change pacbio.spec to include the lines:
ovlThreads=16 #replace previous setting
ovlConcurrency=1 #replace previous setting
ovlHashBlockLength=1000000000 #replace previous setting
ovlRefBlockLength=1000000000 #replace ovlRefBlockSize
blasr=-noRefineAlign -advanceHalf -noSplitSubreads -minMatch 10 -minPctIdentity 70 -bestn 24 -nCandidates 24
And run with:
pacBioToCA -length 500 -partitions 200 -l pacbio -t 16 -s pacbio.spec -fastq filtered_subreads_thresh_500.fastq s_5_seqsMerged.frg > pacBioToCA.out 2>&1
pacBioToCA got stuck at the stage where it ought to call BLASR
The last line of the input was:
/media/langelab/data/longifolia_genome//temppacbio/1-overlapper/ 1
When it got here, it stuck for two days (over a weekend) with CPU at 100% (so, total utilization of a single core). Eventually I killed the process and ran
source /media/langelab/data/longifolia_genome/temppacbio/1-overlapper/ 1 >> pacBioToCA.out 2>&1
myself in the console, and it finally launched BLASR. I believe the reason it did not work before is because the data directory is on an NTFS drive that is not currently mounted to allow permissions changes, although that seems to be possible. Also, it's hard to tell whether I've set the threading settings well, I'm using 12 physical cores with hyperthreading allowing up to 24 threads. I kept the threading settings they use in the tutorial, optimized for a 16 core machine, figuring maybe that would cause my machine to max out 8 cores, but top shows it using 24 threads at an average of 50% load on each, so I'm not sure exactly what's going on, it seems to work well enough though. To solve the NTFS permissions problem, added the following to /etc/fstab:
UUID=386816BB681677B8   /media/langelab/data    ntfs    defaults,auto,umask=000 0       0
Permissions still not perfect, but good enough I think.
The BLASR step ran ok, but when it got to the "" it started using too much memory and accessing swap space constantly and going really slowly. I sent an email to Sergey Koren, the first author on the paper, and he said that the step could take more than 48 gb of RAM on large genomes, however, since it's already done the alignments, I don't have to repeat that step. All I have to do is kill the process and move the temporary files (and the input files) to a higher ram computer and rerun the script. He says that it should pick up where it left off, so that's what I'll do. Sergey also said that decreasing the maxGap option would decrease memory usage. Instead of recompiling AMOS (with all of its myriad dependencies) on the hpc, I'm just copying over the contents of AMOS/bin that I compiled locally, which will hopefully be good enough.
I don't know if it was necessary, but after I moved the temporary files to the hpc, I also changed all the .frg files to point to the right files in the hpc directory structure, and I changed paths used in any .sh files (namely and 1-overlapper/ to be correct relative to the hpc (edit: it appears that this was actually unnecessary I think it rewrites the sh files anyways, though probably not the frg files). The first time I ran pacBioToCA on the hpc it gave me the error:
"Error: found output from runCorrection. Stopping to avoid infinite loops. To try again, remove asm.layout.*"
It looks like the only file I had in temppacbio with that name was asm.layout.err, so I deleted that file and restarted the process. That looks like it worked. However, after about 18 hours of running, the process had generated nearly 3 TB of data, and was using about 300 GB of RAM. At this point, the process terminated with an error (which I assume was caused by the overconsumption of system resources, but I can't find the log, so I don't know for sure). So, I deleted a bunch of the temporary files, changed the maxGap setting to 0 and put the process on the queue again If that works, I'll put it back up to 750 or 1000 and see if that works.Puting the maxGap setting to 0 led to significant reduction in resource utilization, the program got a step further than it had before, but the flaw in my strategy to just copy over the binaries from my desktop to the hpc became apparent, as both BLASR and AMOS died with errors. BLASR compiled easily enough on the hpc, but first had to compile its dependency hdf5 remembering to use the --enable-cxx option on ./configure. AMOS also compiled easily, but when I tried to run one of the executables, I got the same error that I saw with the build from the other machine: "/lib64/ version `GLIBC_2.14' not found", some generic workarounds described here and here. This error doesn't make any sense because AMOS documentation claims that it can be compiled and run on RedHat, and (as of this writing) no version of RedHat has glibc version  >=2.14. So, I spent quite a while trying to make a module that would allow glibc 2.14 to run on RedHat, only to find that once I got it, AMOS still didn't work. Apparently the problem was that I was skipping the `make test` step in the installation, which for reasons I don't understand, is apparently essential on RedHat, but not essential on my Xubuntu desktop. Anyways, all of that hassle, I ran pacBioToCA once again, and it got further than ever before (taking about 15 days to get there), and then crashed. In the meantime, I read about another strategy (#6 in this list), that promises to do the same job, but much more efficiently. So I'm giving up on pacBioToCA for now (and hopefully forever). I think my 500 GB genome is just too much for it, and it keeps crashing, and the debug cycle is extremely long (sometimes it runs for 2 weeks, and then crashes... The obvious solution, I suppose would be to test with a much smaller test dataset, and once it works for that, switch to the larger dataset. Maybe when I'm desperate, I'll try that).

  • Using minia as a shortread assembler: kmergenie (default settings), recommends a kmer-size of 43, but the graph was jagged rather than concave, so it may be a bad estimation (it's a diploid genome, but when I ran kmergenie with the diploid option, it failed). Ran minia with settings: s_5_seqsMerged.fastq 43 3 500000000 k_43_. Not a great assembly. contigs: 1,784,832; N50: 201; mean length: 183; min length: 87; max length: 4782. Tried again with K = 27 (took about 32 hours to run). K = 43 seems to be the better option: contigs: 2,008,767; min = 55; max = 4789; average = 140.24.
    • Tried running minia again, this time with the FLASH merged contigs, and a range of kmer sizes from 17 to 51 step 2.
  • Using SOAPdenovo2: compiled nicely right out of the box . Rumored to be memory intensive, so I'll put it on a node with 512GB. Easy to compile. I need to know more about the Illumina reads before I can configure this properly
  • ABySS: Needs some external libraries to compile properly (see their quickstart for details) : boost, sparsehash, and OpenMPI. I already had boost and OpenMPI, but I had to download and compile sparsehash (I don't have root access on the hpc, so I had to change the Makefile so it installed into my home directory, by changing the $prefix variable). To compile ABySS, I used "./configure --with-boost=/home/software/boost-1.48_intel/include --with-mpi=/home/software/mpi/gcc/openmpi-1.4.5 CPPFLAGS=-I/home/langelab/seanrjohnson/local/include", then make. Then "make install DESTDIR=/home/langelab/seanrjohnson/local" because we can't install this globally either (that DESTDIR is not set quite right, I need to investigate how to do that better).
    Run the following Torque script to try a bunch of different kmer-size assemblies (20 to 64 by 2s)
    #PBS -l nodes=1:ppn=32
    #PBS -o /scratch/langelab/abyss.out
    #PBS -e /scratch/langelab/abyss.err
    #PBS -M
    #PBS -m ae
    #PBS -l walltime=120:00:00
    #PBS -q gp 
    cd /home/langelab/seanrjohnson/longifolia
    export k
    export j
    for j in {10..32}; do
        k=$(($j * 2))
        mkdir -p k$k
        abyss-pe np=32 -C k$k name=longifolia in=../s_5_seqsMerged.fastq.gz
    abyss-fac k*/longifolia-contigs.fa 
    each run seems to take much less memory, but more time than I expected maybe 40 GB of RAM, and 1.5 days. Probably better to run these on a bunch of normal nodes instead.

    n n:500 n:N50 min N80 N50 N20 E-size max sum name
    16.52e6 5261 2272 500 530 595 722 639 1558 3240189 k20/longifolia-contigs.fa
    12.37e6 40910 14617 500 584 777 1168 908 3459 32.07e6 k22/longifolia-contigs.fa
    9983364 62153 20367 500 611 870 1444 1070 5120 53.55e6 k24/longifolia-contigs.fa
    8500598 75342 24011 500 629 922 1558 1146 5996 67.86e6 k26/longifolia-contigs.fa
    7419527 84646 26652 500 639 952 1617 1182 6098 77.99e6 k28/longifolia-contigs.fa
    6607797 91254 28621 500 646 968 1640 1199 7171 85.11e6 k30/longifolia-contigs.fa
    5948290 96755 30313 500 650 977 1644 1204 7173 90.69e6 k32/longifolia-contigs.fa
    5372380 100823 31799 500 650 973 1623 1193 7175 94.21e6 k34/longifolia-contigs.fa
    4920622 104277 33305 500 647 959 1580 1167 6486 96.31e6 k36/longifolia-contigs.fa
    4517600 106868 34639 500 640 938 1524 1133 6113 97.04e6 k38/longifolia-contigs.fa

    The maximum assembly size (in contigs >= 500 bp)  is small, only 97.04 MB out of a predicted 500 MB genome. When I run the stats with a minimum contig cutoff of 0, I see that, in fact, most of the genome seems to be getting covered, albeit mostly with very short sequences (the Sum column is between 550 MB and 431 MB with higher Kmer values corresponding to lower sum). Doing more assemblies, this time with kmers between 40 and 64:
    n n:500 n:N50 min N80 N50 N20 E-size max sum name
    4166577 108544 35754 500 632 915 1460 1093 6094 96.49e6 k40/longifolia-contigs.fa
    3918297 109660 36817 500 624 886 1394 1053 6088 95.14e6 k42/longifolia-contigs.fa
    3642679 109057 37398 500 615 856 1326 1010 6090 92.2e6 k44/longifolia-contigs.fa
    3406290 107276 37542 500 606 826 1255 968 5523 88.2e6 k46/longifolia-contigs.fa
    3155689 104013 37245 500 596 796 1185 925 6958 82.97e6 k48/longifolia-contigs.fa
    2925572 99202 36307 500 587 768 1120 885 6960 76.84e6 k50/longifolia-contigs.fa
    2683522 92908 34748 500 579 742 1057 848 6823 69.93e6 k52/longifolia-contigs.fa
    2573216 84814 32355 500 570 717 1003 817 6823 62.08e6 k54/longifolia-contigs.fa
    2389951 75921 29487 500 563 695 953 788 8339 54.08e6 k56/longifolia-contigs.fa
    2210301 66125 26110 500 556 675 905 762 8342 45.93e6 k58/longifolia-contigs.fa
    2040138 55795 22333 500 550 656 867 738 6811 37.84e6 k60/longifolia-contigs.fa
    1884092 45542 18548 500 545 641 833 720 6815 30.27e6 k62/longifolia-contigs.fa
    1712801 35788 14706 500 541 627 808 707 6817 23.39e6 k64/longifolia-contigs.fa

    We're getting the best assembly at a kmer size of about 38 it looks like. I tested to see if using the FLASH joined reads would improve the assembly quality.

3. LSC always seemed to use way more memory than the 10GB they claim on the website. In the Bowtie step, it launches 10 instances of bowtie2, each of which use about 5 GB of RAM, which is enough to put my computer over the edge, and slow everything way down. I decided to try again on a high-RAM hpc node. First I need to get python, numpy, and scipy installed there, which is kind of a pain, because I don't have root access, and the hpc is not directly connected to the internet, so I have do download packages locally, and then scp them over and compile them. Compiling scipy and numpy looks like a pain (and I don't know if I have the right Fortran compiler), so I'll take the alternative path of compiling SAGE, which looks easier, and includes scipy and numpy. I had all of the prereqs except 'texinfo', which was easy enough to compile (difference: ./configure --prefix=/home/langelab/seanrjohnson/local). (SAGE is a complicated package that I think actually builds most of its environment from scratch). Compiling SAGE worked, so now I just have to point LSC to the SAGE bin directory (/home/langelab/seanrjohnson/sage-6.1.1/local/bin) for its python executable, and I should be good to go. Not that simple it seems, after a bit of finagling with environment variables, I finally did get it working:
add the following to .bash_profile
(update: turns out messing with LD_LIBRARY_PATH may not be such a great idea)

Running LSC gave me the error "bowtie2-build: /lib64/ version `GLIBC_2.14' not found (required by bowtie2-build)" which I fixed by recompiling bowtie2 on the hpc (I think at first I had just copied the binaries that I had compiled locally which apparently was a bad idea). Now bowtie2-build worked, but bowtie2 itself gave the error "GetOptionsFromArray" is not exported by the Getopt::Long module" Which I found out is because the version of Perl on the hpc is old (5.8.8, whereas it must be >= 5.10 for that option to work). So... now I have to compile the newest version of Perl from source on the hpc: honestly not so bad, just followed the instructions here (pointing it to a directory in my home folder of course), it took forever to compile and test, but it worked. After all that, the LSC test data ran, so I submitted my job to the scheduler: kept the default settings in run.cfg except for pointing it to the right files, and setting it to use 32 cores. On the high RAM node, each of the 31 bowtie2-align-s jobs it ran took about 10 gb of RAM. It ended up taking longer than the 24 hours I scheduled, so I tried again with 6 days of computer time, and that failed too. Maybe I should try again with 20 days?

4. Never tried this.

5. Never tried this either.

6. First I made sure Celera Assembler (8.1) was on the system and working. I then converted my Illumina reads into frg format.

/home/langelab/Desktop/assembly_software/wgs-8.1/Linux-amd64/bin/fastqToCA -libraryname s_5_seqsMerged -technology illumina -type illumina -innie -reads ../s_5_seqsMerged.fastq > s_5_seqsMerged.frg
#Note: don't do this, see below

#note: they are actually paired end reads, but I didn't know the insert length, so I'm just treating them as single end reads for now.

/home/langelab/Desktop/assembly_software/wgs-8.1/Linux-amd64/bin/runCA -d s_5_seqsMerged_assembled -p s_5_seqsMerged -s asm.spec s_5_seqsMerged.frg

#specfile adapted from one of the examples on the CA wiki:

# -------------------------------------
# -------------------------------------
#  Expected rate of sequencing error. Allow pairwise alignments up to this rate.
#  Sanger reads can use values less than one. Titanium reads require 3% in unitig.
utgErrorRate=0.03 # not used because I selected the bogart unitigger
utgErrorLimit=2.5 # not used because I selected the bogart unitigger
ovlErrorRate=0.06 # Larger than utg to allow for correction.
cnsErrorRate=0.10 # Larger than utg to avoid occasional consensus failures
cgwErrorRate=0.10 # Larger than utg to allow contig merges across high-error ends
merSize = 22 # default=22; use lower to combine across heterozygosity, higher to separate near-identical repeat copies 
overlapper=ovl # the mer overlapper for 454-like data is insensitive to homopolymer problems but requires more RAM and disk

# utgGenomeSize = # not set!
# -------------------------------------
# -------------------------------------
mbtConcurrency = 2
mbtThreads = 8

#  MERYL calculates K-mer seeds
merylMemory   = 40000
merylThreads    = 24
#  OVERLAPPER calculates overlaps
#ovlMemory         = 8GB  
ovlThreads         = 10
ovlConcurrency     = 2
#ovlHashBlockSize = 2000000
#ovlRefBlockSize  = 32000000
#  OVERLAP STORE build the database
#ovlStoreMemory   = 8GB  # Oops! That doesn't work. See correction below. 
ovlStoreMemory = 8192 # Mbp
# ERROR CORRECTION applied to overlaps
frgCorrThreads    = 10
frgCorrConcurrency = 2
ovlCorrBatchSize  = 1000000
ovlCorrConcurrency = 24
# UNITIGGER configuration
# CONSENSUS configuration
cnsConcurrency   = 16
unitigger = bogart
utgBubblePopping = 0

This run didn't do quite what I expected. For one thing, it didn't seem to recognize that the reads are paired-end, as it reported: "ReadsWithNoMate                 51342111(100.00%)"
But, that doesn't matter because I'm just after unitigs, which it seems to have done just fine.

I found out why it didn't realize they were paired end, I should have run fastqtoca with the "mates" parameter instead of the "reads" parameter. I don't actually know for sure what the insert size is, but based on FLASH and the CLC-Bio output, I can guess it is something like 150 +/- 150 or so (although it has a long right tail, and its left tail cuts off abruptly at 101):

/home/langelab/Desktop/assembly_software/wgs-8.1/Linux-amd64/bin/fastqToCA -libraryname s_5_seqsMerged -technology illumina -type illumina -innie -insertsize 150 150 -mates ../s_5_seqsMerged.fastq >

I decided to try generating unitigs with the FLASH merged reads:
Since we've removed all of the overlaping reads, make an frg of the remaining with 250 +/- 50 as the length.

/home/langelab/Desktop/assembly_software/wgs-8.1/Linux-amd64/bin/fastqToCA -libraryname s_5_seqsMerged_notCombined -technology illumina -type illumina -innie -insertsize 250 50 -mates ../flash_merged/s_5_seqsMerged.notCombined_1.fastq,../flash_merged/s_5_seqsMerged.notCombined_2.fastq > flash.s_5_seqsMerged.notCombined.frg

/media/langelab/data/longifolia_genome/celera_assembly$ /home/langelab/Desktop/assembly_software/wgs-8.1/Linux-amd64/bin/fastqToCA -libraryname s_5_seqsMerged -technology illumina -type illumina -reads ../flash_merged/s_5_seqsMerged.extendedFrags.fastq > flash.s_5_seqsMerged.extendedFrags.frg

/home/langelab/Desktop/assembly_software/wgs-8.1/Linux-amd64/bin/runCA -d flash_s_5_seqsMerged_assembled -p flash_s_5_seqsMerged -s asm_flash.spec flash.s_5_seqsMerged.notCombined.frg flash.s_5_seqsMerged.extendedFrags.frg 

This time it quit early with the error:

Starting file '/media/langelab/data/longifolia_genome/celera_assembly/flash.s_5_seqsMerged.extendedFrags.frg'.

Processing SINGLE-ENDED ILLUMINA 1.3+ QV encoding reads from:

GKP finished with 11071283 alerts or errors:
11071282 # ILL Error: seq longer than longer than gkpShortReadLength bases, truncating.
1 # LIB Alert: already exists; can't add it again.

ERROR: library IID 1 's_5_seqsMerged' has 11.15% errors or warnings.

Failure message:

gatekeeper failed

Turns out, for illumina reads longer than 160 bp, you need to use the illumina-long technology specifier. So I reran fastq-CA for the FLASH merged reads, using "-technology illumina-long", and "-libraryname s_5_seqsMergedFlashed" to get rid of that second error which came about because I used the same library name in both frg files.

Fixing that and rerunning. asm file is the same as above except that I added the line:
so that it won't take so much time.
After a complete run, you'll get a utg.fasta file with the unitigs. When you stop it early, it doesn't get that far, so you've got to extract the unitigs yourself. I believe the command to do this is:
tigStore -g flash_s_5_seqsMerged.gkpStore -t flash_s_5_seqsMerged.tigStore 8.1 -U -d consensus > sequences.utg.fasta
Nope... that's definitely not it. There must be some redundant unitigs in there, because the that file is way too big, and has a total length of 1.9 gigabases.

Running to completion (luckily runCA is good about starting back up where it left off).

Interestingly, the entire assembly using the FLASH merged reads, and using raw reads as pairs came out much worse than the assembly with paired end reads assembled as though they were single end reads (at least by the general statistics, perhaps there is less misassmbly). Also it is notable that I changed two variables here, I used FLASH merged reads, and I also had it treat paired end reads as paired end reads rather than single reads. Maybe it was the second change, not the first that resulted in the worse assembly.

Step 8 of the ectools tutorial is written for some kind of grid computing system that I don't have, so I rewrote steps 8 and 9 to use GNU parallel instead.

#! /bin/bash
export TMPDIR=/media/langelab/data/longifolia_genome/celera_assembly/longifolia_correct/temp_dir #be sure to make this directory
mkdir -p $TMPDIR 
NUM_PARTITIONS=0213 # should be a 4 character wide integer left-padded with zeros

run_file() {
        export SGE_TASK_ID=$1
export -f run_file

for i in `eval echo {0001..$NUM_PARTITIONS}` #braces evaluated before variable
 echo $i
        cd $i
        parallel -j $THREADS run_file ::: `eval echo {1..$NUM_FILES_PER_PARTITION}`
        cd ..

cat ????/*.cor.fa > ${ORGANISM_NAME}.cor.fa

I tried running that on my local computer, but it would have taken way too long, so I modified it once again to run on the campus HPC that uses the Torque scheduler (hint: if running multiple directories at the same time be sure that their temp files are mapped to different folders, otherwise there will be collisions as the process correcting one folder overwrites the temp files for the process correcting a different folder). I used the unitigs from my first Celera assembly of the illumina data as the basis for the correction (because it seemed to be the highest quality set of unitigs I have. I wonder if unitigs from an assembler other than Celera would work very well. Perhaps an entire Minia assembly would work nicely as unitigs.) Happily, it finished, with a total run time of about 6 days (Celera assembly of short-reads plus correction of long reads)  which is way shorter than the run time of PacBioToCA was (before I killed it out of hopelessness). Unfortunately, we're down to about 3.5x PacBio coverage (from  the 6x of the input).

Following the ECtools readme, I converted the fasta file of corrected reads into a celera frg file:
$celera/ -l longifolia_pbcor -s longifolia.cor.fa -q <( python ${ECTOOLS_HOME}/ longifolia.cor.fa) > longifolia.cor.frg

The readme says that Celera with default settings should work pretty well, so I'll give that a shot (the only thing I'm worried about is that it will take forever because it won't be set to use lots of cores).

$celera/runCA -d corrected_pacbio -p corrected_pacbio utgGraphErrorRate=0.01 longifolia.cor.frg

Assembly finished in just a couple days. According to the statistics, there are 30198 scaffolds, representing 92.4 megabases of sequence. N50=3168. Not great, but pretty good compared to what I've been seeing with the other assemblies.

To see if I could recover some of the data lost in the correction of the PacBio reads, I ran PBjelly to try to join the assembled scaffolds into bigger scaffolds.

PBjelly helped a bit:
n    n:0    n:N50    min    N80    N50    N20    E-size    max    sum
28684    28684    9244    1529    2840    4204    6492    4827    29564    111.8e6   
I think that is, by far, my best assembly yet. But I think its small size indicates that we don't have high enough coverage.


8. MaSurCA using corrected PacBio reads, and flashed illumina reads:
Use default parameters, try running it on my desktop, hope it doesn't use more than 48 gb of RAM. MaSurCA got to the stage of assembly where it ran the Celera Assembler, it crashed, I believe because it can't actually handle PacBio reads, according to the paper:
"We note that the modified version of CABOG 6.1 used in MaSuRCA is not capable of supporting the long high-error-rate reads generated by the PacBio technology."
I was hoping it could handle corrected PacBio reads, but apparently it can't do that either. It kept complaining about sequence length being longer than 2047, and eventually crashed, presumably for that reason, but perhaps for some other reason.

To try to get around that, I chunked my PacBio corrected reads into overlapping chunks of length 2047. This got much further, but eventually it also crashed at runCA step 3, with the error: " at /home/langelab/Desktop/assembly_software/MaSuRCA-2.2.1/bin/../CA/Linux-amd64/bin/runCA line 1121
    main::caFailure('109 overlap correction jobs failed; remove /media/langelab/da...', undef) called at /home/langelab/Desktop/assembly_software/MaSuRCA-2.2.1/bin/../CA/Linux-amd64/bin/runCA line 3019
    main::overlapCorrection() called at /home/langelab/Desktop/assembly_software/MaSuRCA-2.2.1/bin/../CA/Linux-amd64/bin/runCA line 5343"
in the files "CA/3-overlapcorrection/*.err" I get the following:
"/home/langelab/Desktop/assembly_software/MaSuRCA-2.2.1/CA/Linux-amd64/bin/correct-olaps: AS_configure()-- AS_CGW_ERROR_RATE set to 0.25
Quality Threshold = 1.50%
Starting Read_Frags ()
Starting Correct_Frags ()
Starting Read_Olaps ()
Starting qsort ()
Starting Redo_Olaps ()
Aborted (core dumped)"

 Tried again with just a subset of the paired end reads (the ones that FLASH could not merge) and no single end reads (that is: no FLASH merged reads, and no PacBio reads). This didn't work either. I give up on MaSuRCA...

Velvet is also capable of hybrid assemblies. It uses the de-brujin graph strategy rather than the overlap strategy, so it probably doesn't handle long reads as well as MaSuRCA, but it's worth a shot:

#using three datasets, Hi-seq, GA2x, and PacBio.


$velvet_dir/velveth h_out 21 -shortPaired -fastq -separate /media/langelab/data/longifolia_genome/flash_merged/s_5_seqsMerged.notCombined_1.fastq /media/langelab/data/longifolia_genome/flash_merged/s_5_seqsMerged.notCombined_2.fastq -shortPaired2 -fasta -separate /media/langelab/data/longifolia_genome/GA_2_reads/trimmed_s_1_1.fastq /media/langelab/data/longifolia_genome/GA_2_reads/trimmed_s_1_2.fastq -short /media/langelab/data/longifolia_genome/flash_merged/s_5_seqsMerged.extendedFrags.fastq -long /media/langelab/data/longifolia_genome/celera_assembly/longifolia.cor.fa

$velvet_dir/velvetg h_out -ins_length 150 -ins_length2 200

Using hash length 21, it used up more RAM than my computer has (46 gb). Tried again with hash-length 31. Same problem.

references and further reading:


A bunch of relevant posts from the blog:

No comments:

Post a Comment