Monday, July 11, 2016

Figuring out which adapters to trim in Illumina data

Often times it's difficult to know what adapter sequences should be trimmed from Illumina data. This can occur if you download public data, for example from SRA or if you send samples for sequencing to a company that doesn't communicate with you very well (not that I have any experience with that...).

Previously it's been a bit of a a struggle for me to figure out which adapters to trim when processing Illumina high-throughput sequencing data. With a little bit of time, and some thought about how Illumina sequencing works, adapters can be identified and removed even if we don't know beforehand what the sequences are.

Read on to see how I figured out the adapter sequences in my most recent RNAseq analysis.

My sample data:
paired end
strand specific
101 bp reads
data from an Illumina HiSeq 3000 instrument

You can find the exact sequences I'm using on SRA SRR3623199

In Illumina sequencing, the target sequences are fragmented, then ligated to adapters. Sequencing starts from a primer that binds to the last sequence in the adapter (the part of the adapter closest to the sample sequence). That means that the source of adapter sequence in Illumina reads will most of the time be from read-through. Read-through is where the fragment between the two adapters is shorter than the read length. In that case, the entire fragment will be sequenced, plus some or all of the adapter that comes after it. What we're looking for mostly is read-through. That means that we're most likely to see primers at the 3' end of our reads, and not at the 5' ends. It also means that we need to be searching for the reverse complements of primer sequences.

Use manual inspection, Fastqc, and/or grep to find reads with likely primer sequence.

Use consensus maker or Pictogram (if it's a relatively short sequence) to generate a consensus sequence of the reads.

The consensus sequence is a likely adapter that should be trimmed.


Navigate to the folder where the fastq data is.

Have a look at the data just to see if there are any obvious patterns.
less SRR3623199_R1.fastq

There don't seem to be any obvious pattern. So let's try opening it with fastqc

Fastqc shows a promising sequence:

A Blastn search shows that the top sequence is not an adapter, but, in fact, the sequence of a virus! That probably means something important about the biological sample, but doesn't help us with primer trimming.

Our next strategy is to search various known primer sequences to see if any occur frequently in our sample. Remember, since we are looking for primers on 3' end of the read, we need to reverse complement the sequences before searching for them (or use a search tool that automatically searches both strands). I'm going to use grep. My source of potential sequences is from here.

After a bit of grepping, I found something promising:
The sequence AGATCGGAAG matches a lot of reads (2,287,216 out of 37,279,073).

We could collect a bunch of these sequences and run them through a multiple sequence alignment program such as clustal, but we can kind of cheat instead.

Since we have so many matches to our query, we can tell grep to only return matches in a specific position.
This command will look for our query sequence followed by 90 of any nucleotide and then the end of the line:
grep 'AGATCGGAAG.\{90,\}$' SRR3623199_R1.fastq

Even with this strict criterion, we still get 99 matches in the dataset, and we can easily manually inspect the matches to look for a consensus sequence which might correspond to the adapter.


We could just use this as our adapter sequence, and it would probably work out pretty well, even if it contains a misread or two.

Just to be thorough, however, we'll take it one step further and look for a consensus sequence.

Our new putative adapter sequence is 64 bp long, and is followed by a variable-length string of A's and then some sequence that seems inconsistent.

which gives us sequences where there are at least 40 more characters after the match. We still get 232,826 matches, which is a lot! So let's only keep the first 1,000. We want them to be in fasta format, so let's add line headers also. The final command is:
grep -o 'AGATCGGAAG.\{54\}$' SRR3623199_R1.fastq | head -n 1000 | awk '{print ">" NR "\n" $0}' > adapter_candidates.fasta

We enter those sequences into the consensus maker program and it gives us the consensus sequence:
AGATCGGAAG agcacacgtc tgaactccag tcaccgatgt atctcgtatg ccgtcttctg cttg

Which a needleman wunsch alignment will tell us is exactly the same as the one we picked out before. But at least we're more confident now.

EMBOSS_001         1 AGATCGGAAGagcacacgtctgaactccagtcaccgatgtatctcgtatg     50

EMBOSS_001        51 ccgtcttctgcttg     64
EMBOSS_001        51 CCGTCTTCTGCTTG     64

What's next?
Next, you should look for adapters on the other set of reads from the paired end reads. They should be similar to the adapter for this set. Finally use your favorite adapter trimmer software to remove the adapter sequence.

Recommended reading and tools:

Awesome video by Illumina explaining paired-end indexed sequencing

Online tool for generating reverse complement of nucleic acid sequences

I borrowed a lot of ideas from these guys

fastqc adapter information

Illumina adapter information

No comments:

Post a Comment