Tuesday, July 29, 2014

Using SwissProt and ExPASy ENZYME to generate putative EC annotations for sequence data

There was a question on Biostars about how to make EC assignments based on sequence. I gave one of the answers, suggesting a few possible solutions. One of my solutions was to Blast against ExPASy ENZYME and base the annotations on the best hit from there. In this post I explain how to do that, and supply the necessary Python code.
[Note: PRIAM is another tool that assigns EC numbers to sequences based on the data in the ENZYME database. Instead of individual blast hits, it uses profiles built from multiple sequence alignments of peptides known to catalyze a given reaction. It's also a lot quicker to use than the method outlined here, so it's probably worth checking out first]

Goal: Starting from a set of unannotated sequences, we want a table with at least two columns: sequence name, and putative EC number.

Strategy overview:
First we'll generate a subset of SwissProt including just those enzymes referenced by Enzyme.  Next we'll create a Blast database from that subset of SwissProt, and run a search against it using our unknown sequences as the query. Based on the XML output from the search, we'll filter out all but the best hits, and then use the remaining hits to retrieve EC numbers from the Enzyme database.
FASTA file of sequences you want to annotate
Python 2.7 (other versions of Python may work too, but this is what I tested on)

Detailed Protocol:

First download the latest enzyme.dat file from the ExPASy ftp site (for detailed information about the format of that file, look at enzuser.txt). We'll need to grab the EC numbers and uniprot IDs, I'm also including the description and whether or not the EC number is "transferred". A transferred number is one that is obsolete because the reactions it described have been moved to other EC numbers.  Here's a script to extract the data. It uses a parser from BioPython, and a tsv writer from Pandas. Ideally, we'd use command line arguments instead of defining parameters at the top of a file, but I'm doing it the easy way for most of the scripts in this post.

    Reads a Expasy Enzyme .dat file and writes a tab separated file where the 
    first column is EC number, the second column is the reaction description, 
    the third column is the associated uniprot ids separated by spaces, 
    and the fourth column indicates whether the reactions described 
    by this EC have been transferred to other EC numbers.
from Bio.ExPASy import Enzyme

import pandas as pd
### Parameters ###
input_name = "enzyme.dat"
output_name = 'ec_uniprot.tsv'

### program ###
handle = open(input_name)
records = Enzyme.parse(handle)

out = dict() #dict of dicts, first key: EC number, second key: field
transferred = dict() #dict of lists
for record in records:
    if 'Transferred entry:' in record['DE']:
        record['DE'] = record['DE'].rstrip('.') #remove period
        record['DE'] = record['DE'].replace('Transferred entry:',' ') #remove title
        record['DE'] = record['DE'].replace(',',' ') #remove commas
        record['DE'] = record['DE'].replace('and',' ') #remove and
        point_to = record['DE'].split()
        transferred[record['ID']] = point_to
        out[record['ID']] = dict()
        out[record['ID']]['uniprot'] = ' '.join([x[0] for x in record['DR']])
        out[record['ID']]['description'] = record['DE']
        out[record['ID']]['transferred'] = False

for id in transferred:
    out[id] = dict()
    out[id]['uniprot'] = ' '.join([out[x]['uniprot'] for x in transferred[id]])
    out[id]['description'] = 'Transferred entry: ' + ' '.join(transferred[id])
    out[id]['transferred'] = True
df = pd.DataFrame.from_dict(out, orient='index')
df.index.name = 'EC'
df.to_csv(output_name, sep='\t')

Now download the Swissprot fasta file and extract it (7zip on Windows, gunzip on Linux).

This next step is optional, and takes a long time. It may be good to generate a subset of SwissProt containing only the sequences associated with EC numbers, eventually, we'll be annotating based on best Blast hit, so there may be situations where two close homologs are in SwissProt, but only one of them is in Enzyme, we want to ensure that the one that is in Enzyme is our top Blast hit. I tried running this workflow both ways, with using a subset of SwissProt, and with using the whole thing. The results were that of 34728 peptide sequences from a genome, 2429 received EC numbers when I used all of SwissProt, and 2479 when I used the subset. So the difference is pretty small. To generate a subset, we first use get_uniprot_ids.py to make uniprot_ids.txt, a list of all unique uniprot IDs from the Enzyme file. Then we use subset_fasta.py (from srj_chembiolib) to generate a fasta file including only the IDs listed in uniprot_ids.txt. (As of this writing, there are 546 000 proteins in SwissProt, 216 854 of these are referenced by Enzyme).


    read the tsv output of parse_enzyme.py
    write a file with one uniprot ID per line, containing all of the
    uniprot IDs mentioned in uniprot column of the input file
    Ignore EC numbers that have been transferred
import pandas as pd

input = "ec_uniprot.tsv"
output = "uniprot_ids.txt"

df = pd.read_table(input)
df.dropna(subset=['uniprot'], inplace=True) #ignore EC numbers with no uniprot ids associated
df = df[df.transferred == False] #ignore EC numbers that are obsolete due to transfer

unique_uniprot = set(" ".join(df.uniprot.values).split(" "))

with open(output, "w") as outfile:
    for id in unique_uniprot:
        outfile.write(id + "\n")

If srj_chembiolib is in your path and you're using Linux, just use the following command at the command line:
(another way to get lots of uniprot sequences quickly is to use the bulk download feature from uniprot)

$ subset_fasta.py -i uniprot_sprot.fasta -o sprot_subset.fasta --names uniprot_ids.txt

otherwise, just copy it to your working directory and call with python

$ python subset_fasta.py -i uniprot_sprot.fasta -o sprot_subset.fasta --names uniprot_ids.txt

(because SwissProt and our list of desired accessions are both so big, this process takes a while (like overnight on my computer), if you know a way to speed it up, please submit a patch to srj_chembiolib !)

Now we've got to make a blast database from swissprot, or our subset of swissprot. Make sure you've got NCBI local BLAST+ installed, and on the command line, run:

$ makeblastdb -in sprot_subset.fasta -dbtype prot -out sprot_subset -hash_index

Next, we Blast our sequences against SwissProt (or our subset of it). The database contains protein sequences, so if the sequences you're trying to annotate are nucleotide sequences, you can either run a blastx search, or translate your sequences to peptides using a tool like TransDecoder or ESTscan (I've used both, and both seem to work well as far as I can tell). My example data is already as peptide sequences, so I'll just go right to a Blastp. To start, I'm going to limit the search to the best hit and set the output to xml. We'll filter the results to be more rigorous in the next step. To make the search go significantly faster, you should also use the -num_threads x option, where x is the number of cores in your computer.

$ blastp -max_target_seqs 1 -outfmt 5 -query new_sequences.fasta -db sprot_subset -out new_sequences_sprot_enzyme.xml -num_threads 4

Almost done! Now that we've got the SwissProt hits what's left is to filter out low scoring matches, assign and retrieve the EC numbers associated with the remaining top hits. Here we're faced  with the tricky question of "how similar do two proteins have to be on the sequence level to catalyze the same reaction?" Unfortunately, the answer to that question is: "It depends" Very different enzymes can catalyze the same reaction (see also here), and very similar enzymes sometimes catalyze different reactions. Still, we need to choose some heuristic, Tian and Skolnick 2003 suggest 60% identity for 90% accuracy, which seems reasonable enough, so let's use that. To filter the Blast hits, and at the same time convert the data to a convenient tabular format we'll use another program from srj_chembiolib, blast_xml_to_outfmt6.py:

$ python blast_xml_to_outfmt6.py -i peptides_sprot_enzyme.xml -o new_sequences_sprot_enzyme.outfmt6 --target_identity 60 --use_fillers --num_hsps 1 --write_header --target_def

Now to transfer EC numbers to our sequences. I thought it would be possible to do this with Excel, but I couldn't figure it out. Excel apparently has no reliable way to perform a vlookup like operation to find internal substring matches in long strings. So I'll use Python Pandas.


import pandas as pd

#### parameters ####
hits_table_file_name = 'new_sequences_sprot_enzyme.outfmt6'
enzyme_table_file_name = 'ec_uniprot.tsv'
output_file_name = "sequences_with_EC_annotations.tsv"

#### program ####

#open data files
hits = pd.read_table(hits_table_file_name, index_col=0)
enzyme = pd.read_table(enzyme_table_file_name)

hits.fillna('', inplace=True)  #replace empty values with blank spaces
enzyme.fillna('', inplace=True)

enzyme = enzyme[enzyme.transferred == False] #drop transferred EC numbers

hits.target = hits.target.str[3:9] #take just the uniprot ID from the name

def get_ecs(uniprot):
    if uniprot == '': #ignore invalid uniprot ids
        return ''
        return ' '.join(enzyme.EC[enzyme.uniprot.str.contains(uniprot)].values)

hits['EC'] = hits.target.apply(get_ecs)

hits.to_csv(output_file_name, sep="\t")

That's it! I hope it's useful to you.

No comments:

Post a Comment