[Note: PRIAM is another tool that assigns EC numbers to sequences based on the data in the ENZYME database. Instead of individual blast hits, it uses profiles built from multiple sequence alignments of peptides known to catalyze a given reaction. It's also a lot quicker to use than the method outlined here, so it's probably worth checking out first]
Goal: Starting from a set of unannotated sequences, we want a table with at least two columns: sequence name, and putative EC number.
First we'll generate a subset of SwissProt including just those enzymes referenced by Enzyme. Next we'll create a Blast database from that subset of SwissProt, and run a search against it using our unknown sequences as the query. Based on the XML output from the search, we'll filter out all but the best hits, and then use the remaining hits to retrieve EC numbers from the Enzyme database.
FASTA file of sequences you want to annotate
Python 2.7 (other versions of Python may work too, but this is what I tested on)
NCBI local BLAST+
First download the latest enzyme.dat file from the ExPASy ftp site (for detailed information about the format of that file, look at enzuser.txt). We'll need to grab the EC numbers and uniprot IDs, I'm also including the description and whether or not the EC number is "transferred". A transferred number is one that is obsolete because the reactions it described have been moved to other EC numbers. Here's a script to extract the data. It uses a parser from BioPython, and a tsv writer from Pandas. Ideally, we'd use command line arguments instead of defining parameters at the top of a file, but I'm doing it the easy way for most of the scripts in this post.
''' Reads a Expasy Enzyme .dat file and writes a tab separated file where the first column is EC number, the second column is the reaction description, the third column is the associated uniprot ids separated by spaces, and the fourth column indicates whether the reactions described by this EC have been transferred to other EC numbers. ''' from Bio.ExPASy import Enzyme import pandas as pd ### Parameters ### input_name = "enzyme.dat" output_name = 'ec_uniprot.tsv' ### program ### handle = open(input_name) records = Enzyme.parse(handle) out = dict() #dict of dicts, first key: EC number, second key: field transferred = dict() #dict of lists for record in records: if 'Transferred entry:' in record['DE']: record['DE'] = record['DE'].rstrip('.') #remove period record['DE'] = record['DE'].replace('Transferred entry:',' ') #remove title record['DE'] = record['DE'].replace(',',' ') #remove commas record['DE'] = record['DE'].replace('and',' ') #remove and point_to = record['DE'].split() transferred[record['ID']] = point_to else: out[record['ID']] = dict() out[record['ID']]['uniprot'] = ' '.join([x for x in record['DR']]) out[record['ID']]['description'] = record['DE'] out[record['ID']]['transferred'] = False for id in transferred: out[id] = dict() out[id]['uniprot'] = ' '.join([out[x]['uniprot'] for x in transferred[id]]) out[id]['description'] = 'Transferred entry: ' + ' '.join(transferred[id]) out[id]['transferred'] = True df = pd.DataFrame.from_dict(out, orient='index') df.index.name = 'EC' df.to_csv(output_name, sep='\t')
Now download the Swissprot fasta file and extract it (7zip on Windows, gunzip on Linux).
This next step is optional, and takes a long time. It may be good to generate a subset of SwissProt containing only the sequences associated with EC numbers, eventually, we'll be annotating based on best Blast hit, so there may be situations where two close homologs are in SwissProt, but only one of them is in Enzyme, we want to ensure that the one that is in Enzyme is our top Blast hit. I tried running this workflow both ways, with using a subset of SwissProt, and with using the whole thing. The results were that of 34728 peptide sequences from a genome, 2429 received EC numbers when I used all of SwissProt, and 2479 when I used the subset. So the difference is pretty small. To generate a subset, we first use get_uniprot_ids.py to make uniprot_ids.txt, a list of all unique uniprot IDs from the Enzyme file. Then we use subset_fasta.py (from srj_chembiolib) to generate a fasta file including only the IDs listed in uniprot_ids.txt. (As of this writing, there are 546 000 proteins in SwissProt, 216 854 of these are referenced by Enzyme).
''' read the tsv output of parse_enzyme.py write a file with one uniprot ID per line, containing all of the uniprot IDs mentioned in uniprot column of the input file Ignore EC numbers that have been transferred ''' import pandas as pd input = "ec_uniprot.tsv" output = "uniprot_ids.txt" df = pd.read_table(input) df.dropna(subset=['uniprot'], inplace=True) #ignore EC numbers with no uniprot ids associated df = df[df.transferred == False] #ignore EC numbers that are obsolete due to transfer unique_uniprot = set(" ".join(df.uniprot.values).split(" ")) with open(output, "w") as outfile: for id in unique_uniprot: outfile.write(id + "\n")
If srj_chembiolib is in your path and you're using Linux, just use the following command at the command line:
(another way to get lots of uniprot sequences quickly is to use the bulk download feature from uniprot)
$ subset_fasta.py -i uniprot_sprot.fasta -o sprot_subset.fasta --names uniprot_ids.txt
otherwise, just copy it to your working directory and call with python
$ python subset_fasta.py -i uniprot_sprot.fasta -o sprot_subset.fasta --names uniprot_ids.txt
(because SwissProt and our list of desired accessions are both so big, this process takes a while (like overnight on my computer), if you know a way to speed it up, please submit a patch to srj_chembiolib !)
Now we've got to make a blast database from swissprot, or our subset of swissprot. Make sure you've got NCBI local BLAST+ installed, and on the command line, run:
$ makeblastdb -in sprot_subset.fasta -dbtype prot -out sprot_subset -hash_index
Next, we Blast our sequences against SwissProt (or our subset of it). The database contains protein sequences, so if the sequences you're trying to annotate are nucleotide sequences, you can either run a blastx search, or translate your sequences to peptides using a tool like TransDecoder or ESTscan (I've used both, and both seem to work well as far as I can tell). My example data is already as peptide sequences, so I'll just go right to a Blastp. To start, I'm going to limit the search to the best hit and set the output to xml. We'll filter the results to be more rigorous in the next step. To make the search go significantly faster, you should also use the -num_threads x option, where x is the number of cores in your computer.
$ blastp -max_target_seqs 1 -outfmt 5 -query new_sequences.fasta -db sprot_subset -out new_sequences_sprot_enzyme.xml -num_threads 4
Almost done! Now that we've got the SwissProt hits what's left is to filter out low scoring matches, assign and retrieve the EC numbers associated with the remaining top hits. Here we're faced with the tricky question of "how similar do two proteins have to be on the sequence level to catalyze the same reaction?" Unfortunately, the answer to that question is: "It depends" Very different enzymes can catalyze the same reaction (see also here), and very similar enzymes sometimes catalyze different reactions. Still, we need to choose some heuristic, Tian and Skolnick 2003 suggest 60% identity for 90% accuracy, which seems reasonable enough, so let's use that. To filter the Blast hits, and at the same time convert the data to a convenient tabular format we'll use another program from srj_chembiolib, blast_xml_to_outfmt6.py:
$ python blast_xml_to_outfmt6.py -i peptides_sprot_enzyme.xml -o new_sequences_sprot_enzyme.outfmt6 --target_identity 60 --use_fillers --num_hsps 1 --write_header --target_def
Now to transfer EC numbers to our sequences. I thought it would be possible to do this with Excel, but I couldn't figure it out. Excel apparently has no reliable way to perform a vlookup like operation to find internal substring matches in long strings. So I'll use Python Pandas.
import pandas as pd #### parameters #### hits_table_file_name = 'new_sequences_sprot_enzyme.outfmt6' enzyme_table_file_name = 'ec_uniprot.tsv' output_file_name = "sequences_with_EC_annotations.tsv" #### program #### #open data files hits = pd.read_table(hits_table_file_name, index_col=0) enzyme = pd.read_table(enzyme_table_file_name) hits.fillna('', inplace=True) #replace empty values with blank spaces enzyme.fillna('', inplace=True) enzyme = enzyme[enzyme.transferred == False] #drop transferred EC numbers hits.target = hits.target.str[3:9] #take just the uniprot ID from the name def get_ecs(uniprot): if uniprot == '': #ignore invalid uniprot ids return '' else: return ' '.join(enzyme.EC[enzyme.uniprot.str.contains(uniprot)].values) hits['EC'] = hits.target.apply(get_ecs) hits.to_csv(output_file_name, sep="\t")
That's it! I hope it's useful to you.