Thursday, December 10, 2015

Extracting gene ontology categories using python pandas and rdflib

I've got a table I exported from Blast2GO, which associates transcript IDs with GO term IDs (an "annot" file in Blast2GO terminology). I want to separate the GO IDs by category: biological_process, cellular_component, or molecular_function. (Note: in this example I use rdflib, a faster way to do the same thing would be to use the more specialized goatools, which is probably the best GO library for python. I used rdflib because I wanted practice SPARQL)


input:
annot file has the form:
Contig50026:7163-9089    GO:0009595    drl28_arathprobable disease resistance protein at4g27220 os=arabidopsis thaliana gn=at4g27220 pe=2 sv=1
Contig50026:7163-9089    GO:0016020
Contig50026:7163-9089    GO:0098542
...

go.owl, from the gene ontology consortium download page.

output:
a table like this:
contig_name   biological_process   cellular_component   molecular_function
[contig name]   [comma separated list of GO terms]   [comma separated list of GO terms]   [comma separated list of GO terms]

Strategy:
I'm using the rdflib and pandas libraries
The owl file contains the mapping of GO IDs to GO categories, as well as a mapping of GO IDs to alternative GO IDs. The annot file associates contigs with GO IDs, as well as Blast hits, and EC numbers.

We first read the owl file, and use SPARQL queries to make a dict of GO ID to GO category mappings. Then we apply this mapping to the GO ID column of the data table. Then we rearrange the table until the columns are GO categories, and the data are comma separated lists of GO IDs

sample output:
sequence    biological_process    cellular_component    molecular_function
Contig0:11772-14489            GO:0003824
Contig0:15848-17595            GO:0003824


Code:

import pandas as pd
import rdflib

table_name = "annot_20151208_1355.annot"
go_rdf_path = "/media/sean/linux_data/gene_ontology/go.owl"
output_name = "blast2GO_go_type_table.tsv"

t = pd.read_table(table_name, index_col=False, header=None, names=['sequence','GO','protein'])

g = rdflib.Graph()

g.parse(go_rdf_path)

go_to_category = dict()
qres = g.query(
    """SELECT DISTINCT ?a ?b
       WHERE {
          ?a oboInOwl:hasOBONamespace ?b
       }""")
for row in qres:
    go_to_category[row[0][-10:].replace("_",":")] = row[1].value
#GO:0045449


qres = g.query(
    """SELECT DISTINCT ?a ?b
       WHERE {
          ?a oboInOwl:hasAlternativeId ?b
       }""")

for row in qres:
    go_to_category[row[1].value] = go_to_category[row[0][-10:].replace("_",":")]

def map_go_to_category(go):
 out = ""
 if go.strip != "" and go[0:2] != "EC":
  out = go_to_category[go]
 return out

t['go_category'] = t['GO'].map(map_go_to_category) #find the GO category

t = t[t['go_category'] != ""] #drop rows without a GO category
grouped = t.groupby(['sequence','go_category'])['GO'] #create a seriesgroup where the indexes are sequence name and go_category and the data is GO ID
t = grouped.agg(lambda x: ",".join(x)) #create a table where the indexes are sequence and go_category and the data are comma separated lists of GO IDs
t = t.unstack() #convert the GO category index entries to column names
t.to_csv(output_name, sep="\t")



references:
https://rdflib.readthedocs.org/en/stable/intro_to_sparql.html

No comments:

Post a Comment