input:
annot file has the form:
Contig50026:7163-9089 GO:0009595 drl28_arathprobable disease resistance protein at4g27220 os=arabidopsis thaliana gn=at4g27220 pe=2 sv=1
Contig50026:7163-9089 GO:0016020
Contig50026:7163-9089 GO:0098542
...
go.owl, from the gene ontology consortium download page.
output:
a table like this:
contig_name biological_process cellular_component molecular_function
[contig name] [comma separated list of GO terms] [comma separated list of GO terms] [comma separated list of GO terms]
Strategy:
I'm using the rdflib and pandas libraries
The owl file contains the mapping of GO IDs to GO categories, as well as a mapping of GO IDs to alternative GO IDs. The annot file associates contigs with GO IDs, as well as Blast hits, and EC numbers.
We first read the owl file, and use SPARQL queries to make a dict of GO ID to GO category mappings. Then we apply this mapping to the GO ID column of the data table. Then we rearrange the table until the columns are GO categories, and the data are comma separated lists of GO IDs
sample output:
sequence biological_process cellular_component molecular_function
Contig0:11772-14489 GO:0003824
Contig0:15848-17595 GO:0003824
Code:
import pandas as pd import rdflib table_name = "annot_20151208_1355.annot" go_rdf_path = "/media/sean/linux_data/gene_ontology/go.owl" output_name = "blast2GO_go_type_table.tsv" t = pd.read_table(table_name, index_col=False, header=None, names=['sequence','GO','protein']) g = rdflib.Graph() g.parse(go_rdf_path) go_to_category = dict() qres = g.query( """SELECT DISTINCT ?a ?b WHERE { ?a oboInOwl:hasOBONamespace ?b }""") for row in qres: go_to_category[row[0][-10:].replace("_",":")] = row[1].value #GO:0045449 qres = g.query( """SELECT DISTINCT ?a ?b WHERE { ?a oboInOwl:hasAlternativeId ?b }""") for row in qres: go_to_category[row[1].value] = go_to_category[row[0][-10:].replace("_",":")] def map_go_to_category(go): out = "" if go.strip != "" and go[0:2] != "EC": out = go_to_category[go] return out t['go_category'] = t['GO'].map(map_go_to_category) #find the GO category t = t[t['go_category'] != ""] #drop rows without a GO category grouped = t.groupby(['sequence','go_category'])['GO'] #create a seriesgroup where the indexes are sequence name and go_category and the data is GO ID t = grouped.agg(lambda x: ",".join(x)) #create a table where the indexes are sequence and go_category and the data are comma separated lists of GO IDs t = t.unstack() #convert the GO category index entries to column names t.to_csv(output_name, sep="\t")
references:
https://rdflib.readthedocs.org/en/stable/intro_to_sparql.html
No comments:
Post a Comment