I've got a table I exported from Blast2GO, which associates transcript IDs with GO term IDs (an "annot" file in Blast2GO terminology). I want to separate the GO IDs by category: biological_process, cellular_component, or molecular_function. (Note: in this example I use rdflib, a faster way to do the same thing would be to use the more specialized goatools, which is probably the best GO library for python. I used rdflib because I wanted practice SPARQL)
annot file has the form:
Contig50026:7163-9089 GO:0009595 drl28_arathprobable disease resistance protein at4g27220 os=arabidopsis thaliana gn=at4g27220 pe=2 sv=1
go.owl, from the gene ontology consortium download page.
a table like this:
contig_name biological_process cellular_component molecular_function
[contig name] [comma separated list of GO terms] [comma separated list of GO terms] [comma separated list of GO terms]
I'm using the rdflib and pandas libraries
The owl file contains the mapping of GO IDs to GO categories, as well as a mapping of GO IDs to alternative GO IDs. The annot file associates contigs with GO IDs, as well as Blast hits, and EC numbers.
We first read the owl file, and use SPARQL queries to make a dict of GO ID to GO category mappings. Then we apply this mapping to the GO ID column of the data table. Then we rearrange the table until the columns are GO categories, and the data are comma separated lists of GO IDs
sequence biological_process cellular_component molecular_function