Monday, July 4, 2016

Making a stoichiometric model of peppermint trichomes

Someday this work will be published with a more coherent and organized version of these methods. But maybe in the meantime, this will be useful to someone. Here's (more or less) my lab notebook for how I generated a stoichiometric model of metabolism in peppermint glandular trichomes.

A separate, complementary but more useful, guide for the same thing, including all of the code, can be found on in a bitbucket repository here

I want a compartmentalized model of metabolism in peppermint glandular trichomes.

I have a Trinity assembly of a Illumina RNAseq reads of RNA isolated from secretory phase peppermint peltate glandular trichomes.

Peptide sequences extracted from the mint transcriptome via Transdecoder.

Arabidopsis core metabolism model created by Arnold and Nikoloski, 2014

AraCyc version 13

MetaCyc version 19.1

PRIAM from March 2015

SUBA database of Arabidopsis protein subcellular localization.

Use the manually curated/compartmentalized model from Arnold 2014 as the basis, and add additional reactions from AraCyc and MetaCyc.

I want to integrate the model with reactions from MetaCyc and PlantCyc, so my first step is to generate a mapping of Arnold-2014 metabolite names to MetaCyc names. The process of name conversion is described in a separate post.

Arnold 2014 includes sub-cellular localizations for all of its reactions. Double check them against SUBAcon (SUBA3)

For AraCyc subcellular localizations use SUBAcon, and SUBA3.
For MetaCyc reactions, use the annotations from UniProt, and TargetP predictions of the UniProt accessions.

collect the uniprot IDs

makeblastdb -in .\uniprot_sequences.fasta -dbtype prot -out uniprot -hash_index

blast against the uniprot IDs

do a global alignment against the top blast hit with align0

calculate the "align identity" Tian and Skolnick (2003)

There are some uniprot IDs from Metacyc and Aracyc that were merged. So, replace those with the ones that were downloaded from UniProt.

Extract TAIR genes for all of the Arnold reactions

import yasmenv.model as model
gra_parse = model.GraParser()
m = model.from_tsv('Arnold_metacyc.tsv')
genes = set()
for r in m:
  genes |= gra_parse.get_genes_from_bool(m[r].annotations['Gene-reaction association'])

upload that list to UniProt batch downloader to get the sequences. Map from TAIR to UniProtKB. For the 623 genes, at least one protein sequence was found for all but 1 (total 800 protein sequences, some genes are mapped to multiple protein sequences because of things like post-translational truncation). The  unmapped gene was "AT2G07727". The reason it was missed was because the associated uniprot gene is also associated with ATMG00220. It only appears once in the gene-reaction associations, and it is in the form:  (AT2G07727 OR ATMG00220)"
So we can safely leave it out of our list.
Download the fasta file and the table for the batch search and the other gene, and make two files:

rename the first column in to "TAIR"

Instantiate MetaCyc and AraCyc using YASMEnv:
Extract UniProt accessions for all of the MetaCyc reactions.

Extract TAIR genes for all of the AraCyc reactions

Use the uniprot Retrieve/ID-mapping  service to get sequences for the MetaCyc Uniprot accessions. (some of the uniprot IDs used in MetaCyc are merged or obsolete, so in the reactions file, replace outdated IDs with new ones where possible)
Use the TAIR bulk sequence downloader to retrieve peptide sequences for AraCyc and Arnold.

run three Blast searches:
blastp of peppermint peptides against peptide sequences from each of the reaction sources.
makeblastdb -in .\TAIR_peptides.fasta -dbtype prot -out arnold_proteins -hash_index -parse_seqids

makeblastdb -in .\aracyc_tair_peptides.fasta -dbtype prot -out aracyc_proteins -hash_index -parse_seqids
makeblastdb -in .\unprot_download.fasta -dbtype prot -out aracyc_proteins -hash_index

blastp -query ../peppermint_transcripts/peppermint_contigs.fasta.transdecoder.pep -outfmt 5 -out peppermint_peptides_to_arnold.xml -db arnold_proteins -max_target_seqs 20
blastp -query ../peppermint_transcripts/peppermint_contigs.fasta.transdecoder.pep -outfmt 5 -out peppermint_peptides_to_aracyc.xml -db aracyc_proteins -max_target_seqs 20
blastp -query ../peppermint_transcripts/peppermint_contigs.fasta.transdecoder.pep -outfmt 5 -out peppermint_peptides_to_metacyc.xml -db uniprot_subset -max_target_seqs 20

Extract the top hits from the xml files and save them to a tab separated format:
python E:\scripts\srj_chembiolib\ -i .\peppermint_peptides_to_arnold.xml -o peppermint_peptides_to_arnold.outfmt6 --num_hits 1 --num_hsps 1 --write_header
python E:\scripts\srj_chembiolib\ -i .\peppermint_peptides_to_aracyc.xml -o peppermint_peptides_to_aracyc.outfmt6 --num_hits 1 --num_hsps 1 --write_header
python E:\scripts\srj_chembiolib\ -i .\peppermint_peptides_to_metacyc.xml -o peppermint_peptides_to_metacyc.outfmt6 --num_hits 1 --num_hsps 1 --write_header

use from srj_chembiolib to convert the output to a fasta.
use from srj_chembiolib to calculate the alignment identity

Because some of the peppermint sequences are incomplete transcripts, and also because some proteins in the database are the mature peptide sequences (without signal peptides) calculate identities from the alignment length, (not including gaps).

Tian and Skolnick (2003) suggest that a 75% align identity should be sufficient for 90% accurate functional annotation.
[TODO: generate percent_identity graphs from transcrips_with_protein_annotations.csv. metacyc_global_ident, aracyc_global_ident, max_global_ident]
Graph of max global identity:

Test whether the localizations given by SUBAcon agree with the localizations in Arnold 2014:

use the following compartment definitions:
subacon_compartments_dict = {'cytosol':"cytosol", 'endoplasmic reticulum':"cytosol", 'extracellular':'extracellular', 'golgi':"cytosol", 'mitochondrion':"mitochondrion", 'nucleus': 'nucleus', 'peroxisome':'peroxisome', 'plasma membrane':'cytosol',  'plastid':'plastid', 'vacuole':'vacuole'}
arnold_compartments_dict = {"c": "cytosol", 'h': 'plastid', 'i': 'mitochondrion', 'l': 'plastid', 'm': 'mitochondrion', 'p': 'peroxisome'}

For each gene in Arnold 2014 find the compartments of all of the metabolites in each of the reactions associated with that protein.

Compare the compartments assigned by Arnold to those assigned by SUBAcon. Both Arnold and SUBAcon assign at least one compartment to each of 622 genes. Of those 622, they agree on all compartments 487 times, they agree on no compartments 75 times, and they agree on at least one but not all 60 times. So they are in complete agreement for 88% of proteins, complete disagreement for 12% of proteins, and partial agreement for 10% of proteins.

Annotate the transcriptome with PRIAM:
java -jar PRIAM_search.jar -cc T -cg T -i peppermint_contigs.fasta.transdecoder.pep -p PRIAM_MAR15 -od mint_priam_9_17_2015
cc: checks for known catalytic residues
cg: tells PRIAM that we're inputing a complete proteome

For many proteins, uniprot provides a subcellular compartment annotation. To convert the text annotation into an explicit compartment, I run text searches of the annotation string:
{"Peroxisome":"peroxisome", "Mitochondrion":"mitochondrion", "Plastid":"plastid", "Vacuole": "vacuole", "Cytoplasm": "cytosol"}
where the key is the substring searched for, and, the value is the compartment. Where no annotation is available, I use the targetp prediction.

Generate model by assigning peppermint transcripts to reactions from Arnold, AraCyc, or MetaCyc.
Frustratingly, no matter which dataset we use, we still can't account for most of the gene expression.

To assign transcripts to reactions:
take the subset of reactions where max_global_ident > 70 (70 is an arbitrary choice here, but it should give us about 80% accuracy in activity calling, Tian and Skolnick 2003). If the highest identity annotation comes from Arnold, assign the transcript to those reactions. If the highest annotation comes from AraCyc, but the Arnold identity is within 5%, use the Arnold reactions, otherwise use the AraCyc reactions. If the highest identity comes from MetaCyc, then but the Arnold identity is within 5%, use the Arnold reactions, otherwise if the AraCyc identity is within 5%, use the AraCyc reactions, otherwise use the MetaCyc reactions.

There are 318 transcripts where the PRIAM positive hit probability was > 80 but the best sequence identity to any of the models was < 70
Many of these (231) are enzymes with protein substrates or DNA/RNA substrates, such as protein kinases and phosphateses, helicases, polymerases, etc. One is a transporter, 47 are reactions involving only non-generic small molecule substrates, and 31 are reactions involving only small molecules at least one of which is

For the mint model I take all of the reactions from Arnold 2014 (whether or not they have any associated peppermint transcripts) except the biomass reactions, as well as any reaction from AraCyc and/or MetaCyc.
I add a biomass reaction (not worrying about stoichiometry for the moment) of
Limonene Menthone Menthol Isomenthone Menthofuran Pulegone Germacrene D B-Caryophyllene  a-Humulene hesperedin Gardenin

With that collection of reactions, I test for the ability to make biomass, and add the necessary reactions

[TODO: generate graphs of metabolite connectivity to demonstrate why removing orphan reactions is necessary]

To generate the "second model":
set light import to zero, allow free import and export of water, protons
add raffinose import
make FNR go in the direction of reduced ferredoxins: merge|0_h and Fd-NADPR_h, and make reversible
add transport of limonene to cytosol
add transport of trans-isopiperitenol and trans-isopiperitenone into and out of the mitochondrion
add the reaction ISOPIPERITENOL-DEHYDROGENASE-RXN|0_m (TODO: add transcript annotation)
add the reaction RXN-5162|0_c (TODO: add transcript annotation)
finish off the pathway to gardenin B: add RXN-1466,
add IPP transport from plastid to cytosol
add RXN-15528|0_c, RXN-15530|0_c, RXN-14657|0_c, RXN-14639|0_c
add export for menthofuran, menthol, isomenthol, neomenthol, neoisomenthol, ethanol
add import for O2
add RXN0-882 to manual reactions (HMPP synthase), so that it doesn't get eliminated when the non-balanced are eliminated
add RXN-8415|0_c (humulene synthase), RXN-8414|0_c ((-)-b_caryophyllene synthase), RXN-8562|0_c ((-)-germacrene-D synthase)
add:UDP-GLUCOSE-46-DEHYDRATASE-RXN|0,  RXN-5481|0, RXN-7754|0, RXN-7755|0,RXN-7759|0, Hesperedin biosynthesis

at this point the model can produce all of the biomass compounds. However, it seems to not be energy balanced or mass balanced.

run the clean method to merge duplicate reactions (ones with the same reactants and products)

if a reaction is only from "bad" metacyc or aracyc, then get rid of it. This will eliminate a lot of spurious isomerizations.

then remove all reactions with orphan metabolites (metabolites that only occur in one reaction).

Now the model is both mass and energy balanced.

There are still some thermodynamically unrealistic loops. The only way I know of to get rid of these is to run repeated FBA minimization simulations and examine the results for things that look wrong.

A lot of problems arise from instantiation of biocyc reactions, or unrealistic reversibility of biocyc reactions.

One problem with the Arnold model is that it includes a possible shuttle of NADH from cytosol to mitochondrion, based on interconversion of Proline and  L-DELTA1-PYRROLINE_5-CARBOXYLATE

Another problem is that cytosolic NADPH can be regenerated by a cycle involving the reactions:

The next step is to add transcript annotations to as many of the manually added reactions as possible, and then assign expression levels to each of the reactions. Use MetaCyc, ExPasy Enzyme, and UniProt to find sequences for the added reactions and BLAST search against the peppermint transcriptome.

Since I don't have boolean gene-reaction statements for all of the reactions, I'm going to calculate expression simply by summing the expression of the genes associated with each reaction.

Combining the manual reactions with the second model and adding reaction expression gives the third model.

Another round of manual curation gives the fourth model.

The fifth model is generated by removing all reactions originating from the Arnold model that have arabidopsis genes associated with them, but no hits to the peppermint trichome transcriptome.

The fifth model had some big problems with generating cytosolic NADPH, the most efficient way was to use NADP dependent isocitrate dehydrogenase, the product of which could only be eliminated by wasteful looping through oxidative phosphorylation, leading to huge fluxes in the ATP-burn reaction.
adding the complete cytosolic OPPP solved this problem. Nevertheless, the fact that there the OPPP is incomplete in the cytosol could be important, and should not be ignored.

According to Kruger and von Schaewen (2003), all of the enzymes (oxidative and non-oxidative) of the PPP occur in the plastid, and only the oxidative reactions (G6PDH, 6PGL (gluconolactonase), and 6PGD) are duplicated in the cytosol, with transporters Glc-6-P/phosphate translocator (GPT) and XPT (xylulose 5-phosphate (Xlu-5-P), triose phosphate (Triose-P) and Pi) exchanging the intermediates.

The original Arnold model does not have cytosolic G6PDH or 6PGL or 6PGD, nor does it have the transporters for the appropriate intermediates to split the OPPP between cytosol and plastid.


Arnold 2014 compartment abbreviations:
c: Cytosol
h: Chloroplast
i: IntermembraneSpace
l: Lumen
m: Mitochondrion
p: Peroxisome

No comments:

Post a Comment