Monday, April 6, 2015

evaluating boolean gene-reaction associations and assigning expression levels to reactions in stoichiometric models

Problem: You've got a stoichometric model, and an RNAseq dataset, and you want to use the transcriptome data to determine which reactions in the model are likely to be active, and at what level.

Solution: Use pandas to read the model and the RNAseq data. Use a pyparsing grammar to parse the boolean gene-reaction association expressions.Write a function calculate a reaction expression level based on the parsed gene-reaction association expression, and the RNAseq data. Write a new model file with a column for gene expression.

Read on for details....

A stochiometric model consists, minimally, of a list of reaction stoichiometries and flux bounds. Large scale stoichiometric models have been generated for many different model organisms, including human, and arabidopsis. I like plants more than I like people, so I'll use an Arabidopsis model as my example. The model by Arnold and Nikoloski (2014) is an excellent model of central metabolism in photosynthetic Arabidopsis cells (even though it's rather small compared to similar models). A particularly important feature of the Arnold model is that it includes boolean gene-reaction associations (GRAs).
For example: the boolean gene-reaction association 4*(AT3G22960) AND 4*(AT5G52920 OR AT1G32440) is associated with the reaction called "Pyr kinase" which has the formula "H[h] + ADP[h] + PEP[h]  -> ATP[h] + Pyr[h]"
A really cool thing about this particular model is that it distinguishes between the genes that encode the cytosolic reaction versus the ones that encode the plastidial version. The model itself can be found in the supplementary information of the paper as an sbml file.

For all its advantages, I still don't think sbml is particularly well suited for stoichiometric modeling, and I much prefer spreadsheets and tab-separated files (although I appear to be in the minority for that preference). Unfortunately there aren't any very good free tools for converting smbl files to tab separated files and back again. The best we can do is use the cobra toolbox to open the sbml file, and then save it as an excel file, then open the excel file and paste it into a text file (or just  save it as a tsv).
Install the COBRA toolbox and open matlab then use the commands:
m = readCbModel()
writeCbModel(m, 'xls', 'arabidopsis_core.xls')

If you don't have a matlab license, I think you can also use cobrapy. If you want to go that route, and you can't figure it out, leave a comment and I'll look into it.
The first few lines of the new file should look like:

Rxn name Rxn description Formula Gene-reaction association Genes Proteins Subsystem Reversible LB UB Objective Confidence Score EC Number Notes References
PSII_h photosystem II 4 hnu[h] + 2 PQ[h] + 2 H2O[h] + 4 H[h]  -> 2 PQH2[h] + O2[h] + 4 H[l]  2*(ATCG00020 AND ATCG00680 AND ATCG00280 AND ATCG00270 AND ATCG00580 AND ATCG00570 AND ATCG00710 AND ATCG00080 AND ATCG00550 AND ATCG00070 AND ATCG00560 AND ATCG00220 AND ATCG00700 AND (AT5G66570 OR AT3G50820) AND AT1G06680 AND (AT4G21280 OR AT4G05180) AND AT1G79040 AND AT1G44575 AND ATCG00690 AND AT3G21055 AND AT2G30570 AND AT2G06520 AND AT1G67740 AND ATCG00300) 2* AT1G06680 AT1G44575 AT1G67740 AT1G79040 AT2G06520 AT2G30570 AT3G21055 AT3G50820 AT4G05180 AT4G21280 AT5G66570 ATCG00020 ATCG00070 ATCG00080 ATCG00220 ATCG00270 ATCG00280 ATCG00300 ATCG00550 ATCG00560 ATCG00570 ATCG00580 ATCG00680 ATCG00690 ATCG00700 ATCG00710  light reactions 0 0 1000 0 2 Reference: 
Cytb6f_h cytochrom b6f complex 2 H[h] + PQH2[h] + 2 PCox[h]  -> PQ[h] + 4 H[l] + 2 PCrd[h]  2*(ATCG00540 AND ATCG00720 AND AT4G03280 AND ATCG00730 AND ATCG00600 AND ATCG00590 AND AT2G26500 AND ATCG00210) 2* AT2G26500 AT4G03280 ATCG00210 ATCG00540 ATCG00590 ATCG00600 ATCG00720 ATCG00730  light reactions 0 0 1000 0 2 Reference: 

Next we need a dataset (or multiple datasets) to map to the model. There are tons and tons of Arabidopsis rna-seq datasets available at the Sequence Read Archive. Which could be converted into expression levels (transcripts per million, TPM values) by RSEM, or Cufflinks, or one of the other program. Our purpose right now isn't to learn how to calculate expression levels, it's to learn how to map expression levels to genome scale models. So we'll use a published dataset. The one from a recent paper by Attaran et al. (2014) in the supplementary information, they have an excel file with TPM values for Arabidopsis genes. Copy that into a text file as tab-separated values. The first few lines will look something like:

gene Ctrl, 0 h Mock, 0.25 h Mock, 0.5 h Mock, 1 h Mock, 1.5 h Mock, 2 h Mock, 2.5 h Mock, 3 h Mock, 4 h Mock, 5 h Mock, 6 ha Mock, 7 h Mock, 8 h Mock, 10 h Mock, 12 h Mock, 14 h Mock, 16 h Mock, 18 h Mock, 20 h Mock, 22 h Mock, 24 h Ctrl, 0 h COR, 0.25 h COR, 0.5 h COR, 1 h COR, 1.5 h COR, 2 h COR, 2.5 h COR, 3 h COR, 4 h COR, 5 h COR, 6 h COR, 7 h COR, 8 h COR, 10 h COR, 12 h COR, 14 h COR, 16 h COR, 18 h COR, 20 h COR, 22 h COR, 24 h
AT1G01010 8.63728173 9.053207743 10.08774829 6.550342584 10.75323386 9.811568477 7.537419004 4.351495797 4.52252331 3.21526652 10.57260204 6.659812746 5.544525626 6.162942118 9.748388586 6.635700199 13.3722347 14.00655824 11.10521312 7.778325422 13.11747752 8.63728173 12.75435123 9.801197785 6.778090097 12.28560124 6.40099162 6.899373934 3.386067446 5.657077904 4.765699405 3.541357727 4.241715881 5.651107608 5.941629547 8.11141089 5.790519473 7.654546712 13.86748802 13.99059111 10.40650469 8.431671093
AT1G01020 15.10339262 15.54764027 15.14399874 13.48005326 19.81186935 18.56371956 14.23806823 16.49350838 10.87040297 14.75793851 14.85751752 13.80838277 15.5354998 13.9001446 12.79017634 11.853186 17.69722726 16.21711202 16.60266843 16.75846021 12.76119239 15.10339262 15.54159685 12.37660128 15.54355707 21.01767813 16.87105124 15.66782328 15.90896922 15.06829 12.36472353 11.14138218 12.11738226 11.69501932 11.31450373 12.61000464 13.81749406 19.47607279 16.24354299 15.1074264 14.03895497 13.58507294
AT1G01030 2.236498377 3.083182063 2.138709346 2.881978583 2.699198427 2.695483213 2.151442663 1.939891162 1.683095087 2.115925242 1.268719686 1.438486987 1.824104671 1.258010328 1.518304771 1.123592657 3.383302012 2.91841416 3.048060754 5.069831473 1.880831445 2.236498377 2.216945007 2.482571095 4.357119396 3.89392288 3.819214468 3.366445483 3.540724539 3.493942373 3.290705762 3.332345686 3.528887136 3.120562877 2.616097916 3.759250943 2.283862863 8.080312074 6.02151064 4.918523228 6.944469488 4.813299132
Now that we've got our two datasets, we need to map the expression levels onto the reactions in the model.
Let's first take a look at the boolean expressions themselves and try to figure out what their components are and how we want to parse them. Here's the one for plastidial pyruvate kinase.

4*(AT3G22960) AND 4*(AT5G52920 OR AT1G32440)

The components of the string are:

gene names
boolean operators
the multiply operator

What this GRA is telling us is that the reaction is catalyzed by a hetero-octamer where four of the individual subunits are the polypeptide encoded by AT3G22960, and the other four subunits are the product of either AT5G52920 or AT1G32440. In other words, according to the AraCyc database:

"The Arabidopsis genome has 14 predicted pyruvate kinase genes. Three were experimentally shown located in the chloroplast and have pyruvate enzyme activities. One encodes the alpha subunit (PKp1, or PKp-alpha), two for the beta subunit (PKp2 and PKp3, or PKp-beta1 and PKp-beta2, respectively). The native enzyme is a hetero-octamer, in a composition of 4alpha4beta. Both alpha and beta subunits are required for enzyme activity."

So expression of AT3G22960 and one of AT5G52920 or AT1G32440 are necessary for the enzyme to have activity (there are other factors such as post-transcriptional and post-translational modifications that might still cause the enzyme to be absent or inactive, but we can't address either of those issues with the current dataset).

So, if we have TPM values for all three of those genes, what is a reasonable way to assign an expression level to the reaction? I think (and I currently have no empirical data to back up this assertion, the best I can say is that it logically makes some sense) a reasonable way is to treat "AND" as a "minimum" function, and "OR" as addition.

Multipliers can be either ignored (which is the approach I favor), or the expression levels can be divided by them. If an enzyme is active as a tetramer, that means that four times as many individual polypeptides must be present for its activity than for an enzyme that acts as a monomer. So it might make sense to divide the total expression by the multiplier to get a better estimation of "mature complex activity". It seems to me, however, that there are so many other factors that will affect the activity that factoring in subunit stochiometry is unnecessary and could even be detrimental (again, I don't have empirical tests to back me up here, it's just my gut feeling). Nevertheless, the code we develop will be able to handle these in a minimal way if necessary. In weird cases like if the factor is written after the gene name, (AT3G22960)*4, our code will produce unexpected results.

I treat the "AND" operator a "minimum" function because if either side is not expressed, then the reaction will not be active. For example, CBP Synthase has the GRA: AT3G27740 AND AT1G29900. If the expression levels are:
AT3G27740: 1
AT1G29900: 100
I think it is reasonable to estimate the expression of the reaction as minimum(AT3G27740, AT1G29900) which is minimum(1, 100), which is 1.

I treat the "OR" operator as addition, because it means that the reaction can be catalyzed by either of the operands independently of each other. For example, Asp Aminotranferase, has the GRA: AT2G22250 OR AT4G31990. If the expression levels are:
AT2G22250: 25
AT4G31990: 75
I think it is reasonable to estimate the expression of the reaction as AT2G22250 + AT4G31990, which is 25 + 75, which is 100.

Another thing we have to be concerned about is order of operations. The GRAs in the Arnold model, use explicit grouping with parentheses whenever there is a chance for ambiguity, so they don't have this problem, but it is possible that we may encounter GRAs of the form:

Gene1 OR Gene2 AND Gene3 OR Gene4

Depending on whether we evaluate AND, or OR first, we will calculate different expression levels for the reaction. Fortunately, there is already a well established order of operations convention for evaluating boolean expressions that tells us to evaluate AND before OR. So this example should be evaluated as:
Gene1 OR (Gene2 AND Gene3) OR Gene4
if the expression levels are:
Gene1: 10
Gene2: 15
Gene3: 30
Gene4: 60
We calculate:
10 OR (15 AND 30) OR 60
10 OR (15) OR 60
25 OR 60

Implementing a parser in Python for these kinds of calculations is made surprisingly easy by the pyparsing library. Specifically the infixNotation function.

Our code for the parser is simple:

def GeneReactionAssociationExpressionSyntax():
    gene = pyparsing.Word(pyparsing.alphas, pyparsing.alphanums)
    mult = pyparsing.Literal( "*" ) 

    or_str = pyparsing.Literal('OR')
    and_str = pyparsing.Literal('AND')

    number = pyparsing.Word(pyparsing.nums) | pyparsing.Combine(pyparsing.Word(pyparsing.nums), pyparsing.Literal("."), pyparsing.Word(pyparsing.nums)) | pyparsing.Combine(pyparsing.Literal("."), pyparsing.Word(pyparsing.nums)) # 123 or 134.451 or .4134 etc.
    atom = gene | number 
    expr = pyparsing.infixNotation(atom, [(mult, 2, pyparsing.opAssoc.RIGHT), (and_str, 2, pyparsing.opAssoc.LEFT), (or_str, 2, pyparsing.opAssoc.LEFT)])
    return expr 
What we're saying here is that a gene name is anything alphanumeric that starts with a letter. We define the operators as literals, "*", "OR", and "AND". A number (ie a multiplier) can be of several forms, an integer with no decimal part, an integer with a decimal part, or a period followed by a decimal part. An "atom" is a number or a gene name (in other words, things that can act as operands). We combine all of these things into an infix notation parser using the pyparsing.infixNotation function, (which implicitly handles parenthesis the way we'd expect, even though we never explicitly tell it how). By specifying the operators in the order "*", "AND", "OR", we are following the conventional order of operations, and hopefully complying with the principle of least astonishment. We also define the associativity for the operators. For "AND" and "OR" the associativity we assign doesn't actually matter because the calculations will be the same either way. For "*" it does matter. If we give it Right associativity, it will perform as we would expect when the multiplier is listed before the gene. We're going to have problems regardless if the multiplier is listed after the gene, so we'll use Right associativity to at least preserve expected behavior as much as possible (this is a kind of complicated issue, I may treat it in a separate blog post if anyone cares).

Now we can parse our GRAs into a nested list of strings:
x = '4*(AT3G22960) AND 4*(AT5G52920 OR AT1G32440)'
p = GeneReactionAssociationExpressionSyntax()
e = p.parseString(x).asList()

output: [[['4', '*', 'AT3G22960'],  'AND',  ['4', '*', ['AT5G52920', 'OR', 'AT1G32440']]]]

The next step is to write a function that will take that structure and a dict of gene expression and calculate the reaction expression. We then use a relatively simple recursive algorithm to evaluate the expression level

def divide(denom, num):
      input: two numbers
      output: the result of dividing the second argument by the first argument
    return float(num)/float(denom)

def expr_val(gra_stack, expression_dict, ignore_multipliers = True):
        gra_stack: a list of lists, such as the kind produced by the parser generated by GeneReactionAssociationExpressionSyntax
          for example: [[['4', '*', 'AT3G22960'],  'AND',  ['4', '*', ['AT5G52920', 'OR', 'AT1G32440']]]]
        expression_dict: keys are gene names, values are numbers representing expression level
          for example: {'AT3G22960': 10, 'AT5G52920': 0.2, 'AT1G32440': 1}
        ignore_multipliers: if True, bare numbers and '*' signs will be ignored and not used in the reaction expression level calculation.
          if False, expression levels will be divided by the multipliers, representing the fact that more units are needed for activity
        NOTE: if multipliers follow the genes they are modifying, and ignore_multipliers is False, expression will be calculated incorrectly
          because the multiplier will be divided by gene expression instead of vice versa.
        a number representing the expression of the reaction
    op = None #operator, a string: 'OR', 'AND', or '*'
    lhs = None #a number, the left-hand-side of the operator
    ops = {'*': divide, 'OR': operator.add, 'AND': min} #dict where keys are the valid operators and values are the functions they represent
    if isinstance(gra_stack, types.StringTypes): #gra_stack is a string and not a list. This is our base case. gra_stack should never be just an operator
        if re.match(r'^[0-9]*(\.[0-9]*)?$', gra_stack): #it's a number string
            lhs = float(gra_stack) #convert the string to a number
        elif gra_stack in expression_dict: #it's a gene name that we have an expression value for
            lhs = expression_dict[gra_stack] #get the expression level from expression_dict
        else: #it's something we don't recognize
            lhs = 0.0  #TODO: it may be a good idea to print a warning or throw an error in this case.
    else: #gra_stack is a list
        for x in xrange(len(gra_stack)): #iterate through the stack
            if isinstance(gra_stack[x], types.StringTypes) and gra_stack[x] in ops: #the next item in the stack is an operator
                if op is None and lhs is not None: #we've seen an expression already for lhs, and we haven't seen an operator yet
                    op = gra_stack[x] #remember the operator
                else: #the input was malformed either two operators in a row, or an operator without anything to the left of it
                    print(op + " " + lhs)
                    print("error parsing boolean expression at line %s" % repr(gra_stack[x]), file=sys.stderr)
            else: #the next item in the stack is a number, a gene, or a list
                if lhs is None:
                    lhs = expr_val(gra_stack[x], expression_dict, ignore_multipliers) #recursively evaluate the expression and remember the result
                elif op is None: #if we've seen an lhs, but have not seen an operator and now we're seeing another expression, it's an error.
                    print("error parsing boolean expression %s" % repr(gra_stack[x]), file=sys.stderr)
                    if op == '*' and ignore_multipliers: #ignore_multipliers is active, and we encounter a multiplication
                        lhs = expr_val(gra_stack[x], expression_dict, ignore_multipliers) # ignore the the left hand side (presumably the bare number) and evaluate the right hand side
                    else: #we've seen a valid operator and have a number in the lhs
                        lhs = ops[op](lhs, expr_val(gra_stack[x], expression_dict, ignore_multipliers)) # apply the operator, using lhs
                    op = None
    return lhs

With those functions and a little bit of wrapper code to read the text files, we're good to go. For the full source code, see in my bitbucket repository.


No comments:

Post a Comment