Tuesday, October 28, 2014

International Chemical Identifers (InChIs): You should use them!

One easy and increasingly common way to increase the make metabolite data more useful is to associate compounds with their corresponding InChI  International Chemical Identifier (InChI) (Heller et al., 2013). An InChI is a unique, standardized text representation of the structure of an organic molecule. Inclusion of InChIs in database records facilitates cross-referencing among databases. The InChI system has a number of advantages over other kinds of identifiers. Some chemical identifiers, such as PubChem IDs, Chemical Abstracts Service (CAS) numbers, and ChemSpider IDs, are database-specific accession numbers with no direct relation to the structure of the molecule they describe, this means that a molecule must have been indexed by one of these services to have an identifier. An InChI, by contrast, is a database-independent structure description, so it can be generated for a molecule regardless of whether the molecule has been indexed by a major database. An InChI can be generated for a novel natural product structure, whereas the other IDs cannot. InChI also has advantages over other linear text representations of molecule structure, such as SMILES. Unlike for SMILES, there is a single open source implementation of the InChI generation algorithm, so while a single structure may have multiple valid SMILES representations, it will only have one Standard InChI representation (Heller et al., 2013). A fixed length compressed version of an InChI, an InChIKey, can be generated from any InChI. InChIKeys are more compatible than InChIs with web search engines such as Google (google.com) (Southan, 2013), however, multiple distinct structure may have, and have been observed to have, the same InChIKey (http://www.chemconnector.com/2011/09/01/an-inchikey-collision-is-discovered-and-not-based-on-stereochemistry/), so InChIKeys should not be used as a basis for cross-referencing. When unambiguous identification of a molecule is the priority, InChI should be preferred. When ease of indexing and searchability is the priority, InChIKey should be preferred. When possible, both identifiers should be listed. By listing InChIs and InChIKeys in websites, databases, and publications (Coles et al., 2005), chemists can enhance the ability of their data to be indexed, searched, and cross-referenced. Free and easy to use software for generating InChIs and InChIKeys are the InChI software available from the InChI Trust (http://www.inchi-trust.org), and MolConverter available from ChemAxon (http://www.chemaxon.com).

Of course, even with the use of InChIs, inconsistencies can still arise in cross referencing. Galgonek and Vondrášek provide an excellent (and Open Access) analysis of the kinds of inconsistencies that can arise, and their sources.

I originally wrote this as part of a draft of the manuscript that eventually became this review article. It's a bit out of the scope of that article, so we dropped it. But I posted it here because I still think it's a good analysis.

References:
Heller et al. 2013 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3599061/
Coles et al. 2005 http://www.ncbi.nlm.nih.gov/pubmed/15889163
Southan 2013 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3598674/ 

Monday, October 20, 2014

What kinds of problems can (mathematical) models (of biochemical systems) solve?

I think the useful applications of computer modeling to biochemistry are in the following kind of situation: You have identified a phenotype of interest, you want to know what the molecular basis is for that phenotype, you know the form of the hypothesis (“a change in the concentration of substance X causes decreased activity of enzyme Y”), but there are too many possible instantiations of the hypothesis to test all of them experimentally (there may be 100 possibilities for “substance X” and 1000 possibilities for “enzyme Y”, giving 100,000 possible hypotheses). If you have some idea of how the system works, enough to make some kind of mathematical model of the system, and some experimental data (for example omics data comparing the condition displaying the phenotype to the condition not displaying the phenotype), you can use the computer to answer questions like: What instantiations of my hypothesis are most likely to be true, based on this dataset (or based on all available datasets)?

What is exciting?

I apologize in advance if this reads like an exercise in self indulgence. I wrote this after a conversation with a friend which basically went like this:

Her: "I'm going to go whitewater rafting next month, doesn't that sound exciting!"
Me: "Kind of. I think it would be more exciting to sit under a tree and read a good book, though."
Her: "What!? That doesn't make any sense."

It made perfect sense to me, so I wrote this essay and sent it to her. And now I'm posting it here for my throng of fans to read:

What is exciting? White water rafting?  Paragliding? Sky diving? Bungee-jumping? Roller coasters? Motorcycles? These things are exciting. They are fun (at least those of these that I have experienced).  But it is a short-lived excitement.  It’s there, and then it’s gone.

To me, a deeper kind of excitement, one that lasts, one that I can return to again and again, comes from learning.  Any idea that changes my perspective on the world or deepens my understanding is an exciting idea.  It’s an idea that makes my heart beat faster.