Monday, February 16, 2015

Comparing data from open natural products chemical spectral databases

I recently wrote a review article called "Open-access metabolomics databases for natural product research: present capabilities and future potential", for the journal Frontiers in Bioengineering and Biotechnology, section Bioinformatics and Computational Biology (link).

Part of the review involved evaluating the contents of In this post, this post consists of the notes I took when generating those comparisons. The code I used is available from my Bitbucket account (I apologize for the poor organization and near-lack of documentation), some of the scripts depend on srj_chembiolib, and/or Pandas. InChI normalization relies on ChemAxon molconvert.


Factors to compare: how many unique compounds are present in the database, how many structures have the various kinds of associated analytical spectra, what compound classes are represented.





Approach:
I compared the contents of databases that allow bulk downloads. Those databases are: HMDB, GMD, BMRB, MassBank, ReSpect, Spektraris, NMRShiftDB, and GNPS. Because some of these databases are not limited to natural products, we also included UNPD, an extensive natural products structure database which does not, however, contain any spectra. Other important publicly searchable databases, such as METLIN, SDBS, and NAPROC-13 were left out of the comparison because they do not allow bulk downloads, so it is difficult or impossible to know precisely which compounds they contain.
To compare the databases, I first downloaded the data from the project websites. Data formats are not standardized, so I used Python scripts custom for each database to extract structure information in the form of SMILES or InChI strings, and spectrum type as one of: MS1, MSn, 1H NMR, 13C NMR, 1D NMR other, and 2D NMR. Structure information was normalized by converting SMILES and InChI strings into Standard InChI strings using ChemAxon molconvert. In the course of extracting and converting the data, a number of errors were noticed in some of the records. I fixed the most obvious and easily correctible errors, but made no attempt at systematic validation or curation.

In some databases, not all of the records are annotated with structural information that could be converted into a Standard InChI. For example, some compounds were identified only by name or CAS number. It also occurred in some cases that the records available for bulk download were out of date compared to those searchable through the web interface, or that the web interface allows searches against proprietary data in addition to the data made available for bulk download. Also, because stereochemical information was not always available, I based my definition of a unique structure only on the "Main" layer of the InChI, which includes formula and atom connectivity (before comparing the Main layer, I removed "?" characters, which are present in the Main layer, and are used to specifically identify unknown stereochemistry).


HMDB:
The metabolites xml file has enough data to know what the inchi is, and whether a spectrum is available.
Sadly, it does not indicate what kind of an NMR spectrum is available, so that has to be gotten by other means. (or we can appeal to the least common denominator and just distinguish between 1D and 2D nmr for all datasets).
To get spectra type, we have to either download the spectra (the download links are broken, so that won't work), or get them from the individual spectra pages, which is possible:
the url to the spectra pages is nice and formulaic:
"http://www.hmdb.ca/spectra/spectra/nmr_one_d/X"
where x is the name of the accession
just change "one" to "two" for 2d spectra.
I use urllib from python to get all the spectra pages.

From the downloaded spectra, the spectra type can be extracted from the page title. The extracted types can then be used to generate a lookup table correlating spectrum number to spectrum type.

NMRShiftDB:
reconstituting the whole database is a total pain. Much easier just to pull out what we want.
grep 'INSERT INTO `MOLECULE` VALUES'  nmr_shift_db_dump.sql  > molecules.txt
grep 'INSERT INTO `SPECTRUM` VALUES'  nmr_shift_db_dump.sql  > spectra.txt

grep 'INSERT INTO `SPECTRUM_TYPE` VALUES'  nmr_shift_db_dump.sql  > spectra_types.txt

First extract the SMILES and convert them to InChI and save in a text file.


GNPS:
download the mgf files
split on "END IONS"
for each record grap the INCHI, SMILES, SPECTRUMID, MSLEVEL
if theres a SMILES but no InChI, use molconvert to turn it into a Standard InChI


In general:
"InChI=" sometimes used after "INCHI=", sometimes not. It should always be there I think.

GNPS-LIBRARY.mgf:
CCMSLIB00000075084 -> SMILES format written in InChI field. Swap
INCHI sometimes written with quotes, sometimes without.

GNPS-NIH-CLINICALCOLLECTION2.mgf:
many (all?) records have SMILES in the InChI field
In that file, I did a global replace of INCHI= to SMILES=, and then of SMILES=N/A to INCHI=N/A

RESPECT.mgf:
lots (84) of records where the SMILES is just "c" and that's all.
CCMSLIB00000213248 SMILES has a space in it (also present in the ReSpect record, so not introduced during GNPS import).

Spektraris NMR:
download "flatfiles"
extract "record.tsv"
used LibreOffice Calc to move columns around so it's in the proper format.
Pasted into notepadd++ and saved

Spektraris AMT:
Download data in MassBank format
use srj_chembiolib to extract relevant data

ReSpect:
Download MassBank format files. They are apparently an older (or nonstandard) style, because they use the "INCHI" tag for InChIs, rather than the "IUPAC" tag, and there are some other differences as well.

PS014501 PS014502 PS014503 PS014504 PS014505 PS014506 PS014507 PS014508 PS014511
truncated InChI

1S/C9H15N3O10P2/c10-7-1-2-12(9(14)11-7)8-3-5(13)6(21-8)4-20-24(18,

GMD:
Extract from *MSP.txt

BMRB:
Dowload from CSV archives:

chem_comp.csv
Experiment.csv
fix the InChI code for bmse010516 and bmse010122 in chem_comp.csv
Luckily they make the mol files available, so it's easy to generate inchis.

some InChI codes are non-standard, so we need to remember to deal with that also.
Normalize some of the spectrum types, I've (arbitrarily) chosen to group spectra as: 1H NMR, 13C NMR, 2D NMR, 1D NMR other

spectrum_map = {'1D 1H':'1H NMR', '1D 1H, 0.5 mM, pH 8.202':'1H NMR', '1D 1H, 100 mM, pH 8.202':'1H NMR', '1D 1H, 100 mM, pH 7.074':'1H NMR', '1D 1H, 100 mM, pH 6.836':'1H NMR', '1D 1H, 0.5 mM, pH 2':'1H NMR', '1D 1H, 0.5 mM, pH 7':'1H NMR', '1D 1H, 0.5 mM, pH 5':'1H NMR', '1D 1H, 100 mM, pH 2':'1H NMR', '1D 1H, 100 mM, pH 9':'1H NMR', '1D 1H, 100 mM, pH 5':'1H NMR', '1D 1H, 100 mM, pH 7':'1H NMR', '1D 1H, 0.5 mM, pH 9':'1H NMR', '1D 1H, 0.5 mM, pH 7.074':'1H NMR', '1D 1H, 2.0 mM':'1H NMR', '1D 1H, 0.5 mM':'1H NMR', '1D 1H, 0.5 mM, pH 6.836':'1H NMR', '1D DEPT135':'13C NMR', '1D 13C':'13C NMR', '1D DEPT90':'13C NMR', '2D J-resolved 1H':'2D NMR', '2D [1H,1H]-TOCSY':'2D NMR', '2D [1H,13C]-HSQC SW small':'2D NMR', '2D [1H,1H]-COSY':'2D NMR', '2D [1H,13C]-HMQC':'2D NMR', '2D [1H,13C]-HSQC':'2D NMR', '2D [1H,13C]-HMBC':'2D NMR', '2D [1H,13C]-HSQC SW expanded':'2D NMR', '1D Boric acid': '1D NMR other', '1D 1H, NOESY':'1H NMR'}

pretty easy to use Pandas to merge the tables


MassBank:
Download using wget
consolidate with
cp */* ../otherdir
This would be much faster if they'd just put it all in a single zip file....
Extract data from MassBank files using srj_chembiolib

UNPD:
get the CSV file "UNPD_unique_information_229358.csv"
I suspect that some of the structures in the database are not natural products.
Beware of multiline entries in the csv, such as "UNPD70379" (which I corrected to one line)

BML-NMR:
A list of compound names and InChI codes is found in the supplemental material of the paper (doi: 10.1007/s11306-011-0347-7). I copied the table into a text editor and then formatted it like the other ones. There are a lot of different kinds of spectra, but for the purposes of the generalizations I'm making with this data analysis, I consider them all to be either 1H NMR, or 2D NMR. So I copy all the compound names/Inchis, and after one set I put 1H NMR, and after the other set, 2D NMR.
Need to add InChI= to the beginning of all of the InChI strings. One of them (N-acetyllysine) is not a standard InChI, so convert it into a standard inchi with: molconvert inchi:"AuxNone SAbs" .\N-Acetyllysine.txt
Remove extra spaces in a few of the inchi codes.
Phosphocholine inchi:
InChI=1S/C5H14NO4P/c1-6(2,3)4-5(7)11(8,9)10/h5,7H,4H2,1-3H3,(H-,8,9,10)/p+1
appears to be incorrect, but I did not correct it.


Comparing compound classes

Assessing what kinds of compounds are in each database is not an easy task. Only a few databases even have compound class annotations. At first I thought the best way to assign classifications would be compare databases to ChEBI, and then transfer the ChEBI ontology annotations to the other databases. What I found was that the ChEBI ontology annotations were not really useful for knowing the biosynthetic origins of a compound. So for the paper, I did not use ChEBI-based annotations, but relied on the annotations provided by the individual databases where available, and giving up on classifying compounds from databases that don't provide annotations.

NAPROC-13:
To get these, I just used the search function with only the compound type field filled in.
Aliphatics (acids, alcohols, cyclohexanes, cyclohexanones, diynes, lipids, thiosulfinates): 29
Alkaloids (many kinds): 156
Aminoacids-Peptides: 5
Aromatics (benzenes, naphthalenes, phenols, etc): 1236
Benzofuranoids: 5
Butanolides: 6
Carbohydrates: 87
Chromans: 304
Flavonoids: 1769
Isochromans: 8
Lignans: 294
Lipids: 3
Miscellanea: 35
Nitrogenous-bases: 7
Nucleosides: 7
Peptides: 4
Polyketides: 58
Pyranes: 23
Steroids: 729
Tannins: 4
Terpenoids: 15527
Diterpenoids (terpenoid subset): 6630
Sesquiterpenoids (terpenoid subset): 3735
Triterpenoids (terpenoid subset): 4239

Terpenoids are, the best represented compound class, with an emphasis on sesqui-, di-, and tri-terpenoids. Other well represented classes are flavonoids, "aromatics", chromans, and lignans.

BML-NMR:
Maps well to ChEBI. Maps perfectly to HMDB, can use name to get cross-reference to compound classes.

BMRB:
maps pretty well to ChEBI (731/1167), check out what the remaining ones are though (probably mostly lignin precursors).

GMD:
maps well to ChEBI (702/914)

GNPS:
ChEBI: 2463/5581 (of compounds that can be given an InChI code, which is not that many). So, no estimation can be made.

HMDB:
Excellent overlap with ChEBI 930/1063. Also has its own internal chemical classification annotations.

MassBank:
Not good overlap with ChEBI (3135/11326), however, some records are annotated with compound classes. The vast majority of annotations are ambiguous, such as "N/A" (20474), or "N/A; Environmental Standard" (8646), or "Natural Product" (2397). Including annotations that are further specified, the total number annotated as "Natural Product" is 7995.

ReSpect:
decent overlap with ChEBI (645/722), however, records are heavily annotated with compound class.
According to internal annoations: Flavonoids are the best represented class (2,999 spectra). Other well represented classes are amino acids (888 spectra), terpenoids (743 spectra), phenylpropanoids (727 spectra), nucleotides (475 spectra), and alkaloids (420 spectra). The only way to convert these "spectra" numbers to "compound" numbers is to use the compound name as a unique identifier. Doing that, the numbers become: Flavonoids are the best represented class (1,360). Other well represented classes are terpenoids (519), phenylpropanoids (341), alkaloids (256), amino acids (236), and glucosinolates (93).

METLIN:
closed, and impossible to estimate.

MMCD:
Experimental NMR spectra shared with BMRB. I don't know how to deal with the rest of their data

NMRShiftDB:
Structures available. Low overlap with ChEBI. Classification difficult.

SDBS:
closed, and impossible to estimate.

Spektraris:
about 242 taxane diterpenes, the rest are hard to classify, poor overlap with ChEBI.

SpinAssign:
records not downloadable.



No comments:

Post a Comment