The problem: I want to take molfiles from multiple sources and normalize them so that they can be rendered with the same style settings by chemdoodle web components. While I'm at it, I'd like to generate SMILES strings, InChI strings, and InChI key strings (I'd also like to generate IUPAC names, but I gave up on that part) for the molecules in these molfiles.
The solution: I found that by converting a molfile into an InChI string, and then back into a molfile seems to retain the important structural information while at the same time normalizing the coordinates, and losing extraneous information, so this seems to be a reasonable way to go. I first tried this manually in ChemDraw (which fortunately my university has a site license for, because I surely can't afford it on my own) and it worked well. I then looked for ways to automate it, and came upon OpenBabel. OpenBabel looks like an extremely useful (open source!) tool for interconverting chemical formats, but for this task, it seemed to struggle. OpenBabel generates InChI strings that are identical to those generated by ChemDraw, but when I convert them back into molfiles, it struggles, and does not output a molecule that looks the same as the one originally input (I was getting carbons with 5 bonds because it was protonating quaternary carbons for some reason), it also crashes when I ask for it to generate 2d coordinates (the --gen2d option). I didn't test it extensively to see if this was because I'm using particularly complicated molecules (or maybe strange molfiles?), or if this is a general problem with InChI conversion. Chemspider also has a nice API that might help with this sort of thing, but it apparently uses OpenBabel as the backend, and when I put in my InChI string the resulting molfile is unparsable by both ChemDraw and Chemdoodle. In any case, OpenBabel didn't seem like it was going to work for me, there may be other open source software that would, but I've got a deadline, so I've got to use what I know will work, and right now that's ChemDraw.
So how can I script ChemDraw to convert several hundred molfiles to InChI strings, InChI keys, SMILES strings, and "normalized" molfiles? The only documentation I can find for the ChemDraw Component Object Model is about 9 years old, and claims to be for ChemDraw version "5.0 or later". There is also some more detailed documentation of the API, but it doesn't seem to correspond 1 to 1 with the interface I get when connecting with pywin32 (Perkin Elmer markets a separate product ChemScript, which I'm sure makes all of this quite easy, and probably also has up to date documentation, but I haven't got a license for it, so I'll do what I can with ChemDraw). I've got Chemdraw Pro version 13.0.2, so I'll be using an interactive iPython shell extensively to explore the API. In addition to looking at the online documentation, I also used COM makepy (through PythonWin) as suggested by the pywin32 documentation to generate the .py documentation that I can look through and hopefully find out how to do what I want.
This all looks very complicated, but I'll try to stay focused and figure out how to do the simple, specific task I'm after. First I'll use the ChemDraw GUI and note down the specific steps I want to do for each molfile, then I'll try to figure out how to do all of those steps with the COM interface:
The steps:
Start from a ChemDraw 13 Pro window with no structures open
File -> open -> select a molfile
In the drawing window that opens with the structure displayed: select the molecule (should be only one molecule there, so select all should work)
Edit -> Copy As -> InChI
Paste the InChI into a text file
Delete the structure in the ChemDraw window
Edit -> Paste Special -> InChI
Select the molecule
Edit -> Copy As -> SMILES
paste the SMILES into a text file
Edit -> Copy As -> InChI key
paste InChI key into a text file
File Save As-> save in a different directory
The ChemDraw COM API seems to lack access (or at least I wasn't able to figure out how to get to them) to certain functions that can be accessed through the GUI, it also seems to lack a way to send keyboard commands directly to the program, so I ended up using a shell to send generic keyboard commands to Windows, that would then be captured by ChemDraw because it is the active application while the script is running.
This actually wound up working pretty well, and did exactly what I was hoping for. Unfortunately, for a technical issue with the way ChemDraw generates InChI strings, I ended up taking a different approach eventually (which I will detail in the next post).
Here's the code:
# Note: for this program to work, you must be windows and have ChemDraw installed
# Also, this program can't run in the background. When you run it, you should't be doing other things
# that will shift the window focus from chemdraw, or use the clipboard
import os
import win32com.client as win32
import re
import time
import win32clipboard as clipboard
import win32con
### configuration ###
root_path = 'E:\\databases\\taxane_nmr'
mol_dir = 'mols'
new_mol_dir = 'new_mols'
inchi_dir = 'inchi'
inchi_key_dir = 'inchi_key'
smiles_dir = 'smiles'
sleep_time = 0.5 #how long to wait when the sleep function is called. We need to wait periodically so the script doesn't get ahead of program loading
short_sleep_time = 0.25 #how long to wait between key presses
### program ###
def sleep():
time.sleep(sleep_time)
# get all the input file names
# maybe just as easy or easier with glob module
all_files = [os.path.join(root, name) for root, dirs, files in os.walk(os.path.join(root_path, mol_dir)) for name in files if name.endswith(('.mol'))]
mol_files = list() #for the whole file name
mol_root_names = list() # for the filename without the extension
for f in files:
m = re.search('^(.+)\.mol$', f)
if m:
mol_files.append(f)
root_name = m.group(1)
# to be more cross platform, you'd want to avoid the backslashes and use os.sep (or whatever it is), but this code
# is pretty inherently bound to Windows, so it's ok to use \ as a literal I think
mol_root_names.append(root_name)
#initialize windows shell (so we can send keyboard commands)
shell = win32.Dispatch("WScript.Shell")
#initialize chemdraw
chemdraw = win32.gencache.EnsureDispatch('ChemDraw.Application') #connect to chemdraw
chemdraw.Visible = True #it's kind of fun to watch...
time.sleep(sleep_time)
def hit_keys(keys): #function to wait, then send a keypress signal
time.sleep(short_sleep_time)
shell.SendKeys(keys,1)
inchi_list = list()
smiles_list = list()
inchi_key_list = list()
for f in mol_files:
print(f)
doc = chemdraw.Documents.Open(os.path.join(root_path, mol_dir, f)) # step 1 done
doc.Activate()
sleep()
#chemdraw must be the active application, so it can receive keyboard commands
shell.AppActivate("ChemDraw Pro") #this should be the name that appears at the top of the ChemDraw window bar
hit_keys("^a") #step 2 done: ctrl+a selects all
hit_keys("%e") #alt+e opens the edit menu
hit_keys("o") #copy as
hit_keys("n") #inchi: step 3 done
hit_keys("{Del}") #step 5 done
sleep()
clipboard.OpenClipboard()
#print(clipboard.GetClipboardData(win32con.CF_TEXT))
inchi_list.append(clipboard.GetClipboardData(win32con.CF_TEXT))
clipboard.CloseClipboard() #saving the string so we can do step 4 later
sleep()
hit_keys("%e") #edit
hit_keys("s") #paste special
hit_keys("i") #inchi: step 6 done
hit_keys("^a") #select all: step 7 done
hit_keys("%e")
hit_keys("o")
hit_keys("s") #step 8 done
sleep()
clipboard.OpenClipboard()
#print(clipboard.GetClipboardData(win32con.CF_TEXT))
smiles_list.append(clipboard.GetClipboardData(win32con.CF_TEXT))
clipboard.CloseClipboard() #saving the string so we can do step 9 later
sleep()
hit_keys("%e")
hit_keys("o")
hit_keys("k") #step 10 done
sleep()
clipboard.OpenClipboard()
#print(clipboard.GetClipboardData(win32con.CF_TEXT))
inchi_key_list.append(clipboard.GetClipboardData(win32con.CF_TEXT))
clipboard.CloseClipboard() #saving the string so we can do step 11 later
sleep()
doc.SaveAs(os.path.join(root_path, new_mol_dir, f)) #step 12 done. Chemdraw detects from the file name that we want to save it in MDL mol format
doc.Close() #close the window so we don't have too many open at once and end up using a lot of RAM and slowing down
#complete steps 4, 9, and 11
for (index, name) in enumerate(mol_root_names):
with open(os.path.join(root_path, inchi_dir, name)+".inchi","w") as outfile:
outfile.write(inchi_list[index])
with open(os.path.join(root_path, inchi_key_dir, name)+".inchikey","w") as outfile:
outfile.write(inchi_key_list[index])
with open(os.path.join(root_path, smiles_dir, name)+".smiles","w") as outfile:
outfile.write(smiles_list[index])
References:
http://stackoverflow.com/questions/19351758/how-to-open-a-program-in-python-and-send-keystrokes
http://www.techrepublic.com/article/automate-tasks-with-windows-script-hosts-sendkeys-method/
http://timgolden.me.uk/pywin32-docs/contents.html (pywin32 documentation)
http://www.cambridgesoft.com/services/documentation/sdk/chemdraw/ActiveX11/ChemDrawControl10_P.html
http://www.cambridgesoft.com/services/documentation/sdk/chemdraw/automation/
http://blog.rguha.net/?p=549
No comments:
Post a Comment