Monday, March 31, 2014

Python Numpy Munkres Hungarian matching algorithm

I didn't make this, but there is a numpy implementation of this algorithm available through scikit-learn in sklearn/utils/linear_assignment_.py . I'm glad I found it, as it seems to be about 4 times faster than the munkres package from PyPI (on which it is based, and which is the python implementation that comes up the most on google).

Reference:
http://stackoverflow.com/questions/1398822/the-assignment-problem-a-numpy-function

Tuesday, March 25, 2014

building a django app that uses ZeroMQ: an annotated webliography


Introduction:
I wanted to build a website that allows people to search their data against a database (in the not too distant future, when the website is live I'll link it and the source code and give more of an explanation. Edit: and here it is, the code, the ZeroMQ stuff is mostly at "/nmr/management/commands", and the website ). Each search takes a few seconds, so in order to be able to serve multiple clients at a time, and allow scaling, I wanted to build a system where the main wsgi process does not block, but passes the search request off to another process that puts it in a queue and executes requests in the queue one by one. I ended up following a simple approach using ZeroMQ. There is a scheduler that runs as a thread in the main wsgi process. When the search input view receives a search request, it writes the search parameters into the database and opens a connection to the scheduler thread and passes it the unique ID of the database record storing the search parameters. There are one or more worker processes each running as a subprocess. The workers are permanently attached via a socket to the scheduler. When a worker completes a job, the scheduler sends it the ID of the next job in the queue, the worker executes the job, writes the results in a database table, and tells the scheduler it is ready for another job. There can be many workers attached to the scheduler, so that multiple searches can be run concurrently.

Here then is a list of (some of) the websites I used for reference while writing this program.

Tuesday, March 18, 2014

ChemDraw ChemAxon synergy

In my previous post, I was complaining that there wasn't any free software with a nice command line interface to reliably convert molfiles to InChI strings, and back again. I also mentioned that there was an issue with the way ChemDraw converts structures to InChI strings that made it unacceptable for my purposes. The technical issue with ChemDraw is that it doesn't preserve the isoform of tautomers. Some of the molecules I'm interested in contain amides that are typically found in the amide form, rather than the imidic acid form. When I copy these molecules as InChI from ChemDraw, and then paste them back in, the amide is changed to the imidic acid form, which I don't want. It turns out that this is due to a feature of the InChI format (a format that is still mostly opaque to me) called "Mobile H Perception", where it simplifies a molecule encoding by not specifying the which tautomer it is (thereby saving 1 bit of information I guess). Many programs have the option to export InChI with Mobile H perception off, which is what I want, but I can't find that option in Chemdraw.

Automating Chemdraw: win32 com scripting with python pywin32

Once again I find myself adventuring in the land of Windows COM scripts. Last time I tried, with mixed success. This time, I'm using it to control ChemDraw to help me convert molecule formats. Anytime I use COM scripts, it feels pretty clumsy, like it would be a lot slicker just to have a command line utility, but in this case I couldn't find anything that seemed like it would work well enough (update: I later found out about ChemAxon Molconverter which is quite nice), so COM scripts it is. Python is currently the language I do general purpose programming in, and pywin32 works quite well for COM scripting.

The problem: I want to take molfiles from multiple sources and normalize them so that they can be rendered with the same style settings by chemdoodle web components. While I'm at it, I'd like to generate SMILES strings, InChI strings, and InChI key strings (I'd also like to generate IUPAC names, but I gave up on that part) for the molecules in these molfiles.