Monday, August 18, 2014

Eating really cheap: a 50 cent meal, and why I think Soylent will never be a food for the poor (although it may help them indirectly)

Two of my major interests in life are eating delicious food and spending as little money as possible. So when I read about the Soylent project, which aims to produce an ultra-cheap food powder that meets all of a person's nutrition needs I was very much interested (about the ultra-cheap part and the nutritional needs meeting part, not so much about the tasteless powder part). My philosophy about food is much different than that of Soylent creator Rob Rhinehart, who states on his blog "In my own life I resented the time, money, and effort the purchase, preparation, consumption, and clean-up of food was consuming." Personally, I see the time I spend cooking as an adventure: an opportunity to learn and experiment. I rarely cook from recipes (although I do keep a notebook), and pretty much never eat the same thing twice. I see cooking as a kind of huge multivariable optimization problem with a complicated (and changing) objective function and no global optimum, but lots of local optima all over the place. I want to make food that is delicious, nutritious, cheap, and not monotonous, and I have fun trying to do that.

Wednesday, August 13, 2014

Chemspider python scripting

In the past, I've shown how to script Chemdraw and ChemAxon for the purposes of converting chemical information among different file formats. In this post I'll show how to achieve (vaguely) similar results using the ChemSpider web API. ChemSpider is nice because (unlike SciFinder) it is free, they don't mind if you reuse their data, and they don't mind if you access their website with scripts. Using ChemSpider has the advantage that it is a huge database and provides cross-references to other databases, it has images, mol files, and lots of other data about each compound.


Tuesday, July 29, 2014

Using SwissProt and ExPASy ENZYME to generate putative EC annotations for sequence data

There was a question on Biostars about how to make EC assignments based on sequence. I gave one of the answers, suggesting a few possible solutions. One of my solutions was to Blast against ExPASy ENZYME and base the annotations on the best hit from there. In this post I explain how to do that, and supply the necessary Python code.
[Note: PRIAM is another tool that assigns EC numbers to sequences based on the data in the ENZYME database. Instead of individual blast hits, it uses profiles built from multiple sequence alignments of peptides known to catalyze a given reaction. It's also a lot quicker to use than the method outlined here, so it's probably worth checking out first]


Sunday, July 27, 2014

Wikipedia (cat)

Here's a silly poem I wrote a few years ago. It comes with its own silly introduction. I think this is pretty much the apex of my poetical achievement, so if you're not a fan of it... well, it just goes downhill from here folks... Also, I'd love to see this turned into a music video, so if anyone with musical or artistic talent would like to collaborate on that, let me know and maybe we can make something really cool.

srj_chembiolib a set of scripts for doing Bioinformaticky type stuff

This post is to announce srj_chembiolib which in an uncreatively named set of scripts I've written to perform various bioinformatics (and hopefully eventually some cheminformatic) tasks that would otherwise be a pain in the rear to accomplish. Most of the scripts work both as libraries that can be imported into other scripts, and as stand alone command line scripts. Some depend on external libraries such as BioPython. Documentation is mostly found in block comments at the top of the script files. Where possible, for example with the scripts that manipulate fasta files, I've tried to make it so that multiple programs can be chained together with pipes on the command line to accomplish more complex tasks. I hope they're useful for someone.

Here's an overview of some (but not all!) of the scripts in the package:

blast_xml_to_outfmt6.py: Converts Blast+ xml output to '-outfmt 6' style output (a tab separated form). Allows for some additional features not available in the standard outfmt 6, such as printing a line for query sequences that had no hits.

subset_fasta.py: Give it a fasta file and a list of strings, and it will give you a fasta file containing only those sequences whose names contain something from the list of strings as a substring.

 extract_massbank.py: A class to read and store data from MassBank format text files, such as those used by MassBank, ReSpect, and Spektraris (my favorite database...).

extract_top_blast_hits.py: Reads a blast xml file and outputs the names of the top hits for each query sequence. There's an option for making a file listing the sequences the queries where the top hit matched in the reverse direction, which is useful, for example with Blastx, to determine whether a nucleotide sequence represents the coding strand, or the non-coding strand.

Monday, July 14, 2014

Head hair as a sensory organ


It's no mystery that hair acts to amplify the sense of touch. Everyone is familiar with the feeling of a bug crawling on their arm, and everyone who has ever gone swimming with a beard knows how much more pleasurable (like a million tiny hands gently pulling at your chin) it is than swimming without a beard. There are also more mystical ideas about long hair granting a kind of sixth sense.

I recently gave myself a buzz cut after letting my hair grow for about 2 years. My hair is moderately curly, so when it's long it poofs out an inch or two from my scalp. From my experience, I don't think long hair grants any kind of mysterious powers (nor do I think it has any effect on personality or intelligence, although people with certain personality traits may be more likely to choose to grow their hair long), but I was surprised to find that scalp hair does seem to contribute to spatial awareness. My evidence for this (which is admittedly circumstantial and has a low sample size, but this would be a very hard thing to test in a controlled environment, you can't very well have a "double blind" haircut) is that in the 24 hours after I cut my hair, I bonked my head into the wall twice. Once I was leaning over to put something in the trash can, and hit my head on the corner of the doorway that the can is next to. The other time I was in the shower and leaning around the too-hot stream of water to adjust the knobs. So I think scalp hair can function for people kind of like how whiskers function for cats.

I think that when I had poofy hair, I subconsciously used it to provide spatial information about the environment immediately around my head. Lacking that information, but not yet having had a chance to compensate by other means, my lack of hair temporarily increased my propensity to bonk into things.

Tuesday, June 24, 2014

maximum blastx tblastn translation length?


Blastx the sequence below against SwissProt. It has an orf that's 493 amino acids long, but the best hsp it turns up (from a protein that is completely identical to the orf) is 453 a.a. long. The same is true when you do tblastn the other way. However, when you first translate the orf (for instance here) and Blastp against SwissProt, you get an hsp covering the whole orf. What's going on here? Does Blast have a maximum length it stores translations at? Hopefully someday I'll have time to investigate this further, or maybe some knowledgeable stranger will pass by this blog and let me know. (edit: This seems to be a freak occurrence with this particular sequence. I have since noticed other Blastx searches to find longer hsps than this one. There is a truncated version of this protein in the NCBI Protein database so maybe that has something to do with it)
GCTGAAAAAAAGTGTAACGTCTCTAGCGGGACGGAGGTAGTATTTATTAAGTTCTGGCGT
GAGAAATAGGAATAAATCTGGACCACCGATCTTAGTTAGATCTATGGTTGATATTTATTT
AAGCATAGTTACAAGATACAATATGCATGATTTTATTTAAGCATTGATTTTACATTAAAA
AGCACAATCGTATAATAAGCATAATATAATGCGTATTTAAGCATGACTTTGTATAGTTTT
AAACATTATTCAAATTTTCTCACGTGAAGATTTATTTACACACGAACCCTTCCCTTCCGT
GTGTATATGTATATAAATATTTAATCAAGATTGACGTGGAGTAGCAAGCACAAGCAAAGG
AGACTTCTTATGGACTACAAATCCAGGAGCTTCAGTCATGTCCAAATCCTCCGCTCTATC
TCCATTACCCAATCTGAAATCGAACTCGTTTACCAGCTTGGATAGTGCAAGCTCGTATAA
AGCCATCGCAAACGTGGATCCGGGGCACCCTCTTCGACCCGACCCGAACGGAAGCATCTC
AAAATGCAACCCTTTATAGTCTATGCTCGTCTCGAGGAACCTTTCTGGACGAAATTCTTC
GGGATTTTCCCACAACGAGGGGTCTCTCGATATGGCCCAGTTGTTGACCAACACGACCGT
GCCACGTGGGATGTCGTAGCCGAGCATATTGGCGTCTTGAGTCAATTCTCGAGGGAGCAG
AATTGCGAAAGGTGGATGTAGGCGTAGAATCTCCTTGGATACTGCTTTCAGATATGGCAT
CTTGTCCACGTCATCCTCGGTAATCCCACCTTTGTTTCTAGAAACTTCTCGCACCTCGTT
CTGCAAAGTTTTTAGGGTACGCGGGTTTTTTATGAGCTCCGCCATCGTCCACTCTAGAGC
CGCGAAAGTCGTATCGGTTCCGGCAGAAACCATGTCGAAGATTAGAGCTTTGATTACGTC
ATCCTCGACGGGGTCAGTATCTTTACTCTCTCTCTGAAACTGAAGCAATGTGTCTACGAA
ATTCGTTTCATCATCACCCACCTTCTTCCTTCTATATTTTCGAAGAATACCCTCCATTGA
TCCATCCAACTTTGTACCGACTTTTTCCACTTCTGCATCGACGCCATTTATCCGGTTGAT
CCAAGACAGCCATGGAACGTAATCCCCCACGTTGAAACTTCCCAAGAGCTTGATAACCTT
GATCAGAATCCGATTAAAATCATCTCCGCCGTCGCCCTTCCTCCCTAACACCGCCCTGTG
AATTACGCCGTTCGTCAGCGCCATGAACATCTCGCTCAAGTTCACGACCGTCGTCGGCTT
CGATCGCCTGATCTTCTCAATCATAGCCGACGTTTCCTCTTCTCGAATCCCGCCGAACGA
CTGGACCCTCTTAGCGCTGAGCAGCTGCAGCATGCACATGCTCCGCGCGTTGCGCCAGTG
CTCGCCGTAGGGGGCGAAGGCCACGCCCTTGCCGCTGTACATCAGCCTGTCGAAGATGCT
CAGCCTCGGCCTGCTCGCGAAGATCACGTCTTGGTTCTTCATGATCTCACGCGCCGCCGC
CGCTGAGGAGGCCACTAGGACAGGAGCGCTGCCGAAATGGAGTAGCATCACCTCGCCGTA
GCGCTTGGATAAGGAGGTGAAGGAGCGGTGGGAGAGGGCTCCGATCAGGTGGAAATGGCC
GATCACCGGAAGCCTTAATGGAGACGGCGGCGGCCTCTTTCTTGAGGAAAGACTGGACTT
TCGTTTATGGAAAAGGACCGCCAGTAAGATTAAAGAGACAGAGAAAAATACTAGAAGAGC
GGCCATTTTCTCTTCAGTTGAATATAATGATGGACAAGTTCTTGGAGAAGG