Friday, January 25, 2019

to postdoc or not to postdoc: from PhD to industry

After finishing my PhD, I did an academic postdoc for about 2 years, then recently moved to a small company in private industry. A friend recently asked whether I thought the academic postdoc was worthwhile and if I had any advice for getting into industry. Here's what I said.

Saturday, January 12, 2019

How systems programming is like biochemistry


Recently I've been taking a lot of classes in computer architecture and systems programming, someone asked me why, with a biology background, I would be interested in such things. Here's what I said.

Thursday, January 3, 2019

compiling Codon Optimizer on ubuntu 18.04


There is a cool old piece of software for codon optimization, inventively named "Codon Optimizer"

http://www.cs.ubc.ca/labs/beta/Projects/codon-optimizer/

It's not really very useful from a practical standpoint because its codon optimization strategy is to try to replace every codon with the highest CAI codon for the same amino acid. Which turns out to not be a very good strategy in practice. The software also can only optimize for E. coli, and has the translation tables and codon frequency tables hard-coded in.

Nevertheless, it's one of the few Open Source programs for codon optimization, so it might be useful to use as a starting point for a more fully-featured program, by adding more tables or some alternative algorithms for picking codons to use.

Unfortunately, it won't compile on Ubuntu 18.04.

To get it to compile, I had to do the following:

Minimalist shoes: a journey


The idea of minimalist footwear appeals to me from a number of perspectives. Aesthetically, I like the idea of having as little as possible on my feet. Minimalist shoes also tend to be cheaper than other shoes, particularly other running shoes. I also like the fact that I can (try to) make my own minimalist shoes, whereas trying to make a pair of padded, heel-raised shoes seems like it would be a nightmare.

Thursday, December 13, 2018

Codon translation tables in JSON format

I reformatted NCBIs genetic code tables (https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?chapter=cgencodes) into JSON format.

Check out this link (http://www.petercollingridge.co.uk/tutorials/bioinformatics/codon-table/) for a quick algorithm for parsing these strings.

 codon_tables = {  
 "1": ["FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG","---M------**--*----M---------------M----------------------------","Standard"],  
 "2": ["FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSS**VVVVAAAADDEEGGGG","----------**--------------------MMMM----------**---M------------","Vertebrate Mitochondrial"],  
 "3": ["FFLLSSSSYY**CCWWTTTTPPPPHHQQRRRRIIMMTTTTNNKKSSRRVVVVAAAADDEEGGGG","----------**----------------------MM----------------------------","Yeast Mitochondrial"],  
 "4": ["FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG","--MM------**-------M------------MMMM---------------M------------","Mold Mitochondrial; Protozoan Mitochondrial; Coelenterate Mitochondrial; Mycoplasma; Spiroplasma"],  
 "5": ["FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSSSSVVVVAAAADDEEGGGG","---M------**--------------------MMMM---------------M------------","Invertebrate Mitochondrial"],  
 "6": ["FFLLSSSSYYQQCC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG","--------------*--------------------M----------------------------","Ciliate, Dasycladacean and Hexamita Nuclear"],  
 "9": ["FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIIMTTTTNNNKSSSSVVVVAAAADDEEGGGG","----------**-----------------------M---------------M------------","Echinoderm and Flatworm Mitochondrial"],  
 "10":["FFLLSSSSYY**CCCWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG","----------**-----------------------M----------------------------","Euplotid Nuclear"],  
 "11":["FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG","---M------**--*----M------------MMMM---------------M------------","Bacterial, Archaeal and Plant Plastid"],  
 "12":["FFLLSSSSYY**CC*WLLLSPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG","----------**--*----M---------------M----------------------------","Alternative Yeast Nuclear"],  
 "13":["FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSSGGVVVVAAAADDEEGGGG","---M------**----------------------MM---------------M------------","Ascidian Mitochondrial"],  
 "14":["FFLLSSSSYYY*CCWWLLLLPPPPHHQQRRRRIIIMTTTTNNNKSSSSVVVVAAAADDEEGGGG","-----------*-----------------------M----------------------------","Alternative Flatworm Mitochondrial"],  
 "16":["FFLLSSSSYY*LCC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG","----------*---*--------------------M----------------------------","Chlorophycean Mitochondrial"],  
 "21":["FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNNKSSSSVVVVAAAADDEEGGGG","----------**-----------------------M---------------M------------","Trematode Mitochondrial"],  
 "22":["FFLLSS*SYY*LCC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG","------*---*---*--------------------M----------------------------","Scenedesmus obliquus Mitochondrial"],  
 "23":["FF*LSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG","--*-------**--*-----------------M--M---------------M------------","Thraustochytrium Mitochondrial"],  
 "24":["FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSSKVVVVAAAADDEEGGGG","---M------**-------M---------------M---------------M------------","Pterobranchia Mitochondrial"],  
 "25":["FFLLSSSSYY**CCGWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG","---M------**-----------------------M---------------M------------","Candidate Division SR1 and Gracilibacteria"],  
 "26":["FFLLSSSSYY**CC*WLLLAPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG","----------**--*----M---------------M----------------------------","Pachysolen tannophilus Nuclear"],  
 "27":["FFLLSSSSYYQQCCWWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG","--------------*--------------------M----------------------------","Karyorelict Nuclear"],  
 "28":["FFLLSSSSYYQQCCWWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG","----------**--*--------------------M----------------------------","Condylostoma Nuclear"],  
 "29":["FFLLSSSSYYYYCC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG","--------------*--------------------M----------------------------","Mesodinium Nuclear"],  
 "30":["FFLLSSSSYYEECC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG","--------------*--------------------M----------------------------","Peritrich Nuclear"],  
 "31":["FFLLSSSSYYEECCWWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG","----------**-----------------------M----------------------------","Blastocrithidia Nuclear"]  
 }  

Thursday, August 17, 2017

targetP wrapper for large queries

As far as I know, TargetP is still (17 years after its original publication!) the best software for predicting subcellular localization for plant proteins, and also the location of truncation sites.

Without any modifications, targetp works well with small (by modern standards) queries, of less than 2,000 sequences at a time. But becomes glitchy when running with larger queries, such as the 30k-100k genes that are typical from a plant transcriptome assembly.

To adapt TargetP for larger queries, I wrote a Python script that acts as a wrapper around TargetP, called targetp_all.py. The script works by separating the input into smaller subsets of sequences and running those, and combining the output.

Interface is the same as the original program but with a few additional options. The output is somewhat simplified to be in tab-separated format.

It would also be nice to be able to parallelize the execution of TargetP to run on multiple cores at once, but I haven't attempted this yet. I believe that there will be complications involving conflicting temporary files, that may require careful modification of the original source code.

Source code follows. BioPython is a dependency.

Saturday, April 29, 2017

Mira4 assembly of 454 reads from SRA

I want to make an assembly of the Annona squamosa fruit transcriptome data from this paper (http://dx.doi.org/10.1186/s12864-015-1248-3). They give in the paper a link to a web resource (http://www.annonatranscriptome.nabi.res.in/), but the resource appears to now be defunct, so to get contigs reads, I will have to assemble the reads myself. The reads are from two different cultivars of Annona squamosa, so I'm going to assemble each cultivar separately first, and then if that works, I'll try a combined assembly.

MIRA is a nice, free, software package that can assemble 454 data. I've had success with it before, so that's what I'll use for this project too.