Thursday, August 17, 2017

targetP wrapper for large queries

As far as I know, TargetP is still (17 years after its original publication!) the best software for predicting subcellular localization for plant proteins, and also the location of truncation sites.

Without any modifications, targetp works well with small (by modern standards) queries, of less than 2,000 sequences at a time. But becomes glitchy when running with larger queries, such as the 30k-100k genes that are typical from a plant transcriptome assembly.

To adapt TargetP for larger queries, I wrote a Python script that acts as a wrapper around TargetP, called targetp_all.py. The script works by separating the input into smaller subsets of sequences and running those, and combining the output.

Interface is the same as the original program but with a few additional options. The output is somewhat simplified to be in tab-separated format.

It would also be nice to be able to parallelize the execution of TargetP to run on multiple cores at once, but I haven't attempted this yet. I believe that there will be complications involving conflicting temporary files, that may require careful modification of the original source code.

Source code follows. BioPython is a dependency.