Sunday, July 27, 2014

srj_chembiolib a set of scripts for doing Bioinformaticky type stuff

This post is to announce srj_chembiolib which in an uncreatively named set of scripts I've written to perform various bioinformatics (and hopefully eventually some cheminformatic) tasks that would otherwise be a pain in the rear to accomplish. Most of the scripts work both as libraries that can be imported into other scripts, and as stand alone command line scripts. Some depend on external libraries such as BioPython. Documentation is mostly found in block comments at the top of the script files. Where possible, for example with the scripts that manipulate fasta files, I've tried to make it so that multiple programs can be chained together with pipes on the command line to accomplish more complex tasks. I hope they're useful for someone.

Here's an overview of some (but not all!) of the scripts in the package:

blast_xml_to_outfmt6.py: Converts Blast+ xml output to '-outfmt 6' style output (a tab separated form). Allows for some additional features not available in the standard outfmt 6, such as printing a line for query sequences that had no hits.

subset_fasta.py: Give it a fasta file and a list of strings, and it will give you a fasta file containing only those sequences whose names contain something from the list of strings as a substring.

 extract_massbank.py: A class to read and store data from MassBank format text files, such as those used by MassBank, ReSpect, and Spektraris (my favorite database...).

extract_top_blast_hits.py: Reads a blast xml file and outputs the names of the top hits for each query sequence. There's an option for making a file listing the sequences the queries where the top hit matched in the reverse direction, which is useful, for example with Blastx, to determine whether a nucleotide sequence represents the coding strand, or the non-coding strand.

No comments:

Post a Comment