Wednesday, March 30, 2016

Linux shell notes

This is a post containing various assorted Bash scripts, and some explanations of how they work. Some don't rely on Bash specific features and will probably work on many other shells as well. I intend to keep adding to this list over time.

Count instances of a specific character in a file:
(in the example here, I count the number of sequences in a fasta file. Sequence names all begin with'>',and that is usually the only time you see that character in a fasta file)
tr -cd '>' < sequences.fasta | wc -c

grep -c '^>' sequences.fasta
The first one reads the fasta file into the translate (tr) program. The -c option tells tr to translate every character except the ones in the parameter, the -d option tells it to delete the characters it translates, so the output of tr will be a stream of '>' characters. wc -c will count the bytes in the input stream, which will be the number of '>' characters in the fasta file, and therefore the number of sequences in the file.

In the second example, the -c option tells grep to output the number of times it encounters a line containing the query regex. the regex '^>' matches only the '>' character at the beginning of a line. Notably, the second one will not count twice if a > is found later in the line, to do that, you'd have to use:
grep -o , sequenes.fasta | wc -l

Interestingly, both of these commands seem to go at about the same speed. I expected grep to be slower since it uses regular expressions. That they go the same speed leads me to believe that they are hard-disk io bound. Because they are similar in speed, but grep is more powerful, I would recommend using that construction.

Convenient output from ls:
ls -alth
Of course this will change depending on the situation, but I find this to be a good general purpose set of options for ls.

Batch rename:
use bash parameter expansion
for i in *.mol ; do mv "$i" "${i: 0:3}.mol" ; done

List processes along with the command used to execute the process:
ps -af
-a shows all calls associated with a terminal. -f shows the full context of the calls (the command line options in addition to the process name)
To see all processes currently running, use ps -A

Simple multi-threading with GNU parallel:
GNU parallel makes it easy to parallelize any bash script. The -j option specifies how many calls to have active at one time. The positional parameter tells it which process to call. The items listed after ::: are the parameters to pass to the process when it is called.  For example, the script:


function thread() {
export -f thread

parallel -j $THREADS thread ::: instance_1_argument instance_2_argument instance_3_argument

Will call the function "thread" a total of three times, such that no more than 2 calls are running concurrently. First it will call thread(instance_1_argument), then thread(instance_2_argument). When one of those finishes, it will call thread(instance_3_argument).

Sort file, but skip the header: 
#the default for AWK, if no action is given is to print the whole line, that's what it does for line 1 (where NR==1)

#if NR > 1, awk does the action stated there, which is to output the whole line (print $0), but instead of printing it
#to stdout, it pipes it to sort. Sort then sorts it by the first field. LANG=en_EN is there so that sort and join (which appears later) are consistent 
awk 'NR == 1; NR > 1 {print $0 | "sort -t \"\t\" -b -k 2"}' 
This command will skip the first row, then sort the rest of the file by the second tab-separated field 

When tethering sort to join, use LANG=en_EN 
I had a heck of a time with this one.
This script:
awk 'NR == 1; NR > 1 {print $0 | "LANG=en_EN sort -t \"\t\" -k2"}' < $OUTPUT > ${OUTPUT}.sorted
awk 'NR == 1; NR > 1 {print $0 | "LANG=en_EN sort -t \"\t\" -k1"}' < $EXPRESSION > ${EXPRESSION}.sorted

LANG=en_EN join -t$'\t' --header -1 2 -2 1 ${OUTPUT}.sorted ${EXPRESSION}.sorted > ${OUTPUT}.expression

absolutely refused to work until I added that weird hack.

additional references:

No comments:

Post a Comment