Count instances of a specific character in a file:
(in the example here, I count the number of sequences in a fasta file. Sequence names all begin with'>',and that is usually the only time you see that character in a fasta file)
tr -cd '>' < sequences.fasta | wc -c
grep -c '^>' sequences.fasta
The first one reads the fasta file into the translate (tr) program. The -c option tells tr to translate every character except the ones in the parameter, the -d option tells it to delete the characters it translates, so the output of tr will be a stream of '>' characters. wc -c will count the bytes in the input stream, which will be the number of '>' characters in the fasta file, and therefore the number of sequences in the file.
In the second example, the -c option tells grep to output the number of times it encounters a line containing the query regex. the regex '^>' matches only the '>' character at the beginning of a line. Notably, the second one will not count twice if a > is found later in the line, to do that, you'd have to use:
grep -o , sequenes.fasta | wc -l
Interestingly, both of these commands seem to go at about the same speed. I expected grep to be slower since it uses regular expressions. That they go the same speed leads me to believe that they are hard-disk io bound. Because they are similar in speed, but grep is more powerful, I would recommend using that construction.
Convenient output from ls:
ls -alth
Of course this will change depending on the situation, but I find this to be a good general purpose set of options for ls.
Batch rename:
use bash parameter expansion
for i in *.mol ; do mv "$i" "${i: 0:3}.mol" ; done
List processes along with the command used to execute the process:
ps -af
-a shows all calls associated with a terminal. -f shows the full context of the calls (the command line options in addition to the process name)
To see all processes currently running, use ps -A
Simple multi-threading with GNU parallel:
GNU parallel makes it easy to parallelize any bash script. The -j option specifies how many calls to have active at one time. The positional parameter tells it which process to call. The items listed after ::: are the parameters to pass to the process when it is called. For example, the script:
THREADS=2 function thread() { arg=$1 } export -f thread parallel -j $THREADS thread ::: instance_1_argument instance_2_argument instance_3_argument
Will call the function "thread" a total of three times, such that no more than 2 calls are running concurrently. First it will call thread(instance_1_argument), then thread(instance_2_argument). When one of those finishes, it will call thread(instance_3_argument).
Sort file, but skip the header:
#the default for AWK, if no action is given is to print the whole line, that's what it does for line 1 (where NR==1)
#if NR > 1, awk does the action stated there, which is to output the whole line (print $0), but instead of printing it
#to stdout, it pipes it to sort. Sort then sorts it by the first field. LANG=en_EN is there so that sort and join (which appears later) are consistent
awk 'NR == 1; NR > 1 {print $0 | "sort -t \"\t\" -b -k 2"}'
http://www.unix.com/302178872-post5.html
This command will skip the first row, then sort the rest of the file by the second tab-separated field When tethering sort to join, use LANG=en_EN
http://unix.stackexchange.com/questions/12942/join-file-2-not-in-sorted-order
I had a heck of a time with this one.
This script:
awk 'NR == 1; NR > 1 {print $0 | "LANG=en_EN sort -t \"\t\" -k2"}' < $OUTPUT > ${OUTPUT}.sorted awk 'NR == 1; NR > 1 {print $0 | "LANG=en_EN sort -t \"\t\" -k1"}' < $EXPRESSION > ${EXPRESSION}.sorted LANG=en_EN join -t$'\t' --header -1 2 -2 1 ${OUTPUT}.sorted ${EXPRESSION}.sorted > ${OUTPUT}.expression
absolutely refused to work until I added that weird hack.
additional references:
http://www.grymoire.com/
No comments:
Post a Comment