Life of a dataminer: 2009

Saturday, September 5, 2009

Making the most of a shell session inside emacs

This post is intended for people who do a lot of their work inside a shell session, who are also familiar with the (highly recommended) emacs text editor. If you've never seen a shell session inside emacs - here is how it's done: Simply open an emacs window and type "M-x shell" (M stands for the Alt key). I have a very useful short-key for this (M-s).

Below I will explain how to save your whole shell session as a text file, and make smart use of it as a log file for your ongoing work. This trick is priceless for me - since I need reminders for what I did yesterday, or even 30 minutes ago - but it can also be valuable for people with better memory...

Imagine you're working in a shell session in your emacs at the office/lab, and then you have to run home to feed your cat. After you got home and your cat is now fast asleep, you realize that in your last command one of the parameters was wrong, and you need to correct and rerun it. However, when you ssh and open a new shell you have to rewrite your command line from memory.

There is a better way of living, and it's really simple!
Before you close your emacs, save your "*shell*" buffer as a file (normal C-x C-w ; it's best to name it by the default name "*shell*" as it is). Then, when you open a new emacs: first open this file as a normal file (C-x C-f), and then type "M-x shell" to start a shell session as you usual. Now this buffer of a normal text file will become a live shell session once again, and you have all your previous command lines to refer back to.
(Note that this is a new shell session, so you can't use the shell history to rerun your previous commands, but you can search back in this buffer and copy those commands)

I hope this would make you more satisfied with your daily emacs experience! :-)

A couple of small but useful additions:

1) Add the date to your prompt, so that it looks like so:
12.10:55 ~/homingEndonuc/hedb/genome.A_thaliana>

This will help you searching back through your old command lines. (Maybe you're looking for a specific qsub command that you ran on Thursday, and you get confused with all the other qsub commands you ran in the last few days)

The definition of your prompt template is usually done in your ~/.cshrc file (for C shell users). Mine looks like so:
set prompt = set prompt = "%B%D.%T%b %~> "
(%D is the day in the month, %T is the time)

Or even a full date if you like:

20120531.10:55 ~/homingEndonuc/hedb/genome.A_thaliana>

like so:
set prompt = set prompt = "%B%Y%W%D.%T%b %~> "
(%Y is the year, %W is the month)

For bash:

PS1 = "\D{%Y%m%d}.\A \h:\w\$ "

2) Compound (regexp) search for old commandlines in the shell session

After a while you are going have some shell session(s) wth several months of work, which might contain thousands of commandlines and hundreds of thousands of other (output) lines. In order to make the best use of this "logfile" you need a powerful and convenient way to find specific commandlines of interest in this mess.

What you would commonly do in order to find a previous command line is simply to search back (C-r) for a name of a file or a script, such as "grepFastaFromFile.pl". But it is quite likely that in the course of your work you applied this script many times, and you now want to find the command that you ran to create a specific file, e.g. "thyA.archaea.3.fasta". Let's say that the full line was:

$ grepFastaFromFile.pl all.fasta regex.list > thyA.archaea.3.fasta

So what you want to do is to search with a regular expression such as:

grepFastaFromFile\.pl.*thyA.archaea.3.fasta

Sometimes this would be enough:

grepFastaF.*3\.fasta

Well, this is (naturally) a built-in operation in emacs, that naturally has a shortcut:

M-C-s and M-C-r (alt-conrol-r)

(short for M-x isearch-forward-regexp)

Try it!

3) Simply replace

Want to run a modified version of a previous long command line? E.g. Run with a different input file name and also with this name as the prefix for the output file(s). Get a copy of the old commandline from the shell history (M-p or C-up-arrow) and then use the emacs find-and-replace (M-%) to replace a the old file name with the new one, wherever it appears in the commandline.

4) Auto-complete

Emacs allows all sorts of word auto-completion. The most basic default is invoked by M-/ which simply searches through the current buffer for completions, and then through other open buffers. A nice usage example in a shell buffer: You want to cd to a directory where you did some stuff yesterday. You don't need to search back and copy the path. Simply start typing:
$ cd /scrat
Then hit M-/ and emacs will miraculously complete:

$ cd /scratch/local/yearly/eprivman/ants/genomeAln

Which was the last path with the same prefix in my shell buffer.

Tuesday, August 4, 2009

Writing a makefile instead of a "pipeline" script

Hello and welcome to all data lovers,

This first post is about writing a makefile instead of a "pipeline" script. (I'm talking about the type of datamining work usually done on UNIX-like systems using shell scripts (or other kinds) that run various command lines. e.g. bioinformatics data mining, which is my field of work)

I knew this to be the right way of getting things done more than 3 years ago (when I witnessed the guru Ninio-san doing it) but was too lazy to try it. Now that I finally did it for the first time, I feel silly that I ever wrote a pipeline script such as this one:

#!/bin/csh

prank -F $1.fas > $1.aln

convertMsaFormat.pl $1.aln $1.msf

phyloAnalysis $1.msf > $1.phyoResults

grep 'score =' $1.phyoResults > $1.scores

plotNiceFig.R $1.scores

($1 is the first command-line parameter in csh scripts)

Such a script is the most straightforward way to run a pipeline of analyses, but it has a significant weakness: If one of the stages fails, (e.g. you forgot to give a certain parameter to phyloAnalysis) then you either have to remove the first lines of the script that ran successfully, or rerun the whole pipeline from the beginning and waste running time.

This problem is precisely what the "make" language was invented for. I assume you have some experience with makefiles in compiling C/C++ etc. If you don't - see this nice tutorial. Make can run any list of shell commands that make files with dependencies among them. The linear pipeline above is a simple scenario of dependencies, where the output file of each command depends on the output of the previous command. If you use a makefile, make will automatically check if some of the first stages of the pipeline were already run successfully and continue from the last successful output file.

Here is how you should write the above script as a makefile:

#!/usr/bin/make -f

all : $(seq).jpeg

$(seq).aln : $(seq).fas

prank -F $^ > $@

$(seq).msf : $(seq).aln

convertMsaFormat.pl $^ $@

$(seq).phyoResults : $(seq).msf

phyloAnalysis $^ > $@

$(seq).scores : $(seq).phyoResults

grep 'score ='$^ > $@

$(seq).jpeg : $(seq).scores

plotNiceFig.R $(seq).scores

$@ is the target file and $^ is the file(s) that the target depends upon.
"all" is a list of the final products - in this case just the last one.
To run this make file, simply chmod +x the file and then you can run it as an executable, exactly like you would with a shell/perl script. The $(seq) variable can be passed in the command line like so:
> runPhyloAnalysis.makefile seq=ThyA.Feb2012resubmission
(if you named the makefile "runPhyloAnalysis.makefile")

Of course, you can run this same makefile on any number of datasets, and even submit these "make" commands to a queuing system (e.g. qsub or condor_submit it) on a cluster.

Try it! I guarantee it will "make" your datamining experience a little more comfortable. But before you fully commit to makefile pipelines I should give a word of caution: When life becomes complicated the make syntax can be very difficult to grasp and may become difficult to debug. Some people claim that the make language is no good (e.g. 1, 2). There are indeed potential difficulties, which I am not going to dwelve on here - I'm not a makefile expert and others have done a better job of this. I will just say that if you keep your makefiles simple and don't become too ambitious, you will enjoy the benefits without running into too many of the pitfalls. If you find yourself trying to do something too sophisticated, which is not easily provided by the make syntax (or that you don't know enough about makefiles in order to implement it) then move this part of the logic outside of the makefile. Afterall, makefiles can run scripts and be run by scripts, so if you can't do it in a makefile then you're back to the well known task of script writing, with all it's pros and cons. You have lost nothing.

References to a few other pages about makefile pipelines:
http://www.nodalpoint.org/2007/03/18/a_pipeline_is_a_makefile
http://biowiki.org/MakefileManifesto - More advanced stuff: Tips, issues and workarounds
http://biowiki.org/MakeComparison - A comparison of many "make" tools

Life of a dataminer

Saturday, September 5, 2009

Making the most of a shell session inside emacs

Tuesday, August 4, 2009

Writing a makefile instead of a "pipeline" script

Blog Archive

About Me

Followers