Tuesday, August 4, 2009

Writing a makefile instead of a "pipeline" script

Hello and welcome to all data lovers,

This first post is about writing a makefile instead of a "pipeline" script. (I'm talking about the type of datamining work usually done on UNIX-like systems using shell scripts (or other kinds) that run various command lines. e.g. bioinformatics data mining, which is my field of work)

I knew this to be the right way of getting things done more than 3 years ago (when I witnessed the guru Ninio-san doing it) but was too lazy to try it. Now that I finally did it for the first time, I feel silly that I ever wrote a pipeline script such as this one:

prank -F $1.fas > $1.aln
convertMsaFormat.pl $1.aln $1.msf
phyloAnalysis $1.msf > $1.phyoResults
grep 'score =' $1.phyoResults > $1.scores
plotNiceFig.R $1.scores

($1 is the first command-line parameter in csh scripts)

Such a script is the most straightforward way to run a pipeline of analyses, but it has a significant weakness: If one of the stages fails, (e.g. you forgot to give a certain parameter to phyloAnalysis) then you either have to remove the first lines of the script that ran successfully, or rerun the whole pipeline from the beginning and waste running time.

This problem is precisely what the "make" language was invented for. I assume you have some experience with makefiles in compiling C/C++ etc. If you don't - see this nice tutorial. Make can run any list of shell commands that make files with dependencies among them. The linear pipeline above is a simple scenario of dependencies, where the output file of each command depends on the output of the previous command. If you use a makefile, make will automatically check if some of the first stages of the pipeline were already run successfully and continue from the last successful output file.

Here is how you should write the above script as a makefile:
#!/usr/bin/make -f
all : $(seq).jpeg
$(seq).aln : $(seq).fas
prank -F $^ > $@
$(seq).msf : $(seq).aln
convertMsaFormat.pl $^ $@
$(seq).phyoResults : $(seq).msf
phyloAnalysis $^ > $@
$(seq).scores : $(seq).phyoResults
grep 'score ='$^ > $@
$(seq).jpeg : $(seq).scores
plotNiceFig.R $(seq).scores

$@ is the target file and $^ is the file(s) that the target depends upon.
"all" is a list of the final products - in this case just the last one.
To run this make file, simply chmod +x the file and then you can run it as an executable, exactly like you would with a shell/perl script. The $(seq) variable can be passed in the command line like so:
> runPhyloAnalysis.makefile seq=ThyA.Feb2012resubmission
(if you named the makefile "runPhyloAnalysis.makefile")

Of course, you can run this same makefile on any number of datasets, and even submit these "make" commands to a queuing system (e.g. qsub or condor_submit it) on a cluster.

Try it! I guarantee it will "make" your datamining experience a little more comfortable. But before you fully commit to makefile pipelines I should give a word of caution: When life becomes complicated the make syntax can be very difficult to grasp and may become difficult to debug. Some people claim that the make language is no good (e.g. 1, 2). There are indeed potential difficulties, which I am not going to dwelve on here - I'm not a makefile expert and others have done a better job of this. I will just say that if you keep your makefiles simple and don't become too ambitious, you will enjoy the benefits without running into too many of the pitfalls. If you find yourself trying to do something too sophisticated, which is not easily provided by the make syntax (or that you don't know enough about makefiles in order to implement it) then move this part of the logic outside of the makefile. Afterall, makefiles can run scripts and be run by scripts, so if you can't do it in a makefile then you're back to the well known task of script writing, with all it's pros and cons. You have lost nothing.

References to a few other pages about makefile pipelines:
http://biowiki.org/MakefileManifesto - More advanced stuff: Tips, issues and workarounds
http://biowiki.org/MakeComparison - A comparison of many "make" tools