Life of a dataminer

Friday, March 7, 2014

Perl oneliners: not just a bad habit

I know I know... You really shouldn't use oneliners. That's just being lazy! You should put your code inside a proper file. But... should you really? This post will illuminate advantages of Perl oneliners relative to scripts in files.

Here's the setting I'm talking about: You're working in your shell running various programs and piped commands, manipulating files, creating intermediate files and result files in a long multi-step data analysis process. The point is that you're not running a polished pipeline implemented as a well structured, QA'ed program. Nor are you interested in implementing such a pipeline that can be reused many times. You're assembling this process as you go, you're changing the way things are done, experimenting, testing, modifying, and fine-tuning. Maybe eventually you'll turn this into a properly implemented pipeline. Maybe you won't.

There are different ways to do this. One way is to write scripts. You could write one script that runs all the steps from beginning to end. But the downside to this is that you can't easily modify step 7 and rerun from that point onwards. So you could split the script to many small scripts. That's fine, but when you start modifying and rerunning the scripts for various steps you'd have to carefully keep track of the various versions of the various scripts and which version you used on which data files. That creates a potentially messy logging business. A very different and very good alternative (described in detail in a previous blog post) is to turn the big script into one big makefile. While this solution has significant advantages it can also get quite tricky to figure out how to formulate a complex pipeline as a makefile, not to mention makefile debugging...

So here comes the simple, "lazy" solution of Perl oneliners. Instead of writing many short scripts you write them as oneliners. This way you see exactly what code was used on each data file in each step. Of course, you must keep a log of all shell commands. This can be done effortlessly using another simple, "lazy" solution described in another previous blog post: running your shell in emacs and saving the whole shell session as a text file. As described in that post, this setup also gives you powerful search capabilities to search back to the command that was used to process a specific file name. You can then copy a oneliner, modify it as needed, and run it again on a different file. You can even document your oneliner with a comment right there inside your shell. In this way the exact version of the code in that tiny Perl script is clearly apparent and logged in the shell session itself.

An example would illustrate best:

$ head -3 in
1 1 one1 aaa GO:12^foo
2 0 two2 bbb GO:34^bar
3 0 three3 abc GO:12^foo`GO:34^bar
$ cat in | awk -F\t '{print $3,$5}' | sort > in.2word
$ head -3 in.2word
one1 GO:12^foo
three3 GO:12^foo`GO:34^bar
two2 GO:34^bar
$ cat in.2word | perl -wpe 'if (/(\d+)/) {$_.="\t".$1*2}' > in.2word.mult
$ head -3 in1.2word.mult
one1 GO:12^foo 2
three3 GO:12^foo`GO:34^bar 6
two2 GO:34^bar 4
$ cat in.2word.uniq.mult | perl -we 'while (<>) {chomp; @a=split "\t", $_; @go=split("`",$a[1]); for (@go) {print "$a[0]\t$_\t$a[2]\n"}}' > in.2word.uniq.mult.1goPerLine # Split the GO terms and print instead multiple lines, one for each term
$ head -4 in.2word.uniq.mult.1goPerLine

one1 GO:12^foo 2
three3 GO:12^foo 6

three3 GO:34^bar 6

two2 GO:34^bar 4

Lovely. You got the job (whatever that was?!) done relatively easily. And zero time wasted on creating script files and saving your shell commands to a log file. But wait, now you realize that you need exponentiation rather than multiplication by 2 in the second step. No prob! Copy that oneliner, modify, and rerun it and the following step(s):

$ cat in.2word| perl -wpe 'if (/(\d+)/) {$_.="\t".$1**2}' > in.2word.exp
$ head -3 in1.2word.exp
one1 GO:12^foo 1
three3 GO:12^foo`GO:34^bar 4
two2 GO:34^bar 9
$ cat in.2word.uniq.exp | perl -we 'while (<>) {chomp; @a=split "\t", $_; @go=split("`",$a[1]); for (@go) {print "$a[0]\t$_\t$a[2]\n"}}' > in.2word.uniq.exp.1goPerLine # Split the GO terms and print instead multiple lines, one for each term
$ head -4 in.2word.uniq.exp.1goPerLine

one1 GO:12^foo 1
three3 GO:12^foo 9

three3 GO:34^bar 9

two2 GO:34^bar 4

Note that I decided to change the file names accordingly. Maybe because I'm not sure if I'll still end up using the results of the pipeline from the first version, so I'd rather not overwrite them. Seems like a lot of manual text work? Not really. With emacs' powerful search, copy, and search&replace this takes under two minutes if you're a slow old-timer like me.

Not convinced? Well, maybe it's not for you. We each find our work practices that suit us best. Still, it's worthwhile experimenting with other people's practices. You can find some nifty techniques that would nicely complement your way of doing things. So I'd suggest trying the shell-in-emacs tricks if you haven't already. And then, try using oneliners for a while. Evaluate it and you'll see if there's something in it for you.

Wednesday, October 5, 2011

Perl vs. Python dog fighting

So Python is the hot new thing and I, as an old Perl dog, should make a statement on the subject. You're guessing I'm going to tell you why Perl is better? You guessed right! Specifically, I want to address the arguments that Python encourages better programming practices or that it is better for teaching. This is true. However, as educators (in the field of bioinformatics but also elsewhere) we need to consider what is our ultimate goal - easy training process or the product of skilled trained people. The analogy that immediately comes to mind is naturally fighter training in the Israel Air Force. In the first decades of Israeli history, when the threat of annihilation was imminent and the culture was very much influenced by the Soviet model, Israeli fighter pilots were taking lots of "stupid" risks during training for combat. As a result, some classes had lost even 50% of the fighter pilots over their 20 years of service, and training accident were more common than KIAs. The other result was that Israeli pilots could win fights against less well trained pilots even when heavily outnumbered. This practice have been reversed in the more recent decades (after stabilization of the state and influence of American culture...) and now training accidents are much less common. We cannot judge if Israeli pilots are not good dogfighters any more because (a) the last heavy air-to-air fighting was in 1982 and (b) dogfighting is less important with modern fighter jets.

Nevertheless, the point remains and the analogy is useful. Perl requires harder training but will ultimately be more powerful. Python is nicer for teaching but more limited. Python is a more constrained language with more proper syntax and object oriented design, which are attractive qualities for the teaching of programming. Perl is the "Pathologically Eclectic Rubbish Lister" (Larry Wall). It doesn't restrict the programmer from using bad programming practices. And it is true that you can find a lot of badly written Perl code out there. But this is not because Perl is a bad language. It is because some Perl programmers are lacking good education in programming. Guns don't kill people. People kill people. A well written program can be written in any language if the programmer knows how to structure the code intelligently and use syntax choices that make clear and understandable code (rather the notorious possibility for code obfuscation in Perl). So it is up to us as educators to produce students well educated in good coding practices.

The flip side of this coin is that Perl is much more versatile, flexible, and powerful than Python (and other alternatives). It allows more efficient development of new code, i.e., shortening the time and effort from the moment you sit down to write a bit of code to the moment you can run it and get your results. This is highly desirable for practical programming such as short scripting for bioinformatic analysis. So as educators of practical programmers we should make the effort to teach good programming practices while using the most powerful language available for our students. At the present time, especially for bioinformatics, this language is Perl. Teaching constrained languages to "force" our students into good practices is taking the easy way out as educators and handicapping the future potential of the people we train.

Monday, September 12, 2011

GUIDANCE and MAFFT unite!

This is a very exciting day - today the MAFFT sequence alignment web server added a button to its results page that redirects users to the GUIDANCE web server. This will encourage the many MAFFT users to test the robustness of their alignments using GUIDANCE. I'm hoping we'll get a substantial exposure through MAFFT and increase the awareness to alignment confidence measures and to our GUIDANCE server.

This was made possible by a collaboration with Prof. Kazutaka Katoh from Kyushu University who developed MAFFT. He generously devoted his time to coordinate the integration of the two servers. On the Tel Aviv side the credit goes to Haim Ashkenazy who developed the GUIDANCE web server.

Saturday, September 5, 2009

Making the most of a shell session inside emacs

This post is intended for people who do a lot of their work inside a shell session, who are also familiar with the (highly recommended) emacs text editor. If you've never seen a shell session inside emacs - here is how it's done: Simply open an emacs window and type "M-x shell" (M stands for the Alt key). I have a very useful short-key for this (M-s).

Below I will explain how to save your whole shell session as a text file, and make smart use of it as a log file for your ongoing work. This trick is priceless for me - since I need reminders for what I did yesterday, or even 30 minutes ago - but it can also be valuable for people with better memory...

Imagine you're working in a shell session in your emacs at the office/lab, and then you have to run home to feed your cat. After you got home and your cat is now fast asleep, you realize that in your last command one of the parameters was wrong, and you need to correct and rerun it. However, when you ssh and open a new shell you have to rewrite your command line from memory.

There is a better way of living, and it's really simple!
Before you close your emacs, save your "*shell*" buffer as a file (normal C-x C-w ; it's best to name it by the default name "*shell*" as it is). Then, when you open a new emacs: first open this file as a normal file (C-x C-f), and then type "M-x shell" to start a shell session as you usual. Now this buffer of a normal text file will become a live shell session once again, and you have all your previous command lines to refer back to.
(Note that this is a new shell session, so you can't use the shell history to rerun your previous commands, but you can search back in this buffer and copy those commands)

I hope this would make you more satisfied with your daily emacs experience! :-)

A couple of small but useful additions:

1) Add the date to your prompt, so that it looks like so:
12.10:55 ~/homingEndonuc/hedb/genome.A_thaliana>

This will help you searching back through your old command lines. (Maybe you're looking for a specific qsub command that you ran on Thursday, and you get confused with all the other qsub commands you ran in the last few days)

The definition of your prompt template is usually done in your ~/.cshrc file (for C shell users). Mine looks like so:
set prompt = set prompt = "%B%D.%T%b %~> "
(%D is the day in the month, %T is the time)

Or even a full date if you like:

20120531.10:55 ~/homingEndonuc/hedb/genome.A_thaliana>

like so:
set prompt = set prompt = "%B%Y%W%D.%T%b %~> "
(%Y is the year, %W is the month)

For bash:

PS1 = "\D{%Y%m%d}.\A \h:\w\$ "

2) Compound (regexp) search for old commandlines in the shell session

After a while you are going have some shell session(s) wth several months of work, which might contain thousands of commandlines and hundreds of thousands of other (output) lines. In order to make the best use of this "logfile" you need a powerful and convenient way to find specific commandlines of interest in this mess.

What you would commonly do in order to find a previous command line is simply to search back (C-r) for a name of a file or a script, such as "grepFastaFromFile.pl". But it is quite likely that in the course of your work you applied this script many times, and you now want to find the command that you ran to create a specific file, e.g. "thyA.archaea.3.fasta". Let's say that the full line was:

$ grepFastaFromFile.pl all.fasta regex.list > thyA.archaea.3.fasta

So what you want to do is to search with a regular expression such as:

grepFastaFromFile\.pl.*thyA.archaea.3.fasta

Sometimes this would be enough:

grepFastaF.*3\.fasta

Well, this is (naturally) a built-in operation in emacs, that naturally has a shortcut:

M-C-s and M-C-r (alt-conrol-r)

(short for M-x isearch-forward-regexp)

Try it!

3) Simply replace

Want to run a modified version of a previous long command line? E.g. Run with a different input file name and also with this name as the prefix for the output file(s). Get a copy of the old commandline from the shell history (M-p or C-up-arrow) and then use the emacs find-and-replace (M-%) to replace a the old file name with the new one, wherever it appears in the commandline.

4) Auto-complete

Emacs allows all sorts of word auto-completion. The most basic default is invoked by M-/ which simply searches through the current buffer for completions, and then through other open buffers. A nice usage example in a shell buffer: You want to cd to a directory where you did some stuff yesterday. You don't need to search back and copy the path. Simply start typing:
$ cd /scrat
Then hit M-/ and emacs will miraculously complete:

$ cd /scratch/local/yearly/eprivman/ants/genomeAln

Which was the last path with the same prefix in my shell buffer.

Tuesday, August 4, 2009

Writing a makefile instead of a "pipeline" script

Hello and welcome to all data lovers,

This first post is about writing a makefile instead of a "pipeline" script. (I'm talking about the type of datamining work usually done on UNIX-like systems using shell scripts (or other kinds) that run various command lines. e.g. bioinformatics data mining, which is my field of work)

I knew this to be the right way of getting things done more than 3 years ago (when I witnessed the guru Ninio-san doing it) but was too lazy to try it. Now that I finally did it for the first time, I feel silly that I ever wrote a pipeline script such as this one:

#!/bin/csh

prank -F $1.fas > $1.aln

convertMsaFormat.pl $1.aln $1.msf

phyloAnalysis $1.msf > $1.phyoResults

grep 'score =' $1.phyoResults > $1.scores

plotNiceFig.R $1.scores

($1 is the first command-line parameter in csh scripts)

Such a script is the most straightforward way to run a pipeline of analyses, but it has a significant weakness: If one of the stages fails, (e.g. you forgot to give a certain parameter to phyloAnalysis) then you either have to remove the first lines of the script that ran successfully, or rerun the whole pipeline from the beginning and waste running time.

This problem is precisely what the "make" language was invented for. I assume you have some experience with makefiles in compiling C/C++ etc. If you don't - see this nice tutorial. Make can run any list of shell commands that make files with dependencies among them. The linear pipeline above is a simple scenario of dependencies, where the output file of each command depends on the output of the previous command. If you use a makefile, make will automatically check if some of the first stages of the pipeline were already run successfully and continue from the last successful output file.

Here is how you should write the above script as a makefile:

#!/usr/bin/make -f

all : $(seq).jpeg

$(seq).aln : $(seq).fas

prank -F $^ > $@

$(seq).msf : $(seq).aln

convertMsaFormat.pl $^ $@

$(seq).phyoResults : $(seq).msf

phyloAnalysis $^ > $@

$(seq).scores : $(seq).phyoResults

grep 'score ='$^ > $@

$(seq).jpeg : $(seq).scores

plotNiceFig.R $(seq).scores

$@ is the target file and $^ is the file(s) that the target depends upon.
"all" is a list of the final products - in this case just the last one.
To run this make file, simply chmod +x the file and then you can run it as an executable, exactly like you would with a shell/perl script. The $(seq) variable can be passed in the command line like so:
> runPhyloAnalysis.makefile seq=ThyA.Feb2012resubmission
(if you named the makefile "runPhyloAnalysis.makefile")

Of course, you can run this same makefile on any number of datasets, and even submit these "make" commands to a queuing system (e.g. qsub or condor_submit it) on a cluster.

Try it! I guarantee it will "make" your datamining experience a little more comfortable. But before you fully commit to makefile pipelines I should give a word of caution: When life becomes complicated the make syntax can be very difficult to grasp and may become difficult to debug. Some people claim that the make language is no good (e.g. 1, 2). There are indeed potential difficulties, which I am not going to dwelve on here - I'm not a makefile expert and others have done a better job of this. I will just say that if you keep your makefiles simple and don't become too ambitious, you will enjoy the benefits without running into too many of the pitfalls. If you find yourself trying to do something too sophisticated, which is not easily provided by the make syntax (or that you don't know enough about makefiles in order to implement it) then move this part of the logic outside of the makefile. Afterall, makefiles can run scripts and be run by scripts, so if you can't do it in a makefile then you're back to the well known task of script writing, with all it's pros and cons. You have lost nothing.

References to a few other pages about makefile pipelines:
http://www.nodalpoint.org/2007/03/18/a_pipeline_is_a_makefile
http://biowiki.org/MakefileManifesto - More advanced stuff: Tips, issues and workarounds
http://biowiki.org/MakeComparison - A comparison of many "make" tools