Life of a dataminer: Perl oneliners: not just a bad habit

I know I know... You really shouldn't use oneliners. That's just being lazy! You should put your code inside a proper file. But... should you really? This post will illuminate advantages of Perl oneliners relative to scripts in files.

Here's the setting I'm talking about: You're working in your shell running various programs and piped commands, manipulating files, creating intermediate files and result files in a long multi-step data analysis process. The point is that you're not running a polished pipeline implemented as a well structured, QA'ed program. Nor are you interested in implementing such a pipeline that can be reused many times. You're assembling this process as you go, you're changing the way things are done, experimenting, testing, modifying, and fine-tuning. Maybe eventually you'll turn this into a properly implemented pipeline. Maybe you won't.

There are different ways to do this. One way is to write scripts. You could write one script that runs all the steps from beginning to end. But the downside to this is that you can't easily modify step 7 and rerun from that point onwards. So you could split the script to many small scripts. That's fine, but when you start modifying and rerunning the scripts for various steps you'd have to carefully keep track of the various versions of the various scripts and which version you used on which data files. That creates a potentially messy logging business. A very different and very good alternative (described in detail in a previous blog post) is to turn the big script into one big makefile. While this solution has significant advantages it can also get quite tricky to figure out how to formulate a complex pipeline as a makefile, not to mention makefile debugging...

So here comes the simple, "lazy" solution of Perl oneliners. Instead of writing many short scripts you write them as oneliners. This way you see exactly what code was used on each data file in each step. Of course, you must keep a log of all shell commands. This can be done effortlessly using another simple, "lazy" solution described in another previous blog post: running your shell in emacs and saving the whole shell session as a text file. As described in that post, this setup also gives you powerful search capabilities to search back to the command that was used to process a specific file name. You can then copy a oneliner, modify it as needed, and run it again on a different file. You can even document your oneliner with a comment right there inside your shell. In this way the exact version of the code in that tiny Perl script is clearly apparent and logged in the shell session itself.

An example would illustrate best:

$ head -3 in
1 1 one1 aaa GO:12^foo
2 0 two2 bbb GO:34^bar
3 0 three3 abc GO:12^foo`GO:34^bar
$ cat in | awk -F\t '{print $3,$5}' | sort > in.2word
$ head -3 in.2word
one1 GO:12^foo
three3 GO:12^foo`GO:34^bar
two2 GO:34^bar
$ cat in.2word | perl -wpe 'if (/(\d+)/) {$_.="\t".$1*2}' > in.2word.mult
$ head -3 in1.2word.mult
one1 GO:12^foo 2
three3 GO:12^foo`GO:34^bar 6
two2 GO:34^bar 4
$ cat in.2word.uniq.mult | perl -we 'while (<>) {chomp; @a=split "\t", $_; @go=split("`",$a[1]); for (@go) {print "$a[0]\t$_\t$a[2]\n"}}' > in.2word.uniq.mult.1goPerLine # Split the GO terms and print instead multiple lines, one for each term
$ head -4 in.2word.uniq.mult.1goPerLine

one1 GO:12^foo 2
three3 GO:12^foo 6

three3 GO:34^bar 6

two2 GO:34^bar 4

Lovely. You got the job (whatever that was?!) done relatively easily. And zero time wasted on creating script files and saving your shell commands to a log file. But wait, now you realize that you need exponentiation rather than multiplication by 2 in the second step. No prob! Copy that oneliner, modify, and rerun it and the following step(s):

$ cat in.2word| perl -wpe 'if (/(\d+)/) {$_.="\t".$1**2}' > in.2word.exp
$ head -3 in1.2word.exp
one1 GO:12^foo 1
three3 GO:12^foo`GO:34^bar 4
two2 GO:34^bar 9
$ cat in.2word.uniq.exp | perl -we 'while (<>) {chomp; @a=split "\t", $_; @go=split("`",$a[1]); for (@go) {print "$a[0]\t$_\t$a[2]\n"}}' > in.2word.uniq.exp.1goPerLine # Split the GO terms and print instead multiple lines, one for each term
$ head -4 in.2word.uniq.exp.1goPerLine

one1 GO:12^foo 1
three3 GO:12^foo 9

three3 GO:34^bar 9

two2 GO:34^bar 4

Note that I decided to change the file names accordingly. Maybe because I'm not sure if I'll still end up using the results of the pipeline from the first version, so I'd rather not overwrite them. Seems like a lot of manual text work? Not really. With emacs' powerful search, copy, and search&replace this takes under two minutes if you're a slow old-timer like me.

Not convinced? Well, maybe it's not for you. We each find our work practices that suit us best. Still, it's worthwhile experimenting with other people's practices. You can find some nifty techniques that would nicely complement your way of doing things. So I'd suggest trying the shell-in-emacs tricks if you haven't already. And then, try using oneliners for a while. Evaluate it and you'll see if there's something in it for you.

3 comments:

Eyal PrivmanMar 7, 2014, 6:10:00 PM
Comments welcome...
Near PrivmanMar 9, 2014, 11:35:00 AM
Hey, cool post. Event though I don't Perl, almost everything here is relevant for other shell-executed scripts (which I often use). You've inspired me to give emacs another try, after about a decade :)

If you do want to run a script from the middle, try adding the following to your .bash_profile:

function execFromMarker() { sed -n -e '/^>/,$p' $1 | sed -e 's/^>//' | bash /dev/stdin ${*:2}; }

Then just place a '>' at the beginning of the line you want execution to start from, and run it as:
execFromMarker myScript.perl param1 param2

Disclaimer: this assumes you don't already have any lines in your script that start with a '>' char. If that's not a safe assumption, use a different marker

Life of a dataminer

Friday, March 7, 2014

Perl oneliners: not just a bad habit

3 comments:

Blog Archive

About Me

Followers