Building Unix Tools with Ruby
by Jacek Artymiak09/18/2003
This article demonstrates how to write Ruby scripts that work like typical, well-behaved Unix commands. To make it more fun and useful, we'll write a command-line tool for processing data stored in the comma separated values (CSV) file format. CSV (not CVS) is used to exchange data between databases, spreadsheets, and securities analysis software, as well as between some scientific applications. That format is also used by payment processing sites that provide downloadable sales data to vendors who use their services.
CSV files are plain text ASCII files in which one line of text represents one row or data and columns are separated with commas. A sample CSV file is shown below.
ticker,per,date,open,high,low,close,vol
XXXX,D,3-May-02,83.01,83.58,71.13,78.04,9645300
XXXX,D,2-May-02,82.47,85.76,82.05,83.84,7210000
XXXX,D,1-May-02,86.80,90.83,81.74,85.50,14253300
What Is the Script Supposed to Do?
The script, csvt, will extract selected columns of data
from a CSV file. The output will also be a CSV file, and the user will be
able to specify the order the columns of data will be printed in. A
simple data integrity test will make csvt fail, when the
number of columns in one line differs from the number of columns in the
previous line. The source of data will be either a file or standard input
(STDIN), as is customary for many Unix command line
tools.
The utility will support the following options:
--extract col[,col][...], to print selected columns from input. Numbers are separated with commas, and numbering starts with 0. For example,$ csvt --extract 1,5,2 fileprints columns 1, 5 and 2 (in that order) from
file:per,low,date D,71.13,3-May-02 D,82.05,2-May-02 D,81.74,1-May-02It will possible to list the same column more than once
$ csvt --extract 0,1,5,2,0 fileWhich has the following output as a result:
ticker,per,low,date,ticker XXXX,D,71.13,3-May-02,XXXX XXXX,D,82.05,2-May-02,XXXX XXXX,D,81.74,1-May-02,XXXX--remove col[,col][...], to print everything but the selected columns. Numbers are separated with commas, and numbering starts with 0. For example,$ csvt --remove 1,5,2 filewill print all columns except 1, 5 and 2 (in any order) from
file:ticker,open,high,close,vol XXXX,83.01,83.58,78.04,9645300 XXXX,82.47,85.76,83.84,7210000 XXXX,86.80,90.83,85.50,14253300Listing the same column number more than once will have no effect.
--help,-h, to display a short help page.--usage,-uhave the same effect as--help.--version, to displaycsvtversion information.
When csvt finds an unsupported option, or when it is run
without any options, it will default to the behavior determined by
--help.
Before You Begin
To complete this tutorial you will need an OS capable of running the Ruby interpreter, the Ruby interpreter itself, and a text editor. The operating system can be any POSIX-compatible system, either commercial (AIX, Solaris, QNX, Microsoft NT/2000, Mac OS X, and others) or free (Linux, FreeBSD, NetBSD, OpenBSD, or Darwin). The Ruby interpreter should be the latest release of Ruby. You can check if Ruby has been installed on your system with the following command:
$ ruby --version
|
Related Reading
Ruby in a Nutshell |
When the system reports that there is no such file or directory, you can either download the latest Ruby binaries from the Ruby site or from one of repositories of ports and packages for your operating system (check the list of resources at the end of this article).
If ready-made binaries are not available, you can always build Ruby from original sources found at the Ruby site. Detailed instructions for building Ruby can be found in the README file found in the interpreter's source archive. If you get stuck support is available on comp.lang.ruby as well as on the Ruby-talk mailing list. (Subscription details are on the Ruby site).
The choice of text editor is largely a matter of personal preference.
The author is a devoted vi user, but any text
editor will do.
Start with the Help Screen
Every tool, no matter how small, should come with a manual or, at the very least, it should print a short help screen that explains its usage. It is a good habit to write documentation before writing the first line of code.
Since csvt is a simple tool with only five options, you
can be forgiven for not writing the
manual, but you should embed basic documentation in the script itself.
This should be mandatory for even a short script that you are writing for
your own use, because chances are good that you will forget what it does
in two weeks.
The help screen shown above will be printed by csvt after
the user makes a mistake or runs csvt without specifying any
options. Since it can only occupy one standard text terminal screen (80
by 25 characters), it must be terse, but informative. Ideally, it should
present the following information:
- the name and the purpose of your utility;
- basic usage information;
- POSIX and GNU options recognized by
csvt; - some examples;
- where to send bug reports.
Your help screen could look like this (and it's okay just to type this stuff in a text editor and wrap it in code later):
csvt -- extract columns of data from a CSV (Comma-Separate Values) file
Usage: csvt [POSIX or GNU style options] file ...
POSIX options GNU long options
-e col[,col][,col]... --extract col[,col][,col]...
-r col[,col][,col]... --remove col[,col][,col]...
-h --help
-u --usage
-v --version
Examples:
csvt -e 1,5,6 file print column 1, 5 and 6 from file
csvt --extract 4,1 file print column 4 and 1 from file
csvt -r 2,7,1 file print all columns except 2, 7 and 1 from file
csvt --remove 6,0 file print all columns except 6 and 0 from file
cat file | csvt --remove 6,0 print all columns except 6 and 0 from file
Send bug reports to bugs@foo.bar
For licensing terms, see source code
Because there are several cases where it might be necessary to display
the help screen, you will need to put the code that displays it in a
separate method. We'll call it printusage(). (It helps to
have the
source code of csvt handy)
def printusage(error_code)
print "csvt -- extract columns of data from a CSV (Comma-Separate Values) file\n"
print "Usage: csvt [POSIX or GNU style options] file ...\n\n"
print "POSIX options GNU long options\n"
print " -e col[,col][,col]... --extract col[,col][,col]...\n"
print " -r col[,col][,col]... --remove col[,col][,col]...\n"
print " -h --help\n"
print " -u --usage\n"
print " -v --version\n\n"
print "Examples: \n"
print "csvt -e 1,5,6 file print column 1, 5 and 6 from file\n"
print "csvt --extract 4,1 file print column 4 and 1 from file\n"
print "csvt -r 2,7,1 file print all columns except 2, 7 and 1 from file\n"
print "csvt --remove 6,0 file print all columns except 6 and 0 from file\n"
print "cat file | csvt --remove 6,0 print all columns except 6 and 0 from file\n\n"
print "Send bug reports to bugs@foo.bar\n"
print "For licensing terms, see source code\n"
exit(error_code)
end
printusage() takes one argument, error_code,
which is later passed to exit()—a built-in Ruby method
used to stop the script and return an error code. In your script
printusage() will be called in two cases:
- when the user runs
csvtwith--helpor--usageoptions, so the script should return 0 (no errors), or - when the user runs
csvtwith an unsupported option or without options, and the script should return 1 (to indicate an error).
You should always remember to write code that returns appropriate error codes. When your script returns meaningful error codes, it is much easier to write scripts that can handle critical situations.