This article demonstrates how to write Ruby scripts that work like typical, well-behaved Unix commands. To make it more fun and useful, we'll write a command-line tool for processing data stored in the comma separated values (CSV) file format. CSV (not CVS) is used to exchange data between databases, spreadsheets, and securities analysis software, as well as between some scientific applications. That format is also used by payment processing sites that provide downloadable sales data to vendors who use their services.
CSV files are plain text ASCII files in which one line of text represents one row or data and columns are separated with commas. A sample CSV file is shown below.
ticker,per,date,open,high,low,close,vol
XXXX,D,3-May-02,83.01,83.58,71.13,78.04,9645300
XXXX,D,2-May-02,82.47,85.76,82.05,83.84,7210000
XXXX,D,1-May-02,86.80,90.83,81.74,85.50,14253300
The script, csvt, will extract selected columns of data
from a CSV file. The output will also be a CSV file, and the user will be
able to specify the order the columns of data will be printed in. A
simple data integrity test will make csvt fail, when the
number of columns in one line differs from the number of columns in the
previous line. The source of data will be either a file or standard input
(STDIN), as is customary for many Unix command line
tools.
The utility will support the following options:
--extract col[,col][...], to print selected columns
from input. Numbers are separated with commas, and numbering starts
with 0. For example,
$ csvt --extract 1,5,2 file
prints columns 1, 5 and 2 (in that order) from file:
per,low,date
D,71.13,3-May-02
D,82.05,2-May-02
D,81.74,1-May-02
It will possible to list the same column more than once
$ csvt --extract 0,1,5,2,0 file
Which has the following output as a result:
ticker,per,low,date,ticker
XXXX,D,71.13,3-May-02,XXXX
XXXX,D,82.05,2-May-02,XXXX
XXXX,D,81.74,1-May-02,XXXX
--remove col[,col][...], to print everything but the
selected columns. Numbers are separated with commas, and numbering starts
with 0. For example,
$ csvt --remove 1,5,2 file
will print all columns except 1, 5 and 2 (in any order) from
file:
ticker,open,high,close,vol
XXXX,83.01,83.58,78.04,9645300
XXXX,82.47,85.76,83.84,7210000
XXXX,86.80,90.83,85.50,14253300
Listing the same column number more than once will have no effect.
--help, -h, to display a short help page.--usage, -u have the same effect as
--help.--version, to display csvt version
information.When csvt finds an unsupported option, or when it is run
without any options, it will default to the behavior determined by
--help.
To complete this tutorial you will need an OS capable of running the Ruby interpreter, the Ruby interpreter itself, and a text editor. The operating system can be any POSIX-compatible system, either commercial (AIX, Solaris, QNX, Microsoft NT/2000, Mac OS X, and others) or free (Linux, FreeBSD, NetBSD, OpenBSD, or Darwin). The Ruby interpreter should be the latest release of Ruby. You can check if Ruby has been installed on your system with the following command:
$ ruby --version
|
Related Reading
Ruby in a Nutshell |
When the system reports that there is no such file or directory, you can either download the latest Ruby binaries from the Ruby site or from one of repositories of ports and packages for your operating system (check the list of resources at the end of this article).
If ready-made binaries are not available, you can always build Ruby from original sources found at the Ruby site. Detailed instructions for building Ruby can be found in the README file found in the interpreter's source archive. If you get stuck support is available on comp.lang.ruby as well as on the Ruby-talk mailing list. (Subscription details are on the Ruby site).
The choice of text editor is largely a matter of personal preference.
The author is a devoted vi user, but any text
editor will do.
Every tool, no matter how small, should come with a manual or, at the very least, it should print a short help screen that explains its usage. It is a good habit to write documentation before writing the first line of code.
Since csvt is a simple tool with only five options, you
can be forgiven for not writing the
manual, but you should embed basic documentation in the script itself.
This should be mandatory for even a short script that you are writing for
your own use, because chances are good that you will forget what it does
in two weeks.
The help screen shown above will be printed by csvt after
the user makes a mistake or runs csvt without specifying any
options. Since it can only occupy one standard text terminal screen (80
by 25 characters), it must be terse, but informative. Ideally, it should
present the following information:
csvt;Your help screen could look like this (and it's okay just to type this stuff in a text editor and wrap it in code later):
csvt -- extract columns of data from a CSV (Comma-Separate Values) file
Usage: csvt [POSIX or GNU style options] file ...
POSIX options GNU long options
-e col[,col][,col]... --extract col[,col][,col]...
-r col[,col][,col]... --remove col[,col][,col]...
-h --help
-u --usage
-v --version
Examples:
csvt -e 1,5,6 file print column 1, 5 and 6 from file
csvt --extract 4,1 file print column 4 and 1 from file
csvt -r 2,7,1 file print all columns except 2, 7 and 1 from file
csvt --remove 6,0 file print all columns except 6 and 0 from file
cat file | csvt --remove 6,0 print all columns except 6 and 0 from file
Send bug reports to bugs@foo.bar
For licensing terms, see source code
Because there are several cases where it might be necessary to display
the help screen, you will need to put the code that displays it in a
separate method. We'll call it printusage(). (It helps to
have the
source code of csvt handy)
def printusage(error_code)
print "csvt -- extract columns of data from a CSV (Comma-Separate Values) file\n"
print "Usage: csvt [POSIX or GNU style options] file ...\n\n"
print "POSIX options GNU long options\n"
print " -e col[,col][,col]... --extract col[,col][,col]...\n"
print " -r col[,col][,col]... --remove col[,col][,col]...\n"
print " -h --help\n"
print " -u --usage\n"
print " -v --version\n\n"
print "Examples: \n"
print "csvt -e 1,5,6 file print column 1, 5 and 6 from file\n"
print "csvt --extract 4,1 file print column 4 and 1 from file\n"
print "csvt -r 2,7,1 file print all columns except 2, 7 and 1 from file\n"
print "csvt --remove 6,0 file print all columns except 6 and 0 from file\n"
print "cat file | csvt --remove 6,0 print all columns except 6 and 0 from file\n\n"
print "Send bug reports to bugs@foo.bar\n"
print "For licensing terms, see source code\n"
exit(error_code)
end
printusage() takes one argument, error_code,
which is later passed to exit()—a built-in Ruby method
used to stop the script and return an error code. In your script
printusage() will be called in two cases:
csvt with --help or
--usage options, so the script should return 0 (no errors),
orcsvt with an unsupported option or
without options, and the script should return 1 (to indicate an
error).You should always remember to write code that returns appropriate error codes. When your script returns meaningful error codes, it is much easier to write scripts that can handle critical situations.
|
The specification presented in an earlier section lists several
options, which csvt should understand. Your script can
access the list of options and arguments in two ways, reading them
directly from the ARGV array (passed to your script
automatically by the operating system) or using the
GetoptLong module to parse ARGV for you. The
latter method is preferred: it's easier and saves time.
GetoptLong is an external module, so it must be explicitly
imported before you can use it:
require 'getoptlong'
After your script imports getoptlong, you will also need
to create a new instance of GetoptLong:
opts = GetoptLong.new(
[ "--extract", "-e", GetoptLong::REQUIRED_ARGUMENT ],
[ "--remove", "-r", GetoptLong::REQUIRED_ARGUMENT ],
[ "--help", "-h", GetoptLong::NO_ARGUMENT ],
[ "--usage", "-u", GetoptLong::NO_ARGUMENT ],
[ "--version", "-v", GetoptLong::NO_ARGUMENT ]
)
The arguments passed to GetoptLong.new are the names of
the long and the short options, and the argument flags that finetune the
behavior of the option parser implemented in GetoptLong. The
example above shows how the csvt option specification is
turned into code. It is a good habit to define both long and short
options, but if for some reason it isn't possible or desired, you can omit
them and put "" in place of either the long or the
short option that you wish to leave undefined. The argument flags can be
set to REQUIRED_ARGUMENT, NO_ARGUMENT, or
OPTIONAL_ARGUMENT. The GetoptLong option and
argument parser uses these settings to decide how it should interpret the
contents of ARGV.
Once you have a properly initiated instance of the option parser, you
can add code to checks which options have been selected and what mistakes
have been made. GetoptLong provides a lot of help here; your
job is limited to defining a few global variables and handling any errors
that may occur at this stage.
First, let's define a few global variables:
version = "0.0.1" # used by the --version or -v option handler
extract_f = false # set to true when --extract or -e are used
extract_args = [] # stores the list of arguments of --extract or -e
remove_f = false # set to true when --remove or -r are used
remove_args = [] # stores the list of arguments of --remove or -r
ex_options_n = 0 # used to store the number of mutually exclusive
# options, when > 1, the script will terminate
have_options_f = false # set to true when at least one option is used
Next, you need to check which options have been used. The general
layout of the block of code responsible for testing this and setting
appropriate parameters that will be used to change the behavior of
csvt follows the pattern show below:
begin
opts.each do |opt, arg|
case opt
when option
... option handler ...
when option
... option handler ...
end
end
rescue
... handle exceptions ...
end
The begin-rescue-end construct that wraps the
opts.each do loop is required to add the exception handler,
rescue-end, that provides a way to gracefully handle
unexpected situations. We need that handler, because we do not want the
user to see the trace messages printed by the Ruby interpreter when
GetoptLong raises an exception. A short error message and a
help screen are much more user friendly.
Let's get down to the details. The opts.each do |opt,
arg| loop reads options and their arguments, if any are
expected:
begin
opts.each do |opt, arg|
Should the value of opt be some undefined option (e.g.,
-w), GetoptLong will display a error message
about unsupported option, throw an exception, and stop the execution of
the script. This sounds a bit drastic, but as you will see in a moment,
you can handle that situation easily.
If the value of opt is one of the known options (e.g.,
--extract), it will be examined by the following
case control structure, which sets the extract_f
flag and checks which columns from the source file the user wants to
print.
Notice that it does not matter if the user uses the long or the short
version of the --extract option. GetoptLong
treats them both as the same option, which means that you only need to
write one handler.
case opt
when "--extract"
extract_f = true
extract_args = arg.split(",")
tmp = 0
extract_args.each do |column|
begin
extract_args[tmp] = Integer(column)
tmp += 1
rescue
$stderr.print "csvt: non-integer column index\n"
printusage(1)
end
end
ex_options_n += 1
have_options_f = true
The --extract option handler sets the
extract_f flag, splits the arguments that follow it
(remember, these are numbers separated with commas), and checks if all
arguments of --extract are numerical, integer indexes. When
all goes well, the ex_options_n exclusive options counter is
incremented and the have_options_f flag is set to indicate
that at least one option was selected by the user. This is used to avoid
ambiguity when the user selects mutually exclusive options.
Because the --extract and --remove options
are quite similar in the way they work, their handlers are also almost
identical (see below).
when "--remove"
remove_f = true
remove_args = arg.split(",")
tmp = 0
remove_args.each do |column|
begin
extract_args[tmp] = Integer(column)
tmp += 1
rescue
$stderr.print "csvt: non-integer column index\n"
printusage(1)
end
end
ex_options_n += 1
have_options_f = true
Requests for csvt version information are handled by the
code shown below. Notice that it doesn't matter if other options were
used. Once --version or -v are found,
csvt prints version information and exits with 0 (no
errors).
when "--version"
print $0, ", version ", version, "\n"
exit(0)
Should the user need some help on csvt usage, our script
displays the help screen and exits with 0.
when "--help"
printusage(0)
when "--usage"
printusage(0)
end
end
Once the loop ends, it's time to check for possible errors like mutually exclusive and missing options. Both are considered errors and result in displaying an error message followed by the help screen.
#################################################################
# test for mutually exclusive options: --extract and --remove
if ex_options_n > 1
$stderr.print $0, ": cannot use --extract (-e) and --remove (-r) together\n"
printusage(1)
end
#################################################################
# test for missing options
if have_options_f == false
printusage(1)
end
The last piece of the option-processing block of code is the exception
handler, which prints the help screen, exits csvt, and
returns error code 1.
rescue
# all other errors
printusage(1)
end
Your code should look like this now:
require 'getoptlong'
version = "0.0.1" # used by the --version or -v option handler
extract_f = false # set to true when --extract or -e are used
extract_args = [] # stores the list of arguments of --extract or -e
remove_f = false # set to true when --remove or -r are used
remove_args = [] # stores the list of arguments of --remove or -r
ex_options_n = 0 # used to store the number of mutually exclusive
# options, when > 1, the script will terminate
have_options_f = false # set to true when at least one option is used
def printusage(error_code)
print "csvt -- extract columns of data from a CSV (Comma-Separate Values) file\n"
print "Usage: csvt [POSIX or GNU style options] file ...\n\n"
print "POSIX options GNU long options\n"
print " -e col[,col][,col]... --extract col[,col][,col]...\n"
print " -r col[,col][,col]... --remove col[,col][,col]...\n"
print " -h --help\n"
print " -u --usage\n"
print " -v --version\n\n"
print "Examples: \n"
print "csvt -e 1,5,6 file print column 1,5 and 6 from file\n"
print "csvt --extract 4,1 file print column 4 and 1 from file\n"
print "csvt -r 2,7,1 file print all columns except 2,7 and 1 from file\n"
print "csvt --remove 6,0 file print all columns except 6 and 0 from file\n"
print "cat file | csvt --remove 6,0 print all columns except 6 and 0 from file\n\n"
print "Send bugs reports to bugs@foo.bar\n"
print "For licensing terms, see source code\n"
exit(error_code)
end
opts = GetoptLong.new(
[ "--extract", "-e", GetoptLong::REQUIRED_ARGUMENT ],
[ "--remove", "-r", GetoptLong::REQUIRED_ARGUMENT ],
[ "--help", "-h", GetoptLong::NO_ARGUMENT ],
[ "--usage", "-u", GetoptLong::NO_ARGUMENT ],
[ "--version", "-v", GetoptLong::NO_ARGUMENT ]
)
begin
opts.each do |opt, arg|
case opt
when "--extract"
extract_f = true
extract_args = arg.split(",")
tmp = 0
extract_args.each do |column|
begin
extract_args[tmp] = Integer(column)
tmp += 1
rescue
$stderr.print "csvt: non-integer column index\n"
printusage(1)
end
end
ex_options_n += 1
have_options_f = true
when "--remove"
remove_f = true
remove_args = arg.split(",")
tmp = 0
remove_args.each do |column|
begin
remove_args[tmp] = Integer(column)
tmp += 1
rescue
$stderr.print "csvt: non-integer column index\n"
printusage(1)
end
end
ex_options_n += 1
have_options_f = true
when "--help"
printusage(0)
when "--usage"
printusage(0)
when "--version"
print "csvt, version ", version, "\n"
exit(0)
end
end
#################################################################
# test for mutually exclusive options: --extract and --remove
if ex_options_n > 1
$stderr.print "csvt: cannot use --extract (-e) and --remove (-r) together\n"
printusage(1)
end
#################################################################
# test for missing options
if have_options_f == false
printusage(1)
end
rescue
printusage(1)
end
|
With option parsing code in place, you are now ready to add code for processing CSV files and for making your script behave like a proper command line tool.
It is an old Unix tradition that commands can be piped together to create more complex tools. Your script should obey that convention; doing so will make it more flexible and allow other users do things the authors of the software have never dreamed of.
Writing a Ruby script that fits into that scheme is actually very
simple. The simplest piece of code that copies everything from
STDIN to STDOUT is just three lines long:
while gets
print
end
Add it at the end of your script and see how it works. You do not need to worry about the way data is sent to your script. Both examples shown below give the same results, all without writing additional code.
$ cat file1 file2 | csvt -e 2,0
$ csvt -e 2,0 file1 file2
The simple loop shown in Section 6 is not very useful, because it it
does not do any processing of input. It does illustrate the general
concept. The csvt script will use two such loops, one for
--extract and one for --remove. Both start with
a test of the appropriate flag, extract_f for
--extract and remove_f for
--remove.
if extract_f == true
first_f = true
The first_f flag is used to avoid the "off by
one" error inside the while loop:
while gets
data = $_.chop
data = data.split(",")
data_n = data.length
Every loop cycle starts with a call to gets, which reads a new line
from STDIN and stores it in $_. Next the script
removes the end of line character and splits the line into an array of
separate columns.
if first_f
old_data_n = data_n
first_f = false
end
The size of the array is stored in data_n. Then it tests
if the line just read was the first line and sets the number of columns on
the non-existent previous line to the number of columns on the first line
to pass the data integrity check (comparing the number of columns in the
previous and the current line).
if data_n != old_data_n
$stderr.print "csvt: the number of fields on the "
+ "following line does not match the number "
+ "of fields on the previous line\n"
$stderr.print $_
exit(1)
end
Should the data integrity test fail, the error message followed by the
offending line will be printed to the system log and the execution of
csvt will stop. It is tempting to relax the rules a little
and introduce an option for skipping such errors, but that's a job for a
separate tool; namely, a specialized data integrity checker, which is
usually written with a particular data set in mind and therefore outside
the scope of the csvt's specification.
When everything goes well, we can begin constructing a line of output. This starts with initializing the line variable:
line = ""
Next we travel the array of arguments for the --extract
option. As you will notice, there is test check, if the column index is
less than the number of fields in the line we just read. If it is,
csvt will complain, suggest the allowed range of indexes and
exit with code 1.
extract_args.each do |column|
if !(column < data_n)
$stderr.print "csvt: column index out of range, "
+ "use numbers between 0 and ",
data_n - 1, "\n"
exit(1)
end
If all goes well, we use the value of column as the index into the data array and add the result to the string stored in line, followed by a comma.
line += data[column] + ","
end
Once all columns listed as arguments of --extract have
been processed, we can print the contents of the line variable, less the
last character, which we replace with the end of line character.
print line[0, line.length-1], "\n"
The last thing is setting the old_data_n variable to the
number of columns in the currently processed line, so the data integrity
check can spot any errors.
old_data_n = data_n
end
end
So it goes until the end of the file or data stream. When all data is
processed, our script ends with a call to exit(0).
The code used to process STDIN when the user chooses the
--remove option is similar to the --extract
handler, with a small twist after the line variable initialization.
if remove_f == true
first_f = true
while gets
data = $_.chop
data = data.split(",")
data_n = data.length
if first_f
old_data_n = data_n
first_f = false
end
if data_n != old_data_n
$stderr.print "csvt: the number of fields on the following "
+ "line does not match the number of fields on "
+ "the previous line\n"
$stderr.print $_
exit(1)
end
line = ""
There is an additional loop that sets the columns whose indexes are
listed as arguments of --remove to "".
remove_args.each do |column|
if !(column < data_n)
$stderr.print "csvt: field index out of range, "
+ "use numbers between 0 and ",
data_nf - 1, "\n"
exit(1)
end
data[column] = ""
end
The rest of the code is identical to the code in the
--extract handler.
data.each do |column|
if column == ""
next
else
line += column + ","
end
end
print line[0, line.length-1], "\n"
old_data_n = data_n
end
end
We now have a complete script to help us filter CSV files. It may grow in the future, but for now it is quite complete. Your script plays well with other command-line Unix tools and is a well behaved Unix citizen. The complete script is here.
csvt ExecutableYour script is working now and you could call it quits, but for greater
convenience in the future, try to make an extra effort and make
csvt executable, so you can type just this:
$ csvt
instead of this:
$ ruby csvt.rb
If you are using Unix, simply add this code on the first line of your script:
#!/usr/local/bin/ruby
The actual path to the ruby interpreter binary might be
different on your system. The easiest way to find out is to use the
locate or which command:
$ locate ruby
$ which ruby
If either fails, use find
$ find / -name "ruby"
This might take a while because find is searching the
whole directory tree. Once you know the access path to the
ruby binary, paste it after #! and save the
script to disk. Remember that you need place these instructions on the
very first line of your script or the shell will not be able to recognize
it as a request to use the Ruby interpreter. If you need to list options
for the interpreter, you can list them, but remember that there is no need
to list the name of the script itself.
Now save csvt to disk, and make it executable with $
chmod u+x csvt.
The u+x argument tells chmod to mark
csvt as executable only by the owner of the script (that
would be you ...). Other possibilities include g+x, which
marks the script as executable by all members of the group that the script
is assigned to (ls -l reveals the script's group);
o+x, which would make the script executable by all other
users (not a good idea); finally, a+x would make it
executable by all users (this should be avoided as well).
Note that neither the #! notation nor chmod
command can be used in the Microsoft Windows environment unless you
install the Cygwin package, which turns Windows into a pretty good Unix
environment look-and-feel-alike. When installing Cygwin is not an option,
you can still use csvt, but it must be preceded with the
ruby command, as in ruby csvt -e file instead of
csvt -e file.
The following places should be on the list of favorite destinations for everyone learning and using Ruby:
If you want to enhance your knowledge of Ruby, you should take a look at Ruby in a Nutshell from O'Reilly or Programming Ruby from Addison-Wesley. Safari has at least half a dozen Ruby titles, from O'Reilly as well as other publishers.
Jacek Artymiak started his adventure with computers in 1986 with Sinclair ZX Spectrum. He's been using various commercial and Open Source Unix systems since 1991. Today, Jacek runs devGuide.net, writes and teaches about Open Source software and security, and tries to make things happen.
Return to ONLamp.com.
Copyright © 2009 O'Reilly Media, Inc.