Building Unix Tools with Ruby
Pages: 1, 2, 3
Get the Plumbing Right
With option parsing code in place, you are now ready to add code for processing CSV files and for making your script behave like a proper command line tool.
It is an old Unix tradition that commands can be piped together to create more complex tools. Your script should obey that convention; doing so will make it more flexible and allow other users do things the authors of the software have never dreamed of.
Writing a Ruby script that fits into that scheme is actually very
simple. The simplest piece of code that copies everything from
STDIN to STDOUT is just three lines long:
while gets
print
end
Add it at the end of your script and see how it works. You do not need to worry about the way data is sent to your script. Both examples shown below give the same results, all without writing additional code.
$ cat file1 file2 | csvt -e 2,0
$ csvt -e 2,0 file1 file2
Processing Input
The simple loop shown in Section 6 is not very useful, because it it
does not do any processing of input. It does illustrate the general
concept. The csvt script will use two such loops, one for
--extract and one for --remove. Both start with
a test of the appropriate flag, extract_f for
--extract and remove_f for
--remove.
if extract_f == true
first_f = true
The first_f flag is used to avoid the "off by
one" error inside the while loop:
while gets
data = $_.chop
data = data.split(",")
data_n = data.length
Every loop cycle starts with a call to gets, which reads a new line
from STDIN and stores it in $_. Next the script
removes the end of line character and splits the line into an array of
separate columns.
if first_f
old_data_n = data_n
first_f = false
end
The size of the array is stored in data_n. Then it tests
if the line just read was the first line and sets the number of columns on
the non-existent previous line to the number of columns on the first line
to pass the data integrity check (comparing the number of columns in the
previous and the current line).
if data_n != old_data_n
$stderr.print "csvt: the number of fields on the "
+ "following line does not match the number "
+ "of fields on the previous line\n"
$stderr.print $_
exit(1)
end
Should the data integrity test fail, the error message followed by the
offending line will be printed to the system log and the execution of
csvt will stop. It is tempting to relax the rules a little
and introduce an option for skipping such errors, but that's a job for a
separate tool; namely, a specialized data integrity checker, which is
usually written with a particular data set in mind and therefore outside
the scope of the csvt's specification.
When everything goes well, we can begin constructing a line of output. This starts with initializing the line variable:
line = ""
Next we travel the array of arguments for the --extract
option. As you will notice, there is test check, if the column index is
less than the number of fields in the line we just read. If it is,
csvt will complain, suggest the allowed range of indexes and
exit with code 1.
extract_args.each do |column|
if !(column < data_n)
$stderr.print "csvt: column index out of range, "
+ "use numbers between 0 and ",
data_n - 1, "\n"
exit(1)
end
If all goes well, we use the value of column as the index into the data array and add the result to the string stored in line, followed by a comma.
line += data[column] + ","
end
Once all columns listed as arguments of --extract have
been processed, we can print the contents of the line variable, less the
last character, which we replace with the end of line character.
print line[0, line.length-1], "\n"
The last thing is setting the old_data_n variable to the
number of columns in the currently processed line, so the data integrity
check can spot any errors.
old_data_n = data_n
end
end
So it goes until the end of the file or data stream. When all data is
processed, our script ends with a call to exit(0).
The code used to process STDIN when the user chooses the
--remove option is similar to the --extract
handler, with a small twist after the line variable initialization.
if remove_f == true
first_f = true
while gets
data = $_.chop
data = data.split(",")
data_n = data.length
if first_f
old_data_n = data_n
first_f = false
end
if data_n != old_data_n
$stderr.print "csvt: the number of fields on the following "
+ "line does not match the number of fields on "
+ "the previous line\n"
$stderr.print $_
exit(1)
end
line = ""
There is an additional loop that sets the columns whose indexes are
listed as arguments of --remove to "".
remove_args.each do |column|
if !(column < data_n)
$stderr.print "csvt: field index out of range, "
+ "use numbers between 0 and ",
data_nf - 1, "\n"
exit(1)
end
data[column] = ""
end
The rest of the code is identical to the code in the
--extract handler.
data.each do |column|
if column == ""
next
else
line += column + ","
end
end
print line[0, line.length-1], "\n"
old_data_n = data_n
end
end
We now have a complete script to help us filter CSV files. It may grow in the future, but for now it is quite complete. Your script plays well with other command-line Unix tools and is a well behaved Unix citizen. The complete script is here.
Make csvt Executable
Your script is working now and you could call it quits, but for greater
convenience in the future, try to make an extra effort and make
csvt executable, so you can type just this:
$ csvt
instead of this:
$ ruby csvt.rb
If you are using Unix, simply add this code on the first line of your script:
#!/usr/local/bin/ruby
The actual path to the ruby interpreter binary might be
different on your system. The easiest way to find out is to use the
locate or which command:
$ locate ruby
$ which ruby
If either fails, use find
$ find / -name "ruby"
This might take a while because find is searching the
whole directory tree. Once you know the access path to the
ruby binary, paste it after #! and save the
script to disk. Remember that you need place these instructions on the
very first line of your script or the shell will not be able to recognize
it as a request to use the Ruby interpreter. If you need to list options
for the interpreter, you can list them, but remember that there is no need
to list the name of the script itself.
Now save csvt to disk, and make it executable with $
chmod u+x csvt.
The u+x argument tells chmod to mark
csvt as executable only by the owner of the script (that
would be you ...). Other possibilities include g+x, which
marks the script as executable by all members of the group that the script
is assigned to (ls -l reveals the script's group);
o+x, which would make the script executable by all other
users (not a good idea); finally, a+x would make it
executable by all users (this should be avoided as well).
Note that neither the #! notation nor chmod
command can be used in the Microsoft Windows environment unless you
install the Cygwin package, which turns Windows into a pretty good Unix
environment look-and-feel-alike. When installing Cygwin is not an option,
you can still use csvt, but it must be preceded with the
ruby command, as in ruby csvt -e file instead of
csvt -e file.
Resources
The following places should be on the list of favorite destinations for everyone learning and using Ruby:
- Ruby binaries and sources
- Ruby mailing lists
- the Ruby newsgroup
- the Cygwin Unix environment for Microsoft Windows
- the Fink Unix environment for Mac OS X (the latest Ruby builds for Mac OS X)
Books
If you want to enhance your knowledge of Ruby, you should take a look at Ruby in a Nutshell from O'Reilly or Programming Ruby from Addison-Wesley. Safari has at least half a dozen Ruby titles, from O'Reilly as well as other publishers.
Jacek Artymiak started his adventure with computers in 1986 with Sinclair ZX Spectrum. He's been using various commercial and Open Source Unix systems since 1991. Today, Jacek runs devGuide.net, writes and teaches about Open Source software and security, and tries to make things happen.
Return to ONLamp.com.
-
Ruby CSV class
2006-10-13 16:27:54 shadowbq [View]
- Trackback from http://www.abstractplain.net/mt-archives/000774.html
Article: Building Unix Tools with Ruby (+ GetOptLong)
2005-07-21 10:56:43 [View]
-
from String to Integer
2004-07-13 03:41:08 Mailinator [View]
-
Why executable only to yourself?
2003-09-27 10:29:13 anonymous2 [View]
-
don't forget "puts"!
2003-09-21 04:21:21 dblack [View]
-
"It's code like this that causes unrest."
2003-09-21 02:12:23 anonymous2 [View]
-
"more Rubyish" isn't relevant in this case
2003-09-21 05:13:39 dblack [View]
-
"more Rubyish" isn't relevant in this case
2003-09-23 08:55:57 anonymous2 [View]
-
very un-Rubyish
2003-09-21 02:01:16 anonymous2 [View]
-
O'Reilly technical review?
2003-09-20 21:48:16 anonymous2 [View]
-
Shebang line
2003-09-20 10:14:47 anonymous2 [View]
-
good but..
2003-09-19 16:55:01 anonymous2 [View]
-
Option parsing
2003-09-19 11:42:31 anonymous2 [View]