The Making of Effective awk Programming
by Arnold Robbins, author of Effective awk Programming, 3rd Edition11/01/2002
Editor's note: Arnold Robbins has been an O'Reilly author for more than eight years, authoring or coauthoring some of its best-selling, most enduring titles, including the sixth edition of Learning the vi Editor, the third edition of Unix in a Nutshell: System V Edition, the second edition of sed & awk, and the second edition of Learning the Korn Shell.
|
Related Reading
Effective awk Programming |
In all that time, he's learned a thing or two about the O'Reilly book production process. But when he was ready to update Effective awk Programming he wanted to use the Texinfo markup language. O'Reilly production prefers authors use the DocBook markup language. Arnold compromised by agreeing to manage the conversion process for the final production of his book. In this article, he chronicles the challenges he faced translating his book from Texinfo to DocBook. The breadth of technical detail and the extensive code examples he provides here offer unique insight into one author's experience working with O'Reilly's book production department to create a book.
Introduction
O'Reilly & Associates published the third edition of
Effective awk Programming in May 2001. The book provides thorough coverage of the awk programming language,
as standardized by the IEEE POSIX standard for portable operating
system applications. This standard is based on Unix and its utilities.
Effective awk Programming also doubles as the user's guide for GNU awk
(known as gawk), explaining the extensions and features that
are unique to gawk. It includes a wealth of sample programs and
library functions that demonstrate good awk programming style.
gawk is the standard version of awk on GNU/Linux
and most BSD-based systems. It is also popular on commercial Unix and
Windows systems because it has a number of useful extensions,
and because it can handle large data sets (records with hundreds or
thousands of fields, arrays with thousands of elements) that often cause
other implementations to give up. The third edition of Effective awk Programming describes the current version of gawk, 3.1.
The GNU project uses the Texinfo markup language for all of its documentation. Texinfo is a pleasant markup language in which to work. It is semantically driven: you markup what something is, not how to print it; it allows easy nesting of different constructs; it is not as painful to type as HTML or DocBook XML; and it provides for translation into multiple output formats.
Printed documents may be generated directly from Texinfo input
files by using TeX. The Texinfo distribution includes the file
texinfo.tex, which is a set of TeX macros that directly
implement the Texinfo language, and scripts for running TeX.
Other output formats are generated by the makeinfo program,
which is a rather large and complicated C program that knows how to
produce GNU Info, HTML, and these days, DocBook XML.
The use of Texinfo for Effective awk Programming presented a problem for O'Reilly.
Their production process prefers the use of DocBook markup (particularly
the XML variant) since it may be used to produce both printed and
browsable versions of the same book. (Browsable versions are necessary
for the CD-ROM editions of their books, as well as for
the Safari Bookshelf.)
Furthermore, O'Reilly has a series design used for all their books:
the TeX output from texinfo.tex, while reasonable enough,
doesn't looks anything like an O'Reilly book.
By the time of the initial discussions with O'Reilly, I had produced
four O'Reilly books in DocBook SGML, so I was quite comfortable with it.
And as the author of the gawk.texi, I was also very comfortable
with Texinfo. Therefore, because both O'Reilly and I were committed to
getting Effective awk Programming published, I promised to manage the conversion from
Texinfo into DocBook for the final book production.
I reasoned that since makeinfo could already produce HTML,
and since HTML and DocBook are conceptually similar, it shouldn't
be that hard to modify the code to generate DocBook. I had
worked with the makeinfo source code in the past, so I wasn't
scared, even if I was a bit naive.
Delaying the conversion to DocBook until the end had two other related,
significant advantages. First, I was able to use the Texinfo version
for the technical review, incorporating all the changes from the review
into the documentation that would eventually ship with gawk.
And second, O'Reilly agreed to do their copy editing on a paper copy of
the Texinfo version of the manuscript. I then entered the copy edits
into the Texinfo source file, again allowing the distributed version to
benefit from O'Reilly's considerable editorial expertise.
(At this point I'd like to pause and acknowledge the significant contributions made by Chuck Toporek, my editor. His comments helped to enormously improve the organization and presentation of the material in the book. Mary Sheehan's copy edits were also very valuable. I learned a lot about good writing during the work on this book.)
Furthermore, Chuck and the rest of the people at O'Reilly bent
over backwards to make sure that they complied with the
GNU Free Documentation License
(FDL), under which the book is published. The final DocBook XML source for the
book is available from the
O'Reilly Web site.
The Texinfo version, of course, is part of the gawk
distribution.)
Converting to DocBook
Fortunately, I didn't have to write the DocBook changes for
makeinfo from scratch. Philippe Martin had done the
bulk of this already, and I was able to obtain his patches to the
makeinfo source code. His code did the vast majority of what
I needed.
Philippe's version generated DocBook SGML. At the time, O'Reilly
was moving away from SGML, towards the XML version of DocBook.
The differences boiled down mostly to using lowercase for tags,
always providing a full closing tag (<emph>whatever</emph> versus
<emph>whatever</>), using the trailing-slash version of tags
that don't enclose objects (such as <xref linkend="..."/>),
and fully quoting all the parameters inside of tags (<colspec
colnum="1"/> vs. <colspec colnum=1>).
Also, Philippe's code often generated a single DocBook tag for multiple
different Texinfo commands, when in fact DocBook has tags that correspond
to the original Texinfo commands. For example, it might produce
<literal> for both @command{} and @file{}.
This needed to be fixed, so that the generated output would contain
separate <command> and <filename> tags. In other words,
as much as possible, it was necessary to preserve the semantic-based
nature of the Texinfo markup in the generated DocBook.
This work was straightforward, and over a week or two, I did the bulk of
it, getting makeinfo to the point where it produced a basic
DocBook XML version of gawk.texi on which I could do further
post-processing.
The current release of Texinfo includes Philippe's original changes, as well
as my improvements. Philippe has gone further with the development, and besides
DocBook XML, makeinfo can produce a variant of XML that uses a
Texinfo DTD that is similar to the DocBook XML DTD. Indeed, most of the reformatting
problems described below are no longer needed with the current version. For
further details, see the Texinfo
distribution.
Making Usable DocBook
Generating technically correct DocBook markup was just the beginning of the
process. While the file might go through an XML parser without any problems,
it would still need to be readable, so that O'Reilly's production editors could
work with it directly. It also needed to adhere to O'Reilly's markup conventions,
such as the id="..." parameter in <chapter>
and section tags, and in <xref> tags for cross references.
There was still a ways to go.
General Cleanups
First, the makeinfo output needed lots of simple cleanups. Some
of these related to anomalies in the output, others to removing Texinfo-specific
output features which were better expressed using different fonts in DocBook.
The first script, fixup.awk, evolved to handle many of these. This
section presents the most interesting of the changes that had to be made.
makeinfo generated some boiler-plate material at the front of
the file that wasn't necessary for O'Reilly's DocBook tools. It looks like this:
<!-- This is /home/arnold/ORA/db/gawk.sgml, produced by makeinfo
version 4.0 from gawk.texi. --><para>
<!DOCTYPE book PUBLIC "-//Davenport//DTD DocBook V3.0//EN">
<book>
<title>The GNU Awk User's Guide</title>
</para>
Notice that the <para> and </para> tags
are misplaced. This early version of makeinfo was over-zealous
about wrapping things in paragraph tags. The first part of fixup.awk
strips off this leading junk. It works by having the first rule look for the
first <chapter> tag. When that's seen, it sets a flag. The
second rule checks the flag. If it hasn't been seen yet, the next
statement gets the next line of input:
#! /bin/gawk -f
# strip leading gunk from file
/<chapter/ { chapter_seen = 1 }
! chapter_seen { next }
The next bit removes trailing white space (space and TAB characters) and removes
leading white space inside lists and examples. The first rules uses the sub()
function to unconditionally remove trailing white space. (This is needed only
because I find such white space gets in the way when editing.)
The in_term variable indicates being inside the terms of a variable
list. Inside list item bodies or examples, the strip_spaces
variable is true (non-zero), so the sub() function removes all
leading white space. The closing tags set the strip_spaces flag
back to false:
# strip trailing white space
/[ \t]+$/ { sub(/[ \t]+$/, "") }
# strip leading spaces inside lists
/<listitem>/ { stripspaces++ ; in_term = 0 }
/<\/listitem>/ { stripspaces-- }
# fix up examples
/<screen>/ { in_screen++ ; stripspaces++ }
stripspaces != 0 { sub(/^ +/, "") }
/<\/screen>/ { in_screen-- ; stripspaces-- }
The Texinfo command @var{} is used to describe something that
is variable, such as a user's supplies. It corresponds to the
DocBook <replaceable> tag. In an O'Reilly book, <replaceable> items
get printed in a Constant Width Italic font. This is entirely
appropriate in most contexts, such as within examples, or in lists
where items represent a combination of a command and its parameters.
However, O'Reilly conventions indicate that variable items should be in regular italics when used in prose discussion. For example:
<!-- Correctly marked up DocBook XML -->
<variablelist>
<varlistentry><term>
<literal>ls -l</literal> <replaceable>file</replaceable>
</term>
<listentry><para>
The <command>ls</command> with the <option>-l</option> gives
extra information about <emphasis>file</emphasis>.
</para></listentry>
</varlistentry>
...
</variablelist>
The generated DocBook used <replaceable> everywhere. This next bit of code
makes the context-sensitive transformation for us: