Keeping Things Tidy
Whether your HTML gets generated automatically or you do it by hand, chances are it is in need of some tidying. And, appropriately enough, next up is tidy, a mild-mannered name for a super program. Tidy, as the man page says, "reads HTML, XHTML, and XML files and writes cleaned-up markup." The tidy project consists of two pieces, TidyLib (a library that can be used by other applications) and the command-line utility we'll be using.
Let's say the political website you posted that screed to has been getting a lot of responses (no wonder, with all your inflammatory statements!), including one sent to you in an email attachment -- an HTML file from a secret contact in the government with some damaging details you want to get on your site as quickly as possible. One problem: the secretstuff.html file was apparently generated by a word processor and has some of the worst markup you've ever seen. It'll take hours to clean up for your site!
Here's the basic command:
tidy -output secretstuff_tidy.html secretstuff.html
We've asked tidy to create a new HTML file (using
-output) rather than dumping it to the Terminal. Alternatively, you could have it modify in place, as long as you either have a backup or are confident you won't be losing anything. The command would become:
tidy -modify secretstuff.html
During its traverse of the input file, tidy will spit out any errors to the Terminal that it finds, along with line numbers, which are handy if there are errors tidy can't fix. Which brings up one of the best uses of the utility -- as a syntax checker that works on your local computer rather than depending on web-based validators. (You can even test certain accessibility aspects.)
One of the test files I fed tidy while writing this article was a word-processor-generated HTML document, a 34KB file, which tidy told me had at least 71 warnings and 12 errors. And then a message came up:
This document has errors that must be fixed before using HTML tidy to generate a tidied up version.
Whoopsie! This may happen at times, when things are just so FUBAR even tidy can't handle them. The majority of the time, however, you won't have to worry about it.
Tidy has a ton of options in checking and converting markup, a list of which you can find at the HTML Tidy Configuration Options Quick Reference. (The man page is less than complete.) These options can be specified on the command line or via a config file. Tidy also deals with standard input and output like any other Unix utility, so you could take generated HTML from someplace else, pipe it through tidy, and then pipe that to the next utility for further processing.
Here's an example. Let's combine what we've looked at so far. Use the following (on one line):
textutil -stdout -convert html screed.txt | tidy -output screed_tidy.html screed.html
First we have textutil convert a plain text file into HTML and (using
-stdout) send it to standard output. The all-powerful pipe takes that output and hands it off (through standard input) to tidy, which gives the html its usual checks, then outputs to the file screed_tidy.html.
Of course, at that point we could pipe it to another command -- say,
/usr/bin/zip for compression or
/usr/bin/pbcopy to copy it all to the Mac OS X Clipboard.
Some GUI text editors, such as Smultron, have command-line components that let you interact from the Terminal to the GUI. So you could pipe that HTML to
smultron. Alternatively, if you have Quicksilver and its
qs command-line option installed, you could send that HTML pretty much anywhere, to any application, or any person.