MacDevCenter    
 Published on MacDevCenter (http://www.macdevcenter.com/)
 See this if you're having trouble printing code examples


HTML Tools on the Mac Command Line

by Robert Daeley
11/22/2005

In a recent blog entry here (The Tell-Tail Heart), I covered using the tail utility -- one of the many text-editing and manipulation programs available on the Mac OS X command line. And when it comes right down to it, HTML editing is text editing. There will be times -- such as SSHing into your server -- that being able to do things via the CLI will be invaluable. Even if you're working on your local development box, having some powerful utilities in your tool belt can't hurt at all.

I'll be focusing on how these few utilities can help while working with HTML on Mac OS X. If you haven't already, you'll need to install the Developers Tools, available with your install disks, or from developer.apple.com. Also, the following assumes you are using Tiger (10.4) and are familiar with using the Terminal and bash shell. It may also apply to earlier system versions, but I don't have any of those available to confirm.

Not sure if you've got the utilities? Open up a Terminal window and type: which foo and hit return, where foo is the name of the utility in question. If you have foo, the system will respond with a pathname where it is. In the case of the utilities below, they will be probably be located in /usr/bin .

Utilitarian

textutil is something of a Rosetta stone for different file types. It allows you to convert amongst various text formats more or less instantly and without a fuss: txt, html, rtf, rtfd, doc, wordml, and webarchive. One of my favorite uses is to feed it a bunch of paragraphs of text from somebody and get nice HTML-tagged paragraphs out.

Basic usage is quite easy. Let's say you have a plain text file -- five paragraphs of political ranting called screed.txt -- that you want to post on a website. You're still white-hot angry over whatever it was you were ranting about and would rather not spend any more time HTMLizing the whole thing. Here's the command:

textutil -convert html screed.txt

This produces a file called screed.html in the same directory, leaving the original screed.txt file alone. That new file is a complete HTML document, even down to the DOCTYPE declaration at the beginning. You'll find a bit of CSS code, but otherwise it's fairly clean. Anything you want to get rid of you can do quickly in your favorite text editor. One thing I've noticed in this regard is that it will take blank lines between text paragraphs and attempt to replicate the space using a <p><br></p> combo -- you can eliminate them in the original .txt file, or simply find and replace them in the HTML.

At this point, you could upload the file to your server, or (if you only need the content portion) copy and paste the <p></p> paragraphs wherever you need them.

There are quite a number of command-line options available for textutil, too many to go into in this article, but there are a couple of techniques I'd like to highlight. Check out man textutil to see them all.

First off is some metadata handling. Let's say I wanted to make sure my screed gets a proper title and attribution in the HTML head section. Here's one way. Type on one line:

textutil -convert html -title "Death to all extremists" -author "Robert Daeley" screed.txt

Now the <title>Death to all extremists</title> shows up in our HTML file, as well as an author metatag. You also have subject, keywords, comment, editor, and other metadata available to use.

Another one of my favorite textutil uses is the wildcard functionality. If you have not one file but a directory of screed text files that you want to convert into a single HTML page, it's as simple as doing this (on one line):

textutil -cat html -output screed.html -title "Screedorama" -author "Robert Daeley" *.txt

The -cat argument takes the the contents of all the *.txt files in the working directory and outputs them to screed.html, with the given title and author.

Learning Unix for Mac OS X Tiger

Related Reading

Learning Unix for Mac OS X Tiger
By Dave Taylor

Keeping Things Tidy

Whether your HTML gets generated automatically or you do it by hand, chances are it is in need of some tidying. And, appropriately enough, next up is tidy, a mild-mannered name for a super program. Tidy, as the man page says, "reads HTML, XHTML, and XML files and writes cleaned-up markup." The tidy project consists of two pieces, TidyLib (a library that can be used by other applications) and the command-line utility we'll be using.

Let's say the political website you posted that screed to has been getting a lot of responses (no wonder, with all your inflammatory statements!), including one sent to you in an email attachment -- an HTML file from a secret contact in the government with some damaging details you want to get on your site as quickly as possible. One problem: the secretstuff.html file was apparently generated by a word processor and has some of the worst markup you've ever seen. It'll take hours to clean up for your site!

Unless...

Here's the basic command:

tidy -output secretstuff_tidy.html secretstuff.html

We've asked tidy to create a new HTML file (using -output) rather than dumping it to the Terminal. Alternatively, you could have it modify in place, as long as you either have a backup or are confident you won't be losing anything. The command would become:

tidy -modify secretstuff.html

During its traverse of the input file, tidy will spit out any errors to the Terminal that it finds, along with line numbers, which are handy if there are errors tidy can't fix. Which brings up one of the best uses of the utility -- as a syntax checker that works on your local computer rather than depending on web-based validators. (You can even test certain accessibility aspects.)

One of the test files I fed tidy while writing this article was a word-processor-generated HTML document, a 34KB file, which tidy told me had at least 71 warnings and 12 errors. And then a message came up:

This document has errors that must be fixed
before using HTML tidy to generate a tidied up version.

Whoopsie! This may happen at times, when things are just so FUBAR even tidy can't handle them. The majority of the time, however, you won't have to worry about it.

Tidy has a ton of options in checking and converting markup, a list of which you can find at the HTML Tidy Configuration Options Quick Reference. (The man page is less than complete.) These options can be specified on the command line or via a config file. Tidy also deals with standard input and output like any other Unix utility, so you could take generated HTML from someplace else, pipe it through tidy, and then pipe that to the next utility for further processing.

Here's an example. Let's combine what we've looked at so far. Use the following (on one line):

textutil -stdout -convert html screed.txt | tidy -output screed_tidy.html screed.html

First we have textutil convert a plain text file into HTML and (using -stdout) send it to standard output. The all-powerful pipe takes that output and hands it off (through standard input) to tidy, which gives the html its usual checks, then outputs to the file screed_tidy.html.

Of course, at that point we could pipe it to another command -- say, /usr/bin/zip for compression or /usr/bin/pbcopy to copy it all to the Mac OS X Clipboard.

Some GUI text editors, such as Smultron, have command-line components that let you interact from the Terminal to the GUI. So you could pipe that HTML to smultron. Alternatively, if you have Quicksilver and its qs command-line option installed, you could send that HTML pretty much anywhere, to any application, or any person.

Powerful stuff.

Vive la Difference!

As a website admin who oversees sites that multiple people work on, I often need to compare pairs of files to see what's different between them -- like between an edited file and a backup, for instance. Linux/Unix users will be familiar with the diff utility, which I'll be touching on here, along with some diff enhancements.

diff

Once again, diff has numerous options that are too much to go into in this particular article, but let's start with a basic scenario.

Let's say your screed.html file gets uploaded to the server. A few days later, one of your fellow politicos lets you know he's edited the server's version of the file, adding a paragraph of text and making a nit-picking edit somewhere else.

In a huff, you download the file from the server, saving it as screed_edit.html in the same local directory as the original file. Now, let's find out what your associate did to your fine prose:

diff screed.html screed_edit.html

Here's what diff said:

20c20
< elit. Fusce arcu eros, sollicitudin vitae
---
> rcu eros, sollicitudin vitae
71a72
> <p>Aliquam imperdiet nonummy risus. Vivamus in lectus.

Looks like he deleted a few words in a sentence on line 20, and his new paragraph started on line 71. Good to know.

If you'd rather view the files side by side, expand your Terminal window and use this:

diff -y screed.html screed_edit.html

Using the -y flag, you'll get both files side by side in their entirety, with indicators in the whitespace between them where differences were found. This might be a little unwieldy with large files, so try this (on one line):

diff -y --suppress-common-lines screed.html screed_edit.html

This acts like a combination of the first two diff commands, presenting just the differences, but doing it side by side.

Now, let's really get busy. Remember our pipe command above? What if we wanted to compare the resulting html with an already existing file, like the edited one we downloaded? On one line, type:

textutil -stdout -convert html screed.txt | tidy | diff -y --supress-common-lines - screed_edit.html

Phew! OK, let's take it step by step. The textutil section you already know, converting .txt to .html. This marked-up text is then piped to tidy, which does its magic on the HTML. And finally, the results are piped to diff, which compares the new code with an existing file called screed_edit.html (the hyphen by itself tells diff to use standard input -- i.e., from our pipe) and displays just the differences in two columns side by side.

opendiff and FileMerge

Apple's Developer Tools come with several apps that are useful both individually and in concert. One of those that works well even on its own is FileMerge, a GUI program that displays and merges two files together. And it has a CLI utility that can launch and feed it input, called opendiff.

With your working directory littered with .html files now, it might be a good idea to check things out:

opendiff screed_tidy.html screed_edit.html

That was easy. Now that FileMerge has been launched, we can view and edit both files in a nice GUI window. Even better, you can use its Actions popup to merge the two. Another handy feature is comparing two files to a common ancestor:

opendiff -ancestor screed.html screed_tidy.html screed_edit.html

vimdiff and ediff

Not to be outdone, my personal favorite editor vim can do something very similar with its vimdiff utility, which will present the two files in a split window with the differences highlighted.

vimdiff -O screed.html screed_edit.html

That's a capital letter O, by the way. This will present the two files side by side, in other words with a vertical split. To split the window horizontally instead, use -o

.

In either case, a maximized Terminal window will make it a lot easier to see everything.

And while I'm not familiar enough to write about it, Emacs does have its own diff mode, known as ediff, which you can read about it in this ediff user manual.

Final Thoughts

We've just scratched the surface of what's available to HTML jockeys on the Mac OS X command line. These utilities have a lot more depth that you'll enjoy playing with, and of course there are numerous other programs available in /usr/bin to help you out. Go exploring in there sometime and pull up a man page on ones that look interesting. Or just issue the command whatis foobar to get a brief description.

The combination of CLI and GUI tools, as well as "glue apps" like Quicksilver that combine the power of both, is one of the most exciting parts of working on Mac OS X.

Robert Daeley is a writer and programmer in Southern California. By day he is a mild-mannered server administrator and website developer; by night, in addition to his super-hero duties, he cooks, bikes, hikes, cheers on the Dodgers, and writes fiction.


Return to the Mac DevCenter

Copyright © 2009 O'Reilly Media, Inc.