Structured grep and Python
05/23/2001When text files are structured, like HTML, XML, or even news or mail files, you can take advantage of that structure in your search. You can search for words that appear within certain tags, like in the title element of an HTML document, or within the From field of a mail file. All you need is a tool that understands the structure of your text.
Jani Jaakkola and Pekka Kilpeläinen's structured text search and
index tool, sgrep, handles all structured text in a generic
way. Sgrep's expression language allows you to provide details about
the structure to sgrep so it can find exactly what you want. It also
has a printf like format for printing the output. It's generic because you supply information about the structure with each search. Since typing out
all those details can be tedious, sgrep can use a preprocessor like m4 to read macros. For example, in the expression
sgrep -o"%f:%r\n" '"Stephen" IN HTML_TITLE' ~/public_html/*.html
HTML_TITLE is a macro for
(( ( "<TITLE>" or ( ("<TITLE " or "<TITLE\t" or \
"<TITLE\n") .. ">")) .. ( "</TITLE>" ) ))
This sgrep expression looks for the name "Stephen" in the contents
of title tags in the html documents in your public_html
directory. sgrep prints the file name, a colon, and then the text of
the matching region, as specified by the -o option.
OK, it isn't pretty, and it isn't a quick thing to learn either. To define those macros you have to use M4, and sgrep's expressions are only slightly more understandable than the equivalent regular expressions would be. Your reward for learning it is a powerful command-line tool for use in searching your documents, and set of macros that makes searching a breeze. You also have one tool that can work on any kind of structured text.
sgrep isn't new. The last version was released in
1998. What's new is Dave Kuhlman's PySgrep, a python extension
module to call and control sgrep. PySgrep lends the power
of Python to sgrep, but it doesn't free you from understanding sgrep's
language (although you could write your own Python macros and avoid
m4.) PySgrep uses a call-back object to handle the results of a query. Whenever there is a carriage return in the result stream, PySgrep calls the call-back object's write method.
Layering the power of Python over the fast searching power of
sgrep, you could use the callback to further refine the
search, perhaps using a regular expression you couldn't have used with
sgrep alone. You could put the information into a list
instead of pumping it to standard out. You could use it as the basis
for a search engine for your site. There are many possibilities.
One of PySgrep's weak points, however, is how you access the files to be queried. PySgrep will take information from stdin, or it will open files named in a file. I would like to pass a list of files to the query call, or even pass an appropriate data structure like a list, but it doesn't currently support that.
Darrel Gallion's swigSgrep was an
earlier attempt to match sgrep with Python. swigSgrep
had better support for handling files but did not support sgrep's M4
macro facility. Its interface is klunkier as well, being a straight
SWIG of the sgrep program. Kuhlman's effort seems clearer to me. If
you work with a variety of structured text files, you should take a
look at PySgrep.
Stephen Figgins administrates Linux servers for Sunflower Broadband, a cable company.
Read more Python News columns.
Return to the PHP DevCenter.