oreilly.comSafari Books Online.Conferences.


AddThis Social Bookmark Button

Build Your Own Blogging Application, Part 2

by Matthew Russell

In part one, we worked through building a front end for our blogging app using Tcl/Tk and some XHTML fundamentals. In this continuation, I'll use two parts Perl and a sprinkle of Bash to show you how to build the back end. At the end of this lesson, you'll be blogging like a champion and have several new tools at your disposal.

Meet Perl

Perl isn't your grandmother's timid friend you met at her last birthday party. Perl is sassy, powerful, and prefers things to be short, sweet, and to the point. More accurately, Perl is the Practical Extraction and Report Language. Larry Wall designed Perl in the mid 1980s when awk (type man awk for a description) became inadequate for a project he was working on at the time. You may not have realized it, but you already have a plethora of Perl documentation at your fingertips. In a terminal, type man perl for an overview of Perl or 'man perlintro' for a nice introduction to programming in Perl. I won't summarize an entire introduction to Perl here, but I will highlight some of the particulars relevant to our discussion.

The canonical Hello World! program is a one liner in Perl. Notice that the print statement takes an argument in double quotes (although single quotes work too) and that statements end in a semicolon. The \n is the newline character that tells Perl to break the line after displaying the string.


print "Hello World\n";

The first line is a special directive that tells the shell to execute the program with Perl. In order for the file to be executable, however, permissions must be set correctly. In your terminal, while in the same directory as, type chmod u+x The command chmod u+x changes the file permissions to make it executable.

Type man chmod in the terminal for more info on file permissions. Depending on your terminal settings, executable files may also appear red, which is a helpful setting. To see markup for directory listings--a great convenience--type ls -G whenever you want a directory listing. As with most any command in the shell, a man ls command will give you more of the details.

Regular Expressions

An essential concept you'll need in order to exploit Perl's potential is that of a regular expression, commonly referred to as a 'regex.' Regexes describe patterns, and learning just a few rudiments will give you all the power you need to get up to speed. I'll cover some of the basics here, but you'll want to reference the tutorial obtained by typing man perlretut in your terminal if you need extra help.

Given some text, a regex determines if it matches a particular pattern designated by characters and special metacharacters. The basic operator for performing pattern matching is m//. Often, the m, which stands for 'match', is dropped, and the operator simply becomes //. The designated pattern to match the string value against is placed between the pair of forward slashes.

Given that we know a regex consisting of a single word matches any string that contains the word, we can deduce that the following statement from the Perl Regex Tutorial will display 'It matches'.


if ("Hello World" =~ m/World/) {
	print "It matches\n";
else {
	print "It doesn't match\n";

There are a couple of important details to notice here. One is that 'World' is in fact a regular expression designated by the concatenation of the five characters that compose it. Whitespace is treated like any other character, so the pattern 'W o r l d' would not have matched. Values placed after one another are assumed to be concatenated in regexes unless special regex operators dictate otherwise.

Another important detail is that the binding operator =~ associates the regex on its right with the string value on its left and returns a boolean value based on whether or not the string matches the pattern. Finally, the syntax for an if statement is like some you've probably already seen, and Perl uses curly braces for groupings.

A slightly more complex example applicable to our blogging app is contained in the following lines of code:

$_ = 
"<img src='/q/image.gif' height='1' width='5' />";

if (m/<img src='(.*\/)(.*?)'(.*) \/>/) {
	print "Group 1: $1\n"; #path
	print "Group 2: $2\n"; #filename
	print "Group 3: $3\n"; #attributes
else {
	print "It doesn't match\n";

You'll be pleased to know that this example is actually very simple. A special variable in Perl, the $_ variable (all Perl variables have a $ prefix), allows us to abbreviate this conditional statement by not explicitly using the binding operator.

In this example, the regex between the forward slashes of the m// operator gets compared with the string value contained in the variable $_ since no alternative is given by the binding operator. The terse convention involving the $_ variable is especially convenient when parsing a file because lines of text read from a file are automatically loaded into the $_ variable as well. Finally, note that the assignment statement is broken across a line, which is fine to do in Perl.

Decomposing the regular expression teaches us several new concepts about regexes: character groupings, regex metacharacters, escape characters, and greediness (a technical term). In the regex given, there are three character groupings defined by the parenthetical subexpressions if the overall expression matches against the value of the $_ variable. As you might have inferred from the code, Perl loads the special variables $1..$N, where N is the number of groupings, with the values of the parenthetical expressions, and we can then use them for whatever purpose we like. In this example, you'll get the following output from the script if you run it:

Group 1: /q
Group 2: image.gif
Group 3:  height='1' width='5'

The expression inside the second grouping is .*?. The . character is a wildcard, which means that it matches any character. The trailing *? indicates that zero or more characters can match the wildcard for that particular grouping, and if it is possible to match in more than one location, match as minimally as possible.

If the ? had been omitted, the regex .* would have been 'greedy' and matched the wildcard character maximally, resulting in a match for any value you give it. Try removing the ? and notice the output you get. The third grouping becomes empty, confirming our assertion about the greediness. As a subtlety, realize that the third grouping still matched the empty string, which is all that gets left over after .* matches. If it had not matched, the else statement would have executed. Finally, the first grouping contains the value .*\/.

The only new aspect here is that \/ designates a single forward slash concatenation. The backslash before the forward slash is used to escape the forward slash. Otherwise, Perl tries to use the forward slash in the m// operator, resulting in an error. See the man pages for examples of other characters that must be escaped. An editor like Vim can save your eyes a lot of strain by producing nice syntax highlighting for Perl and regexes.

The substitution operator, s///, works very similarly to the match operator. The difference is that in addition to specifying what value to check for a match, you also specify the value to replace the matching text. Evolving the previous code...

$_ = 
"<img src='/image.gif' height='1' width='5' />";

if (m/<img\ src='(.*\/)(.*?)'(.*) \/>/) {
	$filePath = $1;
    $fileName = $2;
    s/<img src='(.*\/)(.*?)'(.*) \/>/<img src='$fileName'$3 \/>/;
	print "$_\n";
else {
	print "It doesn't match\n";

Here, we explicitly rename the $1 and $2 values for clarity. The light bulb should be getting brighter if you've looked at the XHTML specification for the image tag. The first grouping represents a path, which ends with a forward slash. Everything after the forward slash but before the closing single quote is the file name.

The third grouping represents values such as the height and width attributes. If it is determined that $_ contains a string value matching the pattern specified in the m// operator, then we remove the path and leave only the filename in the single quotes with the substitution operator.

Pages: 1, 2

Next Pagearrow