Build Your Own Blogging Application, Part 2
by Matthew Russell11/12/2004
In part one, we worked through building a front end for our blogging app using Tcl/Tk and some XHTML fundamentals. In this continuation, I'll use two parts Perl and a sprinkle of Bash to show you how to build the back end. At the end of this lesson, you'll be blogging like a champion and have several new tools at your disposal.
Meet Perl
Perl isn't your grandmother's timid friend you met at her last birthday party.
Perl is sassy, powerful, and prefers things to be short, sweet, and to the
point. More accurately, Perl is the Practical Extraction and Report Language.
Larry Wall designed Perl in the mid 1980s when awk (type man awk for a description)
became inadequate for a project he was working on at the time. You may not
have realized it, but you already have a plethora of Perl documentation at
your fingertips. In a terminal, type man perl for an overview of Perl or 'man
perlintro' for a nice introduction to programming in Perl. I won't summarize
an entire introduction to Perl here, but I will highlight some of the particulars
relevant to our discussion.
The canonical Hello World! program is a one liner in Perl. Notice
that the print statement takes an argument in double quotes (although single
quotes work too) and that statements end in a semicolon. The \n is the newline
character that tells Perl to break the line after displaying the string.
#!/usr/bin/perl
print "Hello World\n";
The first line is a special directive that tells the shell to execute the
program with Perl. In order for the file to be executable, however, permissions
must be set correctly. In your terminal, while in the same directory as helloWorld.pl,
type chmod u+x helloWorld.pl. The command chmod u+x changes the file permissions
to make it executable.
Type man chmod in the terminal for more info on file permissions. Depending
on your terminal settings, executable files may also appear red, which is a
helpful setting. To see markup for directory listings--a great convenience--type
ls -G whenever you want a directory listing. As with most any command in
the shell, a man ls command will give you more of the details.
Regular Expressions
An essential concept you'll need in order to exploit Perl's potential is that
of a regular expression, commonly referred to as a 'regex.' Regexes describe
patterns, and learning just a few rudiments will give you all the power you
need to get up to speed. I'll cover some of the basics here, but you'll want
to reference the tutorial obtained by typing man perlretut in your terminal
if you need extra help.
Given some text, a regex determines if it matches a particular pattern designated
by characters and special metacharacters. The basic operator for performing
pattern matching is m//. Often, the m, which stands for 'match', is dropped,
and the operator simply becomes //. The designated pattern to match the string
value against is placed between the pair of forward slashes.
Given that we know a regex consisting of a single word matches any string
that contains the word, we can deduce that the following statement from the
Perl Regex Tutorial will display 'It matches'.
#!/usr/bin/perl
if ("Hello World" =~ m/World/) {
print "It matches\n";
}
else {
print "It doesn't match\n";
}
There are a couple of important details to notice here. One is that 'World' is in fact a regular expression designated by the concatenation of the five characters that compose it. Whitespace is treated like any other character, so the pattern 'W o r l d' would not have matched. Values placed after one another are assumed to be concatenated in regexes unless special regex operators dictate otherwise.
Another important detail is that the binding operator =~ associates the regex
on its right with the string value on its left and returns a boolean value
based on whether or not the string matches the pattern. Finally, the syntax
for an if statement is like some you've probably already seen, and Perl uses
curly braces for groupings.
A slightly more complex example applicable to our blogging app is contained in the following lines of code:
#!/usr/bin/perl
$_ =
"<img src='/q/image.gif' height='1' width='5' />";
if (m/<img src='(.*\/)(.*?)'(.*) \/>/) {
print "Group 1: $1\n"; #path
print "Group 2: $2\n"; #filename
print "Group 3: $3\n"; #attributes
}
else {
print "It doesn't match\n";
}
You'll be pleased to know that this example is actually very simple. A special
variable in Perl, the $_ variable (all Perl variables have a $ prefix), allows
us to abbreviate this conditional statement by not explicitly using the binding
operator.
In this example, the regex between the forward slashes of the m// operator
gets compared with the string value contained in the variable $_ since no alternative
is given by the binding operator. The terse convention involving the $_ variable
is especially convenient when parsing a file because lines of text read from
a file are automatically loaded into the $_ variable as well. Finally, note
that the assignment statement is broken across a line, which is fine to do
in Perl.
Decomposing the regular expression teaches us several new concepts
about regexes: character groupings, regex metacharacters, escape characters,
and greediness (a technical term). In the regex given, there are three character
groupings defined by the parenthetical subexpressions if the overall expression
matches against the value of the $_ variable. As you might have inferred from
the code, Perl loads the special variables $1..$N, where N is the number of
groupings, with the values of the parenthetical expressions, and we can then
use them for whatever purpose we like. In this example, you'll get the following
output from the script if you run it:
Group 1: /q
Group 2: image.gif
Group 3: height='1' width='5'
The expression inside the second grouping is .*?. The . character is
a wildcard, which means that it matches any character. The trailing *? indicates
that zero or more characters can match the wildcard for that particular grouping,
and if it is possible to match in more than one location, match as minimally
as possible.
If the ? had been omitted, the regex .* would have been 'greedy' and
matched the wildcard character maximally, resulting in a match for any value
you give it. Try removing the ? and notice the output you get. The third grouping becomes empty, confirming our assertion about the greediness.
As a subtlety, realize that the third grouping still matched the empty string,
which is all that gets left over after .* matches. If it had not matched,
the else statement would have executed. Finally, the first grouping contains
the value .*\/.
The only new aspect here is that \/ designates a single forward slash concatenation.
The backslash before the forward slash is used to escape the forward slash.
Otherwise, Perl tries to use the forward slash in the m// operator, resulting
in an error. See the man pages for examples of other characters that must be
escaped. An editor like Vim can save your eyes a lot of strain by producing
nice syntax highlighting for Perl and regexes.
The substitution operator, s///, works very similarly
to the match operator. The difference is that in addition to specifying what
value to check for a match, you also specify the value to replace the matching
text. Evolving the previous code...
#!/usr/bin/perl
$_ =
"<img src='/image.gif' height='1' width='5' />";
if (m/<img\ src='(.*\/)(.*?)'(.*) \/>/) {
$filePath = $1;
$fileName = $2;
s/<img src='(.*\/)(.*?)'(.*) \/>/<img src='$fileName'$3 \/>/;
print "$_\n";
}
else {
print "It doesn't match\n";
}
Here, we explicitly rename the $1 and $2 values for clarity. The light bulb
should be getting brighter if you've looked at the XHTML specification for
the image tag. The first grouping represents a path, which ends with a forward
slash. Everything after the forward slash but before the closing single quote
is the file name.
The third grouping represents values such as the height and width attributes.
If it is determined that $_ contains a string value matching the pattern specified
in the m// operator, then we remove the path and leave only the filename in
the single quotes with the substitution operator.
Pages: 1, 2 |

