In part one, we worked through building a front end for our blogging app using Tcl/Tk and some XHTML fundamentals. In this continuation, I'll use two parts Perl and a sprinkle of Bash to show you how to build the back end. At the end of this lesson, you'll be blogging like a champion and have several new tools at your disposal.
Perl isn't your grandmother's timid friend you met at her last birthday party.
Perl is sassy, powerful, and prefers things to be short, sweet, and to the
point. More accurately, Perl is the Practical Extraction and Report Language.
Larry Wall designed Perl in the mid 1980s when awk (type man awk for a description)
became inadequate for a project he was working on at the time. You may not
have realized it, but you already have a plethora of Perl documentation at
your fingertips. In a terminal, type man perl for an overview of Perl or 'man
perlintro' for a nice introduction to programming in Perl. I won't summarize
an entire introduction to Perl here, but I will highlight some of the particulars
relevant to our discussion.
The canonical Hello World! program is a one liner in Perl. Notice
that the print statement takes an argument in double quotes (although single
quotes work too) and that statements end in a semicolon. The \n is the newline
character that tells Perl to break the line after displaying the string.
#!/usr/bin/perl
print "Hello World\n";
The first line is a special directive that tells the shell to execute the
program with Perl. In order for the file to be executable, however, permissions
must be set correctly. In your terminal, while in the same directory as helloWorld.pl,
type chmod u+x helloWorld.pl. The command chmod u+x changes the file permissions
to make it executable.
Type man chmod in the terminal for more info on file permissions. Depending
on your terminal settings, executable files may also appear red, which is a
helpful setting. To see markup for directory listings--a great convenience--type
ls -G whenever you want a directory listing. As with most any command in
the shell, a man ls command will give you more of the details.
An essential concept you'll need in order to exploit Perl's potential is that
of a regular expression, commonly referred to as a 'regex.' Regexes describe
patterns, and learning just a few rudiments will give you all the power you
need to get up to speed. I'll cover some of the basics here, but you'll want
to reference the tutorial obtained by typing man perlretut in your terminal
if you need extra help.
Given some text, a regex determines if it matches a particular pattern designated
by characters and special metacharacters. The basic operator for performing
pattern matching is m//. Often, the m, which stands for 'match', is dropped,
and the operator simply becomes //. The designated pattern to match the string
value against is placed between the pair of forward slashes.
Given that we know a regex consisting of a single word matches any string
that contains the word, we can deduce that the following statement from the
Perl Regex Tutorial will display 'It matches'.
#!/usr/bin/perl
if ("Hello World" =~ m/World/) {
print "It matches\n";
}
else {
print "It doesn't match\n";
}
There are a couple of important details to notice here. One is that 'World' is in fact a regular expression designated by the concatenation of the five characters that compose it. Whitespace is treated like any other character, so the pattern 'W o r l d' would not have matched. Values placed after one another are assumed to be concatenated in regexes unless special regex operators dictate otherwise.
Another important detail is that the binding operator =~ associates the regex
on its right with the string value on its left and returns a boolean value
based on whether or not the string matches the pattern. Finally, the syntax
for an if statement is like some you've probably already seen, and Perl uses
curly braces for groupings.
A slightly more complex example applicable to our blogging app is contained in the following lines of code:
#!/usr/bin/perl
$_ =
"<img src='/q/image.gif' height='1' width='5' />";
if (m/<img src='(.*\/)(.*?)'(.*) \/>/) {
print "Group 1: $1\n"; #path
print "Group 2: $2\n"; #filename
print "Group 3: $3\n"; #attributes
}
else {
print "It doesn't match\n";
}
You'll be pleased to know that this example is actually very simple. A special
variable in Perl, the $_ variable (all Perl variables have a $ prefix), allows
us to abbreviate this conditional statement by not explicitly using the binding
operator.
In this example, the regex between the forward slashes of the m// operator
gets compared with the string value contained in the variable $_ since no alternative
is given by the binding operator. The terse convention involving the $_ variable
is especially convenient when parsing a file because lines of text read from
a file are automatically loaded into the $_ variable as well. Finally, note
that the assignment statement is broken across a line, which is fine to do
in Perl.
Decomposing the regular expression teaches us several new concepts
about regexes: character groupings, regex metacharacters, escape characters,
and greediness (a technical term). In the regex given, there are three character
groupings defined by the parenthetical subexpressions if the overall expression
matches against the value of the $_ variable. As you might have inferred from
the code, Perl loads the special variables $1..$N, where N is the number of
groupings, with the values of the parenthetical expressions, and we can then
use them for whatever purpose we like. In this example, you'll get the following
output from the script if you run it:
Group 1: /q
Group 2: image.gif
Group 3: height='1' width='5'
The expression inside the second grouping is .*?. The . character is
a wildcard, which means that it matches any character. The trailing *? indicates
that zero or more characters can match the wildcard for that particular grouping,
and if it is possible to match in more than one location, match as minimally
as possible.
If the ? had been omitted, the regex .* would have been 'greedy' and
matched the wildcard character maximally, resulting in a match for any value
you give it. Try removing the ? and notice the output you get. The third grouping becomes empty, confirming our assertion about the greediness.
As a subtlety, realize that the third grouping still matched the empty string,
which is all that gets left over after .* matches. If it had not matched,
the else statement would have executed. Finally, the first grouping contains
the value .*\/.
The only new aspect here is that \/ designates a single forward slash concatenation.
The backslash before the forward slash is used to escape the forward slash.
Otherwise, Perl tries to use the forward slash in the m// operator, resulting
in an error. See the man pages for examples of other characters that must be
escaped. An editor like Vim can save your eyes a lot of strain by producing
nice syntax highlighting for Perl and regexes.
The substitution operator, s///, works very similarly
to the match operator. The difference is that in addition to specifying what
value to check for a match, you also specify the value to replace the matching
text. Evolving the previous code...
#!/usr/bin/perl
$_ =
"<img src='/image.gif' height='1' width='5' />";
if (m/<img\ src='(.*\/)(.*?)'(.*) \/>/) {
$filePath = $1;
$fileName = $2;
s/<img src='(.*\/)(.*?)'(.*) \/>/<img src='$fileName'$3 \/>/;
print "$_\n";
}
else {
print "It doesn't match\n";
}
Here, we explicitly rename the $1 and $2 values for clarity. The light bulb
should be getting brighter if you've looked at the XHTML specification for
the image tag. The first grouping represents a path, which ends with a forward
slash. Everything after the forward slash but before the closing single quote
is the file name.
The third grouping represents values such as the height and width attributes.
If it is determined that $_ contains a string value matching the pattern specified
in the m// operator, then we remove the path and leave only the filename in
the single quotes with the substitution operator.
|
Remember that note we made about being able to rename our image files (specified with an absolute path) relative to their location on the iDisk? This is it! Below is the complete script. I've added statements to remove spaces from the path and filename and to copy the image files to a location designated by the last argument given to the script.
The first argument is used by the <> (diamond) operator.
The diamond operator reads lines from the filenames contained in the command
line arguments array with the exception of the last one, which we removed with
the pop operator. In this case, we only need to specify two arguments: the
filename for the diamond operator to process and the location to copy the updated
file. The first argument is the file newentry referenced in the editor.tcl
script, and the last argument is the directory to copy the modified file. In
our case, this directory is our iDisk directory, which we can access as a local
mount point (more on that in just a moment).
#!/usr/bin/perl
#usage: %processImages inputFile copyLocation
#pop the copy location off of ARGV
#before the diamond operator gets it
$copyLocation = pop(@ARGV);
#edit file in place and backup the original
local $^I = ".bak";
while (<>) {
if (m/<img src='(.*\/)(.*?)'(.*) \/>/) {
$filePath = $1;
$fileName = $2;
s/<img src='(.*\/)(.*?)'(.*) \/>/<img src='$fileName'$3 \/>/;
$filePath =~ s/ /\\ /g;
$fileName =~ s/ /\\ /g;
system "cp $filePath$fileName $copyLocation";
}
print "$_"; #write lines back to the file
}
One last thing to do is to design a convention for organizing the multiple entries that will be written to the blog. A simple approach is to append the newest entry to the top of the blog with each post. To do this, we need only define the boundaries of the blog's content and how to separate the entries. An elaborate implementation in XML with XSLT is the industrial strength solution, but a much simpler solution, which serves our purpose, is available by using XHTML comment tags to designate these boundaries.
Let's agree on the following convention for our simple XHTML-based blog entries.
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head><title>Sample Blog Format</title></head>
<body>
<!--STARTBLOG-->
<!--ENDBLOG-->
</body>
</html>
The most relevant part to our application is the text enclosed between the body tags of the page. You can feel free to use whatever XHTML features or styles you'd like to spice other things up. HTML features will work as well, but based on our previous discussion, XHTML is the wave of the future, and it's really worth it to go the extra mile. No matter which decision you make, be aware of the decision's impact on the longevity and display of your site.
If you log into your .Mac account and go to your homepage section, you're presented with a lot of nice layouts for your page. If you click on the advanced tab, you're able to select an external HTML page like the very simple (and boring) one presented above. All that is required is that you copy the page to the Sites folder of your iDisk, which you can navigate to using Finder or in the terminal at /Volumes/<.Mac username>/Sites. A much more interesting option, however, is to choose your favorite layout already designed by the .Mac gurus, salvage out the parts like the 'Send me a message button' and the counter, and insert the lines
<!--STARTBLOG--> <!--ENDBLOG-->
just inside the body tags so that the blogging app will know where to post entries. Note that depending on how much of the page you do or don't salvage, your Document Type Definition may change from XHTML document version to some version of HTML. For example, if you view the source of the standard pages included with your .Mac membership, you'll notice the following line at the top of the page:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
This is obviously an HTML 3.2 compliant document, which is different from the tag at the top of the sample file that specified the XHTML 1.0 Transitional Document Type. More specifics about the XHTML document types can be found at W3Schools.
Since the Perl code for this task is much more intuitive than the previous Tcl/Tk code we developed, I've included inline comments in lieu of a separate discussion. Examine this file and think about how it fits into the project. Remember that the Perl man pages are always there if you need them.
#!/usr/bin/perl
#The blog entry must be contained in a file
#entitled "newentry"
#-------------------------------------------------
#some special values used in writing the entry
my $startTag ="<!--STARTBLOG-->";
my $endTag ="<!--ENDBLOG-->";
my $startEntry ="<!--STARTENTRY-->";
my $endEntry ="<!--ENDENTRY-->";
#----------------------------------------------
#get a timestamp of the form MMDDYYYYHHMMSS
#----------------------------------------------
sub getTimeStamp {
# Get the all the values for current time
($Second, $Minute, $Hour, $Day, $Month,
$Year, $WeekDay, $DayOfYear, $IsDST) =
localtime(time);
#months start at 0, so increment by 1
$Month += 1;
#pad with a 0 if necessary
if ($Month < 10) {$Month = "0" . $Month;}
#don't need to increment day but must pad it
if ($Day < 10) {$Day = "0" . $Day;}
#do the same for the hour, minute, and second
if ($Hour < 10) {$Hour = "0" . $Hour;}
if ($Minute < 10) {$Minute = "0" . $Minute;}
if ($Second < 10) {$Second = "0" . $Second;}
#add 1900 to the year
$Year += 1900;
#return value is the last value computed
#in a Perl subroutine
my $returnValue= $Month . $Day .$Year .
$Hour . $Minute . $Second;
}
#----------------------------------------------
#get a timestamp of the form MM-DD-YYYY HH:MM
#----------------------------------------------
sub getFormattedTimeStamp {
# Get the all the values for current time
($Second, $Minute, $Hour, $Day, $Month,
$Year, $WeekDay, $DayOfYear, $IsDST) =
localtime(time);
#months start at 0, so increment by 1
$Month += 1;
#do the same for the minute, and second
if ($Minute < 10) {$Minute = "0" . $Minute;}
#add 1900 to the year
$Year += 1900;
#return value is the last value computed
my $returnValue= $Month . "-" . $Day. "-" .
$Year . " " . $Hour . ":" . $Minute;
}
#-------------------------------------------------
#wrap the time stamp between its tags
#-------------------------------------------------
sub wrapTimeStamp {
#last value computed is the value returned
my $returnValue = "<!--TIME" . &getTimeStamp .
"-->";
}
#------------------------------------------------
#open a file handle, get the input and wrap it
#in the appropriate tags
#------------------------------------------------
sub getNewEntry {
my $returnValue = '';
open NEWENTRY, "newentry";
while (<NEWENTRY>) {
$returnValue = "$returnValue" . "$_";
}
close NEWENTRY;
#last calc performed is automatically the
#return value, but first replace any \n
#with <br />\n
$returnValue =~ s/\n/<br \/>\n/g;
$returnValue =
"$startEntry\n" . &wrapTimeStamp .
"\n" . &getBlogHeader .
"\n<p>$returnValue</p>\n" .
"$endEntry\n";
}
#-------------------------------------------------
#a blog header that separates entries
#-------------------------------------------------
sub getBlogHeader {
my $returnValue = "<p><hr><strong>" .
&getFormattedTimeStamp . "</strong></p>";
}
#-------------------------------------------------
#read through the file line by line.
#when the special tag $startTag is found,
#insert the new entry padded with appropriate
#tags. Then continue processing the rest of the
#file
#-------------------------------------------------
local $^I = '.bak'; #edit file in place and backup
while (<>) {
print "$_";
if (/$startTag/) {
print "\n" . &getNewEntry . "\n";
}
}
|
Related Reading
Learning Perl Objects, References, and Modules |
Without further adieu, here is the sprinkle of Bash, script updateBlog.sh,
that was promised from the last article, the last piece of information you
need to make your blogging app functional. It's intended to be an executable
Bash script, so you'll need to type chmod u+x updateBlog.sh in your terminal
while in the same directory as this file. Repeat those steps for processImages.pl
and postEntry.pl. The script updateBlog.sh simply acts as glue for the two
perl scripts by executing and passing arguments into them.
To put it all together one last time, do the following things to post your first entry to your blogging app:
wish editor.tcl, making
sure your paths are set up to point to the version of the Wish shell you've
chosen.Wow! You've learned a lot and now have a blogging application you can use to muse poetry, found your own technology forum, or whisper your feelings to the world. Using your newly honed skills and building blocks of this project, you are empowered to make your application as elaborate as you like. Add in some JavaScript to produce a calendar to navigate your blog or go out and learn about Cascading Style Sheets in order to produce a higher quality display for your readers. You could also use you new friend Perl to organize blog entries into weekly sections and to introduce a navigation bar allowing users to provide an alternative to scrolling down one really long page.
Matthew Russell is a computer scientist from middle Tennessee; and serves Digital Reasoning Systems as the Director of Advanced Technology. Hacking and writing are two activities essential to his renaissance man regimen.
Return to the Mac DevCenter
Copyright © 2009 O'Reilly Media, Inc.