Working on a paper, book, or thesis and need a nerdy
definition of one word, and alternatives to another?
You're writing a paper and getting sick of constantly looking up words
in your dictionary and thesaurus. As most of the hacks in this book have done,
you can scratch your itch with a little bit of Perl. This script uses the dict protocol (http://www.dict.org/) and Thesaurus.com (http://www.thesaurus.com/) to find all you
need to know about a word.
By using the dict protocol, DICT.org and several other
dictionary sites make our task easier, since we do not need to filter through
HTML code to get what we are looking for. A quick look through CPAN (http://www.cpan.org/) reveals that the dict protocol has already been implemented
as a Perl module (http://search.cpan.org/author/NEILB/Net-Dict/lib/Net/Dict.pod).
Reading through the documentation, you will find it is well-written and easy to
implement; with just a few lines, you have more definitions than you can shake a
stick at. Next problem.
Unfortunately, the thesaurus part of our program will not be as simple.
However, there is a great online thesaurus (http://www.thesaurus.com/) that we will use
to get the information we need. The main page of the site offers a form to look
up a word, and the results take us to exactly what we want. A quick look at the
URL shows this will be an easy hurdle to overcome — using LWP, we can grab the page we want and need to worry only
about parsing through it.
Since some words have multiple forms (noun, verb, etc.), there might be more
than one entry for a word; this needs to be kept in mind. Looking at the HTML
source, you can see that each row of the data is on its own line, starting with
some table tags, then the header for the line (Concept,
Function, etc.), followed by the content. The easiest way
to handle this is to go through each section individually, grabbing from Entry to Source, and then parse out
what's between. Since we want only synonyms for the exact word we searched for,
we will grab only sections where the content for the entry line contains only
the word we are looking for and is between the highlighting tag used by the
site. Once we have this, we can strip out those highlighting tags and proceed to
finding the synonym and antonym lines, which might not be available for every
section. The easiest thing to do here is to throw it all in an array; this makes
it easier to sort, remove duplicate words, and display it. In cases in which you
are parsing through long HTML, you might find it easier to put the common HTML
strings in variables and use them in the regular expressions; it makes the code
easier to read. With a long list of all the words, we use the Sort::Array module to get an alphabetical,
and unique, listing of results.
The Code
Save the following code as dict.pl:
#!/usr/bin/perl -w
#
# Dict - looks up definitions, synonyms and antonyms of words.
# Comments, suggestions, contempt? Email adam@bregenzer.net.
#
# This code is free software; you can redistribute it and/or
# modify it under the same terms as Perl itself.
#
use strict; $|++;
use LWP;
use Net::Dict;
use Sort::Array "Discard_Duplicates";
use URI::Escape;
my $word = $ARGV[0]; # the word to look-up
die "You didn't pass a word!\n" unless $word;
print "Definitions for word '$word':\n";
# get the dict.org results.
my $dict = Net::Dict->new('dict.org');
my $defs = $dict->define($word);
foreach my $def (@{$defs}) {
my ($db, $definition) = @{$def};
print $definition . "\n";
}
# base URL for thesaurus.com requests
# as well as the surrounding HTML of
# the data we want. cleaner regexps.
my $base_url = "http://thesaurus.reference.com/search?q=";
my $middle_html = ":</b> </td><td>";
my $end_html = "</td></tr>";
my $highlight_html = "<b style=\"background: #ffffaa\">";
# grab the thesaurus results.
my $ua = LWP::UserAgent->new(agent => 'Mozilla/4.76 [en] (Win98; U)');
my $data = $ua->get("$base_url" . uri_escape($word))->content;
# holders for matches.
my (@synonyms, @antonyms);
# and now loop through them all.
while ($data =~ /Entry(.*?)<b>Source:<\/b>(.*)/) {
my $match = $1; $data = $2;
# strip out the bold marks around the matched word.
$match =~ s/${highlight_html}([^<]+)<\/b>/$1/;
# push our results into our various arrays.
if ($match =~ /Synonyms${middle_html}([^<]*)${end_html}/) {
push @synonyms, (split /, /, $1);
}
elsif ($match =~ /Antonyms${middle_html}([^<]*)${end_html}/) {
push @antonyms, (split /, /, $1);
}
}
# sort them with sort::array,
# and return unique matches.
if ($#synonyms > 0) {
@synonyms = Discard_Duplicates(
sorting => 'ascending',
empty_fields => 'delete',
data => \@synonyms,
);
print "Synonyms for $word:\n";
my $quotes = ''; # purtier.
foreach my $nym (@synonyms) {
print $quotes . $nym;
$quotes = ', ';
} print "\n\n";
}
# same thing as above.
if ($#antonyms > 0) {
@antonyms = Discard_Duplicates(
sorting => 'ascending',
empty_fields => 'delete',
data => \@antonyms,
);
print "Antonyms for $word:\n";
my $quotes = ''; # purtier.
foreach my $nym (@antonyms) {
print $quotes . $nym;
$quotes = ', ';
} print "\n";
}
Running the Hack
Invoke the script on the command line, passing it one word at a time. As far
as I know, these sites know how to work with English words only. This script has
a tendency to generate a lot of output, so you might want to pipe it to less or redirect it to a file.
Here is an example where I look up the word "hack":
% perl dict.pl "hack"
Definitions for word 'hack':
<snip>
hack
<jargon> 1. Originally, a quick job that produces what is
needed, but not well.
2. An incredibly good, and perhaps very time-consuming, piece
of work that produces exactly what is needed.
<snip>
See also {neat hack}, {real hack}.
[{Jargon File}]
(1996-08-26)
Synonyms for hack:
be at, block out, bother, bug, bum, carve, chip, chisel, chop, cleave,
crack, cut, dissect, dissever, disunite, divide, divorce, dog, drudge,
engrave, etch, exasperate, fashion, form, gall, get, get to, grate, grave,
greasy grind, grind, grub, grubber, grubstreet, hack, hew, hireling, incise,
indent, insculp, irk, irritate, lackey, machine, mercenary, model, mold,
mould, nag, needle, nettle, old pro, open, part, pattern, peeve, pester,
pick on, pierce, pique, plodder, potboiler, pro, provoke, rend, rip, rive,
rough-hew, sculpt, sculpture, separate, servant, sever, shape, slash, slave,
slice, stab, stipple, sunder, tear asunder, tease, tool, trim, vex, whittle,
wig, workhorse
Antonyms for hack:
appease, aristocratic, attach, calm, cultured, gladden, high-class, humor,
join, make happy, meld, mollify, pacify, refined, sophisticated, superior,
unite
Hacking the Hack
There are a few ways you can improve upon this hack.
Using specific dictionaries
You can either use a different dict server or you can
use only certain dictionaries within the dict server. The DICT.org server uses 13
dictionaries; you can limit it to use only the 1913 edition of Webster's Revised Unabridged Dictionary by changing the $dict->define line to:
my $defs = $dict->define($word, 'web1913');
The $dict->dbs method will get you a list of
dictionaries available.
Clarifying the thesaurus
For brevity, the thesaurus section prints all the synonyms
and antonyms for a particular word. It would be more useful if it separated them
according to the function of the word and possibly the
definition.
Tara Calishain
is the creator of the site, ResearchBuzz. She is an expert on Internet search engines and how they can be used effectively in business situations.