Understanding Network I/O: From Spectator to Participant
by George Belotsky11/06/2003
This series of two articles will show you how to participate in the global Internet. The architects of this greatest advance in communications since the printing press have strived to make such participation possible. The Internet is — by design — a simple network, engineered only to route data from one location to another. All of the valuable services (such as the World Wide Web and email) are implemented at the endpoints, not inside of the Internet itself (see Further Reading for more information).
Because everybody is allowed to add value at the endpoints of the Internet, you can do it too! Today, this is easier than ever before. You do not need to be proficient in a difficult programming language, such as C or C++. For example, you will see that a useful web client can be about as complicated as a shopping list.
This article describes the basics of networked applications, providing information and sample code to get you started immediately. The second article will delve deeper into various techniques for network I/O, including important, practical results that help you choose the best method.
This article focuses on Internet clients. Clients — like your web browser — request information from servers (like the one from which you accessed this page). Typically, the client then presents the information to a person, although there are clients that talk to other computer programs instead. The next article will present ideas that are also applicable to developing servers and peer-to-peer systems.
|
Example Files Download examples and other files related to this article: |
With just a basic Internet connection, you can create your own clients (for personal or internal company use) in a leisurely afternoon. The following discussion presents the core techniques, illustrated with complete, working web clients. It is very important that you do not skip the last section of this article. The discussion there covers several simple things that you can do to avoid mystifying errors and security breaches in your network application.
A Simple Client
Here is a simple web client. It displays the current outdoor temperature in New York City and then exits.
import urllib # Library for retrieving files using a URL.
import re # Library for finding patterns in text.
# The NOAA web page, showing current conditions in New York.
url = 'http://weather.noaa.gov/weather/current/KNYC.html'
# Open and read the web page.
webpage = urllib.urlopen(url).read()
# Pattern which matches text like '66.9 F'. The last
# argument ('re.S') is a flag, which effectively causes
# newlines to be treated as ordinary characters.
match = re.search(r'(-?\d+(?:\.\d+)?) F',webpage,re.S)
# Print out the matched text and a descriptive message.
print 'In New York, it is now',match.group(1),'degrees.'
Here is the output produced by the client.
Example 2. Output from the simple client
In New York, it is now 52.0 degrees.
The client is written in Python —
probably the easiest, fully featured mainstream programming language available
today. Python is used by such organizations as Google, Yahoo, NASA, and
Lawrence Livermore National Laboratories. The language is open source (so you
can use it freely) and cross platform. Note that any text in the example that
starts with a # is a descriptive comment to human readers of the
code — Python ignores these comments.
|
Related Reading
Python in a Nutshell |
If you are really unfamiliar with Python, the Further Reading appendix provides several helpful links. C++, Perl, Java, and Visual Basic programmers should find Python very easy. It is probably the ideal choice for beginners, as well.
To run the examples in this article, it is sufficient to save the code into
an ordinary text file and then issue the command python
from the shell (the DOS command
prompt, for Windows users). See the Installation Notes for easy directions on installing Python.myfilename
Now we are ready for a more detailed discussion of the example. The first two lines import libraries to use in our program. The Python distribution includes a library to download files based on a URL (e.g., from a web server). There is no need to do any low-level socket programming — the library does this work for you.
The program opens and reads a web page from the National Oceanic and
Atmospheric Administration (NOAA) web server. This server hosts pages with
current weather information from all over the world. The specific page
retrieved by our example shows the conditions in New York City. The information
is saved as one long string in the webpage variable.
Next, a regular expression is applied to the webpage
string. This is a quick, effective way to extract information from text without
implementing complex parsing logic yourself. Applications range from input
validation for security (e.g., in CGI scripts) to bioinformatics (e.g.,
searching for DNA sequences in a genome).
While regular expressions can be a difficult topic, you do not need to become an absolute expert to make good use of this technology in many situations. The Further Reading appendix lists several resources for learning about regular expressions.
The final line in the example prints out the temperature reading, captured as part of the regular expression match. The following discussion gives the details of the matching process.
Capturing a Temperature Reading with a Regular Expression
Here is, once again, the regular expression pattern from the simple client example.
Example 3. Temperature-reading pattern
r'(-?\d+(?:\.\d+)?) F'
The leading r specifies a raw Python string. This will prevent
the interpreter from processing special characters (such as the backslash). In
a raw string, what you see is what you get: a r'\n' is two
characters (a backslash followed by the letter n), whereas an
ordinary string '\n' would be interpreted as a linefeed. If you
are interested in the subtle details (not usually necessary in day-to-day
programming) read the string literals subsection of the Python Reference Manual.
The first character in the pattern string is the open bracket,
(. The bracket starts a grouping; text that matches the pattern
inside of the brackets will be saved, so you can retrieve it for later use. This
is the first grouping in the pattern; thus, the last line of the example retrieves the text captured by the grouping
with match.group(1).
The temperature reading might be negative, so the minus sign is next. The
question mark is a special flag that indicates that the preceding character
is optional (in this case, the temperature reading may or may not contain a
minus sign). If you need to match an actual question mark in the text, escape
it with a backslash like this: \? (similarly, the backslash
itself is matched by \\).
Next, we expect to find some digits. A single digit is matched by a
\d. Note that d is a normal character, which usually
matches itself (i.e., the literal letter d in the text). Here, the
backslash is used to turn on a special meaning for d — that
of matching a single digit. This is a general technique in regular expressions;
some ordinary characters (like d) have a special meaning when
prefixed with a backslash, while special characters (such as ?)
become ordinary characters (which match themselves) when the backslash is
prepended.
The \d is followed by the special character +,
which specifies that at least one, but possibly more, of the preceding
characters should be matched. Thus, \d+ matches any number of
digits, but there must be at least one digit, or the match will fail. If you
want to allow zero or more digits, use * instead of
+, like this: \d*.
After the first string of digits, the temperature might contain a decimal
point, with more digits following. This whole sequence, however, is optional,
so a little more work is required to specify the pattern. We begin another
grouping, nested inside of our original one. This time, however, the grouping is
slightly different: (?: instead of just an opening bracket. The
?: sequence just inside of the bracket indicates that the grouping
should not be saved — only matched. We are already saving the entire
match in our first grouping, and having a copy of just the fractional part of
the reading is not needed for this example.
Inside of the new grouping, we have \.\d+. The \d+ is
already familiar: it matches one or more digits. The \. matches
the decimal point. The backslash escape is required because . by
itself matches almost any single character. This is why, for example,
.* is often used to "match anything."
After the closing bracket of the nested grouping, there is a single question
mark. This question mark applies to the entire nested grouping. Thus,
we have an optional fractional part — the decimal point and at least one
digit — to the temperature reading. There are easier ways to specify this
fractional part, but they may allow malformed constructs (such as
65. with no digits after the decimal point) to slip through.
Next comes the closing bracket of our top-level grouping. After that, the
last part of the pattern will match an ordinary space followed by the
F character (which stands for "Fahrenheit"). While only the part
inside of the outer brackets (the temperature reading itself) will be saved, the
single space and the letter F must follow in the text in order for
the match to succeed. This is a simple case of including extra information in a
pattern, in order to identify just the right data. In the current example, here
is the result if the "space F" sequence is left out of the regular
expression.
Example 4. Matching the wrong thing
In New York, it is now 3 degrees.
When it is winter in New York, you might actually think that this reading is
correct! Actually, it comes from the W3C substring, which occurs
in the start of the returned HTML page.
Example 5. Start of an HTML document
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"hmpro6.dtd">
In the given example, just adding the "space F" yields the right result. Depending on the complexity and variability of the data being processed, you may need more identifying information in your pattern, in order to select the correct text. Of course, if you do not control the generation of the data you are processing, it is always possible that its format will be changed without warning. You should always consider this possibility, as well as the potential impact on your application.
Another interesting variation is writing a web client for a site that you control. For example, the client may interact with the CGI scripts on the web site. In this case, you would have effectively defined a network protocol of your own — layered on top of the HTTP protocol.
In the next section, we provide a Graphical User Interface (GUI) for our simple client. This is quite easy to do in Python.