|
|
Document Mathematics: Count Your Words03/30/2001Document Mathematics: Count Your WordsOK, finally they convinced you to use HTML as the document format of your choice. Sure, it's a great format, as more and more applications make use of it -- not to mention the Web. But, to be honest, working with HTML documents is like living in the pre-word-processing age in many cases. Have you ever tried to find out the number of words or letters in an element or the whole document? Or ever tried to include this number dynamically into your document? A DOM implementation, a bit of JavaScript, and this article will tell you -- by the way, this document contains words. IntroductionOK, let's say we want to create an application that counts the number of words in a paragraph, table, or the whole document, and prints this number exactly where we want it to. We want to be able to format the result, of course. Naturally, we need a reusable solution. This is what we basically need to solve this task:
The second sub-task is an easy one; a few lines of code and some regular expressions will do it. We'll talk about that later. Since we want to be good guys, we're only using standards-compliant methods to work with documents when solving the first sub-task. We'll not use some (well, I have to admit it -- handy) property of Instead, I'll show you how to climb any DOM tree. You'll find this much handier, because you can recycle the techniques described for a lot of other tasks. I'll end our journey through the DOM with some final words and talk about the drawbacks of a different solution or buggy behavior. A quick note...Sometimes the term DOM might seem confusing. Depending on the context in which it's used, it can refer to:
To keep things simple, I'll give only a very basic overview of the concepts behind the Document Object Model; we'll only talk about the things we need to solve our task. Climb the DOMThe DOM is a (very) basic system to organize your elements in a set of parent-child relations, with every element having one parent element (except the root element) and exactly one element being the root element. To put it simply for our case, an element is created by some markup, indicated by the two angle brackets, When you visualize such a model, you'll find it forms a flipped tree, i.e., the root is at the top. Visualizing this model on your own is pretty easy, as you're already familiar with this system. Every time you mark up a document with HTML you're using it, as you're nesting elements within elements, thus creating parent-child relations. Picture 1 shows some HTML code and the corresponding part of the DOM tree.
As you can see, the code translates into the graph almost automatically. The node object
In addition to these properties, the DOM also defines a lot of methods for the individual objects. I'll introduce them only when necessary for our purposes. In case you're interested in the complete specs, you should not only read the descriptive texts, but also the ECMA Script Language Binding. The binding describes what a DOM implementation should look like from the JavaScripter's side. Now that we have the description of the node object, how to connect to a browser's live DOM tree? The document object
To find an individual element for which the ID attribute is defined, use the
And access the tree:
OK, back to our word counter problem. We're looking for text, which is always stored in a text node (don't worry about your JavaScript code, as this goes into a different kind of node), so we can compare the
Now, things get complicated a bit. Our solution must handle any appearance of a DOM tree, because we want to count all words under a given element (remember the flipped tree model). The answer is actually simple: recursion. Simple recursionRecursion is one of the most powerful mechanisms in programming. It tends to be quite confusing, and has not been necessary for most of pre-DOM JavaScript applications. However, we only need some basic information about it, so I think its use will be clear for this application. Basically, a recursive function is a function that calls itself while it's executed. The way a recursive function takes through the code must be well-defined to avoid an endless re-call. The interesting part is: What happens when a function interrupts itself at some point by calling itself? You get the same behavior that you'd expect from any other function call. After returning from some method, you'll find your local variables untouched. Even after returning from a call to the same method, you'll find them untouched, as every function call operates with its own local variables. Let's look at a non-recursive code to retrieve all (child) text nodes of some element:
To retrieve all child text nodes, you'd need some code that calls itself for every child node found, as each of these could contain more elements or text nodes.The solution is obvious: We wrap the code snippet above into a function:
That's it, a recursive function to walk through any DOM tree, no matter which form it takes. Adding text was shifted outside the loop to catch all nodes and, instead of this, the function calls itself in the loop. Because each opened loop is continued after returning from a (nested) method call, this algorithm catches all nodes under the root node given in the first call to You may wonder why I initialized the words variable. Take line 5 of the previous example without an initialized string variable. This code would add a node's value to an undefined variable for the first run, thus yielding We're almost finished now. Before wrapping the new code into another function, let's crack the next nut. Regular expressionsAs I mentioned before, finding words in a text is an easy task using regular expressions (sometimes called regexps). A regexp is a pattern to match against a string. This pattern contains literal characters, i.e., characters that are not interpreted, and special characters. These special characters can be used to group and build classes of characters, but can also act as quantifiers for their leading character. With quantifiers, you can write patterns that say: "I need at least one letter of that" (
In addition to that, the regular expression language defines a couple of special characters. A special character is indicated by a backslash (
You probably noticed that I already introduced everything we need. This is a simple definition of a word: Find a word boundary, at least one letter, and then a word boundary. Which translates into this regexp:
A regexp is defined in between a pair of slashes (/), so I've included them here. To find all occurrences of this pattern inside a string, we have to add a modifier to make the search global:
With JavaScript's regexps, you have two options on how to invoke pattern matching. You can either invoke matching directly on some
For a detailed introduction to JavaScript's regular expression syntax, you should consult the JavaScript Guide. Putting it togetherNow, we know how to retrieve all text nodes and create a string of them, as well as how to count the words. The next step is to create another function where we wrap the recursive function and the variables described above. This method also handles the pattern matching. It expects an element to start the search at and returns the number of words; see Since we want the result to appear in the document, we need to define where it should appear. You can just use any HTML tag with its own ID. This ID is supplied to the method, which finds the appropriate element by
One more new method and we're done.
We're finally done! To make this a handy real-world application, I've created another function, showWords(), which takes two arguments, an element to display the result in and an element in which to count the words. This function checks its arguments (and makes heavy use of JavaScript's loose typing), which means that you can supply an element's ID as well as a element. When no element to search is supplied, this function selects the document's root element. See the source code of Now, you can easily include this function into your own documents. Just include the script and call the method after the DOM tree has built, i.e. after the document has been loaded. You can find the complete script and instructions in the O'Reilly JavaScript Library. Some final wordsAs I already mentioned, one might use a different initial approach to create a working solution. However, here are some possible solutions and methods together with an explanation why I didn't use these. Perhaps you looked at the ECMA Script Binding and found a method of
However, this does not work and is a non-standard way to access the DOM anyway. Internet Explorer 5.x for Windows contains a bug that deletes white space after inserting a new node into the tree. You can avoid this behavior by using the HTML entity ( There is another point to mention. The classical way to insert your own text into a document is by using The DOM implementations of the two major players, Mozilla/Netscape 6 and Internet Explorer, also store the textual content of the DOM elements in properties of the Claus Augusti is O'Reilly Network's JavaScript editor. Read more Essential JavaScript columns. Return to the JavaScript and CSS DevCenter. |
|
|
|||||||||||||||||
|
|
||||||||