Most systems these days can generate log files to store data about the activity of the system. What about when you are asked to transform all of that data into usable information? I will show you how to use regular expressions and .NET's XML classes to turn your log files into a dataset to allow you to search, sort, and report on your data.
Regex class and regular expression capture groupsXMLTextWriter classDataSet classSystem.Text.RegularExpressions namespace.Taking a look at one of the log files on my system, I see the following:
25/05/2002 21:49 Search Dozer Anita1
25/05/2002 21:51 Update Dozer Anita1
26/05/2002 11:02 Search Manda Gerry2k
26/05/2002 11:12 Update Manda Gerry2k
27/05/2002 15:34 Search Anka Anita1
.
.
.
12/08/2002 10:14 Search Amber Huarez
Each line is built of the following columns, delimited by tabs:
|
Related Reading
Regular Expression Pocket Reference |
We need to transform this blob of text into something a little more structured. You might be thinking, "Hmm, why not import the file into Access using the tab-delimited wizard?" That solution would be totally OK, if we have one file or just a few files. The solution here requires a little more automation. Plus, had the log files been written in a different format, for example, several lines per log data, we'd be in trouble. What we need here is a structured data format; enter XML.
We can see several benefits from transforming these files into XML. With XML, we can:
If you've worked with regular expressions before, you know that using them
is one of the fastest ways of parsing text. In the .NET Framework, the main
class to be used in this area is the System.text.RegularExpressions.Regex
class. One of the most powerful features of this class is the ability to specify,
within the search pattern, Groups that will easily allow parsing and retrieval
of parts of the text.
17/08/1975 and the regular expression
(?<day>\d{1,2})/(?<month>\d{1,2})/(?<year>(?:\d{4}|\d{2}),
I can write code to retrieve any part of the text in the date by
name, like so:
const string pattern = @"(?<day>\d{1,2})/" +
@"(?<month>\d{1,2})/" +
@"(?<year>(?:\d{4}|\d{2}))";
string GivenDate = @"17/08/1975";
Match match = Regex.Match(GivenDate,pattern);
if(match.Success)
{
Console.WriteLine(string.Format("Day:{0},Month:{1},Year:{2}",
match.Groups["day"].Value,
match.Groups["month"].Value,
match.Groups["year"].Value));
}
This yields:
Result: Day:17,Month:08,Year:1975
Note: If you don't understand the code above, you should refer to the two articles mentioned at the beginning of this article.
<Date>
<day>17</day>
<month>08</month>
<year>1975</year>
</Date>
Outputting this kind of XML used to be a pretty easy, but pretty
error-prone, task. Sure, you could just slap each string into a memory buffer
with XML tags, but the amount of errors you could get makes this approach pretty untenable.
The XMLTextWriter class in the System.XML namespace
rids us of a lot of details here, and very conveniently abstracts away all of the
"boilerplate" code you need to write, allowing us to concentrate on the content
we wish to write in our XML document.
To show just how easy it is to use this class, here's a class that takes in
the MatchGroup object from the last example, writes an XML document
with this data into a memory buffer, and returns this XML output:
public class XMLUtil
{
public static string ToXML(Match regexMatch)
{
StringBuilder output = new StringBuilder();
// Write the XML into an in-memory string buffer
XmlTextWriter writer =
new XmlTextWriter(new StringWriter(output));
// Make the XML more readable
writer.Formatting = Formatting.Indented;
// Write the Start is a standard XML document
writer.WriteStartDocument();
// Create the opening node for our date element
writer.WriteStartElement("Date");
// Write out each date element value as a separate node
writer.WriteElementString("day",
regexMatch.Groups["day"].Value );
writer.WriteElementString("month",
regexMatch.Groups["month"].Value );
writer.WriteElementString("day",
regexMatch.Groups["year"].Value );
// Close the date and finish the document
writer.WriteEndElement();
// Closes any open elements automatically
writer.WriteEndDocument();
// Close the writer
writer.Close();
return output.ToString();
}
}
The output looks like this:
<?xml version="1.0" encoding="utf-8"?>
<Date>
<day>17</day>
<month>08</month>
<year>1975</year>
</Date>
As you can see, it's a very easy job to write XML using this
class. I first create an in-memory StringBuilder
that will house the created XML. I then hand it off to the constructor of a
StringWriter, which is used to construct our
XMLTextWriter object.
I could have easily passed in any
System.IO.StreamWriter-derived object; thus, I have
the flexibility of writing to pretty much anything I want.
I then call the WriteStartDocument method, which
creates the <xml version-..> tag at the beginning of the
XML text.
(I don't have to call it, though. I can just start writing out
elements right away.) I then open a new element tag that will contain sub-elements,
using the WriteStartElement method.
Then I proceed to write the actual values as sub-nodes in the
open element using the WriteElementString
method,
passing in the name of the node, and the value inside of it.
To finish, I call the WriteEndDocument
method, which closes all open elements in the XML.
Had I wanted to just close the current Date
element, I would call the WriteEndElement
method,
and continue writing more elements.
Now, if you're trying out the code to produce this XML, you might find a little
surprise in the generated XML file. in the XML file, the first line might read
<?xml version="1.0" encoding="utf-16"?> instead
of <?xml version="1.0" encoding="utf-8"?>, and as a consequence,
you might have some problems reading in the XML file later on. In order to control
the encoding with which the XMLTextWriter writes the XML file,
you'll need to specify the encoding in the XMLTextWriter's constructor. This
also means that it is simpler to just pass in a filename to the contructor rather
than use an in-memory buffer, which will then be written to a file anyway. Here's
the code to initialize the writer with a file name and the proper encoding:
//Create an XML textWriter object instance with a file name
XmlTextWriter xmlFile =
new XmlTextWriter(FileName + ".xml",Encoding.UTF8);
This should solve our problem, and since we are writing to a file, we can get
rid of the code that writes the StringBuilder into the file.
Actually, we can make the writing function much more generic, by automatically going through all of the groups of a given match and writing their names and values as XML. The following bit of code shows how to do this:
// Write out all the groups of a match to XML
Regex reg = new Regex(pattern);
Match = regexMatch reg.Match(inputString);
if(regexMatch.Success)
{
for (int i=1;i<regexMatch.Groups.Count;i++)
{
writer.WriteElementString(reg.GroupNameFromNumber(i),
regexMatch.Groups[i].Value);
}
}
In order to achieve this, we need to have an instance of the
Regex class to play with.
We have to use this same instance to receive the
Match object. Then we can use the Regex
instance to retrieve the name of a group, based on its
number:
reg.GroupNameFromNumber(i)
Don't ask me why the group name is not a property of the Group
class.
This means that for this functionality to work, we can't use the
static Match()method of the Regex
class,
which makes things a bit more cumbersome. That's why, for
the remainder of the code samples, I'll use the earlier version
of the code, although it's less generic.
You can then implement this method, if you wish, in your
programs.
OK. We know how to parse, and we know how to output to XML. Let's try to wrap this up using a class that takes in a single log file and transforms it into an XML file. This class should receive the name of the log file to read, parse it line by line, and generate a [logFileName].xml file:
public class LogConverter
{
public static void ConvertLogFile(string FileName)
{
string Pattern = @"(?<date>(?<day>\d{1,2})/" +
@"(?<month>\d{1,2})/(?<year>(?:\d{4}|\d{2}))" +
@"(?x))\t(?<time>(?<hour>\d{2}):(?<minutes>\d{2}))\t" +
@"(?<action>.*)\t(?<record>.*)\t(?<user>\w*)";
string line = String.Empty;
// Open the Log file for reading
TextReader reader =
new StreamReader(File.OpenRead(FileName));
// Create an XML textWriter object instance that
// will write to in-memory String Buffer named 'output'
StringBuilder output = new StringBuilder();
XmlTextWriter xmlFile =
new XmlTextWriter(new StringWriter(output));
// Initialize the xml writer
xmlFile.Formatting = Formatting.Indented;
xmlFile.WriteStartDocument();
xmlFile.WriteStartElement("Entries");
// Read each line in the file
while((line = reader.ReadLine())!=null)
{
// Try to match the line using regular expressions
Match parsed = Regex.Match(line, Pattern);
if (parsed.Success)
{
// If we get a regex Match, we pass
// the XML writer off to a method that will
// use the Match groups to generate XML data
// inside our XML document
WriteAsXML(parsed,xmlFile);
}
}
//Finish off any open elements
xmlFile.WriteEndDocument();
xmlFile.Close();
// Write the xml log to a file
StreamWriter fs = File.CreateText(FileName + ".xml");
fs.Write(output.ToString());
fs.Close();
}
private static void WriteAsXML(Match regexMatch,
XmlTextWriter writer)
{
// Open a new 'Entry' element
writer.WriteStartElement("Entry");
// Write out each date element value as a separate node
// Date: Full format, and separated to day,month,year
writer.WriteElementString("date",
regexMatch.Groups["date"].Value );
writer.WriteElementString("day",
regexMatch.Groups["day"].Value );
writer.WriteElementString("month",
regexMatch.Groups["month"].Value );
writer.WriteElementString("day",
regexMatch.Groups["year"].Value );
// Time: Full format, hours, and minutes
writer.WriteElementString("time",
regexMatch.Groups["time"].Value );
writer.WriteElementString("hour",
regexMatch.Groups["hour"].Value );
writer.WriteElementString("minutes",
regexMatch.Groups["minutes"].Value );
// Record ,actions and users
writer.WriteElementString("action",
regexMatch.Groups["action"].Value );
writer.WriteElementString("record",
regexMatch.Groups["record"].Value );
writer.WriteElementString("user",
regexMatch.Groups["user"].Value );
writer.WriteEndElement();
}
}
This class is pretty straightforward. Here's what's taking place:
XMLTextWriter object and
initializes it to the proper settings. It then creates an open
Entries element inside of it, into which all of the
child Entry elements (for each line) will be
written.Regex.Match method on that line, using a pattern
that matches each sub-group we identified at the beginning of
this article.XMLWriter instance and the Match object
to a separate method, which writes the group names and values
into the XML writer instance.
<?xml version="1.0" encoding="utf-8"?>
<Entries>
<Entry>
<date>25/05/2002</date>
.
.
.
.
</Entries>
Now that we have our data stored as structured XML, we can
use it to let the user easily search through it. To do this,
we'll use a very easy technique already given to us inside the
.NET framework. We'll use a DataSet object to load
our XML data, then we'll Select data from the
dataset using a filter that can be specified by the
user. We can then display the resulting DataRows to
the user.
The DataSet class has a LoadXML method, which allows
us to pass it a file name and have it automatically load the data into a table
structure inside the dataset. For our purposes, we can send in the file name
without any additional parameters. What will be generated inside of the dataset's
memory will be a table that contains a collection of DataRows,
each one holding a set of columns that corresponds to the set of
properties we created in the log file -- Date, Time, Hour, Action,
and so on. Once we have this table in place, we can use the DataTable's
Select method to retrieve any number of DataRow objects
that match the filter we provide. Here's the code to do this:
private void LoadXMLFile()
{
// Load the XML file into the dataset
m_ds.ReadXml(txtFileName.Text);
// Show All log entry Rows at first load
// by passing in a 'true' filter
// this is just like specifying
// SELECT * FROM ENTRIES WHERE true
RefreshResults("true");
}
private void RefreshResults(string filter)
{
try
{
// Clear the result list view
lvResults.Items.Clear();
// Get the first datatable inside the dataset
// we know this one contains the data we need
DataTable table = m_ds.Tables[0];
// Get the datarows that match the user's filter
// the filter can be any valid SQL filter
DataRow[] rows = table.Select(filter);
foreach(DataRow row in rows)
{
// Add an item to the list view
ListViewItem item =
new ListViewItem(row["date"].ToString());
item.SubItems.Add(row["time"].ToString());
item.SubItems.Add(row["record"].ToString());
item.SubItems.Add(row["action"].ToString());
item.SubItems.Add(row["user"].ToString());
lvResults.Items.Add(item);
}
}
catch(Exception e)
{
// The user might pass invalid filter expressions,
// in which case we get an exception notifying
// the filter parsing error in question
MessageBox.Show(e.Message);
}
}
Using this straightforward code, we can let the user load any
XML file, and filter its contents based on a SQL-like
filter expression.
Basically, if you have written SQL code, you can use a
WHERE clause to select the specified rows.
We receive an array of DataRows, and since we know
beforehand the names of the columns for each DataRow (same as the
XML elements in our log file), we can just display the values for
each column.
We could have just as easily looped through all of the available
columns and display each one's value to the user, without even
knowing what kind of data was inside of our DataRow. We could
dynamically add columns to our ListView corresponding to the name
of each DataColumn in the DataRow, and
voila -- you have yourself a more generic searching mechanism for
practically any simple XML file.
References:
Roy Osherove has spent the past 5+ years developing data driven applications for various companies in Israel. He's acquired several MCP titles, written a number of articles on various Net topics(most of which can be found on his weblog), and loves discovering new things everyday.
Return to ONDotnet.com
Copyright © 2009 O'Reilly Media, Inc.