C# Regular Expressions, Revisited
by Brad Merrill, coauthor of C# Essentials, 2nd Edition.03/11/2002
It's been a year since my original article on the .NET Frameworks RegEx classes, and since O'Reilly is now releasing the 2nd edition of C# Essentials, it seems worthwhile to update this topic.
Updates since Beta1
The following is a brief list of the major changes to the RegEx class library since Beta1:
- The RegularExpression Assembly is now merged with the main Frameworks class library. This means you will no longer need to reference the assembly seperately. You will still need to specify the namespace via the using statement, in order to use the classes by name.
- When matching groups, the Group method was used to retrieve appropriately indexed group. Now we can retrieve the Groups property, which is a GroupCollection, which can be indexed directly.
- In Beta1, the RegEx modifiers were specified as character code. Now, there is an Enum called RegexOptions which provides access to the modifier functionality.
Compiled RegEx
The biggest new feature that was added in Beta2 was the addition of compiled regular expressions. This allows assemblies to be seperately compiled, such that new assemblies can be built, referencing a seperately compiled RegEx. This is similar to how Lex programs are often used to create seperate parsers.
Let's look at a small (somewhat contrived) example. Let's say we have a pattern, which is fixed, for 90% of the time we use it. That matching would be a good candidate for prebuilding as a precompiled assembly.
Here's the first part of the solution, which expresses the matching pattern, and generates the assembly:
namespace MyApp
{
using System;
using System.Reflection;
using System.Text.RegularExpressions;
class GenFishRegEx
{
static void Main()
{
// create the pattern
string pat = @"(\w+)\s+(fish)";
// create the compile info
RegexCompilationInfo rci = new RegexCompilationInfo(
pat, RegexOptions.IgnoreCase, "FishRegex", "MyApp", true);
// setup to compile
AssemblyName an = new AssemblyName();
an.Name = "FishRegex";
RegexCompilationInfo[] rciList = { rci };
// compile the regular expression
Regex.CompileToAssembly(rciList, an);
}
}
}
In this sample, the compiled expression pat will match a word preceeding the word fish.
We now compile and run as:
csc GenFishRegEx.cs
GenFishRegEx
We have now created FishRegEx.dll, and we can now use this
assembly in a new program. The specifics worth noting are the
use of the RegexCompilationInfo which specifies the
namespace to create the new type FishRegex within, and
specifying the name of the newly created assembly.
You can now use this new assembly as:
// build as:
// csc /r:fishregex.dll UseFishRegEx.cs
namespace MyApp
{
using System;
using System.Reflection;
using System.Text.RegularExpressions;
class UseFishRegEx
{
public static void Main()
{
string text = "One fish two fish red fish blue fish";
int matchCount = 0;
FishRegex f = new FishRegex();
foreach (Match m in f.Matches(text))
{
Console.WriteLine("Match"+ (++matchCount));
for (int i = 1; i <= 2; i++)
{
Group g = m.Groups[i];
Console.WriteLine("Group"+i+"='" + g + "'");
CaptureCollection cc = g.Captures;
for (int j = 0; j < cc.Count; j++)
{
Capture c = cc[j];
System.Console.WriteLine(
"Capture"+j+"='" + c + "', Position="+c.Index);
}
}
}
}
}
}
|
Related Reading
C# Essentials |
When we need a new instance of the RegEx we simply instantiate
it as new FishRegEx, and then process it normally,
since it inherits from RegEx.
It might be a useful test or verification of the above, to build and run both programs, and then examine their public methods using the ildasm tool.
Performance Considerations
If you do have a need for creating compiled regex assemblies,
note that you will pay a cost for the initial assembly load.
This cost can be minimized by utilizing the NGEN tool, to
create a pre-JIT'ed assembly, which drastically reduces the
intial load cost.
Updated Cookbook
I have updated the C# Cookbook samples for RTM. All of these consisted of just updating the code fragments as outlined in the changes from Beta1 to Beta2.
// Roman Numbers
string p1 = "^m*(d?c{0,3}|c[dm])"
+ "(l?x{0,3}|x[lc])(v?i{0,3}|i[vx])$";
string t1 = "vii";
Match m1 = Regex.Match(t1, p1);
Console.WriteLine("Match=[" + m1 + "]");
// Swap first two words
string t2 = "the quick brown fox";
string p2 = @"(\S+)(\s+)(\S+)";
Regex x2 = new Regex(p2);
string r2 = x2.Replace(t2, "$3$2$1", 1);
Console.WriteLine("Result=[" + r2 + "]");
// Keyword = Value
string t3 = "myval = 3";
string p3 = @"(\w+)\s*=\s*(.*)\s*$";
Match m3 = Regex.Match(t3, p3);
Console.WriteLine("Group1=[" + m1.Groups[1] + "]");
Console.WriteLine("Group2=[" + m1.Groups[2] + "]");
// Line of at least 80 chars
string t4 = "********************"
+ "******************************"
+ "******************************";
string p4 = ".{80,}";
Match m4 = Regex.Match(t4, p4);
Console.WriteLine("if line >= 80 is = " + m1.Success + "]");
// MM/DD/YY HH:MM:SS
string t5 = "01/01/01 16:10:01";
string p5 = @"(\d+)/(\d+)/(\d+) (\d+):(\d+):(\d+)";
Match m5 = Regex.Match(t5, p5);
Console.WriteLine("M5=" + m5);
for (int i = 1; i <= 6; i++)
Console.WriteLine("Group" + i + "=[" + m5.Groups[i] + "]");
// Changing directories (for Windows)
string t6 = @"C:\Documents and Settings\user1\Desktop\";
string r6 = Regex.Replace(t6, @"\\user1\\", @"\\user2\\"); // ";
// expanding (%nn) hex escapes
string t7 = "%41";
string p7 = "%([0-9A-Fa-f][0-9A-Fa-f])";
string r7 = Regex.Replace(t7, p7, HexConvert);
Console.WriteLine("R7=" + r7);
// deleting C comments (imperfectly)
string t8 = @"
/*
* this is an old cstyle comment block
*/
foo();
";
string p8 = @"
/\* # match the opening delimiter
.*? # match a minimal numer of chracters
\*/ # match the closing delimiter
";
string r8 = Regex.Replace(t8, p8, "",
RegexOptions.IgnorePatternWhitespace
| RegexOptions.Singleline);
Console.WriteLine("r8="+r8);
// Removing leading and trailing whitespace
string t9a = " leading";
string p9a = @"^\s+";
string r9a = Regex.Replace(t9a, p9a, "");
Console.WriteLine("r9b=" + r9a);
string t9b = "trailing ";
string p9b = @"\s+$";
string r9b = Regex.Replace(t9b, p9b, "");
Console.WriteLine("r9b=" + r9b);
// turning \ followed by n into a real newline
string t10 = @"\ntest\n";
string r10 = Regex.Replace(t10, @"\\n", "\n");
Console.WriteLine("r10=" + r10);
// IP address
string t11 = "55.54.53.52";
string p11 = "^" +
@"([01]?\d\d|2[0-4]\d|25[0-5])\." +
@"([01]?\d\d|2[0-4]\d|25[0-5])\." +
@"([01]?\d\d|2[0-4]\d|25[0-5])\." +
@"([01]?\d\d|2[0-4]\d|25[0-5])" +
"$";
Match m11 = Regex.Match(t11, p11);
Console.WriteLine("M11=" + m11);
Console.WriteLine("Group1=" + m11.Groups[1]);
Console.WriteLine("Group2=" + m11.Groups[2]);
Console.WriteLine("Group3=" + m11.Groups[3]);
Console.WriteLine("Group4=" + m11.Groups[4]);
// removing leading path from filename
string t12 = @"c:\file.txt";
string p12 = @"^.*\\";
string r12 = Regex.Replace(t12, p12, "");
Console.WriteLine("r12=" + r12);
// joining lines in multiline strings
string t13 = @"this is
a split line";
string p13 = @"\s*\r?\n\s*";
string r13 = Regex.Replace(t13, p13, " ");
Console.WriteLine("r13=" + r13);
// extracting all numbers from a string
string t14 = @"
test 1
test 2.3
test 47
";
string p14 = @"(\d+\.?\d*|\.\d+)";
MatchCollection mc14 = Regex.Matches(t14, p14);
foreach (Match m in mc14)
Console.WriteLine("Match=" + m);
// finding all caps words
string t15 = "This IS a Test OF ALL Caps";
string p15 = @"(\b[^\Wa-z0-9_]+\b)";
MatchCollection mc15 = Regex.Matches(t15, p15);
foreach (Match m in mc15)
Console.WriteLine("Match=" + m);
// find all lowercase words
string t16 = "This is A Test of lowercase";
string p16 = @"(\b[^\WA-Z0-9_]+\b)";
MatchCollection mc16 = Regex.Matches(t16, p16);
foreach (Match m in mc16)
Console.WriteLine("Match=" + m);
// find all initial caps
string t17 = "This is A Test of Initial Caps";
string p17 = @"(\b[^\Wa-z0-9_][^\WA-Z0-9_]*\b)";
MatchCollection mc17 = Regex.Matches(t17, p17);
foreach (Match m in mc17)
Console.WriteLine("Match=" + m);
// find links in simple html
string t18 = @"
<html>
<a href=""first.htm"">first tag text</a>
<a href=""next.htm"">next tag text</a>
</html>
";
string p18 = @"<A[^>]*?HREF\s*=\s*[""']?([^'"" >]+?)[ '""]?>";
MatchCollection mc18 = Regex.Matches(t18, p18,
RegexOptions.Singleline
| RegexOptions.IgnoreCase);
foreach (Match m in mc18)
{
Console.WriteLine("Match=" + m);
Console.WriteLine("Group1=" + m.Groups[1]);
}
// finding middle initial
string t19 = "Hanley A. Strappman";
string p19 = @"^\S+\s+(\S)\S*\s+\S";
Match m19 = Regex.Match(t19, p19);
Console.WriteLine("Initial=" + m19.Groups[1]);
// changing inch marks to quotes
string t20 = @"2' 2"" ";
string p20 = "\"([^\"]*)";
string r20 = Regex.Replace(t20, p20, "``$1''");
Console.WriteLine("Result=" + r20);
Interesting Patterns?
If you come across any frequently used RegEx patterns, I encourage you to share them with your fellow pattern builders. In the future, I hope to collect a repository for these patterns, which we will be able to share among all of the languages used in the .NET Framework. After all, C# is but one of the many languages you can use, and the RegEx classes can be used from them all.

