Regular Expressions

.Regular Expressions
String Method Alternatives | System.Text.RegularExpressions | Static vs Instance Regular Expressions | Regex Class | MatchEvaluator Delegate | .NET Regular Expression Grammar | Webpage Screen Scraper using Regular Expressions

"Regular expressions provide a powerful, flexible, and efficient method for processing text. The concepts behind regular expressions have evolved over the past 70+ years and there are currently several flavors of regular expressions which exist in programming languages and software tools. The .NET languages implement a flavor of the Perl Compatible Regular Expressions (PCRE) de facto standard."

A regular expression is a sequence of characters that form a search pattern. Regular expressions make use of text pattern matching for such tasks as:

  1. Validating user input. - e.g. validating email, web addresses, and password formats.
  2. Edit, replace, or delete text. - e.g. to clean, reformat, or change content.
  3. Extract substrings from a text stream. - e.g. extracting specific sections of an HTML page.

In the 1940's Warren McCulloch and Walter Pitts developed logical expressions used to model the neurological activities of the brain. Stephen Kleene enhanced these expressions with his work on Regular Sets and coined the term regular expressions in the 1950's. Variations of Kleene's expressions were used by Bell Labs in the 1970's within several components of the Unix operating system. In the 1980's more complicated regular expressions arose in Perl. There are currently several flavors of regular expressions. Several languages implement regular expressions similar to those found in Perl 5, that is Perl-Compatible Regular Expressions (PCRE). Some of the languages that implement PCRE include: the .NET languages, Java, JavaScript, Python, and Ruby. However, even the regular expressions which are a derivative of PCRE have slight differences among them. While PCRE has become a de facto standard for regular expressions, a more formal standard exists as part of the POSIX specification for Unix-like operating system environments. The IEEE POSIX standard specifies two sets of compliance: Basic Regular Expressions (BRE) and Extended Regular Expression (ERE). There are many flavors of regular expressions that are extensions of ERE, which by today's standards is rather bare bones. WinGrep is an established grep tool for Windows which supports a limited flavor of POSIX ERE. There are several regular expression libraries and references to help sort through the many implementations of regular expressions, such as:

  1. Regular-Expression.info - regular expression reference with many examples.
  2. RegExLib.com - regular expression library with online tester.
  3. Regular Expression Language - Quick Reference - .NET Framework 4.5 reference.
  4. Regular Expression Examples - .NET Framework 4.5 examples.
  5. Regex Pal - free online utility for building and testing regular expressions.
  6. Expresso - free utility for building and testing regular expressions.
  7. RegexBuddy - utility for building and testing regular expressions (currently $39.95).
  8. RegExLib.com - regular Expression Library.
  9. XRegExp - JavaScript regular expression library.
  10. PCRE - Perl Compatible Regular Expressions.
  11. Regular Expression Language - Quick Reference.
  12. PCRE- Open Source Regex Library.
  13. Ultrapico - Regular Expression Resources.

Note: While the .NET regex engine is based on the PCRE model, it contains an option on the Regex class which allows it to behave like the ECMAScript (i.e. JavaScript) regex engine. This option allows JavaScript regular expressions to run unchanged. Move information about how these behaviors differ is given under Regular Expression Options.

Expresso Regular Expression Utility
.Expresso Regular Expression Utility


Regex Buddy
.Regex Buddy



String Method Alternatives

The String class contains several simple search and comparison methods which provide pattern matching capabilities. The string methods may provide a convenient alternative to regular expressions when searching for a particular string. However regular expressions contain more powerful pattern matching and manipulating capabilities. Some of the simple String search and comparison methods include:

  1. String.Contains()

                // Prints: Attention Queen
                string s1 = "The troll queen must have attention.";
                Console.WriteLine(s1.Contains("attention") ? "Attention Queen" : "Drama Queen");

  2. String.EndsWith
  3. String.StartsWith
  4. String.IndexOf

                // Prints: Mike = 0, solo = 24
                string s1 = "Mike's minions mimic misologists.";
                Console.WriteLine("Mike = {0}, solo = {1}",
                        s1.ToUpper().IndexOf("MIKE"),
                        s1.ToUpper().IndexOf("SOLO"));

  5. String.IndexOfAny
  6. String.LastIndexOf
  7. String.LastIndexOfAny

String also contains the Split() method, which is similar, but less powerful than Regex.Split().

Top




System.Text.RegularExpressions

The System.Text.RegularExpressions namespace contains classes that provide access to the .NET regular expression engine. Some of these commonly used classes include:

  1. Regex - represents a regular expression.
  2. Capture - the results of a single match.
  3. CaptureCollection - a sequence of Capture'sli>
  4. Group - represents the results from a single capturing group.
  5. GroupCollection - returns the set of captured groups in a single match.
  6. Match - represents the results from a single regular expression match.
  7. MatchCollection - a delegate for use during replacement operationsli>

The RegexOptions enumeration contains a set of regular expression options:

  1. Compiled - Specifies that the regular expression is compiled to an assembly. This yields faster execution but increases startup time.
  2. IgnoreCase - Specifies case-insensitive matching
  3. Multiline - Multiline mode. Changes the meaning of ^ and $ so they match at the beginning and end, respectively, of any line, and not just the beginning and end of the entire string.
  4. RightToLeft - Specifies that the search will be from right to left instead of from left to right.

Note: The regular expression options can also specified inline the regular expression as (?flag) where flag is:

  1. (?I) is turn case insensitivity on, (?-i) turn case insensitivity off
  2. (?s) is single line mode on
  3. (?m) is multiline mode on
  4. (?x) is free spacing mode on

Also included in the System.Text.RegularExpressions namespace is the MatchEvaluator Delegate which represents the method which is called each time a regular expression match is found during a Replace method. The delegate provides a way to implement custom verification or manipulation for each match found by a replacement method, such as Regex.Replace(). The delegate method provides custom processing and returns a string that the Replace() method substitutes for the matched string.

Top




Static vs Instance Regular Expressions

Usually the most important factor affecting regular expression performance is the way the Regex engine is used. Typically a static or an instance of an interpreted regular expression is used, unless better performance is required. Below are four different ways the regular expression can be coupled to the Regex engine:

  1. Use a static method that does not require instantiating a regular expression object (e.g. Regex.Match(String, String)).
  2. Instantiate a Regex object and call a pattern-matching method of an interpreted regular expression.
  3. Instantiate a Regex object and call a pattern-matching method of an compiled regular expression.
  4. Create a special purpose Regex object, compile it, and save it to a standalone assembly (Regex.CompileToAssembly()).

Static regular expression methods are recommended as an alternative to repeatedly instantiating a regular expression object with the same regular expression. Static regular expression contain a cache to improve performance. By default, the last 15 most recently used static regular expression patterns are cached. For applications that require a larger number of cached static regular expressions, the size of the cache can be adjusted by setting the Regex.CacheSize property.

By default both static and instance regular expressions are interpreted. The Regex engine converts the regular expression to a set of operation codes. When called, the operation codes are converted to MSIL and executed by the JIT compiler. Interpreted regular expressions reduce startup time at the cost of slower execution time. Because of this, they are best used when the regular expression is used in a small number of method calls, or if the exact number of calls to regular expression methods is unknown but is expected to be small.

Regular expressions can be compiled by using the Compiled option. When compiled, the Regex engine converts the regular expression to an intermediary set of operation codes, which it then converts to MSIL. When a method is called, the JIT compiler executes the MSIL. In contrast to interpreted regular expressions, compiled regular expressions increase startup time but execute individual pattern-matching methods faster. This improves performance of a frequently used regular expression.

Below is a comparison between using a static occurrence vs instances of regular expressions for finding a simple match. In this case the static occurrence has better performance than repeatedly instantiating a regular expression object with the same regular expression.

.Static vs Instance Regex
Static vs Instance Regex

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;

namespace RegularExpressionExample
{
    class Program
    {
        static void Main()
        {
            const int MAX_LOOP = 10000000;
            double staticCount = 0;
            double instanceCount = 0;
            Match staticMatch;
            Match instanceMatch;
            Regex[] myRegex = new Regex[MAX_LOOP];

            string s1 = "Mike's minions' mimic misologists.";
            string pattern = @"mi";

            Console.WriteLine("RegEx - Static vs Instance\n");

            // Static Regex
            Stopwatch staticWatch = new Stopwatch();
            staticWatch.Start();
            staticMatch = Regex.Match(s1, pattern, RegexOptions.IgnoreCase);
            for (int i = 0; i < MAX_LOOP; i++)
            {
                while (staticMatch.Success)
                {
                    staticCount++;
                    staticMatch = staticMatch.NextMatch();
                }
            }
            staticWatch.Stop();
            Console.WriteLine("Static   Time: {0}", staticWatch.Elapsed);

            // Instance Regex
            Stopwatch instanceWatch = new Stopwatch();
            instanceWatch.Start();
            for (int i = 0; i < MAX_LOOP; i++)
            {
                myRegex[i] = new Regex(pattern, RegexOptions.IgnoreCase);

                instanceMatch = myRegex[i].Match(s1);
                while (instanceMatch.Success)
                {
                    instanceCount++;
                    instanceMatch = instanceMatch.NextMatch();
                }
            }
            staticWatch.Stop();
            Console.WriteLine("Instance Time: {0}\n", instanceWatch.Elapsed);
        }
    }
}

Top




Regex Class

The Regex class is at the heart of .NET's regular expression model. Regex is thread safe, and can be created on any thread and can be shared between threads. Regex contains overloaded constructors that accept either a string or a serialized stream as the data source. Versions of the constructor also accept various options and in .NET 4.5 there is a new constructor which accepts a time-out value. The Compiled option causes the regular expression to execute more quickly at the cost of increased startup time, as explained the "Static vs Instance Regular Expression" section above. The other options for Regex include:

  1. RegexOptions.IgnoreCase - use case-insensitive matching.
  2. RegexOptions.Multiline - use multiline mode, where ^ and $ match the beginning and end of each line (instead of the beginning and end of the input string).
  3. RegexOptions.Singleline - use single-line mode, where the period (.) matches every character (instead of every character except \n).
  4. RegexOptions.ExplicitCapture - do not capture unnamed groups. The only valid captures are explicitly named or numbered groups of the form (? subexpression).
  5. RegexOptions.IgnorePatternWhitespace - exclude unescaped white space from the pattern, and enable comments after a number sign (#).
  6. RegexOptions.ECMAScript - enable ECMAScript-compliant behavior for the expression
  7. RegexOptions.CultureInvariant - ignore cultural differences in language.
  8. RegexOptions.None - instructs the regex engine to use its default behavior which includes:
    • The pattern is interpreted as a canonical rather than an ECMAScript regular expression.
    • The regular expression pattern is matched in the input string from left to right.
    • Comparisons are case-sensitive.
    • The ^ and $ language elements match the beginning and end of the input string.
    • The . language element matches every character except \n.
    • Any white space in a regular expression pattern is interpreted as a literal space character.
    • The conventions of the current culture are used when comparing the pattern to the input string.
    • Capturing groups in the regular expression pattern are implicit as well as explicit.

Various methods exist in the Regex class which control the operation of the regex engine. These methods include:

  1. IsMatch - indicates whether the specified regular expression finds a match in the specified input string.
  2. Match - searches the specified input string for the first occurrence of the regular expression specified in the Regex constructor.
  3. Replace - in a specified input string, replaces all strings that match a regular expression pattern with a specified replacement string.
  4. Split - splits an input string into an array of substrings at the positions defined by a regular expression pattern specified in the Regex constructor.
  5. CompileToAssembly - compiles one or more specified Regex objects to a named assembly.
  6. GetGroupNames - returns an array of capturing group names for the regular expression.
  7. GetGroupNumbers - returns an array of capturing group numbers that correspond to group names in an array

The following program uses the Regex class to: test for a match (IsMatch), find the first match (Match), find all matches (Matches), replace matches with text (Replace) and divide text into substrings based on a delimiter (Split).

.Regex Examples
Regex Examples

using System;
using System.Text.RegularExpressions;

namespace RegexExample
{
    class Program
    {
        static void Main()
        {
            string s1 = "Cliquey clowns cluck clueless claptrap.";
            string pattern = @"\w*cl\w*";

            Console.WriteLine("-- Original string --");
            Console.WriteLine("{0}\n",s1);

            // Test for match
            Console.WriteLine("-- Text search --");
            Console.WriteLine(Regex.IsMatch(s1, pattern) ? "Found it\n" : "Not Found\n");

            // Find first match
            Console.WriteLine("-- First match --");
            Console.WriteLine(Regex.Match(s1, pattern));
            Console.WriteLine();

            // Find all matches
            Console.WriteLine("-- List of all matches --");
            MatchCollection myMatches = Regex.Matches(s1, pattern);
            foreach (Match match in myMatches)
                Console.WriteLine(match);
            Console.WriteLine();

            // Replace Text
            Console.WriteLine("-- Replaced text --");
            string replacement = "****";
            string pattern2 = @"\w*cla\w*";
            Console.WriteLine(Regex.Replace(s1, pattern2, replacement));
            Console.WriteLine();

            // Split text
            Console.WriteLine("-- Split text --");
            string s2 = "Chalk,it,up,to,chatty,chums";
            string delim = ",";
            string[] substrings = Regex.Split(s2, delim);
            foreach (string myString in substrings)
                Console.WriteLine(myString);
            Console.WriteLine();           
        }
    }
}

Top




MatchEvaluator Delegate

The MatchEvaluator delegate is called each time a regular expression match is found during the Replace method operation. This allows for custom verification or manipulation on each match found by a replacement method. The following program escapes the copyright symbol.

Escape Unicode Characters

using System.Text.RegularExpressions;

namespace MatchEvaluatorDelegate
{
    class Program
    {
        static void Main()
        {
            string htmlString = "© 2014";

            // Escape Unicode Characters
            string escapedString = Regex.Replace(htmlString , @"[\u0080-\uFFFF]",
                s => @"&#" + ((int) s.Value[0]).ToString() + ";");
            System.Console.WriteLine(escapedString); // Prints &#169; 2014
        }
    }
}

Top




.NET Regular Expression Grammar

The Visual Studio article Regular Expressions (C++) discusses the grammars of the various regular expression engines. The article contains a features comparison chart which includes the following grammars: BRE, ERE, ECMA, grep, egrep, and awk.

  1. Elements - are the characters, operators. and constructs that are used to define regular expressions. They can be any of the following:
    • Ordinary character. It will match the same character(s) in the target sequence.

                  Match match = Regex.Match("It is cold outside.", @"cold");
                  Console.WriteLine(match); // Prints: cold

    • Wildcard character. Such as a period '.' which will match any character except a newline.

                  string s1 = "It is cold outside. It is better inside than outside.";
                  MatchCollection myMatches = Regex.Matches(s1, @"I.");
                  foreach (Match match in myMatches)
                      Console.Write("{0} ", match); // Prints: It It

    • An anchor - Anchor '^' matches the beginning and anchor '$' matches the end.

                  string s1 = "It is cold outside. It is better inside than outside.";
                  MatchCollection myMatches = Regex.Matches(s1, @"^I.");
                  foreach (Match match in myMatches)
                      Console.Write("{0} ", match); // Prints: It

    • Bracket Expression - In the form of "[expr]" matches all characters inside the bracket, "[^expr]" matches all characters that are not inside the bracket. Any of the following may be placed inside the bracket expression:
      • Individual Character

                    string s1 = "It is cold outside. It is better inside than outside.";
                    MatchCollection myMatches = Regex.Matches(s1, @"[^aeiou]");
                    foreach (Match match in myMatches)
                        Console.Write("{0}", match);
                    // Prints: It s cld tsd. It s bttr nsd thn tsd.

      • Character Range
      •             string s1 = "It is cold outside. It is better inside than outside.";
                    MatchCollection myMatches = Regex.Matches(s1, @"[^a-k,A-K]");
                    foreach (Match match in myMatches)
                        Console.Write("{0}", match);
                    // Prints: t s ol outs. t s ttr ns tn outs.

  2. Repetition - In the form of "{c}", where c is the repetition count, or "{min,max}", or "{min,}" matches the specified count or range.

                string s1 = "tt ttttt";
                MatchCollection myMatches = Regex.Matches(s1, @"t{2}");
                foreach (Match match in myMatches)
                    Console.Write("{0}", match);
                // Prints: tttttt  (6 t's)
              
                myMatches = Regex.Matches(s1, @"t{2,}");
                foreach (Match match in myMatches)
                    Console.Write("{0}", match);
                // Prints: ttttttt (7 t's)
               
                myMatches = Regex.Matches(s1, @"t{3,5}");
                foreach (Match match in myMatches)
                    Console.Write("{0}", match);
                // Prints: ttttt (5 t's)

  3. Alternation - the pipe charater "|" indicates an alternation (or logic) inside a regular expression.

                string s1 = "acdcababbcde";
                MatchCollection myMatches = Regex.Matches(s1, @"ab|cd");           
                foreach (Match match in myMatches)
                    Console.Write("{0}", match); // Prints: cdababcd

  4. Metacharacters - are characters which have a special meaning to the regular expression. Such as the "+" metacharacter which means to match the preceding element one or more times. To use a metacharacter as a regular character, it must be escaped with a backslash. For example: 1 \+ 1 = 2

    Metacharacters
    Metacharacter Description
    . Period
    ^ Caret
    $ Dollar sign
    \ Backslash
    | Pipe, or vertical br
    ? Question mark
    * Asterisk
    + Plus sign
    [ Opening square bracket
    () Opening and closing parenthesis


  5. Quantifiers - specifies how many instances of the previous element must be present in the input string for a match to occur.

    Quantifier
    Quantifier Description Pattern Matches
    * Matches the previous element zero or more times. \d*\.\d ".0", "19.9", "219.9"
    + Matches the previous element one or more times. "be+" "bee" in "been", "be" in "bent"
    ? Matches the previous element zero or one time. "rai?n" "ran", "rain"
    *? Matches the previous element zero or more times, but as few times as possible. \d*?\.\d ".0", "19.9", "219.9"
    +? Matches the previous element one or more times, but as few times as possible. "be+?" "be" in "been", "be" in "bent"
    ?? Matches the previous element zero or one time, but as few times as possible. "rai??n" "ran", "rain"


  6. Character Class - defines a set of characters for matching.

    Character Classes
    Character Class Description Pattern Matches
    \d Matches any decimal digit \d Matches 123 in "a1b2c3.?!"
    \D Matches any character other than a decimal digit. \D Matches "abc.?!" in "a1b2c3.?!"
    \w Matches any word character. \w Matches "a1b2c3" in "a1b2c3.?!"
    \W Matches any non-word character. \W Matches ".?!" in "a1b2c3.?!"
    \s Matches any white-space character. \s Matches "s s a " in "This is a test"
    \S Matches any non-white-space character. \S Matches "Thisisatest" in "This is a test"
    \p{name} Matches any single character in the Unicode general category or named block. \p{Sm} Matches "+=" in
    "2 + 2 = 4".
    See Unicode Classes
    \P{name} Matches any single character not in the Unicode general category or named block. \P{Sm} Matches "2 2 4" in
    "2 + 2 = 4".
    See Unicode Classes


  7. Anchors - enables the regular expression to be fixed to a point.
    Anchors
    Anchor Description String Pattern Matches
    ^ Matches the position at the beginning of the input string. If the m (multiline search) character is included with the flags, ^ also matches the position following \n or \r. "The first \nthe second" @"(?im)^t" Tt
    $ Matches the position at the end of the input string. If the m (multiline search) character is included with the flags, $ also matches the position preceding \n or \r. "The first and the second" @"d$" d
    \b Matches a word boundary, that is, the position between a word and a space. "The first and the second" @"\bthe\b" the
    \B Matches a nonword boundary. "The first and the second" @"\Bh\B" hh


  8. Matched SubExpression - a grouping construct that uses parentheses to capture a regular expression pattern. Captures that use parentheses are numbered automatically from left to right based on the order of the opening parentheses in the regular expression, starting from one. The capture that is numbered zero is the text matched by the entire regular expression pattern.

    Capture Groups to Report Duplicate Words

                string pattern = @"(\w+)\s(\1)";
                string input = "He said that that was the the correct answer.";
                foreach (Match match in Regex.Matches(input, pattern, RegexOptions.IgnoreCase))
                    Console.WriteLine("Duplicate '{0}' found at positions {1} and {2}.",
                                      match.Groups[1].Value, match.Groups[1].Index, match.Groups[2].Index);

                // Prints:
                //       Duplicate 'that' found at positions 8 and 13.
                //       Duplicate 'the' found at positions 22 and 26.

  9. Zero-width Positive Lookahead Assertion - is a grouping construct defined as:



    (?= subexpression )



    where subexpression is any regular expression pattern. For a match to be successful, the input string must match the regular expression pattern in subexpression, although the matched substring is not included in the match result.

    Zero-width Positive Lookahead Assertion

                //zero-width positive lookahead assertion to match
                //the word that precedes the word "are"
                string pattern = @"\b\w+(?=\sare\b)";
                string[] inputs = {
                   "People are funny.",
                   "Dogs are usually friendly."};

                foreach (string input in inputs)
                {
                    Match match = Regex.Match(input, pattern);
                    if (match.Success)
                        Console.WriteLine(match.Value);
                }
                // Prints:
                //       People
                //       Dogs

  10. Zero-width Negative Lookahead Assertion - is a grouping construct defined as:



    (?! subexpression )

    where subexpression is any regular expression pattern. For the match to be successful, the input string must not match the regular expression pattern in subexpression, although the matched string is not included in the match result.

    Zero-width Negative Lookahead Assertion

                // zero-width negative lookahead assertion at the beginning of the regular
                // expression to match words that do not begin with "new".
                string pattern = @"\b(?!new)\w+\b";
                string input = "newton Newman Wendy Weller";
                foreach (Match match in Regex.Matches(input, pattern, RegexOptions.IgnoreCase))
                    Console.WriteLine(match.Value); // Prints: Wendy Weller

  11. Zero-width Positive Lookbehind Assertion - is a grouping construct defined as:



    (?<= subexpression )



    where subexpression is any regular expression pattern. For a match to be successful, subexpression must occur at the input string to the left of the current position, although subexpression is not included in the match result.

    For example, the following example matches the last two digits of the year for the twenty first century (that is, it requires that the digits "20" precede the matched string).

    Zero-width Positive Lookbehind Assertion

                // Zero-width Positive Lookbehind Assertion that matches the
                // last two digits in the 1900's
                string input = "1957 1958 1865 2112";
                string pattern = @"(?<=\b19)\d{2}\b";

                foreach (Match match in Regex.Matches(input, pattern))
                    Console.WriteLine(match.Value); // Prints 57 58

Top




Webpage Screen Scraper using Regular Expressions

.Screen Scraper using Regular Expressions


"This program uses regular expressions to extract the title, meta, and anchor tags from the Webpage specified by a URL passed in as the first program argument. Extracted tags and a summary of the results are written to a file."

WebClient class is used to download the Webpage specified as the first argument on the command tail. Regular expressions are used to extract the title tag, all the meta tags, and all the anchor tags (links). The results are timestamped and appended to the file specified as the second program argument, or to file "msdefault.txt" if no file is specified. Additionally a summary of the count of each type of extract html tag is written to the file along with the programs elapsed time. Program uses the following C# features:

  1. Regex - Regular express class used to validate URI and extract HTML tags from Webpage.
  2. WebClient - Used to download the Webpage into a string for processing.
  3. FileStream - A backing store stream for the output file.
  4. TextWriter - A stream adapter for writing text.
  5. StopWatch - A diagnostics class used to measure elapsed time.
Screen Scaper to Extract HTML Tags from Webpages

using System;
using System.Diagnostics;
using System.IO;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;

namespace ManualSpider
{
    /*********************************************************************************
     * Name: ManualSpider                                                            *
     *                                                                               *
     * Syntax: ManualSpider URI [OutputFileName]                                     *
     * Example 1: ManualSpider www.kcshadow.net  (Output File = msdefault.txt)       *
     * Example 2: ManualSpider www.kcshadow.net kcshadow.txt                         *
     *                                                                               *
     * Description: A screen scraper that parses the webpage specified as the first  *
     *              parameter on the command tail.  Extracts the title tag, all      *
     *              meta tags and all anchor tags and writes them to an output file. *
     *              After extraction a summary of tag counts, program run time, and  *
     *              extraction data is also printed to the output file.              *
     *                                                                               *
     *********************************************************************************/
    class Program
    {
        // Validate URI pass in as first argument
        static void ProcessURI(string[] args)
        {
            // Check for URI on command tail
            if (args.Length == 0)
            {
                Console.WriteLine("\nSyntax is: ManualSpider URI OutputFileName");
                Console.WriteLine("Example:   ManualSpider www.kcshadow.net myOutputFile.txt\n");
                Environment.Exit(-1);
            }

            // Check for valid URI
            if (!Regex.Match(args[0], @"([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?", RegexOptions.IgnoreCase).Success)
            {
                Console.WriteLine("Invalid URI: {0}", args[0]);
                Environment.Exit(-1);
            }
        }


        static void Main(string[] args)
        {           
            DateTime currentTimestamp = DateTime.Now;
            string outputfileName;
            string pageHTML = default(string);
            int titleTagCount = 0;
            int metaTagCount = 0;
            int anchorTagCount = 0;

            // Start Stopwatch
            Stopwatch stopWatch1 = new Stopwatch();
            stopWatch1.Start();

            // Validate URI
            ProcessURI(args);

            Console.WriteLine("Program is running ...");

            // Check for output file name on command tail
            if (args.Length > 1)
                outputfileName = args[1];
            else
                outputfileName = "msdefault.txt";

            // Compiled Regular Expressions
            Regex metaTags = new Regex(@"<meta(.*?)>", RegexOptions.IgnoreCase | RegexOptions.Compiled);
            Regex anchorTags = new Regex(@"<a(.*?)>", RegexOptions.IgnoreCase | RegexOptions.Compiled);

            // Get Webpage entered on command tail
            using (WebClient wc = new WebClient())
            {
                try
                {
                    Byte[] pageData = wc.DownloadData(@"http://" + args[0]);
                    pageHTML = Encoding.ASCII.GetString(pageData);
                }
                catch (WebException e)
                {
                    Console.WriteLine("WebClient Exception: {0}", e.Message);
                    System.Environment.Exit(-1);
                }
            }

            // Write results to text file
            using (FileStream fs = new FileStream(outputfileName, FileMode.Append, FileAccess.Write, FileShare.None))
            using (TextWriter writer = new StreamWriter(fs))
            {
                writer.WriteLine("\n----------------------- Start: {0} ------------------------------", args[0]);

                foreach (Match m in Regex.Matches(pageHTML, @"<title>(.*?)</title>", RegexOptions.IgnoreCase))
                {
                    writer.WriteLine(m.Value);
                    titleTagCount++;
                }

                writer.WriteLine("\n------------------------ Meta Tags -----------------------------------", args[0]);

                foreach (Match m in metaTags.Matches(pageHTML))
                {
                    writer.WriteLine(m.Value);
                    metaTagCount++;
                }

                writer.WriteLine("\n----------------------- Anchor Tags ----------------------------------", args[0]);
                foreach (Match m in anchorTags.Matches(pageHTML))
                {
                    writer.WriteLine(m.Value);
                    anchorTagCount++;
                }

                writer.WriteLine("\n--------------------------- Summary ----------------------------------", args[0]);
                writer.WriteLine("Title  Tag Count: {0}", titleTagCount);
                writer.WriteLine("Meta   Tag Count: {0}", metaTagCount);
                writer.WriteLine("Anchor Tag Count: {0}", anchorTagCount);
                // Format and display run time.
                stopWatch1.Stop();
                TimeSpan ts = stopWatch1.Elapsed;
                string elapsedTime = String.Format("{0:00}:{1:00}:{2:00}.{3:00}",
                    ts.Hours, ts.Minutes, ts.Seconds, ts.Milliseconds);
                writer.WriteLine("Program Run Time: {0}", elapsedTime);
                writer.WriteLine("Program Run Date: {0}", currentTimestamp);
                writer.WriteLine("\n------------------------ End: {0} ---------------------------------", args[0]);
            }

            // Write finished message
            Console.WriteLine("\nResults saved in: {0}\n", outputfileName);
        }
    }
}

Top



Reference Articles

Top