.net Regex Tips

A common problem you see in regex is the pattern can get complicated quickly. A new comer trying to understand a pattern can easily get lost in the complexity of the syntax. Just like any programming language its important to format and comment your code. A rarely used feature of regex allows for exactly this.

Lets say you want to write an app that converts a dollar amount into words. We can use regex to locate the parts and replace them with the English equivalent word up to hundred thousands.

Currency Matching Pattern

So we can use the following patter to match any amount up to $999,999.99. I used this nice little online tool to help me design it, it gives you some handy tool tips. http://gskinner.com/RegExr/

\$([0-9]{1,3})?,?([0-9]{1,3})?\.?([0-9]{2})?

Now this isn’t perfect, its making assumptions that the dollar amount will be in a valid format. Lets break it down

  • First we match the $ character with the. You need to escape it because the $ is a special character in regex used to match the end of a string
  • Then we match the thousands set of digits.
    ([0-9]{1,3})?
    
    • We want the match a character between 0 and 9 and it should repeat 1 to 3 times
    • This should be optional so we create a Capture Group for them using the brackets and make the group optional using the ? character
  • Then we have the comma between the thousands and hundreds, this should also be optional
  • Next is the Hundreds which is the same group as the Thousands. There is a bit of repetition here which we can solve by using a back reference to a previous group.
  • We then have the decimal point which again is optional and needs to be escaped as the “.” character is also a special character in regex.
  • Last we have the optional cents group

Capture Groups

You can reference a capture group in C# by using its index number.

class Program
    {
        private const string CurrencyPattern = @"\$([0-9]{1,3})?,?([0-9]{1,3})?\.?([0-9]{2})?";

        static void Main(string[] args)
        {
            var amount = args[0];

            var regex = new Regex(CurrencyPattern);
            var match = regex.Match(amount);

            if (match.Success)
            {
                Console.WriteLine("Index   Value");
                for (int i = 0; i < match.Groups.Count; i++)
                {
                    Console.WriteLine("{0}:    {1}", i, match.Groups[i]);
                }
            }
            Console.Read();
        }
    }

The output for this will be

image

The first capture group is the entire matched string. then we have each group from left to right. This could be ok in simple patterns but in complex patterns it could be had to follow. We can help solve this issue by giving a group a name. We do this by adding a ? folloed by the name in <> at the start of a group. For example (?<Digit>[0-9]). The result looks like

\$(?<Thousands>[0-9]{1,3})?,?(?<Hundreds>[0-9]{1,3})?\.?(?<Cents>[0-9]{2})?

In code we can now reference the group via the name.

    class Program
    {
        private const string CurrencyPattern = @"\$(?<Thousands>[0-9]{1,3})?,?(?<Hundreds>[0-9]{1,3})?\.?(?<Cents>[0-9]{2})?";

        static void Main(string[] args)
        {
            var amount = args[0];

            var regex = new Regex(CurrencyPattern);
            var match = regex.Match(amount);

            if (match.Success)
            {
                Console.WriteLine("Index   Value");
                for (int i = 0; i < match.Groups.Count; i++)
                {
                    Console.WriteLine("{0}:    {1}", i, match.Groups[i]);
                }
                Console.WriteLine("You have {0} cents", match.Groups["Cents"]);
            }
            Console.Read();
        }
    }

Comments

We can make this more readable by adding comments to the pattern.

To use single line comments you can you the x pattern modifier to ignore white spaces. Add (?x) at the start of the pattern. You can disable this modifier anywhere in the pattern with (?-x:)

    class Program
    {
        private const string CurrencyPattern = @"(?x)
                                                # matches the dolar character
                                                    \$
                                                # matches an optional three digit sequent follows by a comma for the Thousands figure
                                                    (?<Thousands>[0-9]{1,3})?,?
                                                # matches an optional three digit sequent follows by a period for the Hundreds figure
                                                    (?<Hundreds>[0-9]{1,3})?\.?
                                                # matches an optional two digit sequent for the Cents figure
                                                    (?<Cents>[0-9]{2})?";

        static void Main(string[] args)
        {
            var amount = args[0];

            var regex = new Regex(CurrencyPattern);
            var match = regex.Match(amount);

            if (match.Success)
            {
                Console.WriteLine("Index   Value");
                for (int i = 0; i < match.Groups.Count; i++)
                {
                    Console.WriteLine("{0}:    {1}", i, match.Groups[i]);
                }
                Console.WriteLine("You have {0} cents", match.Groups["Cents"]);
            }
            Console.Read();
        }
    }

You can also use a comment block that does not require the use of the x pattern modifier by using (?# my comment)

Advertisements
  1. Leave a comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: