Monday, April 14, 2008 2:00 AM bart

Pattern Matching in C# - Part 6

Monday morning: The Return of The Pattern Matcher. After an awesome weekend (well, a Saturday at least) plenty of sun here in Seattle, we'll dive into even more pattern matching fun. This time around we'll investigate ways to match collections. Last time we wrapped our heads around ways to match object initialization patterns but as many of you know, the latest language wave introduced richer collection initialization language features as well. You might want to check out my "C# 3.0 Feature Focus - Part 3 - Collection Initializers" post on the topic, but let's recap first.

Collection and object initializers - anything special?

In the pre C# 3.0 era, you could write stuff like this:

int[] primes = new int[] { 2, 3, 5, 7, 11 };

but something that ought to be richer, namely (generic) collections, had quite redundant noise as illustrated below:

List<int> primes = new List<int>();
primes.Add(2);
primes.Add(3);
primes.Add(5);
primes.Add(7);
primes.Add(11);

It's a pattern dude, and you know what language designers do with those kind of common patterns? Exactly:

List<int> primes = new List<int>() { 2, 3, 5, 7, 11 };

which internally translates into the old verbose code. Well, almost true... Take a look at the IL result (feel free to use much more handy tools - but way too advanced for me - like Reflector) - click to enlarge:

image 

Two assignments, huh? This illustrates the core difference between the old C# 2.0 fragment and the new C# 3.0 fragment: the battle of statements versus expressions. The following is an expression:

new List<int>() { 2, 3, 5, 7, 11 }

just like

new Person() { Name = "Bart", Age = 25 }

So, either you have an object that matches all the criteria of the initialization, or you don't have anything. That is, every time you look at me, I should be called "Bart" and have an age of 25 (till February 11th that is). Similarly, the collection of primes should always have 2, 3, 5, 7 and 11. So, you shouldn't ever see an intermediate state, hence the atomic assignment. I explained this behavior for object initializers in the past (C# 3.0 Object Initializers Revisited) but the same holds true for collection initializers.

The key take-away from this is the fact those are expressions, not abbreviated statement sequences (although internally, we ultimately rely on statements to get the work done, but on the surface it's only the result that counts).

Collection initializers in their expression tree costume

Expressions can be represented as expression trees (duh!) so how does it look like? A sample:

Expression<Func<List<int>>> primes = () => new List<int>() { 2, 3, 5, 7, 11 }

Notice the generic declaration on the left hand side is mandatory (read: no 'var') since lambda expressions don't have a pre-defined type: they are either delegates to anonymous methods or expression tree generators. The C# 3.0 compiler has a new error code for this: error CS0815: Cannot assign lambda expression to an implicitly-typed local variable. And no, you can't write:

Expression<var> primes = () => new List<int>() { 2, 3, 5, 7, 11 }

Anyway, we're more interested in what it gives us back. We could go the IL route, but let's make it less painful:

image

Hooray, a ListInitExpression. What does it consist of? Two things:

image

A NewExpression (where did we see this before?) and a set of Initializers which are of type ElementInit and represent the Add method calls. What about regular arrays? Somewhat similar:

image

but with its own dedicated type NewArrayExpression.

What do we want to match?

Anything to match? You bet! But how? Obviously, we'll have some limitations to deal with since - as usual - we need to trace back construction to runtime matching expressions. Let's make that concrete: consider the following:

.Match((int x) => new[] { x, 1 }, x => { ... })

means we're looking for an array with two elements, the second one being equal to 1, asking for the first one. Notice the wording: second one, first one. We're considering ordering, which gives us the power to match like this:

int[] array = o as int[];
if (array != null && array.Length == 2 && array[1] == 1)
{
   int x = array[0];
   ...
}

Thus, the indexer becomes our vehicle to match and extract (i.e. the opposite of Add). The same principle can be applied to List<T> which also has the pair of Add and indexer members. But what about the following?

.Match((int x) => new HashSet<int>() { x, 1 }, x => { ... })

No indexer this time since sets are unordered. So, what do we mean to say by this? Maybe a set of two elements that contains 1, binding the remainder element to x? Yes, but it gets quite complicated if you extend upon this:

.Match((int x, int y) => new HashSet<int>() { x, 1, 2, y, 2, 1, 3 }, x => { ... })

So, a set with 7 (?) elements, containing 1 and 2 both twice (?), as well as 3, the remainder elements being bound to x and y. Not really, sets can contain an element only once, so our matcher would emit redundant checks and a check for the length would be more difficult to infer (exercise: why do we need a check for the length?). So far, this might still be expressible in some higher-order matching condition (using set operations) but the real crucial problem is the set's unordered characteristic: there's no guarantee which of the two remaining elements will be bound to x and which one to y. Sets don't give any guarantees, so let's not try to rely on those non-existing things. If you really feel this is a severe restriction, you can cook your own matcher (exercise) for sets limiting the number of unbound variables to 1 (which is deterministic).

A more interesting one might be the following:

.Match((int bartAge, int johnAge) => new Dictionary<string, int>() {
    { "Bart", bartAge },
    { "John", johnAge }
}, (bartAge, johnAge) => { ... })

Notice how collection initializers allow to be layered on top of Add methods with more than one parameter (you can even mix different overloads for Add methods in one collection initializer). What would a match condition look like? This raises an interesting question: do we want to check for the number of entries (meaning, we only want to extract information from dictionaries with fixed lengths inferred from the initializer's usage) or do we want to allow additional flexibility (e.g. extract two names from a telephone book using pattern matching)? I'm tempted to opt for the latter one (exercise: why not for T[] or List<T>; consider pros and cons). So here's a sample match:

IDictionary<string, int> dictionary = o as IDictionary<string, int>;
int bartAge, johnAge;
if (dictionary != null && dictionary.TryGetValue("Bart", out bartAge) && dictionary.TryGetValue("John", out johnAge))
{
   ...
}

In this post, we'll focus on T[] and List<T> matching. Dictionary<TKey, TValue> will be treated separately next time. There are more challenging variants, but let's keep their discussion for another time.

Arrays

Match = unify + extract. Given the commutativity of the operation we can start either way, but let's stick with the western habits of left to right processing (we're in compilation land anyhow). As usual, we start by extending MatchEntry.Compile:

NewExpression ne;
ConstantExpression ce;
ParameterExpression pe;
MemberInitExpression mie;
NewArrayExpression na;

Expression match;
if ((ne = Match.Body as NewExpression) != null)
    match = CompileNewExpression(o, target, ne, bindings);
else if ((ce = Match.Body as ConstantExpression) != null)
    match = CompileConstantExpression(o, target, ce, bindings);
else if ((pe = Match.Body as ParameterExpression) != null)
    match = CompileParameterExpression(o, target, pe, bindings);
else if ((mie = Match.Body as MemberInitExpression) != null)
    match = CompileMemberInitExpression(o, target, mie, bindings);
else if ((na = Match.Body as NewArrayExpression) != null)
    match = CompileNewArrayExpression(o, target, na, bindings);
else
    throw new
NotSupportedException("Unsupported expression type in match clause.");

This illustrates how easy it is to extend our matcher. Notice though that for an optimal matcher, you'd need a recursive approach (I'll talk about this later) since expressions can be arbitrarily nested (composability is key), as in:

.Match((int x, int y) => new Bar() { Foo = x, FooBar = new[] { y, 1 }, (x, y) => { ... })

Anyhow, there's something more important in the above, the type of bindings will need to change to:

//
// List of bindings.
//

var bindings = new Dictionary<ParameterExpression, Expression>();

What does this mean? Remember, a binding is used to extract information from the object being matched in order to feed it into the match body. So far, we've been extracting just properties and fields on objects, but now we need more extraction: elements of arrays. So, a MemberInfo value in the dictionary no longer suffices, hence the change to an Expression that will capture the way to get the data out of the original object. For the existing code, this is a rather simple fix: everywhere we stored a MemberInfo, it has to be replaced by:

bindings[pe] = Expression.PropertyOrField(
    Expression.Convert(
        o,
        target
    ),
    member.Name
);

which creates the mapping straight away. Other changes follow easily from this. Back to our array matching, time to take a look at CompileNewArrayExpression. Our check consists of a few things: checking the type, the length and all indices for which a constant is specified. We restrict ourselves to constants as we've done before. Here's how it looks:

private Expression CompileNewArrayExpression(ParameterExpression o, Type target, NewArrayExpression na, Dictionary<ParameterExpression, Expression> bindings)
{
    Expression check = Expression.AndAlso(
        Expression.TypeIs(
            o,
            target
        ),
        Expression.Equal(
            Expression.ArrayLength(
                Expression.Convert(
                    o,
                    target
                )
            ),
            Expression.Constant(na.Expressions.Count)
        )
    );

    int i = 0;
    foreach (var e in na.BLOCKED EXPRESSION
    {
        ParameterExpression pe;
        ConstantExpression ce;

        if ((pe = e as ParameterExpression) != null)
        {
            if (bindings.ContainsKey(pe))
            {
                check = Expression.AndAlso(
                    check,
                    GetEqualityCheck(o, target, i, bindings[pe])
                );
            }
            else
                bindings[pe] = Expression.ArrayIndex(
                    Expression.Convert(
                        o,
                        target
                    ),
                    Expression.Constant(i)
                );
        }
        else if ((ce = e as ConstantExpression) != null)
        {
            check = Expression.AndAlso(
                check,
                GetEqualityCheck(o, target, i, ce)
            );
        }
        else
            throw new
NotSupportedException("Can only match constants.");

        i++;
    }

    return check;
}

The first part of the check outside the foreach-loop checks the type using the is operator equivalent TypeIs and checks the array length which has a nice expression counterpart listening to the lovely name of ArrayLength which is matched against the number of expressions (which by the way stand for the individual entries in the comma-separated initialization list). Notice the compiler will make sure collection initializers have the same number of initialization expressions as the specified array rank (if specified at all).

Next, inside the loop, we try to bind all of the expressions in the initializer expression. The idea is completely analogue to what we've done before: if it's a constant, we emit an equality check that will be added to the match clause expression; if it's a parameter, it becomes a binding for extraction of values that need to be fed into the match body. A breakdown analysis:

        if ((pe = e as ParameterExpression) != null)
        {
            if (bindings.ContainsKey(pe))
            {
                check = Expression.AndAlso(
                    check,
                    GetEqualityCheck(o, target, i, bindings[pe])
                );
            }
            else
                bindings[pe] = Expression.ArrayIndex(
                    Expression.Convert(
                        o,
                        target
                    ),
                    Expression.Constant(i)
                );
        }

For parameters, we need to check first whether or not we've already found a binding to the parameter since we need to emit equality checks in that case. A sample of such an occurrence is:

.Match((int x, int y) => new[] { x, y, x }, (x, y) => { ... })

which should emit a condition like a[0] == a[2]. We keep the original binding (since we enforce equality in the check for bindings to the same name, it doesn't matter which one we retrieve obviously) but chain the extra equality check to the current check. If there isn't a binding yet, we create one using the ArrayIndex factory method on Expression which takes in an array and an index. The code for this speaks for itself.

Constants are even more trivial:

         else if ((ce = e as ConstantExpression) != null)
        {
            check = Expression.AndAlso(
                check,
                GetEqualityCheck(o, target, i, ce)
            );
        }

once we know what's hiding behind the GetEqualityCheck method. So let's reveal this little secret:

private Expression GetEqualityCheck(ParameterExpression o, Type target, int i, Expression e)
{
    return Expression.Call(
        typeof(object),
        "Equals",
        new Type[0],
        Expression.Convert(
            Expression.ArrayIndex(
                Expression.Convert(
                    o,
                    target
                ),
                Expression.Constant(i)
            ),
            typeof(object)
        ),
        Expression.Convert(
            e,
            typeof(object)
        )
    );
}

Again we rely on System.Object.Equals(a, b) in order to do equality checking but you could refine this if you want (exercise). The core part is ArrayIndex and it looks all familiar, so it doesn't require much more attention. Just don't forget about the conversions to System.Object for both operands to the equality check since those are to be bound to the parameters of the System.Object.Equals method.

That's it! We've done the unification, we extracted information, all the rest happens automagically in our Compile method we discussed before. Time for a sample:

var onBisectXY = new Pattern<bool>()
    .Match((int x) => new[] { x, x, 0 }, x => true)
    .Else(() => false);

onBisectXY.Compile();

Random rand = new Random();
for (int i = 0; i < 100; i++)
{
    var p = new[] { rand.Next(0, 5), rand.Next(0, 5), rand.Next(0, 5) };
    if (onBisectXY.Execute(p))
        Console.WriteLine(p[0] + "," + p[1] + "," + p[2]);
}

which produces:

image

Three matches out of 100, not bad given the restrictive condition:

image

List<T>

Our next candidate is List<T>. On the surface of it, it looks pretty similar:

var onBisectXY = new Pattern<bool>()
    .Match((int x) => new List<int> { x, x, 0 }, x => true)
    .Else(() => false);

onBisectXY.Compile();

Random rand = new Random();
for (int i = 0; i < 100; i++)
{
    var p = new List<int> { rand.Next(0, 5), rand.Next(0, 5), rand.Next(0, 5) };
    if (onBisectXY.Execute(p))
        Console.WriteLine(p[0] + "," + p[1] + "," + p[2]);
}

but the way to retrieve data is slightly different since we no longer have an ArrayInitExpression. Instead we're faced with a ListInitExpression. Once more, start by adding the right match handler:

NewExpression ne;
ConstantExpression ce;
ParameterExpression pe;
MemberInitExpression mie;
NewArrayExpression na; 
ListInitExpression lie;

Expression match;
if ((ne = Match.Body as NewExpression) != null)
    match = CompileNewExpression(o, target, ne, bindings);
else if ((ce = Match.Body as ConstantExpression) != null)
    match = CompileConstantExpression(o, target, ce, bindings);
else if ((pe = Match.Body as ParameterExpression) != null)
    match = CompileParameterExpression(o, target, pe, bindings);
else if ((mie = Match.Body as MemberInitExpression) != null)
    match = CompileMemberInitExpression(o, target, mie, bindings);
else if ((na = Match.Body as NewArrayExpression) != null)
    match = CompileNewArrayExpression(o, target, na, bindings);

else if ((lie = Match.Body as ListInitExpression) != null)
    match = CompileListInitExpression(o, target, lie, bindings);
else
    throw new
NotSupportedException("Unsupported expression type in match clause.");

Let's introduce CompileListInitExpression piecemeal. First let's do some checking to allow just List<T>:

private Expression CompileListInitExpression(ParameterExpression o, Type target, ListInitExpression lie, Dictionary<ParameterExpression, Expression> bindings)
{
    if (lie.NewExpression.Arguments.Count != 0)
        throw new NotSupportedException("Collection initializers in match expressions should use the default constructor.");

    if (lie.Type.GetGenericTypeDefinition() != typeof(List<>))
        throw new NotSupportedException("Only List<T> is supported.");

    return CompileListInitExpressionForListOfT(o, target, lie, bindings);
}

We restrict ourselves to List<T> but also make sure the constructor isn't used (since collection initializers can be combined with regular constructor calls) to simplify unification. Notice we don't loose much since the constructors for List<T> take either a capacity (which is irrelevant since we're using the "object construction" only for matching purposes) and an IEnumerable<T> (which could only be one of our supported types, like T[] or List<T> which would be redundant).

The method (with the rather long name) being called starts by checking the type and length, this time using the Count property of the list:

private Expression CompileListInitExpressionForListOfT(ParameterExpression o, Type target, ListInitExpression lie, Dictionary<ParameterExpression, Expression> bindings)
{
    Expression check = Expression.AndAlso(
        Expression.TypeIs(
            o,
            target
        ),
        Expression.Equal(
            Expression.Property(
                Expression.Convert(
                    o,
                    target
                ),
                "Count"
            ),
            Expression.Constant(lie.Initializers.Count)
        )
    );

The rest if fairly predictable but nevertheless interesting:

    int i = 0;
    foreach (var e in lie.Initializers)
    {
        //
        // Note: We know we're processing List<T>, which only has one Add overload,
        //       so we omit additional checks for now.
        //
        Expression arg = e.Arguments[0];

        ParameterExpression pe;
        ConstantExpression ce;

        if ((pe = arg as ParameterExpression) != null)
        {
            if (bindings.ContainsKey(pe))
            {
                check = Expression.AndAlso(
                    check,
                    GetEqualityCheckForListOfT(o, target, i, bindings[pe])
                );
            }
            else
                bindings[pe] = Expression.Call(
                    Expression.Convert(
                        o,
                        target
                    ),
                    target.GetMethod("get_Item"),
                    Expression.Constant(i)
                );
        }
        else if ((ce = arg as ConstantExpression) != null)
        {
            check = Expression.AndAlso(
                check,
                GetEqualityCheckForListOfT(o, target, i, ce)
            );
        }
        else
            throw new
NotSupportedException("Can only match constants.");

        i++;
    }

    return check;
}

Notice the skeleton of the code is the same: iterate, type check, accumulate the check expression, etc. Being so similar, I've highlighted the main differences. The type of the initializers are ElementInit objects which have an AddMethod and Arguments property. We don't do additional checking but in the general case any well-suited overload of Add could be used. List<T> only have one, so we simply extract the first parameter to the call and treat it as the expression being used. Again, it needs to be either constant or a parameter, resulting a match condition or a binding respectively. The binding is where things become interesting since we can't use ArrayIndex but rather need to call the indexer on the List<T> which is known as a method with name get_Item. Other than that, there's no relevant difference.

The helper method GetEqualityCheckForListOfT is plain simple to once we know how to call the indexer:

private Expression GetEqualityCheckForListOfT(ParameterExpression o, Type target, int i, Expression e)
{
    return Expression.Call(
        typeof(object),
        "Equals",
        new Type[0],
        Expression.Convert(
            Expression.Call(
                Expression.Convert(
                    o,
                    target
                ),
                target.GetMethod("get_Item"),
                Expression.Constant(i)
            ),
            typeof(object)
        ),
        Expression.Convert(
            e,
            typeof(object)
        )
    );
}

The resulting match condition you ask? Here it is:

image

Can you still parse it yourself? :-) Anyhow, more important: it works:

image

Next time: matching dictionaries.

Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks

Filed under: , ,

Comments

# Pattern Matching in C# - Part 6

Monday, April 14, 2008 2:05 AM by DotNetKicks.com

You've been kicked (a good thing) - Trackback from DotNetKicks.com

# re: Pattern Matching in C# - Part 6

Monday, April 14, 2008 7:47 AM by Steve

"Last team we wrapper our heads around", hehe, it's monday after all. Can't wait to get my hands on the complete solution :)

# re: Pattern Matching in C# - Part 6

Monday, April 14, 2008 10:23 AM by bart

Hi Steve,

Thanks for the correction :-). Maybe I'm on a team building some wrapper? Anyway, I've corrected the text.

-Bart

# Adventures in F# - F# 101 Part 8 (Mutables and Reference Cells)

Tuesday, April 15, 2008 4:09 PM by Matthew Podwysocki's Blog

Time for another adventure in F#, covering some of the basics of functional programming and F# in particular

# Adventures in F# - F# 101 Part 8 (Mutables and Reference Cells)

Tuesday, April 15, 2008 4:17 PM by Community Blogs

Time for another adventure in F#, covering some of the basics of functional programming and F# in particular

# Reflective Perspective - Chris Alcock &raquo; The Morning Brew #74

Pingback from  Reflective Perspective - Chris Alcock  &raquo; The Morning Brew #74