Monday, November 03, 2008 11:11 PM bart

C# 4.0 Feature Focus - Part 3 - Intermezzo: LINQ's new Zip operator

After named parameters and optional parameters, we'll take a little breadth and deviate a bit from the language specifics to present a new LINQ operator: Zip. Just like a zipper zips two streams of materials together, LINQ's Zip operator can zip together two sequences. Here's the signature of the new method:

public static IEnumerable<TResult> Zip<TFirst, TSecond, TResult>(this IEnumerable<TFirst> first, IEnumerable<TSecond> second, Func<TFirst, TSecond, TResult> func);

 

Sample

Given two sequences and a function that combines two elements from both sequences, a sequence of zipped pairs is produced. Here's a sample:

string[] names = { "Bart", "John" };
int[] ages = { 25, 60 };

names.Zip(ages, (name, age) => name + " is " + age + " years old.");

This produces a sequence with the sentences "Bart is 25 years old." and "John is 60 years old.". The lambda syntax for the passed-in function should speak for itself, and notice we're using extension method invocation here, so names is propagated to become the left-hand side of the method call.

 

How Zip works

Previously I've implemented the Standard Query Operators for reference purposes on the Codeplex site at http://www.codeplex.com/LINQSQO. I won't update that sample library just yet, but here's an illustration on how easy Zip is to implement, ignoring exception handling (which is more subtle than you might think, see further):

static class Enumerable
{
    public static IEnumerable<TResult> Zip<TFirst, TSecond, TResult>(this IEnumerable<TFirst> first, IEnumerable<TSecond> second, Func<TFirst, TSecond, TResult> func)
    {
        var ie1 = first.GetEnumerator();
        var ie2 = second.GetEnumerator();

        while (ie1.MoveNext() && ie2.MoveNext())
            yield return func(ie1.Current, ie2.Current);
    }
}

The Zip operation is implemented using iterators inside ("yield return") and stops as soon as one of the sequences runs out of juice (the case of the asymmetric zipper).

 

Zip without Zip?

As a curiosity, is it actually possible to build Zip out of existing LINQ operators, ignoring performance worries? Unsurprisingly, it turns out this is the case. Here's how:

static class Enumerable
{
    public static IEnumerable<TResult> Zip<TFirst, TSecond, TResult>(this IEnumerable<TFirst> first, IEnumerable<TSecond> second, Func<TFirst, TSecond, TResult> func)
    {
        return first.Select((x, i) => new { X = x, I = i }).Join(second.Select((x, i) => new { X = x, I = i }), o => o.I, i => i.I, (o, i) => func(o.X, i.X));
    }
}

(Exercise for the reader: think of more alternatives.) How does this work? To explain this we need a bit of vocabulary, so let's refer to Wikipedia for a second and apply an isomorphism between textile sciences and computer sciences (something in me screams "zip, zipper, zippest"):

The bulk signature of a zipper Zip consists of two strips sequences of fabric element taypes, each affixed to one of the two pieces sequence instances to be joined, carrying tens or hundreds or anything below OutOfMemoryException conditions of specially regularly shaped allocated metal reference- or plastic value-typed teeth elements. (...) The slider Func<TFirst, TSecond, TResult> delegate, operated by hand executing Zip, moves along the rows sequences of teeth elements. Inside the slider Func<TFirst, TSecond, TResult> delegate is a Y-shaped strongly-typed channel function that meshes together or separates the opposing rows sequences of teeth elements, depending on the direction of its movement. The friction binding and vibration application of the slider Func<TFirst, TSecond, TResult> delegate against on the teeth elements causes a characteristic buzzing callvirt'ing noise, which is probably not the origin of the name zipper. The name also may have originated in the greater speed and ease with which the two sides of a zipper Zip can be joined, compared to the time needed for fastening executing laces or buttons the method above.

Well, that's exactly what happens: opposing rows sequences of teeth elements are combined by a Y-shaped strongly-typed channel function. How to determine opposing sequence elements? Using Select's overload that provides (besides the element itself) an index denoting the element's position in the original sequence. Matching the opposing elements is a matter of joining both sequences, extracting the keys (marked as I in the anonymous type) and combining the selected pairs of elements from both sequences (marked as X in the anonymous type) by feeding them in to the slider Func<TFirst, TSecond, TResult> delegate. I told you the explanation was straightforward, or did I?

 

Iterators and exceptions

Most of the LINQ operators are implemented using iterators. You might wonder this this is relevant at all. Obviously we need it as LINQ operators need to be lazy. Only when you start fetching results by iterating of the sequence, the internal machinery should kick in. Declaring a query doesn't cause any execution whatsoever. Internally iterators are implemented as state machines. Every state can do on "yield", after which the state machine is suspended till the consumer asks for the next element in the sequence being produced by calling MoveNext on the IEnumerator<T> object.

Our Zip implementation above is turned into an equivalent piece of code (slightly simplified for illustrative purposes):

public static IEnumerable<TResult> Zip<TFirst, TSecond, TResult>(this IEnumerable<TFirst> first, IEnumerable<TSecond> second, Func<TFirst, TSecond, TResult> func)
{
    return new ZipIterator<TFirst, TSecond, TResult>(first, second, func);
}

private sealed class ZipIterator<TFirst, TSecond, TResult> : IEnumerable<TResult>, IEnumerator<TResult>
{
    //
    // Captured iterator method parameters.
    //

    private IEnumerable<TFirst> _first;
    private Func<TFirst, TSecond, TResult> _func;
    private IEnumerable<TSecond> _second;

    //
    // Iterator's enumerator state and thread affinity.
    //

    private State _state;
    private int _initialThreadId;

    //
    // Value currently yielded by the iterator's enumerator.
    //

    private TResult _current;

    //
    // Local variables used by the iterator's enumerator.
    //
    private IEnumerator<TFirst> _ieFirst;
    private IEnumerator<TSecond> _ieSecond;

    //
    // Captured iterator method parameters used by enumerator.
    //
    private IEnumerable<TFirst> first;
    private Func<TFirst, TSecond, TResult> func;
    private IEnumerable<TSecond> second;

    //
    // Public constructor to create an iterator in the initial ready state.
    //
    public ZipIterator(IEnumerable<TFirst> first, IEnumerable<TSecond> second, Func<TFirst, TSecond, TResult> func)
        : this()
    {
        this._first = first;
        this._second = second;
        this._func = func;
    }

    //
    // Private constructor to set state and thread affinity.
    //
    private ZipIterator()
    {
        this._state = State.Before;
        this._initialThreadId = Thread.CurrentThread.ManagedThreadId;
    }

    //
    // Gets a new enumerator over the iterator.
    //
    IEnumerator<TResult> IEnumerable<TResult>.GetEnumerator()
    {
        ZipIterator<TFirst, TSecond, TResult> iterator;

        //
        // Can reuse the current enumerator if thread affinity and state permit it.
        //
        if (Thread.CurrentThread.ManagedThreadId == _initialThreadId && _state == State.Before)
        {
            iterator = this;
        }
        else
        {
            iterator = new ZipIterator<TFirst, TSecond, TResult>();
        }

        iterator.first = this._first;
        iterator.second = this._second;
        iterator.func = this._func;

        return iterator;
    }

    //
    // Gets a new enumerator over the iterator.
    //
    IEnumerator IEnumerable.GetEnumerator()
    {
        return ((IEnumerable<TResult>)this).GetEnumerator();
    }

    //
    // Advances the iterator one yield return at a time.
    //
    bool IEnumerator.MoveNext()
    {
        switch (this._state)
        {
            case State.Before:
                this._ieFirst = this.first.GetEnumerator();
                this._ieSecond = this.second.GetEnumerator();
                goto case State.Suspended;

            case State.Suspended:
                if (this._ieFirst.MoveNext() && this._ieSecond.MoveNext())
                {
                    this._current = this.func(this._ieFirst.Current, this._ieSecond.Current);
                    this._state = State.Suspended;
                    return true;
                }

                break;
        }

        this._state = State.After;
        return false;
    }

    //
    // Retting the iterator enumerator is not supported; create a new one instead by calling GetEnumerator.
    // The foreach statement does this anyway.
    //
    void IEnumerator.Reset()
    {
        throw new NotSupportedException();
    }

    //
    // Gets the value currently yielded from the iterator.
    //
    TResult IEnumerator<TResult>.Current
    {
        get
        {
            return _current;
        }
    }

    //
    // Gets the value currently yielded from the iterator.
    //
    object IEnumerator.Current
    {
        get
        {
            return _current;
        }
    }

    //
    // IDisposable implementation.
    //
 
   void IDisposable.Dispose()
    {
    }

    //
    // Internal iterator enumerator states.
    //
    private enum State
    {
        Before,
        Running,
        Suspended,
        After
    }

}

Notice that the code above holds the middle between the specification (paragraph 10.14 of the C# 3.0 specification) and the actual implementation in the Visual C# 2008 compiler. More specifically, I've made the discrete states (which are internally represented as integers) match the ones in the specification, but haven't gone all the way in matching the code up with the

Of all this machinery, the MoveNext method is the most interesting one from the iterator's point of view. Here a state machine is built, rewriting the original iterator block by splitting it into discrete blocks. To perform this transformation, yield return statements are replaced as follows:

 yield a;

becomes

this._current = a;
this._state = State.Suspended;
return true;

Similar rewrites happen for yield break statements, but that's not relevant here. Furthermore all the iterator block code is placed in a MoveNext method, switching on the current state. The specification only mentions the four states I've implemented above, but we're lucky the number of states for our sample matches up with the number of states that's specified. In cases where there are more yield statements, more states are needed to suspend the machine and resume at the same point during the next MoveNext call. Other transformations needed include rewriting of loops (things like while don't really play well with suspension points and need to be rewritten in terms of if/goto, where gotos require additional states). Applying all those tricks gives us:

bool IEnumerator.MoveNext()
{
Begin:
    switch (this._state)
    {
        case State.Before:
            this._ieFirst = this.first.GetEnumerator();
            this._ieSecond = this.second.GetEnumerator();
            this._state = State.Running;
            goto Begin;

        case State.Running:
        case State.Suspended:
            if (this._ieFirst.MoveNext() && this._ieSecond.MoveNext())
            {
                this._current = this.func(this._ieFirst.Current, this._ieSecond.Current);
                this._state = State.Suspended;
                return true;
            }

            break;
    }

    this._state = State.After;
    return false;
}

Here the local variables used in the iterator block are captured in the initial pass through MoveNext, when state is still set to Before. Strictly speaking we move from Before to Running all the way to the first point where the code gets suspended, but in our case states can be merged quite a bit. In fact the code really produced by the compiler looks like this:

bool IEnumerator.MoveNext()
{
    switch (this._state)
    {
        case 0:
            this._state = -1;
            this._ieFirst = this.first.GetEnumerator();
            this._ieSecond = this.second.GetEnumerator();
            while (this._ieFirst.MoveNext() && this._ieSecond.MoveNext())
            {
                this.current = this.func(this._ieFirst.Current, this._ieSecond.Current);
                this._state = 1;
                return true;
            Resume:
                this._state = -1;
            }
            break;

        case 1:
            goto Resume;
    }
    return false;
}

But why am I telling you all this? Because it's fascinating? Yes, for that reason too. But I promised to tell you something about exceptions, right? Let me quote from the C# 3.0 specification paragraph 10.14.4.1 first:

(...) The precise action performed by MoveNext depends on the state of the enumerator object when MoveNext is invoked:

  • If the state of the enumerator object is 'before', invoking MoveNext:
    • Changes the state to 'running'.
    • Initializes the parameters (including this) of the iterator block to the argument values and instance value saved when the enumerator object was initialized.
    • Executes the iterator block from the beginning until execution is interrupted (as described below).

(...)

The red line is what's of interest to us, and you should read it in reverse:

The code from the beginning of the iterator block till the place where execution is interrupted (i.e. a yield occurs) is executed by MoveNext when transitioning from 'before' to 'running'.

What if that code contains exception throwing statements?

static class Enumerable
{
    public static IEnumerable<TResult> Zip<TFirst, TSecond, TResult>(this IEnumerable<TFirst> first, IEnumerable<TSecond> second, Func<TFirst, TSecond, TResult> func)
    {
        if (first == null)
            throw new ArgumentNullException("first");

        if (second == null)
            throw new ArgumentNullException("second");

        var ie1 = first.GetEnumerator();
        var ie2 = second.GetEnumerator();

        while (ie1.MoveNext() && ie2.MoveNext())
            yield return func(ie1.Current, ie2.Current);
    }
}

As you can guess, this code won't get executed till the very first passage through the iterator's enumerator object's MoveNext method. Recall what foreach corresponds to:

foreach (V v in x) embedded-statement

becomes (with C the collection type, E the enumerator type, V the local variable type and T the element type)

{
    E e = ((C)(x)).GetEnumerator();
    try {
        V v;
        while (e.MoveNext()) {
            v = (V)(T)e.Current;
            embedded-statement
        }
    }
    finally {
        ...
    }
}

So just calling the iterator does nothing that can cause the exceptions to be thrown, until a call is made to MoveNext, e.g. by using a foreach loop. As we want the exception to be thrown straight away when calling Zip with invalid arguments, the right way to implement our Zip method is:

static class Enumerable
{
    public static IEnumerable<TResult> Zip<TFirst, TSecond, TResult>(this IEnumerable<TFirst> first, IEnumerable<TSecond> second, Func<TFirst, TSecond, TResult> func)
    {
        if (first == null)
            throw new ArgumentNullException("first");

        if (second == null)
            throw new ArgumentNullException("second");

        return ZipInternal(first, second, func);
    }

    private static IEnumerable<TResult> ZipInternal<TFirst, TSecond, TResult>(this IEnumerable<TFirst> first, IEnumerable<TSecond> second, Func<TFirst, TSecond, TResult> func)
    {

        var ie1 = first.GetEnumerator();
        var ie2 = second.GetEnumerator();

        while (ie1.MoveNext() && ie2.MoveNext())
            yield return func(ie1.Current, ie2.Current);
    }
}

 

Conclusion

The addition of Zip to LINQ is a nice one, but not a mind-blower. So I hope you'll accept my apologies for using this as a lame excuse to nag about the implementation of iterators and their subtle impact on exceptions. Next time: generics co- and contra-variance in C# 4.0.

Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks

Filed under: , ,

Comments

# re: C# 4.0 Feature Focus - Part 3 - Intermezzo: LINQ's new Zip operator

Tuesday, November 04, 2008 1:03 AM by TraumaPony

So it's just basically a Func`3 version of map?

# re: C# 4.0 Feature Focus - Part 3 - Intermezzo: LINQ's new Zip operator

Tuesday, November 04, 2008 7:41 AM by Jelle Hissink

IEnumerable<T> may implement the IDisposable interface. Your wrapper should ensure Dispose() gets called on the contained enumerators (you need to implement IDisposable on your own enumerator).

For more information read msdn.microsoft.com/.../aa288257(VS.71).aspx.

Regards,

Jelle Hissink

# re: C# 4.0 Feature Focus - Part 3 - Intermezzo: LINQ's new Zip operator

Tuesday, November 04, 2008 11:37 AM by bart

Hi Jelle,

You're definitely right about this - rest assured our LINQ to Objects implementation does the right thing wherever required :-). There's another thing here that's missing, a null-check for the func parameter.

The LINQSQO project historically doesn't wrap the iterator blocks in using clauses; I have it on the to-do list for a future release though.

Thanks for the feedback,

-Bart

# re: C# 4.0 Feature Focus - Part 3 - Intermezzo: LINQ's new Zip operator

Tuesday, November 04, 2008 11:48 AM by bart

Hi TraumaPony,

You can think of it that way if you like. Zip could well be called Select, taking in another sequence as its argument. The most relevant fact here is that the shortest sequence determines the length of the outcome, i.e. the zipper function is not called with default(TShorterSequence) arguments.

Thanks,

-Bart

# Arjan`s World &raquo; LINKBLOG for November 4, 2008

Tuesday, November 04, 2008 1:40 PM by Arjan`s World » LINKBLOG for November 4, 2008

Pingback from  Arjan`s World    &raquo; LINKBLOG for November 4, 2008

# Reflective Perspective - Chris Alcock &raquo; The Morning Brew #216

Wednesday, November 05, 2008 12:21 AM by Reflective Perspective - Chris Alcock » The Morning Brew #216

Pingback from  Reflective Perspective - Chris Alcock  &raquo; The Morning Brew #216

# re: C# 4.0 Feature Focus - Part 3 - Intermezzo: LINQ's new Zip operator

Thursday, November 06, 2008 1:55 PM by David

Hi Bart,

Could you comment on Zip from a PLINQ perspective? Could we call list1.Zip(list2).AsParallel() with your implementation above, or do we have to worry about synchronization between the two IEnumerable(OfT) sequences?

David

# fastening for textiles | Bookmarks URL

Saturday, November 08, 2008 2:37 AM by fastening for textiles | Bookmarks URL

Pingback from  fastening for textiles | Bookmarks URL

# InvocationExpression and LINQ to Entities

Sunday, December 07, 2008 6:19 AM by meek

I talked a little bit about patterns using InvocationExpression in a previous post (you might want to

# C# 4.0 : New Extension Method “Zip”

Friday, February 27, 2009 7:26 PM by Wriju's BLOG

While browsing the features I found an interesting extension method called “Zip”. How it works, let’s

# PowerShell - Build a Zip Function, zipping two lists together

Saturday, February 28, 2009 5:03 AM by PowerShell - Build a Zip Function, zipping two lists together

Pingback from  PowerShell - Build a Zip Function, zipping two lists together

# C# 4.0 Feature Focus – Part 4 – Co- and Contra-Variance for Generic Delegate and Interface Types

Monday, April 13, 2009 12:00 PM by B# .NET Blog

Last time around in this series, I promised to talk about generic co- and contra-variance. So that’s

# C# 4.0 Feature Focus – Part 4 – Co- and Contra-Variance for Generic Delegate and Interface Types

Monday, April 13, 2009 12:00 PM by B# .NET Blog

Last time around in this series, I promised to talk about generic co- and contra-variance. So that’s

# ?????????????? ???????????????????????? ???????????????? ??Bizzy Burdy?? ????????????????, ????????????, ???????????????? ???????????? ????????????

Pingback from  ?????????????? ???????????????????????? ???????????????? ??Bizzy Burdy?? ????????????????, ????????????, ???????????????? ???????????? ????????????

# InvocationExpression and LINQ to Entities

Tuesday, June 09, 2009 11:19 PM by VS2010学习

I talked a little bit about patterns using InvocationExpression in a previous post (you might want to

# Social comments and analytics for this post

Friday, December 04, 2009 12:13 PM by uberVU - social comments

This post was mentioned on Twitter by JustinAngel: Silverlight 4 Hidden feature: Use LINQ's new Zip method to iterate over 2 collections simultaneously. http://is.gd/58hci

# Twitted by brian_henderson

Wednesday, December 09, 2009 10:46 AM by Twitted by brian_henderson

Pingback from  Twitted by brian_henderson

# JP Labs &raquo; Test

Tuesday, January 12, 2010 10:19 AM by JP Labs » Test

Pingback from  JP Labs &raquo; Test

# Lambda Expressions: Zip

Tuesday, May 04, 2010 9:33 PM by Deborah's Developer MindScape

One of the new features in C# 4 and VB 10 (.NET 4.0) is the Zip extension method on IEnumerable. This

# Comparing Two Open XML Documents using the Zip Extension Method

Tuesday, July 27, 2010 6:39 AM by Eric White's Blog

Sometimes we want to compare two word processing documents to see if they contain the same content. I&rsquo;m

# circles under eyes

Thursday, March 10, 2011 6:32 PM by circles under eyes

Pingback from  circles under eyes

# traffic travis

Monday, April 18, 2011 9:13 PM by traffic travis

Pingback from  traffic travis

# Right-hand side Enumerable.Zip

Monday, June 13, 2011 2:46 PM by Chasing state of the art

With Reactive Extensions for .NET (Rx) and .NET Framework 4 a new LINQ operator was introduced – Zip ( Bart De Smet gives excellent explanation about the idea and implementation details behind Zip operator ). In a nutshell it merges two sequences into

# How to unifiy two arrays in a dictionary? | Jptab

Thursday, November 14, 2013 8:02 AM by How to unifiy two arrays in a dictionary? | Jptab

Pingback from  How to unifiy two arrays in a dictionary? | Jptab