Angel \”Java\” Lopez on Blog

May 28, 2015

SparkSharp, Spark in C# (2) Implementing Map and Reduce

Filed under: .NET, C Sharp, Open Source Projects, Spark, SparkSharp — ajlopez @ 10:15 am

Previous Post

The project repo is at:

https://github.com/ajlopez/SparkSharp

I’m using TDD (Test-Driven Development) workflow to write the application, so the code is evolving with the use cases I added as tests. I only wrote the code needed to pass the tests. When I had new use cases, I will add new functionality. As an example, the original project has an Spark Context, with factory methods to create Datasets. I don’t need that class, yet. So in the current conde, the Datasets are created as public objects using new operator..

Born after a refactor, the abstract class for all Datasets is BaseDataset. Partial code:

public abstract class BaseDataset<T> : IEnumerable<T>
{
    public abstract IEnumerable<T> Elements { get; }

    public BaseDataset<S> Map<S>(Func<T, S> map)
    {
        return new EnumDataset<S>(this.ApplyMap(map));
    }

    public S Reduce<S>(Func<S, T, S> reduce)
    {
        S result = default(S);

        foreach (var elem in this)
            result = reduce(result, elem);

        return result;
    }
    
    // ...

    private IEnumerable<S> ApplyMap<S>(Func<T, S> map)
    {
        foreach (var elem in this)
            yield return map(elem);
    }
    
    // ...
}

The enumeration of the dataset elements should be implemented by the concrete subclass. The implementation of Map and Reduce is general, for all datasets. Those methods are defined in the abstract class. Thanks to C#, those methods can receive a lambda or a Func (a function).

In the ApplyMap method I’m using the C# yield operator to return an element suspending the executiong of the foreach. That command will resume when the consumer needs the next element of the enumerable collection. In this way, the generation of the elements is lazy, each element is produced only under demand. A note: C# has lambdas and delegate functions, and they are examples of good and useful features added to a programming language. In contrast, Java world has Scala, that in my opinion, it a bit “too much”. I prefer the evolution of C# instead of Scala.

There are no tests using the abstract class (it was born as a refactor), but they are tests on concrete ones. A test of Map method using EnumDataset (a Dataset that is a wrapper around an IEnumerable collection):

[TestMethod]
public void MapIncrement()
{
    EnumDataset<int> ds = new EnumDataset<int>(new int[] { 1, 2, 3 });
    BaseDataset<int> mapds = ds.Map(i => i + 1);
    var enumerator = mapds.GetEnumerator();

    for (int k = 1; enumerator.MoveNext(); k++)
        Assert.AreEqual(k + 1, enumerator.Current);

    Assert.AreEqual(3, mapds.Count());
}

And a Reduce test:

[TestMethod]
public void ReduceSum()
{
    EnumDataset<int> ds = new EnumDataset<int>(new int[] { 1, 2, 3 });
    var result = ds.Reduce<int>((x, y) => x + y);

    Assert.IsNotNull(result);
    Assert.AreEqual(6, result);
}

Next topics: more BaseDataset methods, concrete classes, datasets with keys, etc…

Stay tuned!

Angel “Java” Lopez

http://www.ajlopez.com

http://twitter.com/ajlopez

Blog at WordPress.com.

%d bloggers like this: