Angel \”Java\” Lopez on Blog

May 28, 2015

SparkSharp, Spark in C# (2) Implementing Map and Reduce

Filed under: .NET, C Sharp, Open Source Projects, Spark, SparkSharp — ajlopez @ 10:15 am

Previous Post

The project repo is at:

https://github.com/ajlopez/SparkSharp

I’m using TDD (Test-Driven Development) workflow to write the application, so the code is evolving with the use cases I added as tests. I only wrote the code needed to pass the tests. When I had new use cases, I will add new functionality. As an example, the original project has an Spark Context, with factory methods to create Datasets. I don’t need that class, yet. So in the current conde, the Datasets are created as public objects using new operator..

Born after a refactor, the abstract class for all Datasets is BaseDataset. Partial code:

public abstract class BaseDataset<T> : IEnumerable<T>
{
    public abstract IEnumerable<T> Elements { get; }

    public BaseDataset<S> Map<S>(Func<T, S> map)
    {
        return new EnumDataset<S>(this.ApplyMap(map));
    }

    public S Reduce<S>(Func<S, T, S> reduce)
    {
        S result = default(S);

        foreach (var elem in this)
            result = reduce(result, elem);

        return result;
    }
    
    // ...

    private IEnumerable<S> ApplyMap<S>(Func<T, S> map)
    {
        foreach (var elem in this)
            yield return map(elem);
    }
    
    // ...
}

The enumeration of the dataset elements should be implemented by the concrete subclass. The implementation of Map and Reduce is general, for all datasets. Those methods are defined in the abstract class. Thanks to C#, those methods can receive a lambda or a Func (a function).

In the ApplyMap method I’m using the C# yield operator to return an element suspending the executiong of the foreach. That command will resume when the consumer needs the next element of the enumerable collection. In this way, the generation of the elements is lazy, each element is produced only under demand. A note: C# has lambdas and delegate functions, and they are examples of good and useful features added to a programming language. In contrast, Java world has Scala, that in my opinion, it a bit “too much”. I prefer the evolution of C# instead of Scala.

There are no tests using the abstract class (it was born as a refactor), but they are tests on concrete ones. A test of Map method using EnumDataset (a Dataset that is a wrapper around an IEnumerable collection):

[TestMethod]
public void MapIncrement()
{
    EnumDataset<int> ds = new EnumDataset<int>(new int[] { 1, 2, 3 });
    BaseDataset<int> mapds = ds.Map(i => i + 1);
    var enumerator = mapds.GetEnumerator();

    for (int k = 1; enumerator.MoveNext(); k++)
        Assert.AreEqual(k + 1, enumerator.Current);

    Assert.AreEqual(3, mapds.Count());
}

And a Reduce test:

[TestMethod]
public void ReduceSum()
{
    EnumDataset<int> ds = new EnumDataset<int>(new int[] { 1, 2, 3 });
    var result = ds.Reduce<int>((x, y) => x + y);

    Assert.IsNotNull(result);
    Assert.AreEqual(6, result);
}

Next topics: more BaseDataset methods, concrete classes, datasets with keys, etc…

Stay tuned!

Angel “Java” Lopez

http://www.ajlopez.com

http://twitter.com/ajlopez

May 23, 2015

SparkSharp, Spark in C# (1) First Ideas

Filed under: .NET, C Sharp, Open Source Projects, Spark, SparkSharp — ajlopez @ 11:20 pm

Next Post

In these days, I visited Apache Spark project:

https://spark.apache.org/

And started to think about implementing some of its ideas in C#.

The original project has Datasets, that can be consumed, item by item, and processed by methods like map and reduce. A dataset can consume a text file, local files or distributed ones. The jobs to run over datasets, applying transformations, can be launched in many distributed nodes (I should review the consolidation of results).

I started a new C# project:

https://github.com/ajlopez/SparkSharp

To me, it is important to start with small steps, using TDD (Test-Driven Development) workflow. So, in my first commits, I wrote datasets that implement IEnumerable. They have methods like Map, Reduce, Split, Take, Skip. Those methods were implemented writing the tests that express the expected API and behavior.

A dataset can be a simple wrapper of any IEnumerable, or it can read a text file, reading lines.

All these datasets are local, reside in the same machine. My idea is to implement a dataset wrapper, to expose the dataset content to other machines, and write a client wrapper that runs in each machine. The client wrapper looks like a regular dataset, but when the client program needs the next item of the dataset, that item come from the remote original machine.

The remote dataset gives the next item to any client. Each item is delivered only to ONE client. So, the items can be consumed and processed by n remote clients, without having an item processed twice.

To implement such pair server/client, I should implement serialization/deserialization of an arbitrary type T. I will use my previous work in AjErl and Aktores to have such feature. Using TDD, I could assert the expected behavior of the serializaction/deserialization process. If in the future, I have a better idea for such process, like using an external robust open source serialization library, all the TDD tests will help me to make the switch.

But, baby steps. Next steps: improve current local datasets, maybe add a new variant of dataset, and write keyed datasets, created using MapToKey method (to implement)

Stay tuned!

Angel “Java” Lopez
http://www.ajlopez.com
http://twitter.com/ajlopez

Blog at WordPress.com.