Angel \”Java\” Lopez on Blog

May 23, 2015

SparkSharp, Spark in C# (1) First Ideas

Filed under: .NET, C Sharp, Open Source Projects, Spark, SparkSharp — ajlopez @ 11:20 pm

Next Post

In these days, I visited Apache Spark project:

https://spark.apache.org/

And started to think about implementing some of its ideas in C#.

The original project has Datasets, that can be consumed, item by item, and processed by methods like map and reduce. A dataset can consume a text file, local files or distributed ones. The jobs to run over datasets, applying transformations, can be launched in many distributed nodes (I should review the consolidation of results).

I started a new C# project:

https://github.com/ajlopez/SparkSharp

To me, it is important to start with small steps, using TDD (Test-Driven Development) workflow. So, in my first commits, I wrote datasets that implement IEnumerable. They have methods like Map, Reduce, Split, Take, Skip. Those methods were implemented writing the tests that express the expected API and behavior.

A dataset can be a simple wrapper of any IEnumerable, or it can read a text file, reading lines.

All these datasets are local, reside in the same machine. My idea is to implement a dataset wrapper, to expose the dataset content to other machines, and write a client wrapper that runs in each machine. The client wrapper looks like a regular dataset, but when the client program needs the next item of the dataset, that item come from the remote original machine.

The remote dataset gives the next item to any client. Each item is delivered only to ONE client. So, the items can be consumed and processed by n remote clients, without having an item processed twice.

To implement such pair server/client, I should implement serialization/deserialization of an arbitrary type T. I will use my previous work in AjErl and Aktores to have such feature. Using TDD, I could assert the expected behavior of the serializaction/deserialization process. If in the future, I have a better idea for such process, like using an external robust open source serialization library, all the TDD tests will help me to make the switch.

But, baby steps. Next steps: improve current local datasets, maybe add a new variant of dataset, and write keyed datasets, created using MapToKey method (to implement)

Stay tuned!

Angel “Java” Lopez
http://www.ajlopez.com
http://twitter.com/ajlopez

Blog at WordPress.com.

%d bloggers like this: