You can see and download my presentation from my Skydrive
My ideas was to show some concepts and some experiments I did with Node.js, porting previous work in other technologies. And to show code with concrete examples.
First, I explained what is a distributed application in the context of my talk. A distributed applications is an application that executes in many computers, linked by a network. They interact to solve a problem. See http://en.wikipedia.org/wiki/Distributed_computing
It is beyond the classic client/server. We could have many computers to crawler a site, or to resolve the rendering of a 3d movie. The computers that participates in the run don’t need to have the same software or program. Maybe, some of them were busy on some step of the problem. Neither they should have the same platform or programming language.
I thought of objects being like biological cells and/or individual computers on a network, only able to communicate with messages
Until real software engineering is developed, the next best practice is to develop with a dynamic system that has extreme late binding in all aspects
The big idea is “messaging”
It is interesting see Kay mentioning biologic cells: our execution nodes could be taken as cells, that forms an organism, our distributed application. As Kay says, the big topic is “messaging”, the send of message btw the cells. Distributed applications give us the change of forget the concept of function/subroutine, where we wait for a result, and even forget a callback: an element of our distributed application will send a message, without expecting an answer.
Feynman was once asked by a Caltech faculty member to explain why spin 1/2 particles obey Fermi-Dirac statistics. He gauged his audience perfectly and said "I’ll prepare a freshman lecture on it." But a few days later he returned and said, "You know, I couldn’t do it. I couldn’t reduce it to the freshman level. That means we really don’t understand it."
Adapting his phrase, I could say “If we cannot explain it, then we don’t understand it” as a personal motivation to give talks. Feynman studied by his own the quantum physics of his time, see my Spanish post Feyman por Dyson. I could take his experience as a light justification to experiment with code, instead of taking directly what it is already build.
Questions we have to answer when we build a distributed application:
What elements of our application create messages?
What elements consume messages?
A message, will be consumed by only one element? or it can be processed by many interested elements?
The machines and programs to execute are determined from start? or they can change? An example: I would like to add to my distributed application forty machine at night. It can be do it? We need such dynamic feature?
A message generated by an element in a machine is consumed locally? or it can travel to other machine? what part decides it? the programmer? the supervisor program?
And given a message that should reach another machine, how does it work? what transport is used?
If a message should be remotely processed, how to choose the target machine? by program? by load balancing?
These questions were be answered in different ways, depending on each experiment or example
A repeated experiment is a web crawler, logically composed by:
To retrieve the page of a website, first send an initial URL to a Resolver element. This element keeps a list of already visited pages. If the URL points to a not processed page, that URL is sent to a Downloader element. This logic element could be implemented in many machines. Its responsibility is to retrieve the content of a page. Then, it send the retrieved data to a Harvester element, to extract the links in that page. Those links are sent to Resolver. The Resolver examines if each link belong to the initial website and if the page was already processed.
To start with my experiments, I wrote a module ObjectStream (it was not the first I wrote, but it emerges from refactor):
Based on ObjectStream, I wrote:
It allows the communication of two processes, using bidirectional channels to interchange messages. You can launch a server and many clients. After the connection is established the message interchange is symmetric: any of both parts (client or server) can send a message to the other part. It is the basis to write distributed applications.
But sometimes, we need more than send a simple message. I.e. we need to call an object that is running in other machine. Then, we need a Remote Procedure Call (RPC). I wrote a module:
Based on SimpleMessages, the client and the server can expose an object, and the other party can invoke it as it were a local object. The return value is given asynchronously, the programmer should provide a callback to process the call answer.
There are many alternatives in Node.js module ecosystem. See:
Dnode has clients for different languages. You can check something more agnostic: http://en.wikipedia.org/wiki/JSON-RPC
I had a use case where I needed not only to send a message from machine A to machine B, but also to broadcast the message. Then, I wrote:
In this way, a client can send a message to a server, and then, the server can distribute the message to the other connected clients. The received message is not sent to the original client, only to the rest of clients. You could have an “start” of servers: each server attends its clients and acts as another client to the other servers.
This arrange allows scalability adding servers to the star. Also, it has a concept, the repeater server (see code and tests). In this module I had the collaboration of Fernando Lores.
In other use cases, instead of sending message, I needed the message processors to consume them. I could use a queue written in other language. But I want to stick with Node.js, so I wrote:
At the beginning, I implemented a in-memory queue, totally local. And then, I exposed it to other machines using SimpleRemote.
Other use case I had was: I need to publish messages, but I don’t know what elements are interested in them. It is like Node.js events: an element emits an event, but it does not care who’s subscribed to them. Thinking in this case, I wrote:
The subscribers and the publishers can reside in different machines.
In the talk, I show an example, Market, where the subscription to messages can be described with a “query by example”, that is, given a message with the field values we are interested on. Or we can give a predicate to indicate which message we want to receive. The predicate can be defined in one machine, and it can be sent to other machine.
Most of the use cases I tackled were motivated by implementations in other technologies. The work most influential in my experiments was Fabriq, by @asehmi:
See Remember Fabriq.
Based on that project, I wrote AjFabriqNode:
AjFabriqNode is based on having many executing nodes, each one declares a tree of message processor. Based on the message content, a message processor is chosen. Each node can be connected to other one in the network, and they can interchange which message processors are available in each node. Then, we can run a distributed web crawler where in one node we could have a Resolver, and in the others we could run Downloaders and Harvesters.
In other scenarios, you prefer to send the message directly to an specific element. Based on the actor implementation of Akka/Scala, see:
I implemented something simpler:
In Akka, a message is sent to an actor, identified by its address.
There is a Java project, Storm:
Where you can build a topology with logic elements, that receive and emit tuples (messages). There are tuples producers, the Spouts, and tuple processors, the Bolts, that can emit more tuples, too. Each logic element could be executed in many physical machine/processes. So, we can have a web crawler topology with Resolver, Downloader, and Harvester. But with 20 Downloaders in 10 machines, and 5 Harvesters in the same machines or others.
So, I wrote something on those lines, for Node.js:
After some experiments, I saw the needs of such applications. Many times, I want to seed code to many machines in the network, and interchange messages. Then, I wrote:
A new node that is added to the network receives dynamically the code to execute, and the info related to the other nodes.
Two final examples:
An implementation of distributed genetic algorithm:
And a fractal that can be calculated at the browser, or using a server Node.js+Express, alone or helped by many additional worker nodes:
At the end of the talk, I mentioned many reasons to research on distributed applications: it’s an interesting topic, with practice uses (big data, distributed calculation, parallel calculation, etc…). But I put emphasis on distributed application and artificial intelligence. Our nervous systems runs as a distributed application of neurons (remember Alan Kay talking about biological cells with message)
More distributed app experiments are coming.