Angel \”Java\” Lopez on Blog

March 9, 2009

Distributed Web Crawler using AjMessages

Filed under: .NET, AjMessages, C Sharp — ajlopez @ 8:09 am

Last January, I update my AjMessages project to support communication using DSS/CCR and Windows Communication Foundation. AjMessages is a program that can run in a distributed way, sending asynchronous messages from one logical node to another. The logical nodes can be hosted in one or more physical machines. More info about AjMessages at:

AjMessages- a message processor
Distributed Applications with AjMessages using DSS/CCR

You can download the source code from

http://www.codeplex.com/ajmessages

I had the idea of write a distributed message processor. Years ago, I wrote my first attempt, named AjServer

Hacia el AjServer (Spanish Post)

After that project, I met Fabriq ideas and implementation, where there are logical message handler, that can be distributed in a transparent way, via configuration:

In 2007, I wrote the first version of AjMessages, as a proof of concept: I could reproduce most of the Fabriq ideas, trying to raise some abstraction, in order to have a little more flexibility. Fabriq was more SOA-oriented. AjMessages is oriented to arbitrary message processing, to have a grid-like application.

This is the current AjMessages solution:

There is core project, and two transport projects: one for WCF support, and another to DSS/CCR (Microsoft Robotics technology). If you don’t have DSS/CCR, you can remove the related projects. The core AjMessages is free of dependencies on such technologies. AjMessages.SampleApp contains simple handler: a decrement handler, that takes a message with an integer, and produces a new message with a new decremented integer. AjMessages.WebCrawler implements the message and handlers that make a distributed web crawler application. You can run it in only one machine or in many hosts.

Each AjMessage message has:

Body: an arbitrary payload
Headers: additional information, key-value.
Action: describing the target of the message, using pattern Application/Node/Action

(more detail in the mentioned post AjMessages- a message processor)

Let’s explore the distributed web crawler example. This application is composed by nodes, that can be orchestrated to visit a link, and gets from that initial page all the links and related pages, up to a level. Logically, it can be described as:

The first messages is send to the node Controller, action Dispatch. It contains the initial page to visit. The message is enriched, and passed to node Controller, action Resolve. It’s in charge to keep a list of visited page, and to control the deep of processing. If the link is aproved, then a message is send to node Downloader, action Download. The content of the page is downloaded, added to the message, and forwarded to node Harvester, action Harvest. This action analyze the content of the page, and emits many message, one for each link found in  it. The recipient of the new links is the node Resolver, action Resolve. So, the process continues up to visit all pages (with a limit on the deep of processing).

This application can be described using an XML configuration file, example:

<?xml version="1.0" encoding="utf-8" ?> <AjMessages> <Application Name="WebCrawler"> <Node Name="Dispatcher"> <Handler Name="DispatcherHandler" Type="AjMessages.WebCrawler.Handlers.Dispatcher, AjMessages.WebCrawler"/> <Action Name="Dispatch" Handler="DispatcherHandler"/> </Node> <Node Name="Harvester"> <Handler Name="HarvesterHandler" Type="AjMessages.WebCrawler.Handlers.Harvester, AjMessages.WebCrawler"/> <Action Name="Harvest" Handler="HarvesterHandler"/> </Node> <Node Name="Downloader"> <Handler Name="DownloaderHandler" Type="AjMessages.WebCrawler.Handlers.Downloader, AjMessages.WebCrawler"/> <Action Name="Download" Handler="DownloaderHandler"/> </Node> <Node Name="Controller"> <Handler Name="ResolverHandler" Type="AjMessages.WebCrawler.Handlers.Resolver, AjMessages.WebCrawler"/> <Action Name="Resolve" Handler="ResolverHandler"/> </Node> </Application> </AjMessages>

An application is composed by nodes. Each node can be viewed as a “logical class”. Each node can handle Actions, that are the targets of messages. An action can be composed by one or more steps, of message handler. This is the extensibility point of the application: you must provided the steps, the message handler, and write a configuration file to orchestrate the message processing.

But one thing is the application, and another its distribution over physical machines. You can have two hosts, and install different logical nodes in each one:

In this diagram, the Dispatcher and Resolver are in one host, and the Downloader and Harvester in the other one. But you can put harversters in each host, or in twenty machines. It’s up to you how to distribute the load of the work. When a message is sent to a target, AjMessage forwards it to an appropiate host, that have a node capable of attending the message action.

The distribution of logical nodes into physical host is defined via configuration, an example:

<?xml version="1.0" encoding="utf-8" ?> <AjMessages> <Host Name="Server1" Address="http://localhost:50002/AjMessages" Activate="true"> <Application Name="AjMessages"> <Node Name="Administration"/> </Application> <Application Name="WebCrawler"> <Node Name="Controller"/> <Node Name="Dispatcher"/> <Node Name="Harvester"/> <Node Name="Downloader"/> </Application> </Host> <Host Name="Server2" Address="http://localhost:50003/AjMessages"> <Application Name="AjMessages"> <Node Name="Administration"/> </Application> <Application Name="WebCrawler"> <Node Name="Dispatcher"/> <Node Name="Harvester"/> <Node Name="Downloader"/> </Application> </Host> </AjMessages>

Running the Web Crawler example

You can test the program, launching two host, in the same machine. Run the console program AjMessage.Console. Then enter the command

fork

This command launches a second host. Now, you are ready to configure the two hosts. In the first one, enter:

load ConfigurationServer1.xml

This file loads the initial AjMessage application, and define a WCF endpoint to listen messages. Then, enter:

load ConfigurationWebCrawler.xml

This command loads the Web Crawler node definitions (no deploy info yet). The third command define the distribution in hosts:

load ConfigurationWebCrawlerNode1.xml

Now, back to second host console. Enter the corresponding commands:

load ConfigurationServer2.xml
load ConfigurationWebCrawler.xml
load ConfigurationWebCrawlerNode2.xml

You are ready to launch the first web crawling. Go to first console, and enter:

send WebCrawler/Dispatcher/Dispatch http://ajlopez.zoomblog.com

(that is my Spanish non technical blog… ;-)

The web crawling begins to work, in both consoles. A typical view on first console:

Next steps

I should fix some problems in DSS/CCR tranport (I could run a simpler app, but not the web crawler). Two points to implements:

- Remote configuration of nodes
- Distribution of message handlers bits to remote nodes
- One configuration for all the nodes (now, each node is configured separately)

I began to experiment with a more abstract way of processing message, that should implement uses cases like:

- A simple hello world
- Decrement example, as in AjMessages sample app (see AjMessage.SampleApp)
- Distributed web crawler
- Enterprise Service Bus-like app

My first steps at:

http://code.google.com/p/ajcodekatas/source/browse#svn/trunk/AjProcessor

But the project is still in its infancy. Keep tuned!

Angel “Java” Lopez
http://www.ajlopez.com/
http://twitter.com/ajlopez

5 Comments »

  1. An ode to Fabriq. Thanks!

    BTW I like your other post on WCF complexity. It’s solving complex issues, but really ought to be simpler. Anyhow, I really hope Oslo doesn’t go the same way (some of the same chiefs are now working on that). I’m sticking with DSS.

    Comment by Arvindra Sehmi — March 10, 2009 @ 7:12 pm

  2. Interesting blogg- thanks !

    I am currently reading up on this, having been in web admin a loooong time ago, 1996, when the UNIX cluster at the Uni did these tasks to create a web nanny of banned domains every night..

    since then it has been marketing for me. Funny how things come full circle. Good luck with your work

    Comment by PB StreetGang — October 29, 2009 @ 9:59 pm

  3. [...] Distributed Web Crawler using AjMessages [...]

    Pingback by Web Crawler using Agents and AjSharp « Angel “Java” Lopez on Blog — February 22, 2010 @ 10:33 am

  4. [...] Distributed Web Crawler using AjMessages Web Crawler distribuido usando AjMessages [...]

    Pingback by Web Crawler usando Agentes y AjSharp - Angel "Java" Lopez — February 23, 2010 @ 10:16 am

  5. [...] Web Crawler Using AjAgents and AjSharp Distributed Web Crawler using AjMessages [...]

    Pingback by Web Crawler using the new AjAgents « Angel “Java” Lopez on Blog — November 6, 2010 @ 11:13 am


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Theme: Shocking Blue Green. Get a free blog at WordPress.com

Follow

Get every new post delivered to your Inbox.

Join 67 other followers

%d bloggers like this: