Web Crawler example using DSS (Decentralized Software Services)

Some weeks ago, I wrote an example Scribble app using DSS. Today, I wrote another example: a web crawler application managed from Visual Programming Language (VPL). There is a additional VPL program that reads a web pages, using a Text To Speech control. You can download the example from my Skydrive:

DssWebCrawler2008May.zip

The solution

It has three projects: one DSS assembly, one class library with some utilities to parse an HTML content, and a class library to test the HTML parser. The test library uses NUnit: you can remove it from the solution, if you want, the library is only used for testing and it’s not needed in the final solution.

The DSS assembly is named DssWebCrawler. I defined five DSS service components:

Dispatcher: It receives an initial URL to download, and then dispatch it to the resolver.

Resolver: This service mantains a list of visited URLs, and check if the new URLs to download are valid and are in the same domain of the first page. It has a hardcoded max depth of 3 level of links to explore

Downloader: This service performs the download of the content of an URL. The content is returned as part of the response message.

Harvester: It examines the received content and harvest new URLs to examine and download. For each of these URLs it sends a notification to any service interested in that info.

Reader: It uses the simple HTML parse I wrote for this project. It can obtain the title of the page, or the body, discarding HTML tags and scripts.

The VPL Program

There is a VPL program named VPLWebCrawler. It consists of three diagrams. The first one defines the kickoff process. The first URL to download is entered in a dialog window:

 

 The second diagram defines the process of harvester notifications of new URLs:

The third diagram is a plus: it processes downloader notification of new content, extracting the title, and forwarding it to a Text to Speech component:

 

To launch the application, go to Run -> Start menu. A windows appears, prompting to enter the page URL to begin crawling:

Enter a valid URL, then, the crawling process begins:

After some seconds, the titles of the downloaded pages are posted to the Text to Speech service: you can hear the crawling process.

Reading Pages

In another VPL program, VPLWebReader, you can read the content of a web page, using the Text to Speech:

It is interesting that we are using the same service components than in the last example. But using VPL composition, we can use them for another purpose.

You can use it to read my experiments in “Anglish” (Angel’s English) at http://ajlopezen.zoomblog.com.

Conclusions

The service components were written to use with VPL orchestration. They don’t have partners, or direct connections with other service components in the project. This is a new way of programming: you must plan the message request and message response, to use in the communication to draw with VPL. The notification feature is a plus: you can use the same outgoing messages in different target components.

You can play a little more: put some of the components in another node/machine, using VPL new features.

I hope you’ll find this example useful. I had fun writting it.

Thanks to Fernando Tubio, for his initial ideas for a web crawler implementation.

Angel “Java” Lopez
http://www.ajlopez.com/en

19 thoughts on “Web Crawler example using DSS (Decentralized Software Services)

  1. Arvindra Sehmi

    Hi Angel,

    Nice little demo which I got working under the April CTP of MRDS and VS2008 after upgrading the VS solution and changing a number of the path references in the DssWebCrawler.csproj file. I also had to change the contract identifier (2008/04 -> 2008/05) to allow it to work alongside a previous version of the same Web Crawler contract I had on my machine. For good measure I also re-constructed the VPL diagrams from scratch because the VPL didn’t appear to ‘forget’ the old contract until I deleted and recreated the services. Most downloaders of this demo will never encounter these contract collision issues, but it is worth pointing them out.

    Thanks and good luck with your Argentina RAF presentation tomorrow!

    – Arvindra

    Reply
  2. Pingback: Sehmi-Conscious Thoughts : Web Crawler in VPL/DSS

  3. Pingback: Presentando Microsoft Robotics en el Regional Architect Forum 2008 - Angel "Java" Lopez

  4. Pingback: Juan Manuel Moyano : About Microsoft Robotics

  5. Pingback: Distributed Agents using DSS/VPL « Angel “Java” Lopez on Blog

  6. Pingback: High performance Pub/Sub .NET libraries « Fluent.Interface

  7. Pingback: Agentes Distribuidos usando DSS/VPL - Angel "Java" Lopez

  8. Pingback: » About Microsoft Robotics Juan Manuel Moyano’s Blog

  9. Pingback: Microsoft Robotics in enterprise applications « Angel “Java” Lopez on Blog

  10. Pingback: Web Crawler using Agents and AjSharp « Angel “Java” Lopez on Blog

  11. Donny

    I take pleasure in, result in I found exactly what I
    was taking a look for. You have ended my 4 day long hunt!
    God Bless you man. Have a great day. Bye

    Reply
  12. how to cure the gout

    Wow that was strange. I just wrote an extremely long comment but
    after I clicked submit my comment didn’t appear. Grrrr…
    well I’m not writing all that over again.

    Anyways, just wanted to say excellent blog!

    Reply
  13. Mona

    Its like you read my mind! You seem to know so much approximately this, such as you
    wrote the guide in it or something. I feel that you could do with some p.c.
    to pressure the message home a bit, however other than
    that, this is fantastic blog. A fantastic read.
    I’ll definitely be back.

    Reply
  14. webrtc demo

    Different allocation schemes ffor radio resource (RR) management have been defined in order to multiplex several MSs on the same
    physical channel. Wylder is a part time freelancer and amateur author.
    Owners will appreciate the low cost of tthe IBM Info
    – Print 1601 laser toner ( offered online.

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s