Web Crawler using Agents and AjSharp

Last year, I wrote a web crawler example using messages, details in post:

Distributed Web Crawler using AjMessages

Before that, I wrote other examples, using DSS/CCR, included in Microsoft Robotics Developer Studio:

Distributed Agents using DSS/VPL
Web Crawler example using DSS (Decentralized Software Services)

Now, I wrote a local (not distributed) example, using agents in AjSharp. Remember, each agent runs in its own thread, and its invocation are queued, so they are execute one by one. More about agents in AjSharp:

Agents in AjSharp (Part 1)
Agents in AjSharp (Part 2)

Yesterday, I took code from my previous examples, and rearrange it in AjSharp agents. This is the result. First, an object to build an agent network, and launch the process:

// Build and launch agents
object WebCrawler
  sub Process(url, fn)
    uri = new System.Uri(url);
    downloader = new Downloader();
    harvester = new Harvester();
    resolver = new Resolver(uri,5);
    processor = new Processor(fn);
    downloader.Harvester = harvester;
    downloader.Processor = processor;
    harvester.Resolver = resolver;
    resolver.Downloader = downloader;
    downloader.Process(uri, 0);

The downloader takes an URI, download its content, and send it to two associated agents: Processor and Harvester. Depth parameter is used later, indicating the depth of this URI in the crawling process:

// Downloads a page
agent Downloader 
  sub Process(uri,depth)
    client = new System.Net.WebClient();
    content = client.DownloadString(uri);
    PrintLine("Downloaded: " + uri);
    this.Processor.Process(uri, content);

The Processor executes a user function/routine, receiving the URI and its retrieved content:

// Process the content retrieved
agent Processor
  function Processor(fn)
    this.fn = fn; // function to invoke
  sub Process(uri, content)
    // Add your logic
    this.fn(uri, content);

The Harvester detects other links in content, and send them, one by one, to a new agent, the Resolver:

// Get links from page
agent Harvester
  sub Process(uri,depth,content)
    matches = System.Text.RegularExpressions.Regex.Matches(content, "href=\\s*\"([^&\"]*)\"");
    results = new List();
    foreach (match in matches) {
      value = match.Groups[1].Value;
      if (!results.Contains(value))
    foreach (result in results) 
      if (result.StartsWith("http"))
        this.Resolver.Process(new System.Uri(result), depth+1);

The Resolver keeps a list of processed URIs, and filter those ones that are not belong to original host (I keep the webcrawling process in the site of the first URI):

// Filter invalid or already processed links
agent Resolver
  var processed = new List();  
  function Resolver(uri,maxdepth)
    this.host = uri.Host;
    this.maxdepth = maxdepth;
  sub Process(uri,depth) 
    if (depth > this.maxdepth)
    if (uri.Host != this.host)
    if (uri.Scheme != System.Uri.UriSchemeHttp && uri.Scheme != System.Uri.UriSchemeHttps)
    if (processed.Contains(uri))
    PrintLine("New Link: " + uri);

Finally, example of use, creating and launching two agent sets dedicated to webcrawling:

// Example
WebCrawler.Process("https://ajlopez.wordpress.com", function(uri,content) { PrintLine("From ajlopez.wordpress "+uri);});
WebCrawler.Process("http://ajlopez.zoomblog.com", function(uri,content) { PrintLine("From ajlopez.zoomblog "+uri);});

You can download AjSharp from trunk:


The code is in Examples/WebCrawler.ajs in AjSharp.Console project. After compiling it, you can run the web crawling using command line:

AjSharp.Console Examples/WebCrawler.ajs

Partial ouput:

You can see the example with beautiful formatting at Pastie http://pastie.org/835926

Next step: use distributed agents. There is two ways to explore: one, declare some agents as distributed, via additional code, or configuration, without changing the original code; two, write distribution all explicit in code.

Keep tuned!

Angel “Java” Lopez



3 thoughts on “Web Crawler using Agents and AjSharp

  1. Pingback: Web Crawler using the new AjAgents « Angel “Java” Lopez on Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s