Angel \”Java\” Lopez on Blog

February 22, 2010

Web Crawler using Agents and AjSharp

Filed under: .NET, AjSharp, Open Source Projects, Programming Languages — ajlopez @ 10:30 am

Last year, I wrote a web crawler example using messages, details in post:

Distributed Web Crawler using AjMessages

Before that, I wrote other examples, using DSS/CCR, included in Microsoft Robotics Developer Studio:

Distributed Agents using DSS/VPL
Web Crawler example using DSS (Decentralized Software Services)

Now, I wrote a local (not distributed) example, using agents in AjSharp. Remember, each agent runs in its own thread, and its invocation are queued, so they are execute one by one. More about agents in AjSharp:

Agents in AjSharp (Part 1)
Agents in AjSharp (Part 2)

Yesterday, I took code from my previous examples, and rearrange it in AjSharp agents. This is the result. First, an object to build an agent network, and launch the process:

// Build and launch agents
object WebCrawler
{
  sub Process(url, fn)
  {
    uri = new System.Uri(url);
    downloader = new Downloader();
    harvester = new Harvester();
    resolver = new Resolver(uri,5);
    processor = new Processor(fn);
    
    downloader.Harvester = harvester;
    downloader.Processor = processor;
    harvester.Resolver = resolver;
    resolver.Downloader = downloader;
    
    downloader.Process(uri, 0);
  }
}

The downloader takes an URI, download its content, and send it to two associated agents: Processor and Harvester. Depth parameter is used later, indicating the depth of this URI in the crawling process:

// Downloads a page
agent Downloader 
{
  sub Process(uri,depth)
  {
    client = new System.Net.WebClient();
    content = client.DownloadString(uri);
    PrintLine("Downloaded: " + uri);
    this.Harvester.Process(uri,depth,content);
    this.Processor.Process(uri, content);
  }
}

The Processor executes a user function/routine, receiving the URI and its retrieved content:

// Process the content retrieved
agent Processor
{
  function Processor(fn)
  {
    this.fn = fn; // function to invoke
  }
  
  sub Process(uri, content)
  {
    // Add your logic
    this.fn(uri, content);
  }
}

The Harvester detects other links in content, and send them, one by one, to a new agent, the Resolver:

// Get links from page
agent Harvester
{
  sub Process(uri,depth,content)
  {
    matches = System.Text.RegularExpressions.Regex.Matches(content, "href=\\s*\"([^&\"]*)\"");
    results = new List();
    
    foreach (match in matches) {
      value = match.Groups[1].Value;
      
      if (!results.Contains(value))
        results.Add(value);
    }
    
    foreach (result in results) 
      if (result.StartsWith("http"))
        this.Resolver.Process(new System.Uri(result), depth+1);
  }
}

The Resolver keeps a list of processed URIs, and filter those ones that are not belong to original host (I keep the webcrawling process in the site of the first URI):

// Filter invalid or already processed links
agent Resolver
{
  var processed = new List();  
  
  function Resolver(uri,maxdepth)
  {
    this.host = uri.Host;
    this.maxdepth = maxdepth;
  }
  
  sub Process(uri,depth) 
  {
    if (depth > this.maxdepth)
      return;
      
    if (uri.Host != this.host)
      return;
    
    if (uri.Scheme != System.Uri.UriSchemeHttp && uri.Scheme != System.Uri.UriSchemeHttps)
      return;
      
    if (processed.Contains(uri))
      return;
      
    processed.Add(uri);
      
    PrintLine("New Link: " + uri);
    this.Downloader.Process(uri,depth);     
  }
}

Finally, example of use, creating and launching two agent sets dedicated to webcrawling:

// Example
WebCrawler.Process("http://ajlopez.wordpress.com", function(uri,content) { PrintLine("From ajlopez.wordpress "+uri);});
WebCrawler.Process("http://ajlopez.zoomblog.com", function(uri,content) { PrintLine("From ajlopez.zoomblog "+uri);});

You can download AjSharp from trunk:

http://code.google.com/p/ajcodekatas/source/browse/#svn/trunk/AjLanguage

The code is in Examples/WebCrawler.ajs in AjSharp.Console project. After compiling it, you can run the web crawling using command line:

AjSharp.Console Examples/WebCrawler.ajs

Partial ouput:

You can see the example with beautiful formatting at Pastie http://pastie.org/835926

Next step: use distributed agents. There is two ways to explore: one, declare some agents as distributed, via additional code, or configuration, without changing the original code; two, write distribution all explicit in code.

Keep tuned!

Angel “Java” Lopez

http://www.ajlopez.com

http://twitter.com/ajlopez

3 Comments »

  1. これおすすめっす!!

    SEX
    パイパン
    高収入アルバイト
    ぶっかけ
    騎乗位
    手マン

    Comment by ザーメン — February 26, 2010 @ 7:06 am

  2. very informative..thanks a lot..james
    javajobs.net

    Comment by james smith — April 21, 2010 @ 8:07 am

  3. [...] Web Crawler Using AjAgents and AjSharp Distributed Web Crawler using AjMessages [...]

    Pingback by Web Crawler using the new AjAgents « Angel “Java” Lopez on Blog — November 6, 2010 @ 11:13 am


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Theme: Shocking Blue Green. Get a free blog at WordPress.com

Follow

Get every new post delivered to your Inbox.

Join 67 other followers

%d bloggers like this: