Last year, I wrote a web crawler example using messages, details in post:
Distributed Web Crawler using AjMessages
Before that, I wrote other examples, using DSS/CCR, included in Microsoft Robotics Developer Studio:
Distributed Agents using DSS/VPL
Web Crawler example using DSS (Decentralized Software Services)
Now, I wrote a local (not distributed) example, using agents in AjSharp. Remember, each agent runs in its own thread, and its invocation are queued, so they are execute one by one. More about agents in AjSharp:
Agents in AjSharp (Part 1)
Agents in AjSharp (Part 2)
Yesterday, I took code from my previous examples, and rearrange it in AjSharp agents. This is the result. First, an object to build an agent network, and launch the process:
// Build and launch agents object WebCrawler { sub Process(url, fn) { uri = new System.Uri(url); downloader = new Downloader(); harvester = new Harvester(); resolver = new Resolver(uri,5); processor = new Processor(fn); downloader.Harvester = harvester; downloader.Processor = processor; harvester.Resolver = resolver; resolver.Downloader = downloader; downloader.Process(uri, 0); } }The downloader takes an URI, download its content, and send it to two associated agents: Processor and Harvester. Depth parameter is used later, indicating the depth of this URI in the crawling process:
// Downloads a page agent Downloader { sub Process(uri,depth) { client = new System.Net.WebClient(); content = client.DownloadString(uri); PrintLine("Downloaded: " + uri); this.Harvester.Process(uri,depth,content); this.Processor.Process(uri, content); } }The Processor executes a user function/routine, receiving the URI and its retrieved content:
// Process the content retrieved agent Processor { function Processor(fn) { this.fn = fn; // function to invoke } sub Process(uri, content) { // Add your logic this.fn(uri, content); } }The Harvester detects other links in content, and send them, one by one, to a new agent, the Resolver:
// Get links from page agent Harvester { sub Process(uri,depth,content) { matches = System.Text.RegularExpressions.Regex.Matches(content, "href=\\s*\"([^&\"]*)\""); results = new List(); foreach (match in matches) { value = match.Groups[1].Value; if (!results.Contains(value)) results.Add(value); } foreach (result in results) if (result.StartsWith("http")) this.Resolver.Process(new System.Uri(result), depth+1); } }The Resolver keeps a list of processed URIs, and filter those ones that are not belong to original host (I keep the webcrawling process in the site of the first URI):
// Filter invalid or already processed links agent Resolver { var processed = new List(); function Resolver(uri,maxdepth) { this.host = uri.Host; this.maxdepth = maxdepth; } sub Process(uri,depth) { if (depth > this.maxdepth) return; if (uri.Host != this.host) return; if (uri.Scheme != System.Uri.UriSchemeHttp && uri.Scheme != System.Uri.UriSchemeHttps) return; if (processed.Contains(uri)) return; processed.Add(uri); PrintLine("New Link: " + uri); this.Downloader.Process(uri,depth); } }Finally, example of use, creating and launching two agent sets dedicated to webcrawling:
// Example WebCrawler.Process("http://ajlopez.wordpress.com", function(uri,content) { PrintLine("From ajlopez.wordpress "+uri);}); WebCrawler.Process("http://ajlopez.zoomblog.com", function(uri,content) { PrintLine("From ajlopez.zoomblog "+uri);});You can download AjSharp from trunk:
http://code.google.com/p/ajcodekatas/source/browse/#svn/trunk/AjLanguage
The code is in Examples/WebCrawler.ajs in AjSharp.Console project. After compiling it, you can run the web crawling using command line:
AjSharp.Console Examples/WebCrawler.ajs
Partial ouput:
![]()
You can see the example with beautiful formatting at Pastie http://pastie.org/835926
Next step: use distributed agents. There is two ways to explore: one, declare some agents as distributed, via additional code, or configuration, without changing the original code; two, write distribution all explicit in code.
Keep tuned!
Angel “Java” Lopez
http://www.ajlopez.com
これおすすめっす!!
SEX
パイパン
高収入アルバイト
ぶっかけ
騎乗位
手マン
Comment by ザーメン — February 26, 2010 @ 7:06 am
very informative..thanks a lot..james
javajobs.net
Comment by james smith — April 21, 2010 @ 8:07 am
[...] Web Crawler Using AjAgents and AjSharp Distributed Web Crawler using AjMessages [...]
Pingback by Web Crawler using the new AjAgents « Angel “Java” Lopez on Blog — November 6, 2010 @ 11:13 am