December 10, 2008
CrawLS is a basic web/domain name crawler with the goal of reporting back a domain's total number of unique hyperlinks. As it finds unique URLs, they are displayed in real-time.
- Message Passing is used to signal which type of links to include/exclude in the crawl.
- The GUI is built through the PLT Graphics Toolkit.
- Mastery of regular expressions was required for implementation of a small parser.
This application interfaces with multiple Scheme libraries; net/url, xml, scheme/path, and Alex Schinn's html-parser. One hard to ignore technology, which sometimes goes without saying, is that this application requires the use of the Internet to download or visit pages as it finds them.
Crawler designs are a heavily guarded secret in some well known organizations, such as Google and Yahoo. In addition to assisting me with maintenance and statistical purposes about my site, I wanted to attempt to develop something that might explain why these companies consider crawlers to be a crucial ingredient of their business.
Technology Used Block Diagram
Contrary to popular opinion, I implemented a path descending crawler rather than the widely accepted path ascending versions. This application has the potential for a wide variety of features and enhancements and since it has personal use, I plan to upgrade it.