Recent Changes - Search:
ECG Home

GitHub

People

Publications

Calendar

Projects

Fall 2017

Older Courses

Spring 2017

Fall 2016

Spring 2016

Fall 2015

Spring 2015

Fall 2014

Spring 2014

Fall 2013

Spring 2013

Fall 2012

Spring 2012

Fall 2011

Spring 2011

Fall 2010

Spring 2010

Fall 2009

Spring 2009

Fall 2008

Spring 2008

Fall 2007

HOWTOs

edit SideBar

Patrick Lozzi
December 10, 2008

Overview

CrawLS is a basic web/domain name crawler with the goal of reporting back a domain's total number of unique hyperlinks. As it finds unique URLs, they are displayed in real-time.

Screenshot

Concepts Demonstrated

  • Message Passing is used to signal which type of links to include/exclude in the crawl.
  • The GUI is built through the PLT Graphics Toolkit.
  • Mastery of regular expressions was required for implementation of a small parser.

External Technology

This application interfaces with multiple Scheme libraries; net/url, xml, scheme/path, and Alex Schinn's html-parser. One hard to ignore technology, which sometimes goes without saying, is that this application requires the use of the Internet to download or visit pages as it finds them.

Innovation

Crawler designs are a heavily guarded secret in some well known organizations, such as Google and Yahoo. In addition to assisting me with maintenance and statistical purposes about my site, I wanted to attempt to develop something that might explain why these companies consider crawlers to be a crucial ingredient of their business.

Technology Used Block Diagram

Additional Remarks

Contrary to popular opinion, I implemented a path descending crawler rather than the widely accepted path ascending versions. This application has the potential for a wide variety of features and enhancements and since it has personal use, I plan to upgrade it.

Edit - History - Print - Recent Changes - Search
Page last modified on December 11, 2008, at 12:58 PM