December 7, 2014
Interactive Research Comrade or IRCle is an artificial intelligence whose life purpose is to educate the masses through the medium of Internet Relayed Chat (IRC). IRCle will provide information to users through the English vernacular, easing the technical hurdle existent in learning other technologies such as Google, Wikipedia or Wolfram Alpha.
- Map, filter and reduce were used with higher order functions to parse the contents of an HTML page and scrape only desired content.
- Data abstraction techniques were used (list selectors) to parse html "objects".
- The IRC bot we created had an abstraction layer implemented as on object based on the message passing technique to make it simpler to interface with.
- Currying and function composition were used extensively.
External Technology and Libraries
This project interfaced with a number of external libraries. Below is the list of libraries and their use:
- net/url, net/uri-codec -- Provided a low level way to get a "pure-port" off of an html page and parse that page at a very low level. Used by our link farming procedures and the library described next.
- neil/html-parsing -- An html parsing library that generates a scheme "object" out of the page contents that is far easier to parse than the built in utilities that racket provides. See Eric's "Favorite Lines of Code" section.
- irc -- An irc client for racket we used to interface with the user.
- Racket native -- Allowed us to create a library to interface racket with python code using low level system calls in C.
Favorite Lines of Code
The following bullets contain Santiago and Eric's favorite lines of code and the reasons why:
- Santiago Paredes:
(define (topic-summarize topics depth) (define (wiki-split string depth) (define parsed (string-split string "\n\n")) (define final "") (define wiki-depth (expt 2 depth)) (when (>= wiki-depth (length parsed)) (set! wiki-depth (length parsed))) (for ((i wiki-depth)) (set! final (string-append final " " (list-ref parsed i)))) final) (map (lambda (x) (wiki-split (cadr x) depth)) topics))
This is a routine that capitalizes on Wikipedia's quasi-topic summarized data. Eric passes me a simple data structure: a list of lists, where each list contains two elements – the wiki-article-title and the wiki-article-text. The text, after being processed by my partner’s web scraper, will be passed to me with one very distinct format: each section of an article will be separated by two newlines. Ignoring the wiki-article-title, the function maps over all the different wiki-article-texts, summarizing and coalescing them as it moves from one list to the next. I can also allow users to start choosing the depth of which they would like to know about a subject matter. In the code, you’ll notice that topic summarization takes a depth parameter. Users will be able to choose an appropriate depth for their learning experience, and subsequently they will receive a quantity of data with respect to the depth they chose (0 being the shallowest option and thereby returning the smallest possible amount of information, 1 being a deeper option and thereby returning slightly more information, etc.).
- Eric Marcoux:
(define (get-topic-contents topic) (let* ([root ((compose1 html->xexp get-pure-port-with-redirection string->url) (string-append "http://en.wikipedia.org/w/index.php?search=" topic))] [html (car (filter (curry tag-eq? 'html) (cdr root)))] [body (cddar (filter (curry tag-eq? 'body) html))] [mw-body (cddar (filter (curry is-div-with-class? "mw-body") body))] [mw-body-content (cddar (filter (curry is-div-with-class? "mw-body-content") mw-body))] [mw-content-text (cddar (filter (curry is-div-with-id? "mw-content-text") mw-body-content))] [paragraphs (filter (curry tag-eq? 'p) mw-content-text)] [sanatized-paragraphs (map sanatize-paragraph paragraphs)] [flattened-paragraphs (flatten-paragraph-to-string sanatized-paragraphs)]) flattened-paragraphs))This one procedure does all of our wikipedia web scraping. What it does is download the page in a structured form (think scheme list of lists, ie an object) using the neil/html-parsing library. It then uses a series of filters to traverse it's way down the divs that make up a page until it gets to the div that actually holds the content of the page. It then sanitizes that output into a string. In other words it removes all references to formatting. The function's tag-eq?, is-div-with-class?, and is-div-with-id? are defined elsewhere and provide an abstraction barrier for parsing the content given back in the object.