WikiPath

Anthony Salani
May 2, 2014

WikiPath: A Wikipedia-Based Pathfinder

Overview

WikiPath is a program that is capable of determined some of the shorter possible paths between any two user-provided Wikipedia pages. Upon completion, the shortest results found are displayed to the user in both text and pictographic forms.

Screenshot

Finding the shortest distance between "List of unexplained sounds" and "Smash Mouth"

Concepts Demonstrated

  • Higher-order functions: Used to convert article titles (nodes) into lists of pairs of article titles (edges) either in to or out of that node.
  • Message passing: Used to make the wikipedia.rkt component of the program entirely replaceable. If someone were so inclined, they could write a twitter.rkt that implements the four basic generic functions ('suggest, 'f-links, 'b-links, and 'picture), replace all three references to "wikipedia" with "twitter", and the program would now be able to compute the degree-of-connectedness between Twitter personalities.
  • Abstraction: Every component (frontend.rkt, pathfinder.rkt, wikipedia.rkt) do not need to concern themselves with how other components operate. For example, pathfinder.rkt just takes a start point, end point, and a function to get lists of edges out of nodes and returns a bunch of paths. In fact, there are no references at all to Wikipedia in pathfinder.rkt.
  • Databases/Hash Tables: Used to store the current paths.
  • Lists/List Manipulation: Completed paths are represented as lists. Most Wikipedia API-related function return lists of article names.
  • Let/Local Variables: To prevent the same web page from being loaded multiple times.
  • Tail Recursion: Most functions (including the search algorithm) use tail recursion when possible.

External Technology

  • net/url: Used for requesting web pages. net/uri-codec was used in conjunction to sanitize user inputs.
  • json: Used for parsing the retrieved JSON files. All this does is convert the documents to nested hash tables and lists.
  • racket/gui: Used to display the user a form instead of the REPL.
  • racket/draw: Used to create the visual representations of paths to display in the GUI.
  • threads: Not really an external technology, but it was not covered in class and is integral to the functionality of the program. The primary use of threading was download multiple pages simultaneously for a significant decrease in running time.

Innovation

Traditional (read: sane/practical) pathfinding techniques rely on heuristics to keep the search time as low as possible. For instance, most implementations of A* use some function of the distance from the current node to the finish and the length of the path from the start to determine where to search next. WikiPath does not have the pleasure of being able to use these techniques. In fact, were there some valid heuristic, the number of web requests added would most likely double the execution time anyway.

The innovation in WikiPath is instead of using a heuristic to avoid visiting all of Wikipedia's 4.5 million articles, it employs a variety of techniques to get close enough to the shortest path while still being able to complete the computation sometime this year. The search algorithm employed is a modified version of a breadth-first search from both the start and end points on the graph simultaneously. To avoid having to pull down an exponentially increasingly number of paths, the search algorithm dedicates only a certain number of seconds (about 30 seconds) for each breadth out in the graph. This is not a detriment to the minimum path length of the result however - there are somewhere between 100 and 600 links to or from any given article, so using only a fraction of the nodes available doesn't impact path length significantly. Basically, the primary innovation in WikiPath is a kind of fuzzy graph searching algorithm that is a balance of fast enough and short enough.

Technology Used Block Diagram

Additional Remarks

The Wikipedia API is remarkably inaccurate. The API for requesting the links in and out of articles often returns results that simply are not valid. I believe it is caching the results. Regardless, this adds about 1 extra page request for links out of a node and about 25 more page requests for links in to a node. Had this not been the case, the running time could potentially be reduced tenfold, or the minimum path length could be decreased slightly.

Also, due to the features of the pathfinding algorithm, the length of the paths increases as the internet connection quality degrades.

Statistics:

  • Approximate time to find a valid path: Close to 2 minutes on a good connection.
  • Approximate path length: Around 6 edges or 7 nodes.
  • Approximate number of web requests per computation: Around 500.
  • Lines of code: 713 across 4 files
  • Time spent: Maybe 40 hours?