December 10, 2010
The Search Engine Ranking Algorithm simulates Google's PageRank Algorithm using value iteration along with keyword evaluation. A webcrawler follows the links from a seed URL and downloads the content of the pages it visits. It is then processed to rank the visited webpages.
Links collected by the webcrawler are evaluated along with page text.
The webcrawler visits each link to download and collect more links.
- The webcrawler follows the links on a page similar to Breath First Search. As the links are collected and added to the list, they are then visited in that order.
- The PageRank Algorithm is computed using value iteration. The PageRank formula is similar to the Bellman Update.
The webcrawler is highly optimized for simplicity and understanding.
This project uses a built from scratch search engine to rank webpages. The keyword evaluation procedure would be the most innovative feature of this project.
This project also combines many processes that would normally be separate and runs them as one process containing smaller functions.
All the processing is done without using a database.
The entire project is implemented in Java.
Technology Used Block Diagram
Evaluation of Results
The webcrawler visits the initial website and stores the content for PageRank calculations and keyword evaluation. Additional links are collected and visited from the initial page.
Currently, the webcrawler only visits 10 pages to limit processing time although more pages can be visited.
The entries of the transition matrix are multiplied times the pagerank array and the result is added to the dampening factor which is uaed to calculate reward. The final resulting pagerank value is a floating point number that is added to the results of the keyword evaluation.
The keyword frequencies are sorted from highest to lowest and the pagerank values are added to them.
The final result is used to rank the pages from top to bottom in the results page.
Normally, the search engine results would be returned to the user from the server hosting the search engine. It is also possible to display the results by saving an HTML file the hard drive and opening it with the browser.