December 10, 2010
Smart Reader is a text classifier that uses a naive Bayes algorithm with a trained language model to recognize the category to which a text file belongs by computing the probabilities of all its features.
- I used a Bayesian network to create a bag-of-words model of 23,357 conditionally independent features per category.
- The naive Bayes algorithm assumes that the probabilities are conditionally independent even if they are not, hence the term naive.
- I used Laplacian smoothing to account for features that were absent from the training data set to create a more complete, and hopefully accurate, model.
Statistical algorithms and Bayesian networks are emerging as an effective way to solve AI problems in general and specifically those in natural language. They are used to recognize speech by inferring hidden Markov models and in text classification tasks such as language identification, spam filtering, sentiment analysis for film or product reviews, as well as text mining and machine learning.
Technology Used Block Diagram
Evaluation of Results
I was able to correctly classify 7 out of 10 test files which are similar to the ones on which the classifier was trained. I expect its accuracy to decrease as the text it's classifying differs from the training set for two reasons. First, the training set consists of Reuters' newswires; therefore, the language model does not consider categories that are inconsistent with the 78 business wire ones. Second, this classifier does not account for words it has not seen previously and therefore does not learn. It throws away features that were not included in the training data set.
This project forced me to carefully evaluate the benefits of the naive Bayes approach to text classification. I realized that the entire process is dependent upon the data set. This inductive approach is valid as it allows for accurate language models given enough data. However, after listening to Marvin Minsky's talk I doubt that it could be considered truly intelligent as there is not anything in the algorithm or model that expects a particular outcome. It merely yields the most likely one.