December 12, 2011
My project "Classification of User Input on iSENSE Using Bayes Nets" is an attempt to simplify user interaction when contributing data to the iSENSE system online. Using data from three different sources, I have trained a Bayesian Network to classify headers in a data file into the types that an experiment expects reducing the need for user intervention.
- A Bayesian Network is used to classify words in a data file into "Types" for an experiment.
- Laplacian Smoothing is used to used help classify words that may not have been seen before.
- I have implemented my own optimizations to refine/restrict calculations that are unnecessary.
My innovation is mostly on the iSENSE side of the project. Where the current implementation of column matching on the website simply does a string/substring match on the experiment headers to the file headers, I have created a Bayesian Network to supplement matching. For example, the experiment header "Time" would only be matched to file headers containing the word "time" in some form. In the new implementation, words such as "s","seconds","data","time", etc. would all be matched to "Time."
My innovations on the Artificial Intelligence side of the project include several optimizations that I have implemented. Because the algorithm knows which experiment a user is trying to contribute data to, it can restrict the searches through the Bayes Net to only use the classifiers that are contained in that experiment. Also, because it can be assumed that the data needed for the experiment exists in the file, I can be more confident in the matches made by the Bayes Net and thus can lower the threshold for matching. Finally, I had an idea of doing dynamic Laplacian smoothing. Better matches are made if a word exists in the dictionary if you do not use Laplacian smoothing. Therefore, a quick check to see if a word is in the dictionary and dynamically adjusting K could lead to better checks.
Evaluation of Results
In trying to test my implementation I had three data sources used for training the network. I realized that the data on the iSENSE system is not necessarily the best data to train on because most of the experiments were created in the presence of an iSENSE team member which may influence the results. Because of this, I created two separate sets of training data. The first was created completely by myself and was used for testing. The second was created by starting a Google Docs form and sending it out to all of my friends with very lax instructions. This data set seems to work very well.
I ran several test files through the matcher based on the Bayes Net. These tests were explicitly the ones that would fail the current checks on the website. Each of these passed all of the test and classified correctly.
Below is a graph showing the affect of K on the probabilities created by the Bayes Net. I think that this is proof of the need for Dynamic Smoothing and hope to complete this soon.
Although I was not able to build it into the iSENSE system (because iSENSE is a live site) I am interested in doing so in the future. I think that this a good application of the technology and that iSENSE and its users can benefit from it.