Using the Python NLTK Bayesian Classifier for word sense disambiguation - 92% accuracy

November 30, 2010

Today's article will be going over some basic word sense disambiguation using the NLTK toolkit in Python and Wikipedia. Word sense disambiguation is the process of trying to determine if you mention the world "apple" are you talking about Apple the company or apple the fruit? I've read a few white papers on the subject and decided to try out some of my own tests to compare results. I also wanted to make it so no humans would be needed to see the initial data set and it could be done with data openly available. There are many example of classifiers out there but they all seem to focus on movie reviews so I figured another example may be helpful. Trained NLP professionals will perhaps balk on the simplistic approach but this is meant as more of an intro to NLTK and some of the things you can do with it.

I will demo this approach against real tweets from searching twitter for tweets with the word "apple" in them and creating a data set to test against. I suggest a winner take all vote off between 3 classification/similarity metrics. Jaccard Co-efficient, TF-IDF and Bayesian Classifiers. To the extent that if you were to run all three against an input tweet, whoever pulled 2 or more votes would win and give you a reasonable level of confidence. Although probably not the fastest solution, my goal is accuracy vs performance but your mileage may vary and also not trying to spend weeks developing involved solutions.

Here is a sample tweet: "I guess I'm making a stop at the Apple store along with my quest to find an ugly sweater tomorrow. boo!" It's easy for a human to determine we're talking about Apple the company in this tweet, however to a computer it's not so easy. We first need to find a dataset to seed our algorithms to compare against. Wikipedia has over 2 million ambiguous word definitions so it's important to not require manual training for each word or we'd never get anywhere. My first idea is to look at wikipedia itself. If you look at the disambiguation page for "apple" you can see there are a couple entries of importance: Apple, Inc and apple the fruit. To seed my dataset I suggest grabbing each wikipedia page and storing the complete text of that topic page, along with following each link in the first paragraph and storing the text from each link against the Apple company corpus. So we're grabbing the text from ,, ,, along with all the other wiki links that are in the first paragraph of the Apple topic page. This would be something you could easily script out by looking at the openly available wiki dump pages. So this approach could be used for all the seed data for ambiguous words.

This file: contains a corpus of text for Apple the company, I will do the same with apple the fruit and create a corpus of apple the fruit terms by going to the apple wiki topic and following the links in the first paragraph as well to create this file: . So we now have two corpuses of text that can programmatically be created. The next step we want to do is take in a tweet, tokenize it and try and find some similarity between the tweet and our corpus. For our tokenization we'll grab all the unigrams as well as what NLTK determines to be the most significant bigrams as well. We'll apply porter stemming to each word and also use the WordPunctTokenizer to break up words without punctuation.

First we'll try and train a simple Naive Bayesian Classifier using the NLTK toolkit to try and determine what label we should give a tweet, "company" or "fruit"? We're first going to take each blob of training data and use it to seed our classifier with unigrams and bigrams (two word combinations). We're going to use the NLTK classes to do some of the heavy lifting for us. We will also be porter stemming each word to it's root sense. So "clapping" becomes just "clap". This is to minimize the number of variances of words in the corpus.

Here is a sample file of around 100 random tweets I found with the word apple in it. We'll use this to see how well our classifier is doing. I also hand curated two training files just to verify how accurate our classifier is. We have the following training files available with tweets that were curated into fruit or company buckets. All I did was search "apple" on twitter and grabbed the first tweets I could find, the tweets are picked to increase accuracy, just random apple company and fruit tweets.

Training files:

If you uncomment out the line #run_classifier_tests(classifier) you'll see based on this training data our trained classifier can accurately guess the sense of a tweet with 92.13% accuracy. Not bad for a few hours of work. There are many improvements we can make to the classifier such as clustering around the common hashtags used in tweets it was accurately able to classify, adding trigrams, playing around with other features found in tweets, trying out different classifiers, etc....

Here is the complete classifier code:

If there is interest I'll post the Jaccard Coefficient script and TF-IDF ones as well. The Jaccard script was about 91-93 percent accurate as well.

hit me up on twitter with any comments: @jimplush

** UPDATE **
Oreilly had lead me to this PDF which also discusses using Wikipedia for word sense disambiguations:
it seems to also conclude that this approach is accurate as well as having increased value in the future as wikipedia gets smarter and you retrain your classifiers.


RSS feed for comments on this post.

Leave a Comment

Line and paragraph breaks automatic, HTML allowed: <a href="" title="" rel=""> <abbr title=""> <acronym title=""> <b> <code> <em> <i> <strike> <strong>

Comments disabled due to spammers being losers that lead sad lives.