Using the Python NLTK Bayesian Classifier for word sense disambiguation - 92% accuracy

November 30th, 2010

Today's article will be going over some basic word sense disambiguation using the NLTK toolkit in Python and Wikipedia. Word sense disambiguation is the process of trying to determine if you mention the world "apple" are you talking about Apple the company or apple the fruit? I've read a few white papers on the subject and decided to try out some of my own tests to compare results. I also wanted to make it so no humans would be needed to see the initial data set and it could be done with data openly available. There are many example of classifiers out there but they all seem to focus on movie reviews so I figured another example may be helpful. Trained NLP professionals will perhaps balk on the simplistic approach but this is meant as more of an intro to NLTK and some of the things you can do with it.

I will demo this approach against real tweets from searching twitter for tweets with the word "apple" in them and creating a data set to test against. I suggest a winner take all vote off between 3 classification/similarity metrics. Jaccard Co-efficient, TF-IDF and Bayesian Classifiers. To the extent that if you were to run all three against an input tweet, whoever pulled 2 or more votes would win and give you a reasonable level of confidence. Although probably not the fastest solution, my goal is accuracy vs performance but your mileage may vary and also not trying to spend weeks developing involved solutions.

Here is a sample tweet: "I guess I'm making a stop at the Apple store along with my quest to find an ugly sweater tomorrow. boo!" It's easy for a human to determine we're talking about Apple the company in this tweet, however to a computer it's not so easy. We first need to find a dataset to seed our algorithms to compare against. Wikipedia has over 2 million ambiguous word definitions so it's important to not require manual training for each word or we'd never get anywhere. My first idea is to look at wikipedia itself. If you look at the disambiguation page for "apple" http://en.wikipedia.org/wiki/Apple_(disambiguation) you can see there are a couple entries of importance: Apple, Inc and apple the fruit. To seed my dataset I suggest grabbing each wikipedia page and storing the complete text of that topic page, along with following each link in the first paragraph and storing the text from each link against the Apple company corpus. So we're grabbing the text from http://en.wikipedia.org/wiki/Apple_Inc. , http://en.wikipedia.org/wiki/NASDAQ, http://en.wikipedia.org/wiki/Multinational_corporation , http://en.wikipedia.org/wiki/Consumer_electronics, along with all the other wiki links that are in the first paragraph of the Apple topic page. This would be something you could easily script out by looking at the openly available wiki dump pages. So this approach could be used for all the seed data for ambiguous words.

This file: http://litfuel.net/plush/files/disambiguation/apple-company.txt contains a corpus of text for Apple the company, I will do the same with apple the fruit and create a corpus of apple the fruit terms by going to the apple wiki topic and following the links in the first paragraph as well to create this file: http://litfuel.net/plush/files/disambiguation/apple-fruit.txt . So we now have two corpuses of text that can programmatically be created. The next step we want to do is take in a tweet, tokenize it and try and find some similarity between the tweet and our corpus. For our tokenization we'll grab all the unigrams as well as what NLTK determines to be the most significant bigrams as well. We'll apply porter stemming to each word and also use the WordPunctTokenizer to break up words without punctuation.

First we'll try and train a simple Naive Bayesian Classifier using the NLTK toolkit to try and determine what label we should give a tweet, "company" or "fruit"? We're first going to take each blob of training data and use it to seed our classifier with unigrams and bigrams (two word combinations). We're going to use the NLTK classes to do some of the heavy lifting for us. We will also be porter stemming each word to it's root sense. So "clapping" becomes just "clap". This is to minimize the number of variances of words in the corpus.

Here is a sample file of around 100 random tweets I found with the word apple in it. http://litfuel.net/plush/files/disambiguation/apple-tweets.txt We'll use this to see how well our classifier is doing. I also hand curated two training files just to verify how accurate our classifier is. We have the following training files available with tweets that were curated into fruit or company buckets. All I did was search "apple" on twitter and grabbed the first tweets I could find, the tweets are picked to increase accuracy, just random apple company and fruit tweets.

Training files:
http://litfuel.net/plush/files/disambiguation/apple-fruit-training.txt
http://litfuel.net/plush/files/disambiguation/apple-company-training.txt

If you uncomment out the line #run_classifier_tests(classifier) you'll see based on this training data our trained classifier can accurately guess the sense of a tweet with 92.13% accuracy. Not bad for a few hours of work. There are many improvements we can make to the classifier such as clustering around the common hashtags used in tweets it was accurately able to classify, adding trigrams, playing around with other features found in tweets, trying out different classifiers, etc....

Here is the complete classifier code: http://pastebin.com/4B1xHHht

If there is interest I'll post the Jaccard Coefficient script and TF-IDF ones as well. The Jaccard script was about 91-93 percent accurate as well.

hit me up on twitter with any comments: @jimplush

** UPDATE **
Oreilly had lead me to this PDF which also discusses using Wikipedia for word sense disambiguations: http://www.cse.unt.edu/~rada/papers/mihalcea.naacl07.pdf
it seems to also conclude that this approach is accurate as well as having increased value in the future as wikipedia gets smarter and you retrain your classifiers.


Hadoop and Python Streaming

May 21st, 2010

I've been starting to write some hadoop and python streaming jobs and there isn't all that much documentation regarding it out there. Things like, how do I pass environment variables, how do I pass along modules that my scripts might need, etc...

here's a couple of quick tips... to pass environment variables to your tasknodes use this command line param when launching a hadoop job:
  1.  
  2. /Users/Hadoop/hadoop/bin/hadoop jar /Users/Hadoop/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar \
  3. -mapper /Users/hadoop/code/traffic/mapper.py \
  4. -reducer /Users/hadoop/code/traffic/reducer.py \
  5. -input insights-input-small/* \
  6. -output insights-output-traffic \
  7. -cmdenv PYTHONPATH=$PYTHONPATH:/Users/jim/Code \
  8. -cmdenv MYAPP__PATH=/Users/jim/Code \
  9. -cmdenv MYAPP_ENVIRONMENT=development


if you want to distribute your modules to the tasknodes instead of having them installed on the target task nodes then you can zip up your module file, rename it to mymodule.mod and use this command line param

-file /Users/jim/Code/mymodule.mod

then in your script you can unzip it and import it as usual

  1.  
  2. import zipimport
  3. importer = zipimport.zipimporter('mymodule.mod')
  4. insights = importer.load_module('mymodule')


hope that helps someone :)



How to script multiple telnet sessions

August 4th, 2006

I have a problem. A pain in the ass in my day when I need to deploy code to testing racks. Here's the situation....
We have multiple testing racks in the building to test entertainment systems. We also have multiple web servers on those racks that are basically stand alone web boxes used to interact with the system. Let's say we have 4 racks with 3 terminals who each have to be loaded by hand. (don't ask me why man!). So when I need to update my code for the developers and testers to use I have to do this little process

telnet to rack's gateway box
from gateway box, telnet to the web terminal
ncftp back to the gateway box
retrieve my latest code tar file
exit ncftp
untar new code
place code in proper directory
exit

thats EACH terminal. As you can imagine that could take a while with 4 racks, 3 terminals each. Silly me didn't think it was possible to actually script multiple telnet sessions. E.G.: Telnet to box1, then from box1 telnet to box2. I was wrong. Using the http://expect.nist.gov/ "EXPECT" program that comes standard on most linux and mac systems (there's a windows version too) you can actually script this without blinking an eye. From what I've found this is the only real decent way of doing this, everything else is just a hack and doesn't work reliably.


#!/usr/bin/expect
#---------------- TELNET TO INITIAL BOX1 -------------------------#
spawn telnet 203.288.183.144

#LOGIN
expect -re "login"
send "myUser\r"
expect -re "Password:"
send "myPassword\r"
expect "*box1]$*"

# TELNET TO BOX2 FROM BOX1
send "telnet 172.17.0.33\r"
expect "/ #"

# you now have command over box2 to do whatever commands you wish!
send "ls\r"
send "exit\r"
expect -re "box1"
send "exit\r"


Basically...spawn telnet 203.288.183.144 means start a new telnet session at this IP address

Now I expect I'm going to get prompted for a login right off the bat
expect -re "login"

I use the -re flag which means use regular expressions. So somewhere in string the GUI is showing I expect the word "login"

So I get "login" now I was to send it my login ID
send "myUser\r"

I'm sending my login name "myUser\r" notice the \r which acts as a user hitting the RETURN key on the keyboard.

Now I expect after I enter my username it wants my password
expect -re "Password:"

So I send it my password
send "myPassword\r"

Now I'm logged in and I expect to have a command prompt on box1
expect "*box1]$*"

Now I'm on box1 and I want to telnet to box2 from box1
send "telnet 172.17.0.33\r"
expect "/ #"

and boom, I'm on the command line of the 2nd box in my script and I can execute anything I want just using send and expect.


Using this simple technique you can automate all your interactive tasks such as telnet, ftp, ssh, etc. It's a whole new world. I can take more vacation now. Hope this helps someone out there looking to script telnet sessions :)

How to get your Firefox Extension working on Minimo

June 1st, 2006

One of my fun upcoming tasks is to get a fully event driven desktop like application working on a handheld device like an IPAQ. Ideally, I was hoping to reuse my firefox extension that acts a socket server to listen for events from the outside system and have the app respond to said events. For example... you're on a plane and someone wants to order a drink so they hit the service button on their seat. That button sends a message that is broadcast out to the rest of the system and if the crew member assigned to your section is signed in on their IPAQ they should get a little alert on their device, all asyncronously.

Enter Minimo...
Since I can't reuse my extension with Pocket IE, Minimo was my obvious choice. Minimo is the "mini" firefox browser for handheld devices. Although still VERY much in its infancy I was able to actually get my extension running with the of help of the mozilla forums.

This is a very basic tutorial on how to get a basic extension working in minimo. Since there is no packaging mechanism for minimo extension you kind of have to hack your way through it. First things first....You have to install Minimo on a handheld device. The good news is that was very simple, just download and follow the installer on windows. Next lets look at the file structure of the Minimo Application on the handheld device. Open up file explorer on the handheld and you can see its:

\Program Files\Minimo\chrome
\Program Files\Minimo\components
\Program Files\Minimo\greprefs
\Program Files\Minimo\res

The only thing we're going to do is drop 2 files into the Chrome directory
1. mytest.manifest
2. mytest.jar

There is no XPI file. We're just going to create an extension that does an alert() message from the XUL overlay file. This will get you started at at least let you know if you're able to install an extension.

Lets take a look at the first file we need mytest.manifest
This file contains the path information so minimo knows what do with our extension. Its a very simple file with two lines

content mytest jar:mytest.jar!/content/ xpcnativewrappers=no
overlay chrome/minimo/content/minimo.xul chrome/mytest/content/overlay.xul

1st line we're saying our jar file is mytest.jar and our junk can be found in the the /content/ directory of our jar file
2nd line we're saying overlay the overlay.xul file in my content directory over the minimo.xul file

take a careful look at the paths and make sure you're extension matches whats going on here.

So lets take a look at our extension folder on our local machine:
C:\mytest\content\overlay.xul
C:\mytest\mytest.manifest

So now we have to create our overlay.xul file:

  1.  
  2. <?xml version="1.0"?>
  3. <overlay id="pacsock_overlay"
  4. xmlns="http://www.mozilla.org/keymaster/gatekeeper/there.is.only.xul">
  5. <script type="application/x-javascript">
  6. alert('hey dude!');
  7. </script>
  8. <!-- Context Menu -->
  9. </overlay>


thats it, just a simple alert message that says "hey dude". So now we have the two files we need. So next step is to right click on the content directory inside the C:\mytest directory and choose SEND TO -> COMPRESSED FILE which will create a .zip file next to the content directory. Change that to mytest.jar

Now we're ready to drop that sucker into Minimo. Open up Microsoft Active Sync and click Explore


Next click on My Windows Mobile-Based Device, Program Files, Minimo and you should be looking at a screen that looks like this:


Now go back to your local machines "mytest" directory and copy the mytest.manifest file and the mytest.jar file over to the "chrome" directory in the active sync file explorer page. Your chrome directory should now look like:


thats it! now just close Minimo and reopen it and it should prompt you with an alert that says "hey dude"







AJAX and Unit Testing - it's time to mingle

February 13th, 2006

I've decided to write a little two part introduction into unit testing your AJAX applications with JSUnit. AJAX applications now are adding a new complexity into our development lives. Introducing business logic into our presentation tier. It is now not enough to write some adhoc javascript form validation functions that work most of the time. You now need to take accountability for your javascript code as it can affect your business logic on the server side.

As briefly as I can describe it unit testing gives a developer freedom. "Freedom?? but I have to write more code how the hell is that free?". Unit testing can free you as a developer by making sure your code is accounted for with automated tests. If you're in a multi-developer environment this is crucial to the success and bug management of your project. Imagine you spend weeks writing this great class library, all the code is pretty, well documented, works great! Now Kevin comes in a decides he wants to change the functionality of one of your methods to match something he's working on. His code now works but that change just broke 10 other areas where it was being used. F'ing Kevin! You will never catch this unless you're monitoring every change list that developers submit, or you have an automated testing harness that will run the code nightly (at least) and report failures. If you had a properly tested library, this would have been caught after Kevin ran the unit test suite and realized that code he change did have a purpose and he could look at the test and now see what the purpose was. That my friend is the power of the unit test.

Ok so with that out of the way let's talk about how to harness your new great AJAX code!


First thing you're going to need is a unit test suite for javascript. The current gold standard is JSUnit (http://www.edwardh.com/jsunit/). JSUnit allows you to write html pages with javascript assertions built in. Download the latest JSUnit package to your hard-drive and you'll notice in there it will unzip itself to a jsunit/ directory. The main thing you will be concerned is the jsUnitCore.js file in the jsunit/app directory. That's the file that makes the magic happen. You will need to include this in your testing scripts. The other file of note is jsunit/testRunner.html file. This will be the main test runner page that you will use to view the status of your tests. Mmmmm the Green bar!

Lets take a look at this image to give us an overview of how we want our testing structure setup to start with.


What this relationship shows is that we have our main testRunner.html page which we'll access through the browser. If you take a look at this UI shot of the testRunner interface, you'll notice a UI text box to put in a test script.


This test script can be a single page or a "suite" of pages. The nice part about the suite is you can have suites of suites. What this means is let's say you have an inventory module you are working on. You also have an order entry module you just finished. The inventory modules has 2 javascript classes in 2 files, so you created two test scripts so far. You can roll those into what's called a "suite" so you only have to run one page to actually run both sets of tests. You could also have a "TOTAL PACKAGE" suite that runs all your modules. Sometimes though you might just want to test a module you're working on for speed's sake.

Lets jump into our first script just to make sure we're up and running. This will just cover getting us up and running. I will post another write up on how to actually test your ajax application.

Environment setup:
Let's assume we downloaded jsunit to our root directory so our web structure looks like this:

Htdocs/jsunit/

First things first. Let's make sure everything unpackaged correctly and go to: http://localhost/jsunit/testRunner.html

(if you installed to a different location, point your browser to the appropriate place. Unix servers must also take note of the capital R in testRunner.html)

You should now be looking at the same UI screen that I've shown at the beginning of this page.

Now let's write our first script in the TDD (test driven development) fashion where tests come before code. We just want to make sure we're working properly so here we go.

Step 1:
Create a file called testOfJSUnit.htm

  1.  
  2. <html>
  3. <head>
  4. <title>Test Page for Inventory Module</title>
  5. <script language="javascript" src="http://localhost/jsunit/app/jsUnitCore.js"></script>
  6. </head>
  7. <body>
  8. <script language="javascript">
  9.  
  10. function testSetupWorks() {
  11.  
  12. assertEquals("Should have equaled true for setup", 1, 1);
  13. }
  14. </script>
  15. </body>
  16. </html>


So all we're trying to do is assert that 1 does in fact equal 1 which will give us a passing score. We just want to make sure we're all setup properly. So now we have a bonafide JSUnit test script. Here's what we did:

You'll notice at the top this line
  1.  
  2. <script language="javascript" src="http://localhost/jsunit/app/jsUnitCore.js"></script>


That's the meat and potatoes. We want to include the core library for JSUnit so we have access to all the assert methods.

  1.  
  2. function testSetupWorks() {
  3.  
  4. assertEquals("Should have equaled true for setup", 1, 1);
  5. }


The function above is a test function. Anything that begins with "test" gets run in JSUnit. We're using the function assertEquals which takes 3 arguments. The args are: a comment to show if the test fails, the condition to match and a condition to test. So we're testing that 1 is equal to 1.

Now go back to your browser and make sure you're at http://localhost/jsunit/testRunner.html

You'll see that text input where it says: "Enter the filename of the Test Page to be run:" In there type in the path to the file you created, so for me I would type in:
"localhost/Projects/test/testOfJSUnit.htm". Notice they already add the http:// for you. Click the run button and you should see a green bar for a passing test like this below:



If you see the green bar congrats! You have successfully installed JSUnit. This week I'll be doing a write up on how to test some common ajax functionality as well as setting up a test suite to run all our tests. If you didn’t get JSUnit installed correctly visit their mailing list here: http://groups.yahoo.com/group/jsunit/

Set up an apache virtual host in 30 seconds for windows

September 14th, 2005

If you're a developer working with apache and have multiple folders in your web root for clients project you really need to make use of virtual hosts. A virtual host allows you to type in clienta.localhost.com and map that to C:\inetpub\wwwroot\clients\clienta and the same with clientB.localhost.com and map that to C:\inetpub\wwwroot\clients\old\clientB

in your browser you can just type in clientB.localhost.com and you're all set. So here's how to do it in 30 seconds on windows

First thing we need to do is edit our hosts file to tell it when you type in that address where to go so open up in notepad:
C:\Windows\system32\drivers\etc\hosts (may be different on your system)

add these lines and save the file
127.0.0.1 clienta.localhost
127.0.0.1 clientB.localhost

now open up the httpd.conf file in C:\Program Files\Apache Group\conf\httpd.conf
and find the VIRTUAL HOST section. (search for "VirtualHost example:")

add add these lines
NameVirtualHost 127.0.0.1

<VirtualHost 127.0.0.1>
DocumentRoot "C:\inetpub\wwwroot\clients\clienta"
ServerName clientB.localhost
</VirtualHost>

<VirtualHost 127.0.0.1>
DocumentRoot "C:\inetpub\wwwroot\clients\old\clientB"
ServerName clientB.localhost
</VirtualHost>

save that file and restart apache, now open up your browser of choice and type in clientB.localhost.com and boom you're virtual.

PHP Bitwise Tutorial - Bits, Bytes, Binary Math and Use Cases

April 16th, 2005

I created a tutorial on PHP's bitwise functions and how they work. Its a helpful tutorial for anyone who is looking to understand any part of bits, bytes, binary math or the PHP bitwise operators and why they can be beneficial.

http://www.litfuel.net/tutorials/bitwise.htm

hope it can help someone out there :)

What this tutorial will cover:
1. What are bits and bytes
2. PHP's Bitwise Operators
3. A simple usecase for why you would want to use bitwise operators

BBLOG PHP Syntax Highlighting Tutorial

March 30th, 2005

I'll quickly describe how to get PHP Syntax highlighting using BBCode in your BBLOG application using GESHI (you can apply this same technique to any application). What we're going to do is enable our blog to support [ php] tags

Step 1.
Download Geshi from http://sourceforge.net/project/showfiles.php?group_id=114997

Step 2.
Unzip the files to your webdirectory that your blog is in. For example mine is located in litfuel.net/plush/bblog/bBlog_plugins/geshi/

Step 3.
open up the file modifier.bbcode.php script in bblog/bBlog_plugins/ folder. At the top of this file add the following
  1.  
  2. // this is the path to the geshi.php file yours will obviously be different so change this!
  3. $include_path = "/plush/bblog/bBlog_plugins/geshi/";
  4. // include main geshi file
  5. include_once($include_path."geshi.php");


Step 4.
Add this block of code RIGHT BEFORE THE RETURN statement in the smarty_modifier_bbcode function. So locate
  1. return (nl2br($ret));
and add the code below right above it.

  1.  
  2. /*------------------------------------------------------------------------------------------*/
  3. // Jim Plush Add on for PHP Syntax Highlighting questions? <a href="mailto:jiminoc@gmail.com">jiminoc@gmail.com</a>
  4. // CHECK FOR PHP TAGS
  5. $regex = '/\[php\](.*?)\[\/php\]/si';
  6. // GRAB THE PHP CODE WE WANT TO HIGHLIGHT IN $matches[1]
  7. preg_match_all($regex, $ret, $matches);
  8. // set path to the geshi FILES folder - notice I have the double geshi now this is where the php.php file is located
  9. $path = "/litfuel.net/plush/bblog/bBlog_plugins/geshi/geshi/";
  10. // now we have to loop through all our matches because we can have multiple php brackets in our post
  11. $cnt = count($matches[1]);
  12. for($i=0; $i < $cnt; $i++)
  13. {
  14. // Create a GeSHi object where php is the language we want to use
  15. $geshi = new GeSHi($matches[1][$i], 'php', $path);
  16. // lets enable line numbers so people can comment based on the line
  17. $geshi->enable_line_numbers(GESHI_NORMAL_LINE_NUMBERS);
  18. $phpcode = $geshi->parse_code();
  19. $ret = str_replace($matches[0][$i], $phpcode , $ret);
  20. }
  21. /*------------------------------------------------------------------------------------------*/


thats it! now when you write in your blog whenever you want to use PHP Syntax Highlighting just put [ php ] my php stuff [/ php ] tags in and use the BBCode Entry Modifier :)
There are tons of advanced features you could add with geshi but I tried to keep it super simple for the moment.

SimpleXML and RSS Grabbing

March 30th, 2005

With PHP5 comes the easiest XML tool yet.. SimpleXML. Here is an example of how to print out information from an RSS feed (v 2.0 + 1.0)

  1.  
  2. <?php
  3. // lets set our default to yahoo news stories
  4. $url = 'http://rss.news.yahoo.com/rss/topstories';
  5. }
  6. /*---------------------------------------------------------------------------------------*/
  7. // grab the contents of the rss feed
  8. $rss_file = file_get_contents($url);
  9.  
  10. // load up our simple xml object
  11. if(!$feed = simplexml_load_string($rss_file))
  12. {
  13. die("Cannot load RSS Feed. This application supports RSS 1.0 and 2.0");
  14. }
  15. /*---------------------------------------------------------------------------------------*/
  16.  
  17. // print out the title of the feed
  18. echo '<p>';
  19. echo $feed->channel->title;
  20. echo '</p>';
  21.  
  22. // check for RSS Version
  23. $items = ($feed['version'] != '') ? $feed->channel->item : $feed->item;
  24.  
  25. // PRINT OUT ITEMS
  26. /*---------------------------------------------------------------------------------------*/
  27. foreach($items as $item)
  28. {
  29. echo '<a href="'.$item->link.'">'.$item->title.'</a><BR>';
  30. echo $item->description.'<BR><BR>';
  31. }
  32. /*---------------------------------------------------------------------------------------*/
  33. ?>

Regular Expression Magic! Pattern Naming and Comments

March 22nd, 2005

I wrote up a little tutorial last night on some advanced but EASY and HELPFUL features you can add to your regex arsenal.

I focus briefly on Pattern Naming and Commenting your regular expressions. If you don't know about these two items already you MUST read this little guide, you'll love it :)

http://www.litfuel.net/tutorials/regex.htm