Not So Random Stuff on Big Data: Machine Learning: Natural Language Processing Using NLTK

Introduction

Last year (2012), I explored the use of Pig, a data flow and processing language that is part of the Hadoop eco-system. In this article, I will be exploring the topic of document classification in natural language processing. A few months ago (into the run-up to the 2012 U.S. Presidential Elections), I was exploring R and came across a YouTube video of how R could be used to ingest a set of speeches by the Presidential candidates and then use that to predict another set of speeches that were not known to R.

That video got me started into exploring natural language processing - aka NLP, and here I am now, blogging an article about it!

NLP can be looked at from various perspectives and has many applications. In this article, I will focus on classification and walk through a practical example using Python and a natural langauge processing module called nltk (natural language processing toolkit). So what is classification? In the simplest sense, it involves taking a set of documents (any document or text - e.g. emails, books, speeches, etc.) and labeling, classifying or grouping them as appropriate. For example, you can take news articles and classify them into sports, political or business articles. You need to have a sufficient number of documents so that they can be statistically analyzed. Once "enough is learnt" about the documents by a program or a software system, the program can be feed a set of documents that haven't been classified to see if it can statistically analyze the documents and determine their label or group, i.e. be able to classify the documents based on the statistical analysis done on the earlier set of documents. The software can be iteratively adjusted to tune certain parameters to improve its accuracy.

This process is based on quite a bit of statistics and quantitative analysis, but I will not delve into the mathematical background as it is has been handled quite well elsewhere. Instead, I will walk through the program (see the bottom of this article) so that you can use it as is or customize/improve it for a variety of document classfication usecases.

Pre-requisites for the Python Program

Ensure that you have an installation of Python 2.5 and the nltk python package.You can check that your setup is fine by running just the "import" statements from the program. If you do not see any errors, your setup and installation is complete. The authors of nltk have also made available a very good online book/tutorial. The python program makes use of the Naive Bayes Classifier and if you are interested in the details, you can read wikipedia and this article/chapter. from a good online book on language processing and information retrieval.

And to try out the program, you will need two sets of documents. The first document set, called the training document set, where each document is formally classfied. THe second set of documents is the set of documents that you will provide as a "test" to the program to see how well it performs. Inspired by the YouTube video, I chose to use US Presidents' economic reports available here. I selected only those presidents that had 5 or more reports to ensure there is sufficient statistical sample available to process. Consequently I took all the speeches from Harry Truman, Dwight Eisenhower, Lyndon Johnson, Ronald Reagan, Bill Clinton and George Bush Jr and set aside two speeches in the test set and put the remaining in the training set. I just "selected" the text from the browser and saved the text as is - no magic or additional processing done.

Next you will need to arrange the documents as follows. Create a top-level directory for the training data (say, directory name = Train). Next create a sub-directory under that for each document type (e.g. name of each president in the case of economic reports). And copy/move all the training documents in their respective document type directory. The document type is referred to as the label or classification in NLP lingo/parlance. And for the test documents, you can dump all of them into a single directory and let the python program try to classify it (hopefully correctly!). See the diagram below for what I have done.

Inside the Program

The program is small but effective. The key parts are the two loops to process the training and test directories and the "get_featureset" function. The get_featureset function takes a file as an argument, tokenizes the file content, removes English stopwords and then finds various kinds of bigrarms and trigrams. The n-grams are de-duped and a set of "features" is created for each file. A feature is essentially a dictionary where the n-gram is the key and the boolean value True is the value for each key. Note that in this approach, all n-grams are considered of equal weight within each input training file. The set of features or featureset from each training file is labeled appropriately with its directory name. The labeled featuresets are used to train the program. The program is then tested against a series of test files that are contained in the directory provided as the second parameter to the program.

As you can see, the program is quite simple, yet effective. Below are two screenshot of the program used with the President's economic report speeches. The first screenshot shows the "training" output and the "test" output.

As you can see, the program did quite well and was able to predict the correct classification for all but one of the speeches.

Python Code


#-----------------------------------------------

# Import pre-amble

#-----------------------------------------------

import sys

import os

import re

import pickle

from nltk.corpus import words

from nltk.corpus import stopwords

from nltk.tokenize import WordPunctTokenizer

from nltk.collocations import BigramCollocationFinder

from nltk.metrics import BigramAssocMeasures

from nltk.collocations import TrigramCollocationFinder

from nltk.metrics import TrigramAssocMeasures

from nltk.classify import NaiveBayesClassifier

#-----------------------------------------------



#-----------------------------------------------

def print_script_usage():

    ''' Prints usage of this script.



    Script requires two parameters.

    First parameter (referred to as training_data_dir)

    is a top-level directory name that contains 

    sub-directories of documents for "training/learning".

    The name of each sub-directory is considered a

    "label" (classification) and all documents 

    under that sub-directory are classified 

    as having that label (or classification).



    The second parameter is a directory that

    contains test documents that need to be

    classified. 

    '''

    print "\n\nUsage: " + script_name + " <input_train_data_dir> <test_data_dir>\n\n"

#-----------------------------------------------



#-----------------------------------------------

def validate_dir(dir_name):

    ''' Checks if input dir is a valid an existing directory.'''

    if not os.path.isdir(dir_name):

        print dir_name + " is not a directory."

        print_script_usage()

        sys.exit(1)

#-----------------------------------------------



#-----------------------------------------------

def get_featureset(file, single_words=False, bigrams=True, trigrams=True):

    '''Function to extract featureset from a file.



    This is the main function. It takes a file

    as input parameter and generates a featureset for

    the file. 

    The featureset consists of a set (dictionary) of

    key/value pairs where the key is a single word, 

    bigram or trigram from the text in the file and 

    the value is the boolean value "True".

    '''

    #-----------------------------------------------

    # Word tokenization of input file

    #-----------------------------------------------

    try:

        f = open(file)

        all_text = f.readlines()

        f.close()

    except:

        print "Error in opening file " + file + "."

        print sys.exc_info()

    all_text = " ".join(all_text).lower()

    wp_tokenizer = WordPunctTokenizer()

    tokens = wp_tokenizer.tokenize(all_text)

    english_stopwords = set(stopwords.words('english'))

    filtered_tokens = [token for token in tokens if token not in english_stopwords and len(token) > 4]

    total_word_tokens = len(filtered_tokens)

    n_gram_limit = int(total_word_tokens*0.1)

    #-----------------------------------------------

    # Bigrams from word tokens

    #-----------------------------------------------

    if bigrams == True:

        bigram_finder = BigramCollocationFinder.from_words(filtered_tokens, 5)

        bigrams_chi_sq = bigram_finder.nbest(BigramAssocMeasures.chi_sq, n_gram_limit)

        bigrams_raw_freq = bigram_finder.nbest(BigramAssocMeasures.raw_freq, n_gram_limit)

        bigrams_likelihood_ratio = bigram_finder.nbest(BigramAssocMeasures.likelihood_ratio, n_gram_limit)

        bigrams_poisson_stirling = bigram_finder.nbest(BigramAssocMeasures.poisson_stirling, n_gram_limit)

    else:

        bigrams_chi_sq = []

        bigrams_raw_freq = []

        bigrams_likelihood_ratio = []

        bigrams_poisson_stirling = []

    #-----------------------------------------------

    # Trigrams from word tokens

    #-----------------------------------------------

    if trigrams == True:

        trigram_finder = TrigramCollocationFinder.from_words(filtered_tokens)

        trigrams_chi_sq = trigram_finder.nbest(TrigramAssocMeasures.chi_sq, n_gram_limit)

        trigrams_raw_freq = trigram_finder.nbest(TrigramAssocMeasures.raw_freq, n_gram_limit)

        trigrams_likelihood_ratio = trigram_finder.nbest(TrigramAssocMeasures.likelihood_ratio, n_gram_limit)

        trigrams_poisson_stirling = trigram_finder.nbest(TrigramAssocMeasures.poisson_stirling, n_gram_limit)

    else:

        trigrams_chi_sq = []

        trigrams_raw_freq = []

        trigrams_likelihood_ratio = []

        trigrams_poisson_stirling = []

    #-----------------------------------------------

    # Consolidated list of words, bigrams and trigrams

    #-----------------------------------------------

    if single_words == False:

        filtered_tokens = []

    

    all_tokens = list(set(filtered_tokens + bigrams_chi_sq + bigrams_raw_freq + bigrams_likelihood_ratio + bigrams_poisson_stirling + trigrams_chi_sq + trigrams_raw_freq + trigrams_likelihood_ratio + trigrams_poisson_stirling))

    #-----------------------------------------------

    # Featureset for the input file

    # The featureset consists of a key-value pair

    # where key is the word, bigram or trigram and 

    # the value is True. 

    #-----------------------------------------------

    return dict([(token, True) for token in all_tokens if token != tuple()])

#-----------------------------------------------





#-----------------------------------------------

script_name = os.path.basename(sys.argv[0]) or 'this_script'



if len(sys.argv) != 3:

    print_script_usage()

    sys.exit(1)



training_data_dir, test_data_dir = sys.argv[1], sys.argv[2]

validate_dir(training_data_dir)

validate_dir(test_data_dir)

#-----------------------------------------------







#-----------------------------------------------

print "\n\n\n\n**********************************************************************"

print "\t\t\tT R A I N I N G"

print "**********************************************************************"

print "\n\nProcessing training data from:", training_data_dir



labels = [dir_name for dir_name in os.listdir(training_data_dir) if os.path.join(training_data_dir, dir_name)]



training_featureset = []



for label in labels:

    dir = os.path.join(training_data_dir, label)

    if not os.path.isdir(dir):

        print "\tUnexpected error in resolving directory - " + dir

        sys.exit(1)

    print "\nProcessing sub-directory: " + label 

    for file in os.listdir(dir):

        file_path = os.path.join(dir, file)

        if not os.path.isfile(file_path):

            print "\tfile " + file + "(" + file_path + ") does not seem to be a file"

            continue

        featureset = get_featureset(file_path)

        print "\tCompleted file: " + file

        #print "file = " + file + "  label = " + label + " featureset length = " + str(len(featureset))

        labeled_featureset = tuple([featureset, label])

        training_featureset.append(labeled_featureset)



nb_classifier = NaiveBayesClassifier.train(training_featureset)



print "\n\nTraining data directories:", nb_classifier.labels()

print "Found", str(len(training_featureset)), "files in", str(len(nb_classifier.labels())), "training data directories."

print "\n\n**********************************************************************"

#-----------------------------------------------





#-----------------------------------------------

print "\n\n\n\n**********************************************************************"

print "\t\t\tT E S T I N G"

print "**********************************************************************"

print "\n\nProcessing test data from:", test_data_dir, "\n\n"

for file in os.listdir(test_data_dir):

    infile = os.path.join(test_data_dir, file)

    featureset = get_featureset(infile)

    print "\tLabel for", file, "=",  nb_classifier.classify(featureset)

print "\n\n**********************************************************************"

#-----------------------------------------------

Not So Random Stuff on Big Data

Wednesday, January 30, 2013

Machine Learning: Natural Language Processing Using NLTK

Introduction

Pre-requisites for the Python Program

Inside the Program

Python Code

No comments:

Post a Comment