Wednesday, January 30, 2013

Machine Learning: Natural Language Processing Using NLTK


Introduction

Last year (2012), I explored the use of Pig, a data flow and processing language that is part of the Hadoop eco-system. In this article, I will be exploring the topic of document classification in natural language processing. A few months ago (into the run-up to the 2012 U.S. Presidential Elections), I was exploring R and came across a YouTube video of how R could be used to ingest a set of speeches by the Presidential candidates and then use that to predict another set of speeches that were not known to R.

That video got me started into exploring natural language processing - aka NLP, and here I am now, blogging an article about it!

NLP can be looked at from various perspectives and has many applications. In this article, I will focus on classification and walk through a practical example using Python and a natural langauge processing module called nltk (natural language processing toolkit). So what is classification? In the simplest sense, it involves taking a set of documents (any document or text - e.g. emails, books, speeches, etc.) and labeling, classifying or grouping them as appropriate. For example, you can take news articles and classify them into sports, political or business articles. You need to have a sufficient number of documents so that they can be statistically analyzed. Once "enough is learnt" about the documents by a program or a software system, the program can be feed a set of documents that haven't been classified to see if it can statistically analyze the documents and determine their label or group, i.e. be able to classify the documents based on the statistical analysis done on the earlier set of documents. The software can be iteratively adjusted to tune certain parameters to improve its accuracy.

This process is based on quite a bit of statistics and quantitative analysis, but I will not delve into the mathematical background as it is has been handled quite well elsewhere. Instead, I will walk through the program (see the bottom of this article) so that you can use it as is or customize/improve it for a variety of document classfication usecases.

Pre-requisites for the Python Program

Ensure that you have an installation of Python 2.5 and the nltk python package.You can check that your setup is fine by running just the "import" statements from the program. If you do not see any errors, your setup and installation is complete. The authors of nltk have also made available a very good online book/tutorial. The python program makes use of the Naive Bayes Classifier and if you are interested in the details, you can read wikipedia and this article/chapter. from a good online book on language processing and information retrieval.

And to try out the program, you will need two sets of documents. The first document set, called the training document set, where each document is formally classfied. THe second set of documents is the set of documents that you will provide as a "test" to the program to see how well it performs. Inspired by the YouTube video, I chose to use US Presidents' economic reports available here. I selected only those presidents that had 5 or more reports to ensure there is sufficient statistical sample available to process. Consequently I took all the speeches from Harry Truman, Dwight Eisenhower, Lyndon Johnson, Ronald Reagan, Bill Clinton and George Bush Jr and set aside two speeches in the test set and put the remaining in the training set. I just "selected" the text from the browser and saved the text as is - no magic or additional processing done.

Next you will need to arrange the documents as follows. Create a top-level directory for the training data (say, directory name = Train). Next create a sub-directory under that for each document type (e.g. name of each president in the case of economic reports). And copy/move all the training documents in their respective document type directory. The document type is referred to as the label or classification in NLP lingo/parlance. And for the test documents, you can dump all of them into a single directory and let the python program try to classify it (hopefully correctly!). See the diagram below for what I have done.




Inside the Program

The program is small but effective. The key parts are the two loops to process the training and test directories and the "get_featureset" function. The get_featureset function takes a file as an argument, tokenizes the file content, removes English stopwords and then finds various kinds of bigrarms and trigrams. The n-grams are de-duped and a set of "features" is created for each file. A feature is essentially a dictionary where the n-gram is the key and the boolean value True is the value for each key. Note that in this approach, all n-grams are considered of equal weight within each input training file. The set of features or featureset from each training file is labeled appropriately with its directory name. The labeled featuresets are used to train the program. The program  is then tested against a series of test files that are contained in the directory provided as the second parameter to the program.

As you can see, the program is quite simple, yet effective. Below are two screenshot of the program used with the President's economic report speeches. The first screenshot shows the "training" output and the "test" output.

As you can see, the program did quite well and was able to predict the correct classification for all but one of the speeches.





Python Code

#-----------------------------------------------
# Import pre-amble
#-----------------------------------------------
import sys
import os
import re
import pickle
from nltk.corpus import words
from nltk.corpus import stopwords
from nltk.tokenize import WordPunctTokenizer
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from nltk.collocations import TrigramCollocationFinder
from nltk.metrics import TrigramAssocMeasures
from nltk.classify import NaiveBayesClassifier
#-----------------------------------------------

#-----------------------------------------------
def print_script_usage():
    ''' Prints usage of this script.

    Script requires two parameters.
    First parameter (referred to as training_data_dir)
    is a top-level directory name that contains 
    sub-directories of documents for "training/learning".
    The name of each sub-directory is considered a
    "label" (classification) and all documents 
    under that sub-directory are classified 
    as having that label (or classification).

    The second parameter is a directory that
    contains test documents that need to be
    classified. 
    '''
    print "\n\nUsage: " + script_name + " <input_train_data_dir> <test_data_dir>\n\n"
#-----------------------------------------------

#-----------------------------------------------
def validate_dir(dir_name):
    ''' Checks if input dir is a valid an existing directory.'''
    if not os.path.isdir(dir_name):
        print dir_name + " is not a directory."
        print_script_usage()
        sys.exit(1)
#-----------------------------------------------

#-----------------------------------------------
def get_featureset(file, single_words=False, bigrams=True, trigrams=True):
    '''Function to extract featureset from a file.

    This is the main function. It takes a file
    as input parameter and generates a featureset for
    the file. 
    The featureset consists of a set (dictionary) of
    key/value pairs where the key is a single word, 
    bigram or trigram from the text in the file and 
    the value is the boolean value "True".
    '''
    #-----------------------------------------------
    # Word tokenization of input file
    #-----------------------------------------------
    try:
        f = open(file)
        all_text = f.readlines()
        f.close()
    except:
        print "Error in opening file " + file + "."
        print sys.exc_info()
    all_text = " ".join(all_text).lower()
    wp_tokenizer = WordPunctTokenizer()
    tokens = wp_tokenizer.tokenize(all_text)
    english_stopwords = set(stopwords.words('english'))
    filtered_tokens = [token for token in tokens if token not in english_stopwords and len(token) > 4]
    total_word_tokens = len(filtered_tokens)
    n_gram_limit = int(total_word_tokens*0.1)
    #-----------------------------------------------
    # Bigrams from word tokens
    #-----------------------------------------------
    if bigrams == True:
        bigram_finder = BigramCollocationFinder.from_words(filtered_tokens, 5)
        bigrams_chi_sq = bigram_finder.nbest(BigramAssocMeasures.chi_sq, n_gram_limit)
        bigrams_raw_freq = bigram_finder.nbest(BigramAssocMeasures.raw_freq, n_gram_limit)
        bigrams_likelihood_ratio = bigram_finder.nbest(BigramAssocMeasures.likelihood_ratio, n_gram_limit)
        bigrams_poisson_stirling = bigram_finder.nbest(BigramAssocMeasures.poisson_stirling, n_gram_limit)
    else:
        bigrams_chi_sq = []
        bigrams_raw_freq = []
        bigrams_likelihood_ratio = []
        bigrams_poisson_stirling = []
    #-----------------------------------------------
    # Trigrams from word tokens
    #-----------------------------------------------
    if trigrams == True:
        trigram_finder = TrigramCollocationFinder.from_words(filtered_tokens)
        trigrams_chi_sq = trigram_finder.nbest(TrigramAssocMeasures.chi_sq, n_gram_limit)
        trigrams_raw_freq = trigram_finder.nbest(TrigramAssocMeasures.raw_freq, n_gram_limit)
        trigrams_likelihood_ratio = trigram_finder.nbest(TrigramAssocMeasures.likelihood_ratio, n_gram_limit)
        trigrams_poisson_stirling = trigram_finder.nbest(TrigramAssocMeasures.poisson_stirling, n_gram_limit)
    else:
        trigrams_chi_sq = []
        trigrams_raw_freq = []
        trigrams_likelihood_ratio = []
        trigrams_poisson_stirling = []
    #-----------------------------------------------
    # Consolidated list of words, bigrams and trigrams
    #-----------------------------------------------
    if single_words == False:
        filtered_tokens = []
    
    all_tokens = list(set(filtered_tokens + bigrams_chi_sq + bigrams_raw_freq + bigrams_likelihood_ratio + bigrams_poisson_stirling + trigrams_chi_sq + trigrams_raw_freq + trigrams_likelihood_ratio + trigrams_poisson_stirling))
    #-----------------------------------------------
    # Featureset for the input file
    # The featureset consists of a key-value pair
    # where key is the word, bigram or trigram and 
    # the value is True. 
    #-----------------------------------------------
    return dict([(token, True) for token in all_tokens if token != tuple()])
#-----------------------------------------------


#-----------------------------------------------
script_name = os.path.basename(sys.argv[0]) or 'this_script'

if len(sys.argv) != 3:
    print_script_usage()
    sys.exit(1)

training_data_dir, test_data_dir = sys.argv[1], sys.argv[2]
validate_dir(training_data_dir)
validate_dir(test_data_dir)
#-----------------------------------------------



#-----------------------------------------------
print "\n\n\n\n**********************************************************************"
print "\t\t\tT R A I N I N G"
print "**********************************************************************"
print "\n\nProcessing training data from:", training_data_dir

labels = [dir_name for dir_name in os.listdir(training_data_dir) if os.path.join(training_data_dir, dir_name)]

training_featureset = []

for label in labels:
    dir = os.path.join(training_data_dir, label)
    if not os.path.isdir(dir):
        print "\tUnexpected error in resolving directory - " + dir
        sys.exit(1)
    print "\nProcessing sub-directory: " + label 
    for file in os.listdir(dir):
        file_path = os.path.join(dir, file)
        if not os.path.isfile(file_path):
            print "\tfile " + file + "(" + file_path + ") does not seem to be a file"
            continue
        featureset = get_featureset(file_path)
        print "\tCompleted file: " + file
        #print "file = " + file + "  label = " + label + " featureset length = " + str(len(featureset))
        labeled_featureset = tuple([featureset, label])
        training_featureset.append(labeled_featureset)

nb_classifier = NaiveBayesClassifier.train(training_featureset)

print "\n\nTraining data directories:", nb_classifier.labels()
print "Found", str(len(training_featureset)), "files in", str(len(nb_classifier.labels())), "training data directories."
print "\n\n**********************************************************************"
#-----------------------------------------------


#-----------------------------------------------
print "\n\n\n\n**********************************************************************"
print "\t\t\tT E S T I N G"
print "**********************************************************************"
print "\n\nProcessing test data from:", test_data_dir, "\n\n"
for file in os.listdir(test_data_dir):
    infile = os.path.join(test_data_dir, file)
    featureset = get_featureset(infile)
    print "\tLabel for", file, "=",  nb_classifier.classify(featureset)
print "\n\n**********************************************************************"
#-----------------------------------------------


Wednesday, January 16, 2013

A Detour: Data Mining, Machine Learning and Artificial Intelligence


Data mining, machine learning and artificial intelligence - these seem to be some of the "big data space" buzz from people, companies, institutions, research analyst groupsmedia and others. However I feel these terms are misnomers and would like to take an opportunity to put forth my thoughts on them.

Lets take the term "mining" as an example. Traditionally, mining is associated with the process or activity of extracting minerals from the earth. Mineral is the object of interest that is mixed with dirt, mud, sand, clay and other throw-away material excavated from mines. So for example, when one refers to iron-ore mining, gold mining, etc. - it indicates the extraction of the "object of interest" (e.g. gold) and not the "throw-away object" (e.g. mud). Hence gold mining is the term for extracting gold out of the mixture of gold and mud. In the same sense, when we mine data for information and insight, it should be referred to as information mining or data crawling or something more appropriate, but not data mining.

Now lets look at machine learning and artificial intelligence (AI). AI was in vogue in the 80s when there was FUD (Fear Uncertainity and Death) being spread in the U.S. about AI and robotics advances in Japan. See Appendix B of this book. There was so much hype about AI and robotics that well-respected people and media were forecasting everyday use/appearances of robotics at home and workplace. I don't see that happening even when the 32-bit epoc timestamp doomsday dawns upon us! From my perspective, we teach children to walk, talk, read and write by using a variety of things like repeated actions, coercion  reward, enticement, "positive re-enforcement" etc. to impart that learning. Children in turn apply that learning in a variety of ways. For example, we just teach children to read and write, but we do not read every book for them - they use their learning and apply that to read a book (or any book). Same goes for say, walking. Once a child learns to walk, he/she not only goes for treks, but also adopts to walks in space and/or under-water! Once the basic skill is learnt, a child uses his or her own "feedback" mechanism to adjust, adopt and improve and enhance that learning. However when we use statistical/quantitative and other techniques for supervised and/or unsupervised "machine learning", we are just using computers and software to do some mundane, repetitive task at high speed. The computers/software themselves don't learn anything (atleast that's what I believe!). It is the human developer that is "learning" and acting as the "intelligence", "controller" and the "feedback-loop" to adjust the computer programs/software. Of course  developers often "automate" the feedback loop by additional programming of rules and "fuzzy logic".

Finally, referring to software or computer systems as "Artificial Intelligence" is taking away credit from the people who designed, developed and created it. The software system is the embodiment of the collective knowledge, intelligence and smarts of the people who created it and others (e.g the developers/designers in turn may have used software modules developed by others).

All the same, I do not intend to "swim against the tide" and will (unwillingly) adopt these misnomers and use them - this was just my digression to voice a disagreement with those terms.

Thursday, January 3, 2013

Big Data:2013


2013 is forecasted by some to be the year of Big Data, so I thought that it would be nice to start off the year by defining some of the interesting categories of challenges and problems to solve in the big data space. Note that this is just my perspective, yours might differ :-)

First, there is the set of problems to do with data itself - data processing, data storage, retrieval and caching, and data analysis.

Data Processing: This is the set of challenges to do with processing large volumes of data that is loosely structured with fluid schema and the processing needs evolving and changing all the time. Typical examples include processing large volumes of continuously streaming/flowing Twitter data, web logs and web crawler logs. The challege here is that the volume, development speed, clarity of specifications for data processing is not as clear as in say, traditional ETL.  Furthermore, ETL tools like Informatica, Ab Initio etc are either not applicable, not scalable or affordable or not flexible to address the needs.

Data Storage, Retrival and Caching: Similar to traditional ETL tools, traditional data storage and retrieval technologies like RDBMSes are not suitable for big data storage and retrieval because of the scalability, performance, throughput and loose data structure and schema. And even if the data volume was within the realms of traditional RDBMSes, the software cost and potentially the hardware cost of a massive single machine (or set of large machines) would be prohibitive for many companies. Using traditional RDBMSes and traditional data volumes, it is often possible to cache a significant portion of the data in RDBMS buffer cache or filesystem buffers and thereby significantly improve performance and throughput. With big data, this is not practical, atleast not with a single machine setup. So often, when there is an online, interactive application residing over big data, there is a need to add a caching layer to improve read or write response times.

Data Analysis: In traditional data warehousing/BI and OLTP application space, reporting and analysis needs are often well defined. In fact, data warehouses and/or datamarts are often built to store and help with specific business analysis and reporting needs. And OLTP applications have their own set of well-defined and structured reporting needs. However in the case of big data, the objective is store all/as much data as possible and then later examine from a variety of angles to see if any kind of information, insight and perspectives can be extracted out of it - either on its own standing or by combining it with other data sets and/or knowledge. For example, given web server logs, there is only so much analysis that can be done - e.g., web server response time analysis, geographical and temporal analysis of web requests, popular pages and frequent visitor analysis. However, combining that data with additional information about content (e.g. content/page categories and subjects), user/visitor profiles, information on releveant external events (e.g. information on business, finance and news events along with their timings) better insights into the web server logs. And the knowledge gained can be looped back to further enhance or supplement the data sources. E.g. one can now enhance a user/visitor's profile with the kind of information/content preferred by the user.

The second set of problems associated with big data is to do with systems and infrastructure architecture - designing and implementing a scalable, flexible, reliable and fault tolerant data storage system, data backups (if it is possible), data replication (if required), monitoring and operationalizing the data storage system and high availability. Rather than try to fulfill these seperately, the storage systems are built from ground-up along with many of those features. However, one still needs to ensure that as sub-systems are bolted together, the features complement or enhance the overall system and SPOF (single points of failures) are avoided (or minimized). I would like to point the reader to a very interesting article that I read a while ago on how a company designed high availability and reliability into their system. Big data relies on distributed systems, and for the more adventurous and brave of heart, I would like to point you to an excellent presentation by a genius - he is incidentally my cousin and lived a very short but very memorable life of 26 years.

And finally, the third set of problems is to do with permissions, data integrity, security, privacy and encryption (as needed). These problems or needs often arise when there is personally identifiable, financial or legally sensitive data to be handled/managed. Authentication, authorization along with data integrity are important in the big data space as users of the data (analysts, data scientists, developers, admins, etc.) work directly with the data, unlike a data warehouse or an OLTP application where the application layer takes care of many of those functions.

Rather than solve many of these problems over and over again, people are building stacks and ecosystems that inherently address many of the above problems - e.g. the Hadoop ecosystem consisting of Hadoop, HBase, Pig, Hive, Sqoop, Mahout and other subsystems for data storage, processing and analysis. These are then complemented with systems like Lucene and Solr for effective indexing and searching, Memcached for caching, etc.

Due to the nature of my professional work, I have a particular interest in data processing and analysis - and already blogged a couple of articles on Pig. Of lately, I have been interested in data analysis and will post some articles in that space over the next few months.