Not So Random Stuff on Big Data: 2013

Saturday, March 30, 2013

A Little Teaser on NLP, Machine Learning, Data Mining, Data Science and Artificial Intelligence

While I touched on document classification in the previous articles, NLP on its own can, and is, applied to solve a variety of problems. Here are some examples.

Email Processing & Categorization
- Spam v/s non-spam
- Prioritization
- Email distillation, summaries and even semantic linking/threading
(e.g. summary of important unread emails by applying named entity recognition and other techniques)

Social Media Analysis
- Facebook messages that evoke most likes
- Twitter hashtag re-tweet propensity
- Gender identification
- Socioeconomic analysis
(e.g. work done at a data dive for the World Bank)

Document Clustering and Classification
(Supervised as well as Unsupervised learning)
- Sorting/grouping documents
- Identifying similar/related documents
(presentation by .....)
- News feed analysis
- Financial report analysis (e.g. 10K document analysis)
- Data analysis, information distillation and presentation
(e.g. see Narrative Science)

Language Analysis and Processing
- Automated grading of essays and answers
- Checking for plagiarization
- Post-processing of speech-to-text output
(e.g. in medical transcription and dictation system)

Having vetted your creative side, let me venture into machine learning (ML) which can be considered a superset of NLP. From my perspective, NLP, ML, data science, data mining (DM), data analysis (DA) are all related topics. They all require applying data wrangling/processing, statistical data exploration and visualization. I will refer to these collective fields of study as Data Science (DS). And unlike standard software development, where you have well-defined input and output and processing requirements for each software module or component, DS is highly iterative and the "objectives" are refined over several iterations. It is a mixture of "science", "math", "art" and "black magic"!

In a nutshell, machine learning is a journey of discovery and learning and has a very fluid process having the following skeleton steps -

Define/refine problem or quest
Determine what data are needed (its ok not to have all the data upfront)
Inspect and visualize available data (using appropriate sampling if necessary)
Enumerate anad analyze applicable approaches, algorithms and techniques
Determine if any additional features can be obtained or refined from existing or new data data

I am still exploring this "amazing and magical science" and as I "peel the layers", am amazed by its capabilities and potential. It seems that data science with an automated feedback loop and some more bells and whistles becomes AI (Artificial Intelligence). And the Google Driverless Car and a recent quadrocopter ball juggling feat are examples of that. If the links seem to have put you to sleep, these videos will wake you up.

Quadrocopter video
Google driverless car videos (#1, #2, #3)

Wednesday, February 13, 2013

Machine Learning: Natural Language Processing Using NLTK - Part 2

So in the previous post, I explored the use of nltk python module and the use of Naive Bayes classifier to classify or predict documents. The results of the classification were quite good - with 11 out of 12 documents being correctly classified. The incorrectly classified document was from George Bush Jr., and it was classified as being from Harry Truman. Naive Bayes used conditional probability to arrive at its result, using the probabilities of the different n-grams occurring in the significant set of n-grams of each of the Presidents.

Based on casual observations, I had noticed that the documents from Harry Truman were exceptionally long and the feature set on n-grams was consequently quite large, with a significant overlap. George Bush Jr's documents on the other had were not so long and had a variety of n-grams across the documents. So I thought of taking a different approach - that of using Jaccard similarity of the n-grams (see Mining of Massive Datasets).

And apparently that turned out to be a better approach - and got us 12 out of 12 correct classification of the documents. The program is similar to the previous one with a very small change.

Python Code

#-----------------------------------------------
# Import pre-amble
#-----------------------------------------------
import sys
import os
import re
import pickle
from nltk.corpus import words
from nltk.corpus import stopwords
from nltk.tokenize import WordPunctTokenizer
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from nltk.collocations import TrigramCollocationFinder
from nltk.metrics import TrigramAssocMeasures
#from nltk.classify import NaiveBayesClassifier
#-----------------------------------------------

#-----------------------------------------------
def print_script_usage():
''' Prints usage of this script.

Script requires two parameters.
First parameter (referred to as training_data_dir)
is a top-level directory name that contains
sub-directories of documents for "training/learning".
The name of each sub-directory is considered a
"label" (classification) and all documents
under that sub-directory are classified
as having that label (or classification).

The second parameter is a directory that
contains test documents that need to be
classified.
'''
print "\n\nUsage: " + script_name + " <input_train_data_dir> <test_data_dir>\n\n"
#-----------------------------------------------

#-----------------------------------------------
def validate_dir(dir_name):
''' Checks if input dir is a valid an existing directory.'''
if not os.path.isdir(dir_name):
print dir_name + " is not a directory."
print_script_usage()
sys.exit(1)
#-----------------------------------------------

#-----------------------------------------------
def get_featureset(file, single_words=False, bigrams=True, trigrams=True):
'''Function to extract featureset from a file.

This is the main function. It takes a file
as input parameter and generates a featureset for
the file.
The featureset consists of a set (dictionary) of
key/value pairs where the key is a single word,
bigram or trigram from the text in the file and
the value is the boolean value "True".
'''
#-----------------------------------------------
# Word tokenization of input file
#-----------------------------------------------
try:
f = open(file)
all_text = f.readlines()
f.close()
except:
print "Error in opening file " + file + "."
print sys.exc_info()
all_text = " ".join(all_text).lower()
wp_tokenizer = WordPunctTokenizer()
tokens = wp_tokenizer.tokenize(all_text)
english_stopwords = set(stopwords.words('english'))
filtered_tokens = [token for token in tokens if token not in english_stopwords and len(token) > 4]
total_word_tokens = len(filtered_tokens)
n_gram_limit = int(total_word_tokens*0.1)
#-----------------------------------------------
# Bigrams from word tokens
#-----------------------------------------------
if bigrams == True:
bigram_finder = BigramCollocationFinder.from_words(filtered_tokens, 5)
bigrams_chi_sq = bigram_finder.nbest(BigramAssocMeasures.chi_sq, n_gram_limit)
bigrams_raw_freq = bigram_finder.nbest(BigramAssocMeasures.raw_freq, n_gram_limit)
bigrams_likelihood_ratio = bigram_finder.nbest(BigramAssocMeasures.likelihood_ratio, n_gram_limit)
bigrams_poisson_stirling = bigram_finder.nbest(BigramAssocMeasures.poisson_stirling, n_gram_limit)
else:
bigrams_chi_sq = []
bigrams_raw_freq = []
bigrams_likelihood_ratio = []
bigrams_poisson_stirling = []
#-----------------------------------------------
# Trigrams from word tokens
#-----------------------------------------------
if trigrams == True:
trigram_finder = TrigramCollocationFinder.from_words(filtered_tokens)
trigrams_chi_sq = trigram_finder.nbest(TrigramAssocMeasures.chi_sq, n_gram_limit)
trigrams_raw_freq = trigram_finder.nbest(TrigramAssocMeasures.raw_freq, n_gram_limit)
trigrams_likelihood_ratio = trigram_finder.nbest(TrigramAssocMeasures.likelihood_ratio, n_gram_limit)
trigrams_poisson_stirling = trigram_finder.nbest(TrigramAssocMeasures.poisson_stirling, n_gram_limit)
else:
trigrams_chi_sq = []
trigrams_raw_freq = []
trigrams_likelihood_ratio = []
trigrams_poisson_stirling = []
#-----------------------------------------------
# Consolidated list of words, bigrams and trigrams
#-----------------------------------------------
if single_words == False:
filtered_tokens = []

all_tokens = list(set(filtered_tokens + bigrams_chi_sq + bigrams_raw_freq + bigrams_likelihood_ratio + bigrams_poisson_stirling + trigrams_chi_sq + trigrams_raw_freq + trigrams_likelihood_ratio + trigrams_poisson_stirling))
#-----------------------------------------------
# Featureset for the input file
# The featureset consists of a key-value pair
# where key is the word, bigram or trigram and
# the value is True.
#-----------------------------------------------
return dict([(token, True) for token in all_tokens if token != tuple()])
#-----------------------------------------------

#-----------------------------------------------
script_name = os.path.basename(sys.argv[0]) or 'this_script'

training_data_dir, test_data_dir = 'C:\\Users\\jthakrar\\Jayesh\\Reference\\Machine-Learning\\Presidential-Speeches\\economic_reports\\Train', 'C:\\Users\\jthakrar\\Jayesh\\Reference\\Machine-Learning\\Presidential-Speeches\\economic_reports\\Test'
training_data_dir, test_data_dir = sys.argv[1], sys.argv[2]
validate_dir(training_data_dir)
validate_dir(test_data_dir)
#-----------------------------------------------

#-----------------------------------------------
print "\n\n\n\n**********************************************************************"
print "\t\t\tT R A I N I N G"
print "**********************************************************************"
print "\n\nProcessing training data from:", training_data_dir

labels = [dir_name for dir_name in os.listdir(training_data_dir) if os.path.join(training_data_dir, dir_name)]

training_featureset = []

for label in labels:
dir = os.path.join(training_data_dir, label)
if not os.path.isdir(dir):
print "\tUnexpected error in resolving directory - " + dir
sys.exit(1)
print "\nProcessing sub-directory: " + label
for file in os.listdir(dir):
file_path = os.path.join(dir, file)
if not os.path.isfile(file_path):
print "\tfile " + file + "(" + file_path + ") does not seem to be a file"
continue
featureset = get_featureset(file_path)
print "\tCompleted file: " + file
#print "file = " + file + " label = " + label + " featureset length = " + str(len(featureset))
labeled_featureset = tuple([featureset, label])
training_featureset.append(labeled_featureset)

#-----------------------
# This is where we deviate from the NLTK approach Instead of using the Naive Bayes Algorithm,
# we combine the n-grams for the same label across different documents to get a single n-gram set per label.
# This becomes the "training_data" set. We will apply the Jacardina similarity test to each test document to predict
# the label for the test document.
#-----------------------
training_data = {}
for featureset, label in training_featureset:
label_set = set(feature for feature in featureset)
#-----------------------
# Add each training document's tokens to the training data dictionary.
# Note that we need to "append" the new training document's tokens to the label's existing tokenset
#-----------------------
if label in training_data:
training_data[label] = training_data[label].union(label_set)
else:
training_data[label] = label_set

#-----------------------------------------------

#-----------------------------------------------
print "\n\n\n\n**********************************************************************"
print "\t\t\tT E S T I N G"
print "**********************************************************************"
print "\nProcessing test data from:", test_data_dir
for file in os.listdir(test_data_dir):
infile = os.path.join(test_data_dir, file)
featureset = get_featureset(infile)
label_set = set(feature for feature in featureset)
print "\n\nProcessing test file", file
max_value = 0.0
max_value_label = ''
file_score = {}
#-----------------------
# For each test file, we examine the overlap of the n-grams between the test file/document
# and each of the the labeled datasets (training data) using Jacardian Similarity
# This gives us a file_score dictionary where the key is the label and the value is the
# Jacardian Similarity overlap. We then select the key (or label) with the max value as the
# label prediction for the test file.
# Note that if we do not use "1.0" as the multiplier below, the file_score[key] will be zero
# as all other numeric values are integers and the result for file_score[key] is also computed as integer.
#-----------------------
for key in training_data:
file_score[key] = len(training_data[key].intersection(label_set))*1.0/(len(training_data[key].union(label_set)))
if file_score[key] > max_value:
max_value_label = key
max_value = file_score[key]
print "\tLabel for", file, "is", max_value_label, "with score of", max_value

print "\n\n**********************************************************************"
#-----------------------------------------------

Wednesday, January 30, 2013

Machine Learning: Natural Language Processing Using NLTK

Introduction

Last year (2012), I explored the use of Pig, a data flow and processing language that is part of the Hadoop eco-system. In this article, I will be exploring the topic of document classification in natural language processing. A few months ago (into the run-up to the 2012 U.S. Presidential Elections), I was exploring R and came across a YouTube video of how R could be used to ingest a set of speeches by the Presidential candidates and then use that to predict another set of speeches that were not known to R.

That video got me started into exploring natural language processing - aka NLP, and here I am now, blogging an article about it!

NLP can be looked at from various perspectives and has many applications. In this article, I will focus on classification and walk through a practical example using Python and a natural langauge processing module called nltk (natural language processing toolkit). So what is classification? In the simplest sense, it involves taking a set of documents (any document or text - e.g. emails, books, speeches, etc.) and labeling, classifying or grouping them as appropriate. For example, you can take news articles and classify them into sports, political or business articles. You need to have a sufficient number of documents so that they can be statistically analyzed. Once "enough is learnt" about the documents by a program or a software system, the program can be feed a set of documents that haven't been classified to see if it can statistically analyze the documents and determine their label or group, i.e. be able to classify the documents based on the statistical analysis done on the earlier set of documents. The software can be iteratively adjusted to tune certain parameters to improve its accuracy.

This process is based on quite a bit of statistics and quantitative analysis, but I will not delve into the mathematical background as it is has been handled quite well elsewhere. Instead, I will walk through the program (see the bottom of this article) so that you can use it as is or customize/improve it for a variety of document classfication usecases.

Pre-requisites for the Python Program

Ensure that you have an installation of Python 2.5 and the nltk python package.You can check that your setup is fine by running just the "import" statements from the program. If you do not see any errors, your setup and installation is complete. The authors of nltk have also made available a very good online book/tutorial. The python program makes use of the Naive Bayes Classifier and if you are interested in the details, you can read wikipedia and this article/chapter. from a good online book on language processing and information retrieval.

And to try out the program, you will need two sets of documents. The first document set, called the training document set, where each document is formally classfied. THe second set of documents is the set of documents that you will provide as a "test" to the program to see how well it performs. Inspired by the YouTube video, I chose to use US Presidents' economic reports available here. I selected only those presidents that had 5 or more reports to ensure there is sufficient statistical sample available to process. Consequently I took all the speeches from Harry Truman, Dwight Eisenhower, Lyndon Johnson, Ronald Reagan, Bill Clinton and George Bush Jr and set aside two speeches in the test set and put the remaining in the training set. I just "selected" the text from the browser and saved the text as is - no magic or additional processing done.

Next you will need to arrange the documents as follows. Create a top-level directory for the training data (say, directory name = Train). Next create a sub-directory under that for each document type (e.g. name of each president in the case of economic reports). And copy/move all the training documents in their respective document type directory. The document type is referred to as the label or classification in NLP lingo/parlance. And for the test documents, you can dump all of them into a single directory and let the python program try to classify it (hopefully correctly!). See the diagram below for what I have done.

Inside the Program

The program is small but effective. The key parts are the two loops to process the training and test directories and the "get_featureset" function. The get_featureset function takes a file as an argument, tokenizes the file content, removes English stopwords and then finds various kinds of bigrarms and trigrams. The n-grams are de-duped and a set of "features" is created for each file. A feature is essentially a dictionary where the n-gram is the key and the boolean value True is the value for each key. Note that in this approach, all n-grams are considered of equal weight within each input training file. The set of features or featureset from each training file is labeled appropriately with its directory name. The labeled featuresets are used to train the program. The program is then tested against a series of test files that are contained in the directory provided as the second parameter to the program.

As you can see, the program is quite simple, yet effective. Below are two screenshot of the program used with the President's economic report speeches. The first screenshot shows the "training" output and the "test" output.

As you can see, the program did quite well and was able to predict the correct classification for all but one of the speeches.

Python Code


#-----------------------------------------------

# Import pre-amble

#-----------------------------------------------

import sys

import os

import re

import pickle

from nltk.corpus import words

from nltk.corpus import stopwords

from nltk.tokenize import WordPunctTokenizer

from nltk.collocations import BigramCollocationFinder

from nltk.metrics import BigramAssocMeasures

from nltk.collocations import TrigramCollocationFinder

from nltk.metrics import TrigramAssocMeasures

from nltk.classify import NaiveBayesClassifier

#-----------------------------------------------



#-----------------------------------------------

def print_script_usage():

    ''' Prints usage of this script.



    Script requires two parameters.

    First parameter (referred to as training_data_dir)

    is a top-level directory name that contains 

    sub-directories of documents for "training/learning".

    The name of each sub-directory is considered a

    "label" (classification) and all documents 

    under that sub-directory are classified 

    as having that label (or classification).



    The second parameter is a directory that

    contains test documents that need to be

    classified. 

    '''

    print "\n\nUsage: " + script_name + " <input_train_data_dir> <test_data_dir>\n\n"

#-----------------------------------------------



#-----------------------------------------------

def validate_dir(dir_name):

    ''' Checks if input dir is a valid an existing directory.'''

    if not os.path.isdir(dir_name):

        print dir_name + " is not a directory."

        print_script_usage()

        sys.exit(1)

#-----------------------------------------------



#-----------------------------------------------

def get_featureset(file, single_words=False, bigrams=True, trigrams=True):

    '''Function to extract featureset from a file.



    This is the main function. It takes a file

    as input parameter and generates a featureset for

    the file. 

    The featureset consists of a set (dictionary) of

    key/value pairs where the key is a single word, 

    bigram or trigram from the text in the file and 

    the value is the boolean value "True".

    '''

    #-----------------------------------------------

    # Word tokenization of input file

    #-----------------------------------------------

    try:

        f = open(file)

        all_text = f.readlines()

        f.close()

    except:

        print "Error in opening file " + file + "."

        print sys.exc_info()

    all_text = " ".join(all_text).lower()

    wp_tokenizer = WordPunctTokenizer()

    tokens = wp_tokenizer.tokenize(all_text)

    english_stopwords = set(stopwords.words('english'))

    filtered_tokens = [token for token in tokens if token not in english_stopwords and len(token) > 4]

    total_word_tokens = len(filtered_tokens)

    n_gram_limit = int(total_word_tokens*0.1)

    #-----------------------------------------------

    # Bigrams from word tokens

    #-----------------------------------------------

    if bigrams == True:

        bigram_finder = BigramCollocationFinder.from_words(filtered_tokens, 5)

        bigrams_chi_sq = bigram_finder.nbest(BigramAssocMeasures.chi_sq, n_gram_limit)

        bigrams_raw_freq = bigram_finder.nbest(BigramAssocMeasures.raw_freq, n_gram_limit)

        bigrams_likelihood_ratio = bigram_finder.nbest(BigramAssocMeasures.likelihood_ratio, n_gram_limit)

        bigrams_poisson_stirling = bigram_finder.nbest(BigramAssocMeasures.poisson_stirling, n_gram_limit)

    else:

        bigrams_chi_sq = []

        bigrams_raw_freq = []

        bigrams_likelihood_ratio = []

        bigrams_poisson_stirling = []

    #-----------------------------------------------

    # Trigrams from word tokens

    #-----------------------------------------------

    if trigrams == True:

        trigram_finder = TrigramCollocationFinder.from_words(filtered_tokens)

        trigrams_chi_sq = trigram_finder.nbest(TrigramAssocMeasures.chi_sq, n_gram_limit)

        trigrams_raw_freq = trigram_finder.nbest(TrigramAssocMeasures.raw_freq, n_gram_limit)

        trigrams_likelihood_ratio = trigram_finder.nbest(TrigramAssocMeasures.likelihood_ratio, n_gram_limit)

        trigrams_poisson_stirling = trigram_finder.nbest(TrigramAssocMeasures.poisson_stirling, n_gram_limit)

    else:

        trigrams_chi_sq = []

        trigrams_raw_freq = []

        trigrams_likelihood_ratio = []

        trigrams_poisson_stirling = []

    #-----------------------------------------------

    # Consolidated list of words, bigrams and trigrams

    #-----------------------------------------------

    if single_words == False:

        filtered_tokens = []

    

    all_tokens = list(set(filtered_tokens + bigrams_chi_sq + bigrams_raw_freq + bigrams_likelihood_ratio + bigrams_poisson_stirling + trigrams_chi_sq + trigrams_raw_freq + trigrams_likelihood_ratio + trigrams_poisson_stirling))

    #-----------------------------------------------

    # Featureset for the input file

    # The featureset consists of a key-value pair

    # where key is the word, bigram or trigram and 

    # the value is True. 

    #-----------------------------------------------

    return dict([(token, True) for token in all_tokens if token != tuple()])

#-----------------------------------------------





#-----------------------------------------------

script_name = os.path.basename(sys.argv[0]) or 'this_script'



if len(sys.argv) != 3:

    print_script_usage()

    sys.exit(1)



training_data_dir, test_data_dir = sys.argv[1], sys.argv[2]

validate_dir(training_data_dir)

validate_dir(test_data_dir)

#-----------------------------------------------







#-----------------------------------------------

print "\n\n\n\n**********************************************************************"

print "\t\t\tT R A I N I N G"

print "**********************************************************************"

print "\n\nProcessing training data from:", training_data_dir



labels = [dir_name for dir_name in os.listdir(training_data_dir) if os.path.join(training_data_dir, dir_name)]



training_featureset = []



for label in labels:

    dir = os.path.join(training_data_dir, label)

    if not os.path.isdir(dir):

        print "\tUnexpected error in resolving directory - " + dir

        sys.exit(1)

    print "\nProcessing sub-directory: " + label 

    for file in os.listdir(dir):

        file_path = os.path.join(dir, file)

        if not os.path.isfile(file_path):

            print "\tfile " + file + "(" + file_path + ") does not seem to be a file"

            continue

        featureset = get_featureset(file_path)

        print "\tCompleted file: " + file

        #print "file = " + file + "  label = " + label + " featureset length = " + str(len(featureset))

        labeled_featureset = tuple([featureset, label])

        training_featureset.append(labeled_featureset)



nb_classifier = NaiveBayesClassifier.train(training_featureset)



print "\n\nTraining data directories:", nb_classifier.labels()

print "Found", str(len(training_featureset)), "files in", str(len(nb_classifier.labels())), "training data directories."

print "\n\n**********************************************************************"

#-----------------------------------------------





#-----------------------------------------------

print "\n\n\n\n**********************************************************************"

print "\t\t\tT E S T I N G"

print "**********************************************************************"

print "\n\nProcessing test data from:", test_data_dir, "\n\n"

for file in os.listdir(test_data_dir):

    infile = os.path.join(test_data_dir, file)

    featureset = get_featureset(infile)

    print "\tLabel for", file, "=",  nb_classifier.classify(featureset)

print "\n\n**********************************************************************"

#-----------------------------------------------

Wednesday, January 16, 2013

A Detour: Data Mining, Machine Learning and Artificial Intelligence

Data mining, machine learning and artificial intelligence - these seem to be some of the "big data space" buzz from people, companies, institutions, research analyst groups, media and others. However I feel these terms are misnomers and would like to take an opportunity to put forth my thoughts on them.

Lets take the term "mining" as an example. Traditionally, mining is associated with the process or activity of extracting minerals from the earth. Mineral is the object of interest that is mixed with dirt, mud, sand, clay and other throw-away material excavated from mines. So for example, when one refers to iron-ore mining, gold mining, etc. - it indicates the extraction of the "object of interest" (e.g. gold) and not the "throw-away object" (e.g. mud). Hence gold mining is the term for extracting gold out of the mixture of gold and mud. In the same sense, when we mine data for information and insight, it should be referred to as information mining or data crawling or something more appropriate, but not data mining.

Now lets look at machine learning and artificial intelligence (AI). AI was in vogue in the 80s when there was FUD (Fear Uncertainity and Death) being spread in the U.S. about AI and robotics advances in Japan. See Appendix B of this book. There was so much hype about AI and robotics that well-respected people and media were forecasting everyday use/appearances of robotics at home and workplace. I don't see that happening even when the 32-bit epoc timestamp doomsday dawns upon us! From my perspective, we teach children to walk, talk, read and write by using a variety of things like repeated actions, coercion reward, enticement, "positive re-enforcement" etc. to impart that learning. Children in turn apply that learning in a variety of ways. For example, we just teach children to read and write, but we do not read every book for them - they use their learning and apply that to read a book (or any book). Same goes for say, walking. Once a child learns to walk, he/she not only goes for treks, but also adopts to walks in space and/or under-water! Once the basic skill is learnt, a child uses his or her own "feedback" mechanism to adjust, adopt and improve and enhance that learning. However when we use statistical/quantitative and other techniques for supervised and/or unsupervised "machine learning", we are just using computers and software to do some mundane, repetitive task at high speed. The computers/software themselves don't learn anything (atleast that's what I believe!). It is the human developer that is "learning" and acting as the "intelligence", "controller" and the "feedback-loop" to adjust the computer programs/software. Of course developers often "automate" the feedback loop by additional programming of rules and "fuzzy logic".

Finally, referring to software or computer systems as "Artificial Intelligence" is taking away credit from the people who designed, developed and created it. The software system is the embodiment of the collective knowledge, intelligence and smarts of the people who created it and others (e.g the developers/designers in turn may have used software modules developed by others).

All the same, I do not intend to "swim against the tide" and will (unwillingly) adopt these misnomers and use them - this was just my digression to voice a disagreement with those terms.

Thursday, January 3, 2013

Big Data:2013

2013 is forecasted by some to be the year of Big Data, so I thought that it would be nice to start off the year by defining some of the interesting categories of challenges and problems to solve in the big data space. Note that this is just my perspective, yours might differ :-)

First, there is the set of problems to do with data itself - data processing, data storage, retrieval and caching, and data analysis.

Data Processing: This is the set of challenges to do with processing large volumes of data that is loosely structured with fluid schema and the processing needs evolving and changing all the time. Typical examples include processing large volumes of continuously streaming/flowing Twitter data, web logs and web crawler logs. The challege here is that the volume, development speed, clarity of specifications for data processing is not as clear as in say, traditional ETL. Furthermore, ETL tools like Informatica, Ab Initio etc are either not applicable, not scalable or affordable or not flexible to address the needs.

Data Storage, Retrival and Caching: Similar to traditional ETL tools, traditional data storage and retrieval technologies like RDBMSes are not suitable for big data storage and retrieval because of the scalability, performance, throughput and loose data structure and schema. And even if the data volume was within the realms of traditional RDBMSes, the software cost and potentially the hardware cost of a massive single machine (or set of large machines) would be prohibitive for many companies. Using traditional RDBMSes and traditional data volumes, it is often possible to cache a significant portion of the data in RDBMS buffer cache or filesystem buffers and thereby significantly improve performance and throughput. With big data, this is not practical, atleast not with a single machine setup. So often, when there is an online, interactive application residing over big data, there is a need to add a caching layer to improve read or write response times.

Data Analysis: In traditional data warehousing/BI and OLTP application space, reporting and analysis needs are often well defined. In fact, data warehouses and/or datamarts are often built to store and help with specific business analysis and reporting needs. And OLTP applications have their own set of well-defined and structured reporting needs. However in the case of big data, the objective is store all/as much data as possible and then later examine from a variety of angles to see if any kind of information, insight and perspectives can be extracted out of it - either on its own standing or by combining it with other data sets and/or knowledge. For example, given web server logs, there is only so much analysis that can be done - e.g., web server response time analysis, geographical and temporal analysis of web requests, popular pages and frequent visitor analysis. However, combining that data with additional information about content (e.g. content/page categories and subjects), user/visitor profiles, information on releveant external events (e.g. information on business, finance and news events along with their timings) better insights into the web server logs. And the knowledge gained can be looped back to further enhance or supplement the data sources. E.g. one can now enhance a user/visitor's profile with the kind of information/content preferred by the user.

The second set of problems associated with big data is to do with systems and infrastructure architecture - designing and implementing a scalable, flexible, reliable and fault tolerant data storage system, data backups (if it is possible), data replication (if required), monitoring and operationalizing the data storage system and high availability. Rather than try to fulfill these seperately, the storage systems are built from ground-up along with many of those features. However, one still needs to ensure that as sub-systems are bolted together, the features complement or enhance the overall system and SPOF (single points of failures) are avoided (or minimized). I would like to point the reader to a very interesting article that I read a while ago on how a company designed high availability and reliability into their system. Big data relies on distributed systems, and for the more adventurous and brave of heart, I would like to point you to an excellent presentation by a genius - he is incidentally my cousin and lived a very short but very memorable life of 26 years.

And finally, the third set of problems is to do with permissions, data integrity, security, privacy and encryption (as needed). These problems or needs often arise when there is personally identifiable, financial or legally sensitive data to be handled/managed. Authentication, authorization along with data integrity are important in the big data space as users of the data (analysts, data scientists, developers, admins, etc.) work directly with the data, unlike a data warehouse or an OLTP application where the application layer takes care of many of those functions.

Rather than solve many of these problems over and over again, people are building stacks and ecosystems that inherently address many of the above problems - e.g. the Hadoop ecosystem consisting of Hadoop, HBase, Pig, Hive, Sqoop, Mahout and other subsystems for data storage, processing and analysis. These are then complemented with systems like Lucene and Solr for effective indexing and searching, Memcached for caching, etc.

Due to the nature of my professional work, I have a particular interest in data processing and analysis - and already blogged a couple of articles on Pig. Of lately, I have been interested in data analysis and will post some articles in that space over the next few months.