Not So Random Stuff on Big Data: March 2013

While I touched on document classification in the previous articles, NLP on its own can, and is, applied to solve a variety of problems. Here are some examples.

Email Processing & Categorization
- Spam v/s non-spam
- Prioritization
- Email distillation, summaries and even semantic linking/threading
(e.g. summary of important unread emails by applying named entity recognition and other techniques)

Social Media Analysis
- Facebook messages that evoke most likes
- Twitter hashtag re-tweet propensity
- Gender identification
- Socioeconomic analysis
(e.g. work done at a data dive for the World Bank)

Document Clustering and Classification
(Supervised as well as Unsupervised learning)
- Sorting/grouping documents
- Identifying similar/related documents
(presentation by .....)
- News feed analysis
- Financial report analysis (e.g. 10K document analysis)
- Data analysis, information distillation and presentation
(e.g. see Narrative Science)

Language Analysis and Processing
- Automated grading of essays and answers
- Checking for plagiarization
- Post-processing of speech-to-text output
(e.g. in medical transcription and dictation system)

Having vetted your creative side, let me venture into machine learning (ML) which can be considered a superset of NLP. From my perspective, NLP, ML, data science, data mining (DM), data analysis (DA) are all related topics. They all require applying data wrangling/processing, statistical data exploration and visualization. I will refer to these collective fields of study as Data Science (DS). And unlike standard software development, where you have well-defined input and output and processing requirements for each software module or component, DS is highly iterative and the "objectives" are refined over several iterations. It is a mixture of "science", "math", "art" and "black magic"!

In a nutshell, machine learning is a journey of discovery and learning and has a very fluid process having the following skeleton steps -

Define/refine problem or quest
Determine what data are needed (its ok not to have all the data upfront)
Inspect and visualize available data (using appropriate sampling if necessary)
Enumerate anad analyze applicable approaches, algorithms and techniques
Determine if any additional features can be obtained or refined from existing or new data data

I am still exploring this "amazing and magical science" and as I "peel the layers", am amazed by its capabilities and potential. It seems that data science with an automated feedback loop and some more bells and whistles becomes AI (Artificial Intelligence). And the Google Driverless Car and a recent quadrocopter ball juggling feat are examples of that. If the links seem to have put you to sleep, these videos will wake you up.

Quadrocopter video
Google driverless car videos (#1, #2, #3)

Not So Random Stuff on Big Data

Saturday, March 30, 2013

A Little Teaser on NLP, Machine Learning, Data Mining, Data Science and Artificial Intelligence