Saturday, August 25, 2012

Pignalytics - Big Data Analytics Using Pig : Introduction


Today, Hadoop is widely used as a big data storage and processing framework and provides a scalable, highly available storage and online archival solution and also serves as a platform to parallelize the processing of that big data using map-reduce paradigm. The "batch-oriented" nature of Hadoop is complemented by  HBase which is a high volume, high throughput key-value datastore. There are mainly three ways to harness the highly scalable processing capabilities of Hadoop - (1) using Java-based map-reduce programming (2) using Hadoop streaming and (3) using higher-language languages and tools like Hive, Pig and Sqoop. 

Although the whole Hadoop framework and ecosystem is written in Java, I find using Java as a big deterrent for me - probably, because I come from the traditional data warehousing, ETL tools and Unix scripting background. Hive provides some solace, but limits me to encapsulating all my processing into a single HiveQL statement and using a store-and-forward approach if I needed to use multiple statements. However, Pig has enamored me with its simple, yet powerful and extensible data-processing-flow approach. For the uninitiated (like how I was about 3 months ago), Pig is a scripting language/tool that facilitates (1) scripting based, (2) pipe-line or flow oriented, (3) batch-processing of data. The scripting language naturally fits into how one would think about a data processing or ETL flow - e.g. a data flow for processing web log files as shown below.


I should note that although conceived and associated with the Hadoop framework, Hadoop is not a requirement. Pig can seamlessly work with equal effectiveness with data in Cassandra and local filesystems. As shown in the diagram above, Pig facilitates creating a data processing pipeline by iterating through one or more datasets, referred to as relations, specifying processing logic on each row of the dataset and passing it to the next step in the pipeline. At each juncture of the pipeline, you have one result relation and one or more input relations. Relation are just logical concepts and are not necessarily a (transiently) persistent entity. One uses Pig Latin - a simple, yet very powerful instruction-set to specify the processing. Pig compiles that Pig Latin program into a series of map-reduce jobs, allowing one to focus on the processing. So for example, if one were to write a word-count program (the Hadoop equivalent of Java's "hello world") here's one way to write it -

lines = load 'my_text_file.txt' using TextLoader() ;
words = FOREACH lines GENERATE 
        TOKENIZE($0) AS bag_of_words ;
individual_words = FOREACH words GENERATE 
                   FLATTEN(bag_of_words) as word ;
DESCRIBE individual_words ;
word_groups = GROUP individual_words BY word ;
word_count = FOREACH word_groups GENERATE group, COUNT(individual_words.word) as word_occurrences ;
DESCRIBE word_count ;
ordered_word_count = ORDER word_count BY word_occurrences ASC ;
DUMP ordered_word_count ;

The same thing done using a Java mapreduce program would be atleast 5 times larger, with the number of "import" Java statements itself exceeding the size of the above program.

Its very easy to see that the above Pig program reads in lines from a file and "tokenizes" each line into a "bag (collection) of words". Next, the words are grouped together and a count is made of the number of times each word occurs in the bag. The words are then "dumped" in ascending order of their count.

The beauty about Pig is that this program can 
- run on an input file in Hadoop DFS or on a local filesystem
- scale from a few lines of input data, to thousands, millions and billions of lines of input data
- can scale from a single node to a thousand-node hadoop cluster 

At the same time, I would like to point out that the above pig program generates a pipeline of multiple map-reduce jobs and there is some room for optimization in that regards. However often the trade-off in the savings of development time and effort for slightly increased computational requirements is worth it. Isn't that why we use higher languages and scripting languages in the first place? If not, everyone would still be programming in assembly language!

There is sufficient introductory information on Pig out there that explains Pig very well. Pig Latin (the Pig programming language) is quite small, succinct and effective, and is quite easy to understand. What I intend to share in the next few blog entries are things that I have learnt in my Pig journey - some cool things, best-practices, design patterns, quirks/gotchas etc.