Thursday, January 3, 2013

Big Data:2013


2013 is forecasted by some to be the year of Big Data, so I thought that it would be nice to start off the year by defining some of the interesting categories of challenges and problems to solve in the big data space. Note that this is just my perspective, yours might differ :-)

First, there is the set of problems to do with data itself - data processing, data storage, retrieval and caching, and data analysis.

Data Processing: This is the set of challenges to do with processing large volumes of data that is loosely structured with fluid schema and the processing needs evolving and changing all the time. Typical examples include processing large volumes of continuously streaming/flowing Twitter data, web logs and web crawler logs. The challege here is that the volume, development speed, clarity of specifications for data processing is not as clear as in say, traditional ETL.  Furthermore, ETL tools like Informatica, Ab Initio etc are either not applicable, not scalable or affordable or not flexible to address the needs.

Data Storage, Retrival and Caching: Similar to traditional ETL tools, traditional data storage and retrieval technologies like RDBMSes are not suitable for big data storage and retrieval because of the scalability, performance, throughput and loose data structure and schema. And even if the data volume was within the realms of traditional RDBMSes, the software cost and potentially the hardware cost of a massive single machine (or set of large machines) would be prohibitive for many companies. Using traditional RDBMSes and traditional data volumes, it is often possible to cache a significant portion of the data in RDBMS buffer cache or filesystem buffers and thereby significantly improve performance and throughput. With big data, this is not practical, atleast not with a single machine setup. So often, when there is an online, interactive application residing over big data, there is a need to add a caching layer to improve read or write response times.

Data Analysis: In traditional data warehousing/BI and OLTP application space, reporting and analysis needs are often well defined. In fact, data warehouses and/or datamarts are often built to store and help with specific business analysis and reporting needs. And OLTP applications have their own set of well-defined and structured reporting needs. However in the case of big data, the objective is store all/as much data as possible and then later examine from a variety of angles to see if any kind of information, insight and perspectives can be extracted out of it - either on its own standing or by combining it with other data sets and/or knowledge. For example, given web server logs, there is only so much analysis that can be done - e.g., web server response time analysis, geographical and temporal analysis of web requests, popular pages and frequent visitor analysis. However, combining that data with additional information about content (e.g. content/page categories and subjects), user/visitor profiles, information on releveant external events (e.g. information on business, finance and news events along with their timings) better insights into the web server logs. And the knowledge gained can be looped back to further enhance or supplement the data sources. E.g. one can now enhance a user/visitor's profile with the kind of information/content preferred by the user.

The second set of problems associated with big data is to do with systems and infrastructure architecture - designing and implementing a scalable, flexible, reliable and fault tolerant data storage system, data backups (if it is possible), data replication (if required), monitoring and operationalizing the data storage system and high availability. Rather than try to fulfill these seperately, the storage systems are built from ground-up along with many of those features. However, one still needs to ensure that as sub-systems are bolted together, the features complement or enhance the overall system and SPOF (single points of failures) are avoided (or minimized). I would like to point the reader to a very interesting article that I read a while ago on how a company designed high availability and reliability into their system. Big data relies on distributed systems, and for the more adventurous and brave of heart, I would like to point you to an excellent presentation by a genius - he is incidentally my cousin and lived a very short but very memorable life of 26 years.

And finally, the third set of problems is to do with permissions, data integrity, security, privacy and encryption (as needed). These problems or needs often arise when there is personally identifiable, financial or legally sensitive data to be handled/managed. Authentication, authorization along with data integrity are important in the big data space as users of the data (analysts, data scientists, developers, admins, etc.) work directly with the data, unlike a data warehouse or an OLTP application where the application layer takes care of many of those functions.

Rather than solve many of these problems over and over again, people are building stacks and ecosystems that inherently address many of the above problems - e.g. the Hadoop ecosystem consisting of Hadoop, HBase, Pig, Hive, Sqoop, Mahout and other subsystems for data storage, processing and analysis. These are then complemented with systems like Lucene and Solr for effective indexing and searching, Memcached for caching, etc.

Due to the nature of my professional work, I have a particular interest in data processing and analysis - and already blogged a couple of articles on Pig. Of lately, I have been interested in data analysis and will post some articles in that space over the next few months.

No comments:

Post a Comment