Using Hadoop in big data analysis

Big data abstract

As the quantity of data collected by businesses continues to expand, new forms of data management are developing to identify commercial opportunities, and big data analysis is becoming a core business function.

It's well understood that data has value, but extracting that value is proving to be difficult. A survey by technology services firm Avanade showed that 85% of respondents reported obstacles in managing and analysing data. These included being overwhelmed by the sheer volume of data, security concerns and not having enough dedicated staff for the analysis. Also, 63% of stakeholders felt their company needed to develop new skills to turn data into business insights.

"Big data has gained a top spot on the agenda of business leaders for the real value it has begun to create," said Tyson Hartman, the company's global CTO and corporate vice president. "Today, the technologies and skills used to leverage big data for business purposes have reached a tipping point – new types of data supported by better tools to leverage it, enable companies to find financial and competitive benefits."

The most widely used tool in mass data analysis is currently Hadoop, an open source software framework that supports the running of applications on large clusters of hardware. It enables the management and analysis of any kind of data from log files to video, and can facilitate the analysis of decentralised data across a number of storage systems.

IBM has identified a number of business advantages for Hadoop. Firstly, it is scalable: new nodes can be added as needed, without having to change data formats, how data is loaded, how jobs are written or the applications on top.

Secondly, it is cost-effective, making it possible to run parallel computing on commodity servers and sharply cutting the cost per terabyte of storage. In turn, this makes it affordable to model all your data.

It is also flexible, operating free of schema and able to absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.

Finally, it is fault tolerant, so that when you lose a node, the system redirects work to another location of the data and continues processing without missing a beat.

In its eBook about understanding big data, IBM states: "Hadoop is generally seen as having two parts: a file system (the Hadoop Distributed File System) HDFS and a programming paradigm (MapReduce). One of the key components of Hadoop is the redundancy built into the environment. To understand Hadoop, you must understand the underlying infrastructure of the file system and the MapReduce programming model."

Business case

We're now at the point where, when business and the IT managers look at upgrading their data management and analysis systems, they're asking whether Hadoop is the answer.

The key to successfully deploying the framework is to clearly understand the goals for the installation. IT managers need to be vigilant that Hadoop does not become another highly complex system to manage that yields few real insights, and it is vitally important to understand its ecosystem.

For instance, most installations of Hadoop will use the Flume framework to handle the data streams that Hadoop will produce. Using Sqoop, a tool for transferring bulk data between Hadoop and structured datastores, is necessary to connect the Hadoop output with standard SQL databases. This makes it easier to query large data silos using familiar tools.

In addition, Zookeeper is used to manage data that could be spread over a large number of data silos, and provides a centralised management system for use across clusters of data. These tools are freely available.

Digital IQs

PwC's fourth annual Digital IQ survey said that companies need more than ever to make the technology they are employing work harder.

"Raising a firm's Digital IQ means improving the way it leverages digital technologies and channels to meet customer needs," said John Sviokla, principal at PwC.

"The core of the ecosystem for innovation has moved from inside the firm to out in the marketplace. Customer and employee expectations are being shaped by this new, dynamic and exciting environment—if you miss this trend you will be increasingly irrelevant to the market."

In an age of big data, Hadoop is becoming a key technology that can deliver real value to its users; but it is not a panacea. There are security issues, and it requires specific skills to set up and maintain an environment using Hadoop.

There are alternative systems, but none take the same holistic approach as Hadoop, which has emerged from the integration of a group of projects on big data analysis on an open source platform.

Dell's white paper, Hadoop Enterprise Readiness, provides a good snapshot of how important it is to businesses that need robust data analysis.

"In short, leveraging big data analytics in the enterprise presents both benefits and challenges," it says. "From the holistic perspective, big data analytics enable businesses to build processes that encompass a variety of value streams (customers, business partners, internal operations, etc.).

"The technology offers a much broader set of data management and analysis capabilities and data consumption models. For example, the ability to consolidate data at scale and increase visibility into it has been a desire of the business community for years. Technologies like Hadoop finally make it possible."