How to use Hadoop to mine business value from new types of data
- 26 September, 2013 16:10
This vendor-written tech primer has been edited by Network World to eliminate product promotion, but readers should note it will likely favor the submitter's approach.
Apache Hadoop has emerged as the leading technology for helping companies mine big data. And while every organization is different, their big data demands are often similar. Hadoop enables companies to collect and process massive amounts of data that was once thought of as too expensive or unwieldy to store and analyze. They have learned that these data types are valuable as sources of insight and business advantage. Let's take a look at how Hadoop is used to mine value from these new types of data.
* Clickstream Data. Analyzing the clickstream (a succession of mouse clicks) can reveal how users research products and, more importantly, how they complete their online purchases. Online marketers can then optimize product web pages and promotional content to improve the likelihood that a visitor will learn more about certain products and then click the buy button. There are tools that help web teams analyze clickstreams, but Hadoop adds three key benefits:
" Hadoop can join clickstream data with other data sources like CRM data on customer demographics, sales data from brick-and-mortar stores, or information on advertising campaigns. This additional data provides a more comprehensive view of customer patterns than an isolated analysis of clickstream alone.
" Hadoop scales easily so you can store years of data without much incremental cost, allowing you to perform temporal or year-over-year analysis on clickstream data. You can save years of data on commodity machines and find deeper patterns that your competitors may miss.
" Hadoop makes website analysis easier. Without Hadoop, clickstream data is typically very difficult to process and structure. With Hadoop, even a beginning web business analyst can organize clickstream data by user session and then refine it and feed it to analytics or visualization tools.
* Sentiment Data. Sentiment data is unstructured data on opinions, emotions and attitudes contained in social media posts, blogs, online product reviews and customer support interactions. Enterprises use sentiment analysis to understand how the public feels about something and track how those opinions change over time.
Sentiment analysis quantifies the qualitative views expressed in social media. Researchers need big data to do this reliably. With Hadoop, social media posts can be organized and scored for sentiment with advanced machine learning methodologies. Here's how it works: Words and phrases are assigned a polarity score of positive, neutral or negative. By scoring and aggregating millions of interactions, analysts can judge candid sentiment at scale, in real time.
After scoring sentiment, it's important to join the social data with other sources of data. Hadoop makes that easy and reproducible. CRM, ERP and clickstream data can be used to attribute what was previously anonymous or semi-anonymous sentiment to a particular customer or segment of customers. The results can be visualized with business intelligence tools like Microsoft Excel, Platfora, Splunk or Tableau.
* Server Log Data. Server log data, which reports EKG-like information on the operations of enterprise networks, often holds the answers to most security breaches. Server logs are the first place the IT team looks when there's a problem with the network. However, the sheer volume of this data makes it difficult and expensive to store and even more difficult to analyze.
When security fails, Hadoop helps enterprises understand and repair the vulnerability quickly and facilitates root cause analysis to create lasting protection. Often, companies don't know of system vulnerabilities until they've already been exploited. So rapid detection, diagnosis and repair of the intrusion are critical.
Hadoop can make forensic analysis faster. If an IT administrator knows that server logs are always flowing into Hadoop, to join other types of data, he can establish standard, recurring processes to flag any abnormalities. He can also prepare and test data exploration queries and transformations, for when he suspects an intrusion.
* Sensor and Location Data. Hadoop solves two big challenges that currently limit the use of sensor data--its volume and its structure. Sensors measure and transmit small bits of data efficiently, but they are always on. As the number of sensors increases and time passes, the data from each sensor can add up to petabytes. Hadoop stores this data more efficiently and economically, turning big sensor data into an asset.
Using specific algorithms that identify previously invisible patterns, Hadoop can also be used for predictive analytics and proactive maintenance. The ability to predict equipment failure is valuable because it's far less expensive to do preventative maintenance than it is to pay for emergency repair or replacement equipment.
Doctors can now track more than 1 billion individual data measurements to diagnose and predict medical episodes with greater precision. Hadoop makes it much easier to refine and explore this data to find the meaningful patterns. Tools can be used to join various data sets together, combine that with data on health outcomes, and then refine it all into a master dataset that includes the important patterns and excludes the trivial ones.
Location data is a sub-variant of sensor data since the device senses its location and transmits data on its latitude and longitude at pre-defined intervals. This is truly a new form of data, since it did not exist (outside of highly specialized military and aerospace applications) until 10 years ago.
Today, smartphones can capture and transmit precise longitude and latitude at regular time intervals--the sensor is connected to the communication network in the same device. Consumer-driven businesses want to use this data to understand where potential customers congregate during certain times of the day. In addition, delivery vehicles use location data to optimize driver routes, improving delivery times, lowering fuel costs and reducing the risk of accidents.
With the introduction of these new data sources enterprises are forced to think differently. The emerging data architecture most commonly seen introduces Apache Hadoop to handle these new types of data in an efficient and cost-effective manner. Hadoop does not replace the traditional data repositories used in the enterprise, but rather is a complement. With the availability of enterprise-ready Apache Hadoop distributions, enterprises are embarking on a wide variety of Hadoop-based big data projects and creating a next-generation data architecture.
Hortonworks is the only 100% open source software provider to develop, distribute and support an Apache Hadoop platform explicitly architected, built and tested for enterprise-grade deployments.
Read more about software in Network World's Software section.