How to use visualization tools to derive data intelligence from unstructured data
- 19 September, 2013 20:59
Like their bigger brethren, mid-sized companies are struggling to manage tens of terabytes of data about their customers, markets and products a veritable gold mine of information, if only they knew how to excavate it.
In the last two years alone, businesses have generated more data than we saw in the previous 60 years. Thanks to innovations in deduplication, compression, incremental increases in hard drive density and falling solid state drive (SSD) prices, companies are finding ways to store the massive influx of data. The real challenge, however, goes beyond storage. This data is rich with intelligence that could inform business strategy, reduce costs and drive growth, but few smaller companies have the budgets or the staff to unlock its potential. These businesses need a solution that can provide answers and intelligence without breaking budgets or requiring a data scientist.
The volume of data is daunting enough; the type of data, however, magnifies the challenge. Structured data only accounts for about 20 percent of stored information. The rest unstructured data includes social media feeds, emails, blogs, Microsoft Office documents, photos, videos and many more.
This data typically lives in a variety of locations across an organization and is rarely managed explicitly. The companies that do attempt to manage these unstructured sources generally use document management systems, which often end up as yet another information silo, like email applications, file shares and corporate intranets. In a study conducted last year, IDC found that these silos are part of the reason why information workers lose nearly 20 percent of their time to inefficiencies.
It's easy to see why unstructured data lies dormant. It is often created and consumed ad-hoc and isn't organized for ease of access. It doesn't have a clearly defined schema, and companies typically lack the tools and expertise to massage, visualize and manipulate this data to identify valuable information and inform decisions.
This challenging data is at the core of the big data problem with which companies of all sizes are grappling. At the large enterprise end of the spectrum, the solution comes in the form of a cluster of servers and one or more data scientists (and the large costs associated with them). Some smaller companies, in their scramble to keep up, have tried to train their business analysts to do data scientist-level work. What the latter really need is a solution that can automatically transform their data into intelligence and present the right data to the right people at the right time.
Data intelligence questions
The first priority in solving the unstructured, big data problem for smaller companies is putting the data in context. For example, who in my company knows about Customer X? Where is the latest version of the contract and who hasn't read it that should? These companies need to know what data is being produced by which departments and individuals, and they need to know how their teams are using that information. Answering these questions raises numerous other ones, including:
1. Who owns the data? This question is critical in organizations where IT is responsible for purchasing and planning infrastructure but doesn't participate in the utilization and management. Compliance and legal staff are also interested in understanding data ownership and custodians.2. How do you capture this data in real time and share it? There aren't attractive options for doing this with unstructured data across an organization. Some of the current solutions periodically scan file shares to identify new or changed data and copy the contents to big data processing systems such as Hadoop. This approach puts unnecessary load on the file server and results in at least two copies of the original data, ballooning storage costs and management overhead.3. Which properties of unstructured data should be identified? The answer to this question depends heavily on the company. For example, there are compliance issues that cut across many industries, such as Health Insurance Portability and Accountability Act (HIPAA), Payment Card Industry Data Security Standard (PCI DSS), and privacy laws that prevent the unencrypted storage of personally identifiable information. Automatically identifying when and where this sensitive data is created and consumed is both challenging and crucial.
When questions like these can be answered efficiently, a whole new set of use-cases emerge, saving time and making better use of existing investments. For example, you could quickly locate the newest version of the proposal document or identify who in the marketing department is most knowledgeable about Product X. It is often tremendously difficult for companies to meet these basic information needs.
Visualizing what's possible: data intelligence that works
Visualization has long been an important tool for helping users understand and act on complex information. For example, if you had a database of the stock prices of all companies taken at one-second intervals across all of 2008, looking only at the individual rows of data in a table, how long would it take you to determine how the market did overall and how several specific stocks did relative to the market?
Stock charts are one category of visualization. They can rapidly show performance over time, compare the relative performance of multiple securities, and show additional derived information such as moving average, relative strength index, and more.
Visualization tools like this are extremely valuable, but they have traditionally operated exclusively on highly structured data, like stock prices and sales records. As organizations create and consume massive amounts of unstructured data, there is tremendous value in extending visualization functionality to include unstructured data.
When attempting to visualize unstructured data, it is crucial for organizations to create and maintain a rich set of metadata structured data that describes the unstructured data. This metadata can be directly fed into visualization tools, or it can be used to link the unstructured data with the existing structured data. It is worth noting that the process of manually extracting and identifying metadata is costly and impractical, so some form of automatic annotation should be part of the solution.
When looking for visualization tools, note that the ability to support data from multiple sources and join or overlay the results is a crucial feature. This will keep the content in context and anchor the results. Based on the stock example above, imagine the value of combining related unstructured information news articles over the same time period.
Imagine plotting news volume and key topics and concepts. It would then be much easier to see what happened by looking at the structured data, and gain insight into why it happened by looking at the related unstructured data. This exploration can work in both directions; I could look for documents describing corporate acquisitions and see which direction the stock price of the acquiring organization tends to go.Small and medium-sized businesses have large data challenges. These companies see their data usage swelling and they see the intelligence in that data remaining untapped. In order to derive big value from big data, companies need to bridge silos and visualize information, so they can answer difficult questions.
With limited resources, these organizations can't build complex systems and hire data scientists. Instead, they need a new generation of data management solutions to help them extract information from stored data and compete in a marketplace where data intelligence is becoming a game changer.
Kearns is director of product management at DataGravity http://datagravity.com/, which is focused on defining and delivering data intelligence. He has spoken at conferences around the world about the power of search and analytics and has worked with many of the world's most successful companies and government agencies implementing these technologies.
Read more about software in Network World's Software section.