A company today can buy a terabyte of enterprise-class disk storage for about $US5000. Eight years ago, it would have cost $US200,000. Even the dramatic drop in the cost of processing doesn't happen that fast.
The cost of storage is plummeting just in time for consumers to save all those digital photos, videos and songs they've developed an appetite for - and just in time for companies to comply with new regulations on document retention. And while they're at it, companies might as well hold on to all of their sales transactions and do some data mining and analysis.
But there's more going on here than a linear extrapolation of capabilities. Size matters, and users with big storage systems are likely to find themselves at a tipping point, empowered with fundamentally new capabilities.
"Since storage is almost free, you can kind of keep everything now," a professor of electrical engineering and computer science at Stanford University, Kunle Olukotun, said. And "everything" was just what you need for an increasingly popular class of techniques called statistical machine learning, he said. As the name suggests, the idea is for a system to develop its own rules of logic by discovering patterns and relationships in data rather than having a programmer hard-code the rules in advance.
"There's this notion of using large amounts of data to do things that previously were done by clever algorithms - for example, language translation," Olukotun said. Traditionally, automated language translation has been accomplished via bilingual dictionaries and databases of linguistic rules. Google uses that method for translating among English, Spanish, German and French. But it's using machine learning in experimental translation engines for Arabic, Chinese and Russian.
"The Google guys said, 'Why don't we just throw massive amounts of data at it and look at lots of examples of the source and destination languages to come up with the rules?'" Olukotun said. "The more data you have, the better the rules get." Access to vast amounts of data would put the solutions to some kinds of problems within reach, senior vice-president for research at Microsoft, Rick Rashid, predicted.
"We can think about analysing huge amounts of epidemiological data to find solutions to medical problems, or think about traffic flow and urban planning and energy usage," he said.
"Where there are large stores of data, whether the company realises it or not, there is a gold mine," vice-president of engineering at AdMob in San Mateo, Kevin Scott, said.
Collaborative filtering Recently a senior engineering manager at Google, Scott said companies with large volumes of sales transaction data could use it for collaborative filtering, a way to infer information about a shopper by comparing his transactions with those of similar users. It's how Amazon.com guesses that you might be persuaded to buy a book about digital photography when you order Adobe Photoshop. The data growing most quickly in volume today is unstructured information, which can't be readily parsed and analysed by automated means.
More than 80 per cent of all digital information in companies is in unstructured documents, according to research firm, IDC. That provided a huge opportunity for data mining via techniques akin to machine learning, Scott said.
For example, a large company could pass many thousands of free-form performance reviews through algorithms that learn which data represents an employee name, an annual raise and so on, he said. The systems would extract structure from unstructured information. Then a query tool could answer a question such as, "What is my average employee rating by geographic region each year for the past 10 years?"
Techniques such as machine learning and text mining won't be used widely until sometime well into the future, but many organizations already need to store all of their data for other purposes.
"We have this explosion in rich content, and it's not just consumers with digital phones and videos and music," IDC analyst, Richard Villars, said. "It's hospitals moving to electronic records and X-rays and MRIs, and banks going to video surveillance, and then archiving that for years at a time."
In fact, some companies save individual pieces of data multiple times.
US-based Intellidyn has 70TB of data, covering things such as demographics, lifestyles, credit histories and property transactions, on 200 million adults.
Duplicated data CEO, Peter Harvey, said multiple credit agencies send files that had 90 per cent overlap, but it was cheaper and easier to store the extra data than to purge it. And he said Intellidyn itself duplicated data by setting up private data marts for clients.
Harvey said Intellidyn's storage would grow to 2 petabytes within two years. "So what?" he shrugged. "Storage is getting cheaper every day."
But while raw disk space may be, as Olukotun said, "almost free", doubling in disk capacity should not be confused with a doubling in performance from a disk system.
"We've got this huge mismatch between the transfer bandwidth and latency of the disk and how much you can store on the disk," he said.
A traditional way to get data on and off a disk faster has been to increase the rotational speed of the disk, but mechanical limits cap that speed at about 15,000rpm, Villars said. That could open the door to newer technologies, he said, such as persistent fl ash memory, which had no moving parts.
In the meantime, AdMob's Scott said users would have to compensate for the transfer bandwidth bottleneck by being clever designers.
"We are entering an era where programmers will have to be fairly savvy about the performance of the programs they write," Scott said.
"They won't be able to make silly decisions like, 'Oh, I have a 1TB disk, so I will write a program to do a linear scan through 500GB of data to find an email'. That is not going to work."