Deduplication stemming the data flood
- 20 August, 2008 12:46
If you say ‘backup’ repeatedly and with increasing speed it ends up sounding a little like the classic Australian refrain ‘bugger’ – try it if you don’t believe me. And while it’s not something you want your employees saying in front of clients – perhaps not at all – it is arguably the most appropriate reflection of the state of storage affairs for many corporations.
Organisations continue to backup data in an uncontrolled manner, resulting in multiple copies which clog up storage space and complicate disaster recovery (DR) plans.
But in the midst of this well-documented data flood, deduplication, or dedupe as it is commonly known, is stemming the tide.
“It’s hot – industry hype is very high at the moment,” Sun Microsystems storage product marketing manager, Steve Stavridis, said. “I think the reality is it is a fact of life.”
Some, like Independent Data Solutions Group (IDS-G) managing director, Gerard Hackwill, claim eight out of 10 customers are interested in deduplication technology.
IDC puts this appeal down to the massive creation of data of all types but particularly unstructured data like videos and PowerPoint files.
“There is obviously great demand not simply on storage capacity but more importantly on the ability to recover these documents once they have been archived,” IDC program director for Asia-Pacific storage research, Simon Piff, said. “I know there is a lot of interest from the dedupe vendors in saying, ‘this is going to help you optimise your storage’. But I think the real message is why are you trying to optimise it? It’s not because the cost of storage is so expensive; the price of storage and dollar per gigabyte is going down year on year. It’s more a requirement of customers – if not to be able to reproduce that data, then know where the data actually resides should the requirement to reproduce the data for compliance and legal requirements arise. That is really driving the markets.”
Looking at most vendor offerings it is clear dedupe has become an integral feature in storage management.
“You have anywhere between five and 20 copies of information stored on your infrastructure,” EMC product marketing manager backup, recovery and archive, Shane Moore, said. “It makes sense that data deduplication gets used there. It’s not just reducing duplicate files – data deduplication actually goes to the sub-file level so it actually turns out there is a huge amount of deduplication there. Backup and recovery is where we are seeing it applied most in the industry.”
Flood of variety
For the unfamiliar, the multi-layered characteristics of dedupe can appear somewhat complex and potentially prompt a few more mumbled curses. To cut a long story short, dedupe involves the detection of identical data at the file, block or bit level by a software application or appliance. At the file level when two identical files are detected, only one is saved and a pointer to the original replaces subsequent copies.
At the block and bit level dedupe goes into files and saves unique instances of each bit or block. For example, if you change the title of a document, only that change is saved and not a new copy of the entire document. To achieve this, every piece of data is processed using a hash algorithm, such as MD5 or SHA-1, which generates a unique number for each block or bit. This is then compared against an index of existing hash numbers – those with numbers that already exist are not re-saved; new ones are added to the index and saved.
While there have been reported occurrences of hash-collisions – where the hash algorithm produces the same number for two different chunks and fails to store the new data – the technology has been instrumental in reducing storage requirements.
“There has been compression techniques and things like this over many years, so like a lot of things in our industry we have seen similar forms of this before,” NetApp systems engineering director, Michael DeLandre, said. “What is different in the last couple of years is the massive, almost uncontrollable increase some customers [have experienced] in the amount of storage. That has resulted in a lot of investment in cutting the amount of physical storage required.”
There are essentially two times when dedupe occurs: Source and target. A source-based approach cuts down the amount of data being sent across the network by doing all the work on the fly with client software communicating with a backup server (using the same software) to determine if data has already been stored.
Conversely a target-based approach eliminates replicated data after it is sent across the network and can use a virtual tape library or NAS with existing backup systems. In target-based systems dedupe can be executed inline (during the backup process) or post-process (after data hits the disk).
The question of which is better is a matter of debate and is unlikely to be resolved in the near future. Some vendors like HP, however, have taken the view that large enterprises, which store immense amounts of data, generally should look at post-process methods while inline – or disk to disk – is appropriate for SMBs.
“One of the things about deduplication is it is fairly intensive to look at the data as it is coming down the wire and understanding whether you have a copy of that data already or not,” HP product marketing manager StorageWorks, Mark Nielsen, said.
“One of the things we do with the enterprise-class products is we actually don’t get in the way of the backup. That allows us to blast the data down as quickly as possible on to the disk and then once we have the data on disk and finished our backup we go back and look at the data and deduplicate it on the virtual library system itself.”
Regardless of which of the countless techniques is deployed or vendor approach adopted, deduplication is a technology that allows IT decision makers to show signifi cant savings to their reporting line – which increasingly includes the CFO or CEO.
“It is one of the few very techie technologies that has the wow factor for business people,” Hitachi Data Systems chief technologist, Simon Elisha, said. “They love the fact that you can point to an appliance and say ‘this is now storing what used to take up all this space in your datacentre’. Or ‘guess what? We’ve just kept all of your file systems online now for an extra 90 days for recovery and it is much faster’. They love that and it is one of those technologies that delivers genuine value.”
It is also one of those technologies where the ‘horses for courses’ rule applies – the benefits and costs can vary widely depending on the existing storage infrastructure, the kind of data being stored and also the culture of the organisation. As a result vendor approaches also exhibit significant differences and claims, particularly when it comes to dedupe compression: A simple search on the Internet will produce advertised ratios of up to 500:1 down to 10:1 and lower.
“It is important that while we are looking at the advertised ratios in terms of what data we can actually save by implementing data deduplication, it is important to exercise a little bit of healthy skepticism,” Sun’s Stavridis said.
“The level of reduction you will achieve is dependent on the type of data and amount of data you have got. That will dictate what that reduction ratio will be, which varies from environment to environment.”
Another debate that has arisen – or more accurately has raised its head again – as a result of dedupe’s popularity is the question of tape’s future in the storage landscape. Some suggest the boost to disk-based storage that deduplication provides is a reason to give tape the flick for good.
“It [tape] is a media we have all loved to hate over the years,” IDS-G’s Hackwill said. “It has been a necessary evil because there has been no other option. But deduplication technology is offering another option where we can be storing many terabytes of data on a much smaller disk footprint than we could on a tape footprint.”
Fujitsu’s national sales manager products, Julian Badell, agreed and added deduplication was marking the beginning of the end of tape for some customers.
“A lot of customers are very keen to move off what they consider a dated media,” he said. “And really the cost of storage is helping make it more accessible. But the significant gains we get from deduplication in saving the physical storage we need is really making the entry point very close if no equivalent to tape.”
A good example can be found in one of Hitachi’s Melbourne-based enterprise customers, which has gone for a completely disk-based storage infrastructure. But not everyone thinks tape has lost its place.
“There are a lot of comments being made that because we are putting more data online we don’t need tape technology,” Sun’s Stavridis said. “Data deduplication is obviously allowing us to put more data on disk which means we can put less data on tape. But I think it is important to really look at that and understand that disk hasn’t replaced tape completely; far from it in fact. Tape has found its place in the datacentre and that is in long-term archives.”
Yet, whatever your position on the future of tape, the uptake of deduplication and disk has some implications for the older technology.
“If you don’t have tape in your environment or you want to reduce it as much as possible you need to backup to some target device, probably at your local site and then replicate that to another site,” EMC’s Moore said.
“So now you have two copies. In the past we have had solutions like virtual tape where people backup to virtual tape and then backup from one site to another. But that is the whole backup. With dedupe only the truly unique elements get replicated from one site to another. That has huge implications for having that offsite copy of your data and the replication of your backup.”
While uptake continues to improve – NetApp, for example, claims 2500 customers globally are using its offerings – deduplication is still in embryonic stages as a technology in the storage space.
“It is early days. Although they sound like big numbers if you put it in the context of NetApp’s total customers we are still early in the curve,” NetApp’s DeLandre noted. “At the moment it is a key feature and something that is important to us because of the fact that we provide it on primary and provide it free; it is a fantastic extra tool in the kit bag for our partners.
“Partners who have really got their head around applying this technology and using it are absolutely in a great position to compete in their space when they have got something like this that they can take to market.”
EMC’s Moore agreed and claimed the technology would eventually be adopted by most organisations, especially if the channel can articulate the benefits.
“The bottom line is the total cost of ownership [TCO] is much better for the organisation,” he said. “They [partners] need to understand how to calculate that and articulate the benefits for the customer from a financial and business point of view.”
First, though, the channel needs to get across the technology and familiarise themselves with the differences in vendor offerings to provide benefit to clients, Symantec director systems engineers, Paul Lancaster, claimed.
“Storage, data, the recovery process, it’s all getting very complex now and it’s not just a case of ensuring your tape is in the backup library each night,” he said. “One of the key things is understanding what the customer’s needs are and understanding the data.”
Additionally, as dedupe continues to attract attention and find its way into storage infrastructures it is an opportunity to assist customers in re-evaluating their business goals and processes; much like virtualisation before it.
“Customers are still getting their heads around the reality that we need to do backup to disk. I think deduplication is probably where virtualisation was two years ago,” HP’s Nielsen said. “It will take a little bit of time for customers to understand they need deduplication but it is fair to say a lot of our enterprise customers are focused on reducing their storage footprint.”