Profiling unstrucutred data to streamline storage management
- 21 June, 2013 16:19
This vendor-written tech primer has been edited by Network World to eliminate product promotion, but readers should note it will likely favor the submitter's approach.
New data profiling technology makes it possible for organizations to reclaim storage capacity, archive data with business value, delete aged and abandoned data with no business value, tier content to other classes of storage and even manage storage charge backs with reliable statistics.
Data profiling takes all forms of unstructured data and provides a searchable "map" of the metadata, including the type of information that exists, where it is located, who owns it, if it's duplicate and when it was last accessed. Optionally data profiling can look beyond metadata for full-text searches and locate specific files, uncovering sensitive content such as Social Security or credit card numbers.
[ SPOTLIGHT:Storage strategies for the brave new world]
Using this technology, data can be classified and disposition can be determined.
Starting the assessment
Typical data profiling deployments manage petabytes of unstructured user data and process content using high-speed, low-impact technology that is incrementally updated and always accurate.
Data profiling uses high-speed NFS or CIFS crawling to index user files and email repositories. Metadata is the default indexing approach; however, full content indexing is available if keyword and PII search (Social Security and credit card numbers) is required.
Once data is indexed, high-level summary reports and interactive filters allow instant insight into enterprise storage, providing new knowledge about data assets. The tools should integrate with your Active Directory environment and be able to inherit the security schema to provide advanced insight into abandoned and ex-employee data.
Through this process, mystery data can be managed and classified, including content that has outlived its business value or that which is owned by ex-employees and is now abandoned on the network. [Also see: "4 ways to beat data bloat"]
Unstructured data classifications typically fit into one of seven buckets, including:
- Abandoned: Owned by ex-employees and not accessed in many years
- Aged: Not accessed in three or more years, created by active users
- Redundant: Duplicate content based on a unique MD5 hash
- Personal: Multimedia files such as iTunes libraries, photos and movies
- Risk: Sensitive content such as PII and e-discovery and legal hold data
- Archive: Data with long-term business value that must be preserved
- Active: Manage data in place to determine future disposition
Data profiling tools provide a way to copy, delete and archive data. Additionally, full path and filename text files typically can be downloaded, allowing the use of third-party tools and utilities to manage disposition. This would include options to encrypt and tier data to the cloud or less-expensive storage platforms.
The tools should also offer a validation process for disposition to ensure that content is purged reliably and that it has not changed since it was profiled. Validation checks the last modified date or optionally the signature of the document. As disposition of the data is performed, including defensible deletion, logs should be maintained that detail the date and disposition of the document, including the user that executed the disposition.
Through this process, mystery and legacy data that was once difficult to understand can now been deciphered. Accurate charge backs can now occur among the business units. Data center migrations can be streamlined by purging content with no business value and not ever moving it to the new platform.
Unstructured files and email that once presented a compliance risk and a storage nightmare can be reviewed and disposition decided. Some will move to archive, other to less expensive storage and a shocking amount of duplicates, personal files and aged data can be purged.
Index Enginesis a leading provider of enterprise information management and archiving solutions.
Read more about data center in Network World's Data Center section.