There’s an often-repeated stat that 90% of all data that exists today has been created in the last two years.
The provenance of that figure is murky and disputed, and it dates back to nearly 10 years ago, so even if it was true then, that percentage is likely even higher today. But what can’t be disputed is that the exponential growth of information continues. IDC predicts the total sum of the world’s data will by 175 zettabytes by 2025 – up from 33 zettabytes in 2018.
Within that overall data growth, the increasing amount of unstructured data creates security and compliance risks. Up to 90% of data that organisations own is now estimated to be unstructured and growing at 55-65% each year. This includes things like documents, spreadsheets, photos, videos, audio, web pages, text files, social media and slide presentations, which can contain sensitive or personally identifiable information (PII) that is difficult to track and manage.
For example, people might keep passwords for all of their applications in an unencrypted or password-free Excel file and then store that in a folder on OneDrive because they think it’s secure. Or someone might take a photo or scan of their passport, which contains lots of PII, for a job or visa application and share with HR, which consequently stores it on OneDrive, or SharePoint. We’ve all done these things and it’s so easy to do without really thinking about it.
Data classification and compliance risk
The problem with this unstructured data is that it doesn’t live in a database and have a pre-defined data model or schema. Whereas structured data in a database is more easily classified and managed, it’s difficult to know what the content of a video or a spreadsheet is and whether it contains passwords or PII.
This creates a data governance risk – particularly in highly regulated industries such as healthcare, financial services and government that have a duty to comply with data protection legislation and regulations such as the US Health Insurance Portability and Accountability Act (HIPAA) and the Sarbanes-Oxley Act (SOX).
There is also a security risk. Many existing data classification tools can’t tell you if, for example, a Word file is infected by a macro virus. So not only do you need to be able to classify your unstructured data across your cloud environment and identify any containing PII or sensitive data, but you also need to be able to scan that information for security threats.
Cloud data governance challenges
Another factor that creates data governance challenges around unstructured data is the widespread adoption of cloud. Instead of all this data being stored on laptops, PCs, file servers and network attached storage (NAS), it is now being stored in cloud platforms such as Office 365 or Google Workspace as organisations move away from on-premises infrastructure. Many companies are even configuring laptops and systems so staff can only save data onto OneDrive.
But in the process of this cloud migration, organisations are not taking the time to sort their data out. It is just lift and shift. They are literally moving a pile of unstructured data from one place to another. This just moves the problem rather than addressing the core issues around lack of visibility.
Of course, there are data classification tools that have been out in the market for many years. But they haven’t kept pace with the times and weren’t designed for the cloud, so there are gaps in functionality and capability. One tool might report on your data but doesn’t fix it. Another will fix it and organise it into a more structured fashion, but only works on one platform.
A lot of these older products also aren’t compatible with the latest file formats and can’t do optical character recognition (OCR). For example, if you want to check and classify something like a photo of a passport you need OCR to automatically parse the image and capture name, passport number, address and other PII into a text format instead of an image format.
You also need a modern tool that can redact any personal or sensitive information as it classifies. The tool needs to detect and flag that information for the administrator but redact it so it can’t be viewed.
This growth in unstructured data is only going to continue, and organisations must get to grips with classification and governance across their cloud environments so they can identify and protect sensitive information and avoid costly or damaging security and compliance breaches.
By Phil Maynard, VP, Barracuda Networks