Data classification projects are a dark cloud for data teams. These projects can appear randomly, are difficult to conduct, and redirect the data teams’ time away from their other projects. Therefore most data teams try to avoid them. In this post, we explore data classification projects and then examine why these are so difficult.
What is Data Classification?
Data classification is the process of organizing data into categories for the purpose of easier storage, management, and security. The benefits of data classification are that it makes data more readily available to other authorized teams within the organization; while also improving the securitization of that data.
Through the data classification process data teams apply security measures, enabling adequate storage and access controls, and making data more easily accessible to authorized users.
Even with all of these benefits, data teams still avoid data classification projects. This is because there are a number of difficulties surrounding data classification projects.
Difficult to Define
Data classification projects often emerge through some other necessary function or project, making the entire process difficult to define.
As you are working on a data security project, or attempting to meet a certain compliance requirement, such as passing an audit, or even reactively when there’s a data breach; and you realize you need to first classify the data before you can complete the other necessary components. In these cases, the data classification project is large and necessary; but isn’t even defined within the larger security project.
Data discovery and Inventory
In order to take inventory of your data or complete a data discovery project (this could include creating a data inventory or data dictionary, learn more about the difference here), you first need to know what sensitive data you have and where it is located. As part of the data governance process, you also need to classify the data in order to understand who owns the data, the authorization requirements to keep the data, as well as enabling access to the data (such as data masking).
Discovering Sensitive Data
We have explored the methodology of data classification projects in the past, in summary there are two basic ways in which data classification is performed and the main difficulties associated with each:
- Manually, using delegated questionnaires, which is often time-consuming and inaccurate.
- By scanning all the data in your databases, data warehouses, and data lakes. This is in many cases disruptive to the business, expensive (as a lot of data is scanned), and in companies where data changes often is also ineffective.
The Classification Itself Is Hard
Classification is not simply identifying sensitive data, but also in most cases classifying it according to a taxonomy comprised of sensitive PII, PHI, sensitive operational data, etc. The process (or more commonly the software) that decides what type of data is in each location also has the responsibility of classifying the data. This may be difficult both in terms of false positives and false negatives.
For example: if you have a long number in your system how do you distinguish whether it is trivial data such as a session ID or data that should be secured such as a credit card or SSN.
Finding Your Sensitive Data When It Moves or Changes
There are two questions which need to be addressed within a data classification project.
- How to maintain continuously compliance?
- How to ensure data classification results are not stale?
In an organization where data changes often, data classification that is done in an ad-hoc manner loses accuracy quickly.
Locating Sensitive Data Across Different Data Platforms
One of the additional challenges is that most organizations have a non-trivial data store architecture. Sensitive data is spread across and is found in many (sometimes unexpected) places in the databases, data warehouses, and data lakes.
This may complicate the project even when we’re dealing with the same tech used in different places (such as Amazon Redshift clusters used by different teams, or separated for other reasons). This is complicated when the data is spread across different platforms (such as Amazon Redshift and Snowflake)
Locating Sensitive Data Within Semi-Structured Data
In many cases, some of the data does not have a structured schema and is stored either in semi-structured files (such as JSON) or as semi-structured within highly structured formats (a semi-structured column such as Snowflake’s Variant or Amazon Redshift’s Super). These complicate data classifications as data now has to be identified within those locations, which don’t have a schema and may change for different data items.
Custom Sensitive Data
Determining, defining, and locating custom sensitive data is another obstacle within data classification projects. Some sensitive data can be specific to a company or even a business unit. Custom sensitive data (such as employee ID, specific types of links, etc) also need to be discovered across all data stores. In these custom cases, it is significantly more difficult to locate and then classify the data.
Data Classification Is Easier With Satori
Those challenges were the reason we built Satori’s data classification and sensitive data discovery. With Satori data (including semi-structured data) is continuously discovered and classified across all your data stores, without any impact on the data stores or data consumers. Learn more about our data classification capabilities, including a short demo video here, or book a meeting with one of our experts.