Data Classification Best Practices - Part 1

Note: this is part 1 of the guide, in part 2 we discuss the questions you should ask when performing data classification projects, and approaches to performing them.

In the past, I worked as a part of a team responsible for handling data and generating awesome value out of it (you’d have to trust me on this). Life was peaceful and quiet until, one day, we received a request from our data governance and security team to provide them with a report of all the data the team owned along with a breakdown of the sensitive data in our repositories. This project was not our definition of exciting, as we had to deal with diverse data across several platforms.


Eventually, we overcame this challenge by breaking the task down into small parts, assigning owners to each section, and persevering through the tedious task. Some of the work was done manually (yikes!) by having someone actually go through the data, and some was automated using a tool which we then validated. For that point in time, the result that we produced was pretty good, but it was far from perfect, and the task distracted us from other projects. Furthermore, the greater problem was that, after a few days, our report’s accuracy decreased, as data gradually changed over time.


So, although the project of manually classifying data was not a total flop, it felt wrong at the time, and it definitely does not look right in retrospect, now that there are much more advanced, streamlined processes for automatic data classification.


Importance of Data Classification

Back when we were doing the manual classification project, we did not doubt the importance of data classification. We fully understood the need for it, and the request made perfect sense. We knew how crucial it is to know what you have when it comes to data that we were willing to work long and hard to execute the task.


As such, I think it is important to elaborate on the main reasons why you need to know where sensitive data is:

Prioritizing Placement of Security Controls

Yes, everything needs to be properly secured, but we also need to be rational about our resources. Classifying data helps avoid a “peanut butter approach” in which you spread your resources too thin. Data classification helps determine a starting point and suggests where you should allocate the most resources on security. Based on risk analysis, the greatest need for security tends to be mostly where sensitive data is located.

Monitoring and Enforcing Access Controls Specific to PII

Similar to the last point, in many cases, it is beneficial to have specific auditing and access controls when accessing sensitive data. For example, you may apply automatic data masking when sensitive data is being accessed. Classifying your data allows you to enforce these additional controls on specific data.

Limiting Resource Access to Specific Individuals

When you know where sensitive data is, you gain an increased ability to limit access to those resources. For example, if you have classified data as sensitive, you will think twice about granting access to this data to other business units or entities outside of your company. You can even control which data you provide access to and grant access to certain data, while maintaining security of the sensitive data.

Data Classification for Compliance

The requirements for compliance vary based on the types of data stored, your industry, and other factors, but it may be that access to sensitive data is to be audited and retained for a specific period of time or that permissions to access the data need to be controlled. Regardless of the specifics, compliance requirements around access to sensitive data require knowing where the sensitive data is stored and how it is being accessed.

Data Classification for Data Protection & Privacy Acts/Regulations

Due to data protection and privacy acts and regulations, there may be limitations on how you use data based on its sensitivity. These limitations can include applying functionality such as “the right to be forgotten” on users’ data or regional privacy. Knowing where different data types are located helps you scope out such projects and ensure you comply with the regulations.

Data Classification for Contractual Reasons

In a similar manner as compliance requirements, you are often obligated to treat certain data differently based on customer commitments. For example, a SaaS company may have an obligation towards businesses in a specific region not to move their sensitive data out of the specific region.

Why Is Data Classification Hard?

Going back to the data classification project I performed, as I wrote previously,  we had a perfect understanding of the task’s importance, yet the project was very difficult and time-consuming. After discussing the issue with a lot of data engineers and data owners, I have summarized the common hardships surrounding data classification below:

Data Is a Moving Target

Data is often a moving target due to ETL or ELT processes in which data is moved to enrich it, anonymize it, or apply other transformations to it. These movements can occur within the same platform (such as from one Snowflake database to another), but they can also be across different public clouds or data platforms, which can get very complicated to track.

Data Itself Is Changing

Not only is data a frequent-flier in terms of travel as it moves from one place to another, but it also changes. You may have a table that does not have any sensitive data in it, until someone changes something somewhere, and, all of the sudden, you are dealing with sensitive PII. For example, once I was dealing with a product table that was not supposed to have any sensitive data in it, but then an application added custom hidden products which contained the customer name added as a custom field.

Data is Spread Across Different Platforms

If having the data move around and change continuously wasn’t challenging enough, one of the hardships in the project I was running, as well as in other projects, was that data was not all stored in the same platform. Some of it may be stored in Parquet files stored on S3 and retrieved using Athena, some is in AWS Redshift, and others are in Postgres.

Classifying Semi-Structured Data Is a Challenge

Semi-structured data (such as data stored in JSON files or in other semi-structured data objects in data warehouses or data lakes) can add complexity to data classification. It makes it harder to classify and discover sensitive data, maintain a report on it, and monitor it. For example, a column named event_data in a Snowflake table may contain different types of semi-structured objects depending on the type of event, and, in some cases, there is an item with sensitive data. Iterating through the data to discover sensitive data becomes much more difficult with semi-structured data. For example, it can look like in the image below, with sensitive data in the customer_details.first_name and customer_details.last_name fields:

event_type: "complaint",
ts: "<timestamp>",
tech_details: [
{item_id: "item1", ...},
{item_id: "item2", ...}
customer_details: {
first_name: "Ben",
last_name: "Herzberg"



However, it can otherwise be in a totally different location (this time in the and matching_results.blood_type fields):

event_type: "checkup",
ts: "<timestamp>",
tech_details: [
{item_id: "item1", ...},
{item_id: "item2", ...}
matching_results: {
phone: "555-6672",
blood_type: "AB"



This is a relatively simple example, but often, semi-structured data is far more hierarchical, including lists, and can be much larger in size. In many cases it is collected without proper knowledge of what it may contain, which adds to the complexity of performing data classification on semi-structured data.


Performing Data Classification Projects

In part 2, I discuss performing data classification projects. This will be in a separate blog post, to be posted in a couple of weeks. If you’d like to ask any questions, feel free to reach out to me. If you’d like to learn more about how Satori solves data classifications in an agile way that does not hinder performance, schedule a demo.