Back to Blog

Access Control,

AWS Redshift

Data Classification Best Practices – Part 2

By Ben Herzberg

|Chief Scientist

July 15, 2021

In the first part of this article, we discussed the many reasons why you should perform data classification as well as some of the hardships you are likely to encounter when performing such a project. The following sections will help you learn more about the frequency of data classification and its importance among other projects.

How To Perform Data Classification

The following are questions you should ask before starting a data classification project:

What Is the Motivation Behind This Data Classification Project?

In many cases, the reasoning driving data classification is a demand from another team (such as GRC, legal, privacy, or security). In these situations, it is important to understand the reason for the request as well as the end goal. Sometimes, the team requesting data classification will be certain that they require a specific quality or granularity (e.g. they may need all data types in a certain data store or a mapping of columns to sensitive data parts). Discussing these requirements with that team can also help you prioritize the data classification project over other projects and understand its degree of urgency.

What Level of Granularity Is Required?

The level of granularity required is twofold - one is the granularity level when describing the location with the classified data, and the other is the granularity level of the types of data classified. Let’s discuss these levels of granularity:

Location Level Granularity of Data Classified

The requirement can be granular to a specific data store, database, schema, table, or column. It can be even more granular to require an understanding of the location of the different data types within semi-structured data located within a specific column.

Data Types Granularity of Data Classified

The requirement can be boolean, which means specifying the locations where we have sensitive data versus the locations where we do not have sensitive data. However, in most cases, there is a requirement to at least define the categories of data classified. For example, these categories can be PII or PHI data. In many cases, the requirement is to be even more specific and classify the data as specific types such as phone numbers, names, blood types, patient IDs, or social security numbers.

How Often Does the Data Change?

Some data stores are relatively static, with constant additions of the same types of data. Some data stores are continuously changing, often by contributions from many different teams. These changes include new data being poured in and transformations, which can lead to ongoing shifts in the data types being stored, processed, and accessed. In these situations, it is important to understand that an ad-hoc data classification project can become stale very quickly.

Where Does the Data Come From?

In many cases, data is not produced then stored, but rather it is taken from a different location where it goes through an ETL/ELT process. There are situations in which you have data classification known for the data source, and can take this knowledge into account when planning a data classification project. If you can get the inventory or catalog information about the source data, you can prioritize “following the sensitive data.” However, you still need to keep in mind that, often, sensitive data is added in unexpected places or without any conscious decisions being made.

How Diverse Is the Data?

It is one thing to handle data that is pretty much consistent and another when the data is inconsistent. The inconsistency can be in the data platforms (e.g. some of it is stored in S3 buckets and queried with AWS Redshift Spectrum, some in MS-SQL, and some in Snowflake). Inconsistency in the data can also mean that the data structures themselves are very different from one another, often due to semi-structured data. The more the diverse the data is, the more difficult a data classification project becomes.

The Data Classification Project

Once you have answered the questions above, you have good background knowledge about the data classification project and can make an informed decision about the best path to completion. There are three main paths you can take at this point:

Manual Data Classification

A manual data classification project is performed without any specific tools by accessing the data and preparing an inventory of the types of data and their locations, depending on the level of granularity required, as discussed above. This path is taken mainly when the data stack is too complicated or outdated to run automated classifications or when running automated data classification is not an option for various other reasons. If the data is changing, or if it is important for the data classification to remain up to date, a manual data classification is not a good option. Nevertheless, even though it is often not a very efficient strategy, manual classification is still quite popular and is often completed by distributing the work across the data owners.

Automated Data Classification Tools

The more streamlined alternative to manual classification is running an automated data classification. Automated classification is implemented by using data classification tools (or sometimes homebrewed scripts) which access the stored data (either the files or by sending queries), analyze the data returned, and suggest a classification for the data. This process should obviously be well-planned, so it does not create any operational problems when scanning the data (such as data scan costs or performance impact). Data classification tools are using algorithms to identify different data types, and, depending on the answers you provide to the questions in the section above and on the way they operate, these tools may require manual validation to mitigate false positives. Automated data classification is good for the time the data is being scanned, butany changes to the data made after the scan make the results obsolete. It is therefore important to understand the motivation for the project and how often the data changes.

Continuous Data Classification

A continuous data classification process involves scanning data on-the-fly as it is being accessed. This is the most fitting data classification method for organizations with data that changes rapidly or in any situation where you would like to keep your data classification information up to date. As long as data is being accessed, there is no additional overhead spent on scanning the data in this method. We, at Satori, chose this method of data classification, as it is the most suitable method for DataSecOps because it is continuous (and not ad-hoc) and ensures that, even if sensitive data “found its way” to a new location, it will get discovered. You can always manually override the data classifications performed by Satori.

Detecting Data Which Is Not Accessed

Continuous data classification focuses on data in use, as it is scanned when being accessed in real time. However, in some cases, you may want to initiate partial or full data scans in addition to the continuous scans. The way to do this is straightforward: running a SQL query to query data from the locations you want to scan. For more information on this method, feel free to contact us.

Conclusion

If you have any significant amount of data, you will inevitably need to perform a data classification on it sooner or later due to the many reasons we described above. We hope that, by reading this article, you gain insights on the importance of data classification, why it is not an easy problem to solve, what questions you should ask yourself before performing it, and the main classification methods. If you would like to receive more information about Satori’s continuous data classification and how it helps streamline DataSecOps, reach out to us for a quick demo:

Learn More About Satori
in a Live Demo

Book A Demo

About the author

Ben Herzberg