How to Automate a Data Inventory for AWS Redshift

Many modern organizations are using AWS Redshift clusters to fulfill some or all of their data warehousing needs. In addition, with Redshift Spectrum, organizations can also use Redshift as a query engine for data lake needs, pulling and processing data from S3 storage buckets. Using Redshift can help organizations utilize more data and increase their data-driven value.

 

Storing and processing data is one thing, but, with ever-growing and ever-changing data, it is becoming increasingly difficult to gain visibility into what you (as an organization) have and to answer questions such as:

  • What personal identifiable information (PII) do we have in our data warehouses and data lake?

  • What other types of sensitive data (healthcare, financial data, etc) do we possess?

  • What new data types have been added in a certain time period?

  • What sensitive data has been recently added to our data repositories?

In a physical warehouse, where you store a large number of items, it is crucial to maintain an inventory. This inventory provides vital information about what you have in the warehouse and its location., The same is true for a data warehouse.

 

This is especially true because the ability to grow your operations and the large number of consumers both in and out of your organization makes data operations much more agile and intensive. More people have access to data, and more people are adding data of various types, so the inventory should be as up-to-date and adaptable as possible.

 

In this article, we will discuss:

What Is a Data Inventory?

Similarly to how an inventory for a physical warehouse contains information about the items in the warehouse, where to find them, and their nature, a data inventory is a repository with a metadata collection of the datasets that an organization collects and maintains. It includes the location of each dataset and the type of data each contains.

When we use the term ‘metadata’ throughout this blog post, we refer to it in its broader meaning. For example, the existence of a social security number in a certain column and not strictly the numeric data type of that column.

When we use the term ‘metadata’ throughout this blog post, we refer to it in its broader meaning. For example, the existence of a social security number in a certain column and not strictly the numeric data type of that column.

The metadata in the inventory is collected for different reasons: for the consumers to know where to find what they are looking for, for the organization to know where specific types of data are kept (this is often a compliance requirement), and to define relevant access policies for each dataset.

 

How Does it Differ from a Data Catalog?

The terms ‘data catalog’ and ‘data inventory’ are often used interchangeably in organizations, and sometimes a data inventory of one organization is used in the same way as a data catalog in another organization. However, ideally, the inventory is a more technical metadata collection that refers to specific datasets and is granular to the level of columns and data types stored.

 

In contrast, the catalog is less technical and mainly serves as a directory for the data, providing a high-level idea of where to locate specific data and who owns the data within the organization. To account for occasional overlaps, different sections of the data catalog may refer to a specific location within an inventory.

 

Why Is it Important to Inventory Your Data?

Maintaining up-to-date knowledge of what data you have and where it is located is important by definition. If you do not know what you have, it becomes much harder to derive value from your data, and it will take consumers longer to understand what data they need, what data the organization already has, and where it is stored. Lacking this clear understanding can delay projects and incur costs. For example, it can lead multiple copies of the same data to be created and processed by different teams.

 

In addition, having a data inventory is important for complying with data protection and privacy regulations. Such regulations, including GDPR, require companies to know where their sensitive information, especially PII, is stored.

 

Why Is it So Difficult to Maintain a Data Inventory for Redshift?

Keeping a regular data inventory is not a walk in the park. Typical uses of Redshift are often very complicated. This is because, in many cases, there are multiple users (often from different teams) storing data within the warehouse, the data often rapidly changes, as it is copied from one location to another with varying levels of sensitive information.

 

This process is even more elaborate when Redshift is used as a data lake, as this increases the amount of data stored and the data’s complexity (as there are different types of files, often with semi-structured data).

 

Creating a Redshift Data Inventory

Regardless of the reason behind it, creating a data inventory for your Redshift can be done in different ways, each with its own advantages and disadvantages. Let’s consider the options:

  1. Creating a Manual Redshift Data Inventory

  2. Creating a Redshift Database Scanner-Based Data Inventory

  3. Using Satori for an Autonomous Data Inventory

Creating a Manual Redshift Data Inventory

Often, a very straightforward way to build an ad-hoc data inventory is by querying your metadata tables. Thankfully, in Redshift, based on the Postgres engine, the metadata is kept in a single centralized location, namely the information_schema.columns table, so we do not need to iterate over several tables. Rather, we get all of our columns from there by, for example, running the following query:

SELECT column_name, table_name, table_schema, table_catalog
FROM information_schema.columns
WHERE table_schema NOT IN ('information_schema', 'pg_catalog')
ORDER BY table_catalog, table_schema, table_name, ordinal_position ;

 

The query above retrieves a table with the column name and location (table, schema, and database), sorted by its ordinal position within each table, schema, and database. The result is something like this, taken from a TPC-H schema:

 

table

 

You can use this data as a basis for adding more metadata. Depending on the use case, this method may require writing a script or an application to pull this data from different clusters and make sense of it all by adding information about owners, data types, and descriptions. In some cases, this will involve copying the data into a spreadsheet. In other cases, there is no need to drill-down to the column level, so only tables are pulled from the metadata.

 

Keep in mind that the list is static and therefore only correct for the moment of its creation. If someone creates or modifies the schema at a later point in time, the metadata is now outdated. If you want a more agile solution, you probably need to write a script to pull updates from your clusters or choose a different option.

 

Creating a Redshift Database Scanner-Based Data Inventory

There are multiple open source and commercially available offerings for this purpose. These products can be used to build, enhance, and structure the inventories of your databases. The products and services include Alation, Collibra, Informatica, and others. Alternatively, you can utilize AWS Glue for crawling and retrieving the metadata from Redshift. You can employ these products to periodically scan your clusters to retrieve the metadata.

 

Scanning the data with these solutions also contributes to data stewardship and using the metadata as a system of records for data access policies. Keep in mind that, if the data changes frequently, the integrations become more complex to manage, and manual data access request audits can be challenging to supervise at scale.

 

This complexity, especially because data is ever-changing, led us to design our autonomous data inventory in a way that does not require additional work as the data evolves.

 

Using Satori for an Autonomous Data Inventory

After speaking with dozens of data engineers, architects, stewards and governance professionals about the benefits of having an up-to-date data inventory, which includes sensitive data locations in warehouses and lakes, it became clear that organizations need updated data inventories but often lack them (especially a good one which includes data classification).

 

We took this feedback into account when we built Satori’s data inventory. We wanted it to work in an autonomous manner that would not burden data operations teams or become an ad-hoc project. Our goal is to have the data inventory continuously update without requiring any configuration or integration time.

 

It works by continuously monitoring data access, identifying the location of each dataset and classifying the data in each column as it is being accessed. The result is a self-populating data inventory, which also classifies the data for you and allows you to add your own customized classifications on top:

Data inventory

 

If our approach sounds intriguing, you can read more about how we implemented Satori’s Autonomous Data Inventory here.

 

Conclusion

A data inventory solves a lot of problems. It makes your data more accessible and helps you maintain governance, meet compliance requirements, and reduce security risks. You need to decide how you want to solve this problem - with a homemade manual solution, an integrated scanning solution, or Satori’s autonomous data inventory.


If you would like to set up a demo of Satori, click here and select a 30 minute time slot.