Data Inventory for Snowflake: Manual vs Automated
The Snowflake Data Cloud helps organizations quickly transition from having a lot of data to making value from a lot of data. We’ve detailed the reasons why in previous blog posts, and we’ve spoken with many data engineers who have said that their organizations are increasingly adopting Snowflake, as it allows more teams to store and process data in a simple way.
This is all very useful for extracting more value from the data, but it also compromises an organization’s control over the metadata, and it’s becoming significantly harder to answer questions such as:
Who is the owner of each Database, Schema and Table?
Where, within our Petabytes of data, across hundreds or thousands of tables, do we have specific types of data? For example, where do we store certain types of PII, PHI or financial information?
Who has access to what type of data?
Does a certain group of people have unwanted access to sensitive data?
What new data was introduced to our Snowflake Data Cloud over the last week, month or other period of time? Of this data, what was PII, PHI, or other types of sensitive data?
There are clearly many metadata questions that need to be answered. The accuracy, simplicity and freshness of metadata can have a positive impact on the data’s usability, as well as on meeting compliance requirements (and the amount of resources required to do so), and mitigating security threats, as improving your metadata’s visibility allows you to proactively identify and address risks.
To address these needs and keep a complete and accurate record of the data stored, you need to maintain an up-to-date repository with the inventory of your data. This is aptly called a data inventory.
What is a Data Inventory?
A data inventory is a central metadata collection of all the datasets an organization collects and maintains. It includes the location of each dataset and the type of data each one contains. Data analysts use the data inventory to identify what data is available and how to access it, and data stewards define relevant data access policies for each data set.
Data Inventory vs. Data Catalog
The terms data inventory and data catalog are sometimes used interchangeably, but ideally a data inventory is a more technical collection of metadata that applies specifically to the data store and is granular enough to point to specific columns where data of certain types are contained.
A data catalog is less technical, primarily serving as a directory for the data in the organization, relaying the owners, data stewards and data experts who can answer questions about and provide access to the data.
In other words, a certain data point in the data inventory can be referenced in several sections of the data catalog.
Going a step further, the distinction between data inventory and data catalog is different, or even non-existent, within many organizations (some of them maintain one or the other, which has different levels of granularity).
The main reasons for the differences in implementations and scopes of data inventories and data catalogs include:
Differences in data that is stored by different organizations, and its sensitivity
Differences in data architecture
Difficulties keeping track of the metadata, especially when Big Data is involved, which is a large volume of data that changes frequently
Why is it Important to Perform a Data Inventory?
There are two main reasons for performing and maintaining a data inventory. First, it provides data consumers, such as analysts and data scientists, with a starting point for data access and discovery. A data inventory is the foundation for broad and streamlined data access, which in turn fosters an effective approach to data usage and operations.
Second, a data inventory is foundational for complying with privacy regulations. Most regulations, including GDPR, require companies to know where their sensitive information is stored, and although not explicitly mentioned, they also require a de-facto data inventory.
How to Create a Data Inventory for Snowflake
If you’re running a Snowflake Data Cloud and managing sensitive customer information to which a variety of data analysts and other data consumers have access, you may need a data inventory.
Let’s consider your options when building one:
Manual Data Inventory
Database Scanner Based Inventory
Autonomous Data Inventory
Manual Data Inventory
The most straightforward way to immediately build a data inventory is to query your data warehouse metadata tables. In Snowflake, you can get this metadata from your information_schema for each table, or centralized (per account) from your snowflake.account_usage views. This is a sample query of querying column data, sorted by table, schema and database:
SELECT column_name, data_type, comment, table_name, table_schema, table_catalog FROM snowflake.account_usage.columns ORDER BY table_catalog, table_schema, table_name, column_name
You can enrich this data by pulling it to an external application or database, or by using the comments parameter for the different columns (which is often not a good idea, as it may contain other comments). We’ve seen some organizations export the data this way (or sometimes just in the scope of tables, without the columns) to a spreadsheet and create an ad-hoc data inventory, often for a compliance audit, which requires preparing such an inventory.
This list is static, and it’s only true for the moment of its generation. This means that if someone in your company creates a new table or modifies an existing one, it won’t be part of your data inventory until the next time you run and re-build this script. If you need a more agile solution, you will likely have to implement a periodic update mechanism or utilize a third party solution for metadata management and data cataloging.
Database Scanner Based Inventory
Multiple open source and commercial offerings are available for this purpose. These products can be used to automatically build, enhance and structure an inventory of your enterprise data sets. Vendors such as Alation, Collibra, Informatica and others build custom integrations and connectors with leading data platforms, such as Snowflake, to periodically scan and maintain a complete data inventory, putting lifecycle processes and monitoring functions around the manual querying and analysis of the platform’s metadata tables.
Scanning based data inventory solutions are also used for data stewardship purposes, acting as a system of record for data access policies often used for the manual audit of data access requests. If the data changes frequently, integrations become complex to manage and manual data access request audits are quite difficult to manage at scale. There is another option available, one that is not complex to deploy and is integrated into your data access policy enforcement layer: an autonomous data inventory.
Autonomous Data Inventory
We spoke with dozens of data engineers, data architects, data stewards and data governance professionals about the process of building and maintaining a data inventory. Two problems surfaced above all others: the burden of staying up-to-date with a moving target and constantly identifying the data types in each column of each table.
That is why when we built Satori’s data inventory, we wanted this to work in an autonomous manner that would not be a burden for data operations teams or become an ad-hoc project. We wanted the data inventory to continuously update and require no configuration or integration time.
It works by continuously monitoring the access to the data, identifying the location of each data set and classifying the data in each column as it's being accessed. You can read more about how we implemented Satori’s Autonomous Data Inventory here.
At the end of the day, your data inventory requirements depend on your data architecture and what you plan to do with the metadata you gather. We recommend either integrating with a data catalog platform or using our autonomous data inventory, as maintaining a manual inventory of your data is usually not cost-effective.
If you would like to set up a demo of Satori, click here and select a 30 mins time slot.