Data dictionaries, inventories, and catalogs are terms often used interchangeably. While they are all critical to effective data management systems, they are not the same. What are the differences between these structures and their functions?
In each case, we include guidelines for building the structure along with examples of its use. We also unpack when each of these structures is necessary as well as the critical factors involved in effective data governance.
What Is a Data Inventory?
A data inventory is a centralized metadata collection, indicating all of the datasets an organization collects and maintains. This document (or collection of documents) pinpoints each dataset's location and the type of data it contains. On a practical level, this inventory allows data analysts to determine what data is available to them and how it is accessed. Data stewards maintain these data inventories and define the relevant data access policies for each dataset.
Conducting a data inventory can be a mammoth task, especially when an organization undertakes this task for the first time. Here, it is important to distinguish between a data inventory and a data catalog. While these terms are often used interchangeably, they are not the same and perform different functions.
A data inventory is a unique set of data detailing the location and type of each data point in a company's collection. A data catalog allows users to locate those datasets by referencing them in various categories. While each entry in a data inventory is unique, a data catalog can refer to the same data point in various entries. For this reason, a data inventory is far more granular, detailed, and technical than a data catalog.
How to Conduct a Data Inventory?
Before conducting a data inventory, we must understand its function. This document provides data consumers with a starting point for data discovery and access. When performed effectively, the data inventory facilitates broad, streamlined data access, making data usage and related operations more effective and efficient.
(Europe's General Data Protection Regulation), require companies to know the location of all sensitive data they collect and store. By implication, this requires a detailed and up-to-date data inventory.
The primary challenge in creating a data inventory is keeping it current. Therefore, you need an agile system to create and maintain your data inventory. The most efficient way to do this is either by automation or by maintaining a continuously updated data inventory (such as the one Satori creates).
For examples and further references, read our guide to creating manual and automated data inventories in Snowflake
What Is a Data Dictionary?
Data dictionaries outline information about naming and defining data assets. These are stored as repositories and serve to support data engineering operations. In these dictionaries, we can find names, settings, and any other important attributes of data assets found in specific databases or data pipelines. These datasets offer a high-level understanding of the elements involved, allowing for more effective interpretation and guidance. This information helps define the scopes and application rules of each dataset, as outlined by various stakeholders.
When employed effectively, data dictionaries help prevent inconsistencies and conflicts in data assets when they are used in projects. They further allow for precise, uncomplicated definition conventions and consistency in enforcing roles and uses.
Additionally, data dictionaries often hold centralized definitions for terms surrounding data assets, relationships, and metadata on the origin, use, and data schema. In short, data dictionaries explain how specific types of data fit into the bigger picture. They are closely related to data warehouses, relational databases, and data management systems.
How to Create a Data Dictionary?
There are two types of data dictionaries
- Active Data Dictionary
- Static Data Dictionary
Active data dictionaries are automatically updated as the data repository to which they are linked grows. Conversely, static data dictionaries are not bound to any specific databases and therefore must be updated manually. This manual process is challenging since delays in carrying out updates render the metadata in the dictionary out of sync. Adding to the conundrum is the speed with which events typically occur in databases—making it ever difficult to keep a dictionary up to date manually. For this reason, we advocate for implementing automatic, agile procedures to ensure that all data dictionaries remain updated and accurate.
Data Dictionary Example
Data dictionaries typically contain the following elements:
- Data asset name
- Format type
- Relationships with other data entities and assets
- Reference data
- Data quality rules
- Elemental data asset hierarchy
- Datastore location
- System-level diagrams
- Reference data
- Missing data and quality-indicator codes
- Business rules (data quality validation and schema objects)
- Entity-relationship diagrams
The metadata in data dictionaries mainly focuses on the data asset's business attributes. It typically facilitates communication between business stakeholders and technical users, ensuring that all information, contents, and formats meet requirements. Data dictionaries further serve as a valuable tool in defining project requirements for data pipelines or products.
Most often, database management systems and information systems created by computer-aided software engineering contain active data dictionaries. Teams can use these dictionaries as a starting point to create their data dictionaries. If you are unable to generate a machine-readable data dictionary automatically, you can use a single-source dictionary, such as those contained in a spreadsheet.
What Is a Data Catalog?
A data catalog is a centralized metadata repository that organizations use to manage their data. Here, a company outlines information about the organization, use, and management of data resources. This catalog supports functions of data engineering, analytics operations, and science.
Why Have a Data Catalog?
When executed effectively, data catalogs facilitate improve data management. They provide high-level, categorized information on the datasets available in an organization, thus offering high-level insights and analytics. This asset enables stakeholders to efficiently find relevant datasets of any type stored in various locations, such as data lakes, warehouses, and other databases.
Data catalogs underpin data engineering operations by keeping track of data schema changes to facilitate transformations and aggregations in data pipelines. Here, the data catalog helps data engineers check that incoming data adhere to the expected schema by triggering alerts when changes occur.
Without effective data catalogs in place, these changes would likely occur unnoticed, resulting in silent failures. This results in the common problem where gaps exist between data and metadata in pipelines where data from various sources are processed. When the data changes unexpectedly, the data pipeline can subsequently fail or provide an incorrect output.
This centralized, up-to-date repository allows organizations to track data assets efficiently and lets stakeholders quickly and easily find relevant datasets while adapting to the changing data landscape. Here, data teams can uncover previously unknown potential benefits, effectively apply data governance policies, and ensure regulatory compliance.
How to Build a Data Catalog?
Data catalogs are generally stored separately from the datasets to which they refer, either in data warehouses or in data lakes. While building a data catalog, make sure you keep the primary goal in mind: data catalogs make data management easy and effective, sharing knowledge and information on the data collected and stored in your organization. It outlines the data flow in various pipelines and offers a birds-eye view of your data landscape.
Building a data catalog can be time-consuming, especially when done manually. Here are five steps to follow when building an effective data catalog:
- Data Capture
- Assign Data Owners
- Knowledge Documentation
- Regular Updates
In the data capture phase, first determine which metadata is relevant and how it should be captured. Answering these questions helps develop the shape and structure of your data catalog by giving you an understanding of your data's shape, structure, and semantics.
First, determine the data most relevant to your company's growth, such as:
- Customers who have made at least one purchase
- Calculating acquisition costs
These are just a few examples of a potentially extensive list. The next step is to find where this relevant data resides and in what shape it is stored. Some of these data points could come from a combination of sources and potentially have varying shapes. This means that the data is not necessarily only contained in neat tables. A well-designed data catalog supports various data types, including table and streaming data.
Streaming data and other forms of nested data are becoming more commonplace as the global data landscape evolves. For this reason, your data catalog must be configured to support these types of data. Additionally, the catalog should function logically, supporting hierarchical data relationships and connecting with how users think.
Data lineage is another essential aspect of data catalogs; this involves an understanding of where data comes from and where it is going. The lineage provides context for data users, as in the following examples:
- Understanding which downstream processes would be impacted if you update a specific data pipeline.
- Knowing where PII (personally identifiable information) goes after it enters the data lake.
Implementing systems that automatically update data catalogs is essential because it eliminates thousands of hours wasted on manual updates and ensures that the catalog is always current and relevant. Nearly all databases and data stores have tools that facilitate metadata extraction in your desired shape and semantics.
Some circumstances do not facilitate direct database connections, such as when you are protecting sensitive data or using a privately managed database. Here, sample files and extracts will suffice in place of a direct database connection.
In addition to automatic updates and data extraction, a manual human check is also necessary. Databases change and evolve rapidly, and automated processes and machine learning algorithms cannot always keep up with these changes. Thus, you need to regularly check that the processes are functioning correctly.
Assign Data Owners
Once the data is captured, the organization must assign ownership over this data. This distinction provides a contact person for data users who require additional information, and it assigns someone the responsibility for ensuring that the data and documentation are complete and accurate.
The most important data owners are the data stewards and technical owners. The data steward manages and addresses business-related queries, while the technical owner is responsible for resolving technical issues.
Data documentation, even at a small scale, is potentially overwhelming. It is often not feasible to catalog all of your data at once. For this reason, you should follow a practical, logical approach, ensuring that the most important data is cataloged first, followed by the second most important data, and going down the hierarchy from there.
Typically, one of three methodologies is followed:
- Document it when you find it. Here, everyone is responsible for updating the data catalog when they learn something new that has not been previously documented.
- Update documentation when you change the code. In this case, updating the data documentation is part of the protocol followed when releasing new features.
- Regularly set aside time. Set aside an hour per week, or 15 minutes per day, during which each team member documents the data assets they are familiar with or researches one they do not know well.
The ability to quickly find information in a data catalog is essential. Here, it is helpful to add rich-text documentation within the catalog, which highlights critical insights. It also helps to group assets into common datasets for easy extraction.
Documenting questions about the data, along with answers to these questions, is also useful. Think of this process as akin to the "frequently asked questions" section on a website.
Datasets change constantly. Identifying these changes and updating your data catalog accordingly is essential. This process should ideally be automated to save time, but a fair portion of user interaction remains essential to ensure that all updates are sensible. Here, governance actions automatically prompt user intervention when appropriate.
A data catalog is a tool that enables your teams to interact with your data effectively. Understanding these teams' needs and optimizing your standards and norms accordingly paves the way for optimal data interactions. To this end, develop the following norms:
- Standardize documentation formats for all in-house databases, schemas, fields, and data lineage.
- Identify essential learning plans, such as new employee onboarding and tag the relevant assets.
- Reinforce norms surrounding the data catalog and embed it into your organization's data culture.
When Should You Have a Data Dictionary?
A data dictionary is essential when you have a large amount of quantitative data that is accessed by many users because it prevents data redundancy and ambiguity. When used correctly, a data dictionary promotes efficiency. While this document may take time to prepare, the long-term results are worthwhile.
When Should You Have a Data Inventory?
Data inventories are directly or indirectly required according to data governance regulations, such as the GDPR
. This is especially relevant when collecting personally identifiable information (PII).
When your organization has an extensive data collection, understanding what information you have and why it is useful can be a daunting task. Having a data inventory at hand simplifies this task exponentially since it provides granular detail on what data you have and where it is located. This information simplifies and streamlines data tracking, which, in turn, improves efficiency, since your data is now inherently searchable.
When Should You Have a Data Catalog?
It is best to have a data catalog when you have data across multiple data dictionaries that is accessed by multiple users. The data catalog organizes this data into a simple, easily digested form, streamlining data extraction and processing.
Data Inventory vs. Data Dictionary
A data inventory details all of the datasets available in your organization and displays all relevant metadata. Alternatively, a data dictionary outlines the rules for those datasets, indicating their proper format, shape, and schema.
Data Catalog vs. Data Inventory
A data catalog offers a birds' eye view of all of the data available in your organization and where to find it. Here, the data is organized according to regular business functions, such as understanding the lead generation pipeline, managing purchasing and inventory, or tracking customer spending habits.
On the other hand, a data inventory contains metadata on all of the datasets in the company, making these datasets inherently searchable. It is granular in nature, providing detailed information on single datasets. The information contained in a data inventory is always unique, while one dataset could be featured in multiple entries in a data catalog.
Key Factors in Creating Data Inventories, Catalogs, and Dictionaries for the Modern Data Stack
Creating data inventories, catalogs, and dictionaries are essential functions in modern data processing. Nevertheless, there are some common pitfalls inherent to these processes, specifically when dealing with sensitive data and unstructured or semi-structured data.
Here, data inventories, catalogs, and dictionaries work together, forming the basis of understanding and protecting this data.
Special Care for Sensitive Data
The risks surrounding sensitive data are a sore point for many organizations because the threats of cybercrime are currently at an all-time high. Therefore, taking special care of this data is essential. This data should be labeled, cataloged, and inventoried accurately, as knowing where the data is and how sensitive it enables further data protection measures.
Organizations should assign ownership over this sensitive data because knowing who is responsible for the data creates urgency in protecting it.
Lastly, restrict access to this data and update the usage and access guidelines accordingly in the data catalog.
Continuous Sensitive Data Discovery
Outline and implement protocols to continuously discover sensitive data in your organization's data structures. If you do not know that sensitive data is there, you cannot begin to protect it.
Ensure Semi-Structured Data Is Classified and Updated
Semi-structured data does not fit into a defined structure or schema. Instead, it is organized through tags that allow them to be grouped and organized. These non-relational, or NoSQL, data types are often difficult to capture, classify, and update, but they form an essential part of good data governance.
Implement processes that identify and catalog this type of data to make sure that your organization does not create a lake filled with dark data.
Metadata Management with Satori
Satori maintains a continuously updated data inventory, which is automatically updated as data is being accessed. This includes classification of sensitive data, so you are aware when sensitive data is added to your repositories. Furthermore, Satori integrates with data catalogs, such as the Collibra data catalog
, to simplify setting security policies according to data catalog definitions and to populate and classify the metadata in the data catalog.
Data dictionaries, data catalogs, and data inventories are integral aspects of good data governance practices. Having these in place and up-to-date ensures effective and efficient data interactions, allowing teams to streamline their operations and gain valuable insights from data far quicker.
The structures also allow for data disambiguation, ensuring that all stakeholders adhere to the same definitions and protocols. Here, you can minimize data redundancy and eliminate incorrect information gained from data.
The greatest challenge in managing metadata, regardless of its application, is keeping it current. The speed and volume of our data collection pipelines are astronomical, necessitating automated and agile protocols for updating these structures.
Since the data landscape is continuously changing, regular user intervention is necessary. Effective metadata handling protocols enable automatic processes while identifying instances that require user intervention.
Satori’s Continuously Updated Data Inventory
Satori continuously updates a data inventory, which also contains data classification that is done as data is being accessed. To learn more read about our continuous data discovery & classification capability
, or schedule a demo by filling the form below.