The more data a company collects and stores, the more likely that data can turn into a Data Swamp if proper procedures for managing that data are not in place. In this article, we will discuss what a data swamp is, how organizations typically land in a data swamp, the risk of data swamps, and what you can do to prevent your data lake from becoming a data swamp.
What Is a Data Swamp? What Is a Data Lake?A data swamp occurs when data is stored without proper organization. The data therefore lacks the appropriate metadata to make retrieving that data simple. Metaphorically, you can think of the data as being lost in a dark swamp, with little hope of finding it. Or even worse - you can imagine yourself reaching for the data and sinking into the swamp. The best way to understand a data swamp is to compare it with a functioning data lake. A data lake is a central repository that allows storage of all data types, whether structured or unstructured and at scale. It is not necessary to structure the data in a data lake, but it is indeed necessary to organize it and assign it the appropriate metadata, so it can be retrieved intelligently and easily for later reports and analytics. Data lakes differ from older forms of data storage such as relational databases. The value of data lakes is that raw data can be added to it without a lot of resources spent on creating schemas and ETLs. Unfortunately, because data lakes are based on the storage of raw data of any type, whether relational or not, some degree of organization is required to prevent it from quickly becoming a data swamp.
How Organizations Land in a Data SwampThe value of opting for a data lake storage method compared to more traditional methods is that any type of data can be fed into the lake, including data from social media, log files, clickstream data, mobile app data, IoT device data, and so on. Unlike “data warehouses”—whose schema is known before any data is written—a data lake’s schema is determined when the data is read. This characteristic creates tremendous flexibility for data analytics and reporting, allowing organizations to draw correlations and conclusions that would be impossible to know before the data is received. By not being tied to a particular schema, data lakes allow organizations to store data at scale and constantly adapt to the data so that deeper insights can be obtained from it. This ability leads to faster decision-making. Because of this flexibility, mechanisms to categorize data as well as security policies to ensure that sensitive data is not inadvertently leaked, need to be in place. Although a data lake can store any type of data, it does not mean it should store every bit of data. So you need a policy to avoid unnecessary dumping and indiscriminate data storage.
The Risks of a Data SwampA data swamp may hold data without knowing that some of it is intertwined with sensitive data. Because of this, sensitive data can be inadvertently leaked, resulting in compliance violations or security risks. In general, compliance may become a nightmare when data swamps are involved. For example, it can take significant effort to even know where sensitive data is and who is accessing it. Also, without proper governance to establish how long each piece of data should be stored, the data lake quickly becomes overgrown, and its data becomes redundant.
Key Ways to Avoid Landing in a Data SwampThe following requirements are essential to avoid turning a data lake into a data swamp:
RelevanceData governance and risk teams should set policies for data expiration so that outdated data is purged, or access to it is limited. Policies need to be set to determine the types of data that will be restricted, to which users, and enforce such access control.
Discovery of Sensitive DataBecause data keeps being added and modified within the data lake, it is imperative that there is a continuous discovery of sensitive data, as upstream operations may cause sensitive data to be added to the data lake.
MetadataBecause the type of data being fed into a data lake is completely unknown—a data lake can accept all types of data objects—additional metadata needs to be included when storing data to make future retrieval easier. There are several categories of metadata:
- Descriptive metadata: describes the source of the data.
- Structural metadata: describes the data’s structure and any links it might have to other data.
- Administrative metadata: includes permissions data and information about how the data should be managed.
- Reference metadata: includes high-level info about numeric values.
- Statistical metadata: how the asset was collected and processed.
- Legal metadata: any data relevant from a legal perspective, such as copyright and licensing data.
AutomationManually tagging every bit of data that comes in would be impossible. The whole point of data lakes is that they are great for processing Big Data. That means billions of data points. Therefore, it is necessary to have an automated system that can properly categorize and store the incoming data, so it receives the appropriate metadata and is handled within the respective data governance policies.
SummaryData lakes lend themselves well to big data and large transfers of data in real-time. They can store both relational and non-relational data, thereby opening doors to deep analytics opportunities that are not available with a company’s typical datastores. A data swamp, on the other hand, is utterly useless—nothing but a marshland of unusable facts stored in a disorganized manner and providing no business value. To avoid a data lake turning into a data swamp, proper categorization of data and sensible data governance policies are a must. Satori helps you prevent your organization’s data from turning into a swamp. Here are some examples of how Satori can help you achieve these best practices:
- Satori continuously monitors data access, so you can locate sensitive data being accessed. Satori also creates a continuously updated data inventory with the sensitive data in your repositories.
- Satori enables your data engineering teams to save precious time on enabling access to data, allowing both data owners and consumers to achieve more efficient data access. This in turn leaves data teams with extra time to organize the infrastructure as well as organize your lake with relevant datasets configured in Satori.
- Satori enables a high level of visibility into the data accessed in your organization, helping reduce the “data opacity” level.