In this article, I will give an overview of data access orchestration (not to be confused with data orchestration) and then discuss the benefits of data access control orchestration as well as its limitations. I will also compare data access orchestration with a data access control proxy. Lastly, I will explain why we, at Satori, decided to design a proxy data access control platform rather than an orchestration platform.
What Is Data Orchestration?
Data orchestration is a process that consolidates data from numerous storage locations and combines it in a rational manner so that companies can use the data in their analysis and management platforms.
Data orchestration is usually backed by the use of software platforms, which connect various storage systems and enable connections with other applications when required.
What Is Data Access Orchestration?
Unlike data orchestration, in data access orchestration, the thing being orchestrated is access to the data, rather than the data itself. Instead of configuring data access manually in the data stores themselves (e.g. databases, data warehouses, and data lakes), access policies are defined using a single tool which then implements the security policies in the various data stores.
For example, when defining certain data types (such as PII) to prevent exposure to certain groups of users in a Snowflake account, the data access orchestration solution “translates” that to create a dynamic masking policy in Snowflake. Another use case is that, when authorizing access to a certain dataset (which contains several tables), the policy executes an SQL grant command to allow access to the objects within the dataset.
In many cases, access is controlled by creating an abstraction layer. For example, when controlling access to a certain schema, another schema can be created with a set of database views where the access logic is maintained or applied. The views (i.e. visible data) will return the data with the restrictions applied.
A simplified example appears as follows. Let’s say you want to access a table that is in the public schema and would normally access it using the following command:
|SELECT * FROM orgdata.public.customer_demographics LIMIT 100;|
You would need to create a view, which looks like this (this is a simplified example):
|create view orgdata.public.customers_demograpohics_view as|
orgdata.orchestration."ORCH_MASK_COLUMN_[ORGDATA].[PUBLIC].[CUSTOMERS_DEMOGRAPHICS].[LAST_NAME]"(first_name) as first_name,
orgdata.orchestration."ORCH_MASK_COLUMN_[ORGDATA].[PUBLIC].[CUSTOMERS_DEMOGRAPHICS].[FIRST_NAME]"(first_name) as last_name,
orgdata.orchestration."ORCH_MASK_COLUMN_[ORGDATA].[PUBLIC].[CUSTOMERS_DEMOGRAPHICS].[CC_NUM]"(cc_num) as cc_num,
orgdata.orchestration."ORCH_MASK_COLUMN_[ORGDATA].[PUBLIC].[CUSTOMERS_DEMOGRAPHICS].[ADDRESS]"(address) as address,
orgdata.orchestration."ORCH_MASK_COLUMN_[ORGDATA].[PUBLIC].[CUSTOMERS_DEMOGRAPHICS].[ZIP_CODE]"(zip_code) as zip_code,
orgdata.orchestration."ORCH_MASK_COLUMN_[ORGDATA].[PUBLIC].[CUSTOMERS_DEMOGRAPHICS].[SALARY]"(salary) as salary
/* There may also be access control conditions, for example for row-level security */
The exact abstractions would depend on the data platform, and sometimes depend on additional details (such as your plan). For example, you may require different implementations based on your Snowflake account level. Sometimes those abstraction layers will be created by the orchestration tool itself (for example when policies can be applied by the native data platform).
As part of the process of applying access controls, users’ access to direct objects is revoked, and they are instead granted access to the abstract objects (visible data).
Benefits of Data Access Orchestration
The primary benefit of using a data access orchestration platform is that data engineering teams responsible for setting and revoking user access to data can perform access control operations using a single tool. In certain cases, the data engineering team can enable additional users (such as data stewards or data owners) to provide user access without involving data engineering, a process that can sometimes cause a bottleneck.
Additional benefits of this system include reduced effort when data is accessed from multiple platforms. When orchestrating data access in multiple platforms, your orchestration solution can simplify changes which can otherwise require an additional learning curve from data engineers (because they need to figure out how to accomplish the same results in different platforms).
Limitations of Data Access Orchestration
A data access orchestration platform has many advantages over directly applying data access controls on the data platform itself. For example, it allows scaling of data access, especially for organizations with a large and growing number of data consumers, as well as satisfying various security, compliance, and privacy requirements. However, there are also certain limitations to such platforms.
One of the main disadvantages of such an approach is the lack of visibility into the data being accessed. This means that, when setting configuration and restrictions, you most often also combine that process with log collection and lack the ability to analyze the data being accessed. This means that you must perform a periodic scan to discover sensitive information, as it will not be discovered immediately during access.
Administrative Users Operations
Another issue that arises in regards to data access orchestration tools is that admin users’ actions may interfere with its operations, breaking the security controls. This operation may be done by mistake (e.g. by an administrator changing or creating a view outside of the scope of the orchestration tool), or it can be done on purpose. Configuring the access controls to be something that some users in the system can workaround may not be functional in all scenarios and can leave certain risks unhandled.
Performance and Complexity Degradation
In some cases, when orchestration tools are used to enforce complicated logic, the process may cause degradation in the datastore’s performance. Such degradation can occur because these tools often add many objects to the data store, and views often contain many function calls (e.g. for masking data) along with lookup or entitlement tables, “CASE..WHEN” statements, and more. Such operations may therefore, in addition to requiring added computation, also eliminate some datastore optimizations.
In addition to possible performance effects, complexities are often introduced when additional objects such as roles, views and entitlement tables are added. This is especially true when you have several access controls working together (for example: applying data localization with row-level security, decryption, and dynamic masking alongside).
Re-Creating Objects May Remove Controls
In some data orchestration tools, an object that is dropped and re-created (sometimes due to perfectly legitimate operational reasons) may lose the changes the data access configuration added to it through the orchestration tool. This event may lead to operational problems or security and compliance risks.
Dependency on Native Capabilities
A data access orchestration platform relies on the native capabilities of the data store, which can be inherently limiting. In the previous example (under “what is data access orchestration”), I mentioned the use of dynamic masking policies in Snowflake. However, not all platforms have dynamic masking capabilities, and, even in Snowflake, this process may require a different solution if some of your accounts are standard.
Changing Existing Queries
In many cases, when a company starts working with such platforms, they in effect cause their operators to have to alter existing queries. This necessity is usually because you need to reach different locations to access the data. In cases where you have existing dashboards, reports, and scripts accessing the data, this obstacle can place a significant burden on multiple teams. This effect is exacerbated, as you would often need to undergo such processes also when offboarding or moving to a different platform.
Use of Over-Privileged Roles/Users
When creating views, changing user access, and in many cases also reading data (as part of periodic scanning to discover sensitive data), you are required to give such platforms a high level of permission to access your data stores. In Snowflake, this permission is often ACCOUNTADMIN (or a slightly less powerful user), or it can be other power users on other platforms. In many cases, authentication is also performed using a username/password authentication without stronger authentications (such as key-pair).
Data Access Control Orchestration vs. Data Access Proxy
When we, at Satori, started to grapple with the challenge of creating the first DataSecOps platform for organizations, we considered different options for enforcing data access and discovering sensitive data. Instead of building an orchestration solution, we chose to follow a different approach that would allow us to provide more flexibility and value to our customers.
We wanted to build a platform that would have the lowest level of intrusion possible and also allow for a gradual onboarding process. In today’s data-driven world, where data is ever-changing, sensitive data is often found in unexpected places, and it is hard to track where sensitive data is across your data stores, it was important for us to discover sensitive data as it is being accessed rather than as part of a periodic scan. Finally, we wanted to deliver an identical experience to users regardless of the underlying platform.
The Challenges of a Data Access Proxy
The main challenge with building a data access proxy is creating a truly reliable service. The product needs to be able to scale as data access grows and to be able to “understand” the different protocols involved in data access. By leveraging battle-proven and scalable technologies such as Kubernetes, Nginx, and cloud-native architecture, we built our product to have reliability and scalability as its cornerstone. Of course, it helped to have a team with experience in building such reliable and scalable products.
Key Advantages of a Data Access Proxy
A data access proxy has the ability to analyze data as it is being queried. This way, you can identify sensitive data while it is accessed, allowing you to act fast. For example, you can create access policies based on different data types, which will apply even as such data is discovered. For example, you can limit access to PII, even if it is not pre-configured. A data access proxy can also be used when updating a data catalog such as Collibra when new sensitive data is discovered.
Feature Parity Across Different Data Store Types
Implementing the logic as part of the proxy enables Satori to deliver capabilities regardless of their availability in the underlying data platforms. For example, Satori can provide additional capabilities such as row-level security or dynamic masking on semi-structured data or leverage certain attributes in an ABAC policy.
Easy to Onboard
One of the most important aspects of any product is the ability to implement it with the least amount of friction to the customer. As a transparent proxy, Satori does not need to create any objects on the data platforms, and pre-configurations are not necessary. The only change you need to make is to the hostname through which users are connected to the data platform.
Enables Client-Side Encryption
In certain cases, where there are requirements like decrypting parts of the data only once it leaves the data store, a data access proxy can decrypt data on-the-fly and simplify access to such highly sensitive data.
Data access orchestration platforms are powerful tools that enable data access, especially when there is a large number of data consumers and sensitive data involved. However, we decided to follow a more difficult path by creating a data access proxy, as it allows us to provide additional value to our customers by enabling them to make more use out of their data.