A few years ago, when I first began using Athena, it was a magical and revolutionary experience for me. This service was a great enabler for data analytics and research. We would throw a lot of (mainly Parquet) files into S3 buckets, containing many different types of data, and then we could, almost instantaneously, write SQL queries to analyze this data. The payment model of payments mainly per data scanned was also very appealing.
This fascination with Athena was not unique to me or my former team. Amazon Athena can be a vital component of an AWS Data Lake House architecture for analytics, research, and even for relieving some of the ETL pain.
The Promise of Quick & Accessible Data for All
The promise of a data lake house or a data lake is in part its ability to let more people utilize the data and harvest value from it. Data accessibility is increased by enabling easier data ingestion, so there is little overhead other than pouring data into buckets, and storage costs are low. Data consumption in AWS can be done directly, usually with Amazon Athena or Amazon Redshift Spectrum. Consumption can also be indirect and executed by loading the data into a more structured data warehouse like Amazon Redshift or Snowflake. The data that is being moved is usually the data we know we are going to use.
We have discussed maintaining security, governance, and privacy at bay in cloud data warehouse platforms extensively. A good starting point to learn more about these topics would be our Snowflake security guide
and Amazon Redshift security guide
So assuming we have this background, let’s get back to the data we are not yet sure we are going to use or we are consuming with Athena for other reasons. Let’s see how, by using Satori, we can solve some of the challenges surrounding security, governance, and privacy when accessing data with Athena.
Security, Governance, and Privacy Challenges in Athena
Below are some of the security, governance, and privacy challenges that are resolved by adding Satori to the Athena-S3 stack. All of these controls can be defined in a simple manner in Satori’s UI, with our API, or by using our Terraform provider
S3 Sensitive Data Discovery
One of the issues I have personally encountered is that, when you are pouring a lot of raw data into S3 buckets to be queried, it is difficult to know exactly where sensitive data such as PII is intertwined with the other data. This is especially true when inputs are coming from a multitude of sources and even more true when some of these input sources are outside of your control.
When data is accessed using Amazon Athena with Satori, data discovery is done in real-time, so you know exactly which types of data are being pulled and by whom. This data is then available to the organization as a continuously updating data inventory as well as in a data access audit log, complete with the identities of users accessing the sensitive data.
Dynamic Data Masking in Amazon Athena
Knowing that there is sensitive data present is important. However, taking this knowledge a step further, Satori also allows you to set dynamic masking for data types you deem necessary to anonymize. This can be as simple as an order to “always mask any sensitive data (PII, PHI, PCI, and Operational Data)” or as elaborate as the request to “mask usernames in emails, hash addresses, and redact all the rest.” You can also define which teams or users will receive each masking profile in order to keep your security policies effective on your data lake access.
Row-Level Security in Athena
Row-level security is usually considered one of the “luxuries” included in having a data warehouse, but do not underestimate the importance it may have for a data lake as well. For example, when opening telemetry use of data, you can give regional teams access only to the data for their specific region.
Simplified Data Access to S3 Buckets
By using Satori with Athena, you can simplify data access to data stored on S3 (whether this is JSON, Parquet, Avro, or other structured or semi-structured formats). You can specify definitions for datasets in Satori, which map to Athena tables and set the data stewards or data owners of this data. From that point forward, those data owners will be able to manage access to the data, including by assigning temporary access or even providing self-service data access. This capability releases the burden from the data infrastructure or data engineering teams and increases the value harvested from the data.
Identity Propagation from BI tools
Another serious security and compliance issue arise when configuring access to Athena in BI tools: you want a log with the data access which is correlated with the real user who accessed the data. Configuring this log is very simple with Satori, where you can get an audit log including full details of the types of data pulled, the identity of the data consumers, and additional useful metadata.