Snowflake Security: Best Practices for Stages

One of the toughest problems in DataOps is having robust access to data. We’ve discussed this in length, and specific to Snowflake we’ve discussed snowflake roles and user management strategies, and ways to optimize them using Satori. However, today, what I’d like to discuss is an earlier stage of the data operations, which is data ingestion to Snowflake and specifically Snowflake stages.

 

If you’re an organization that is either looking into or using Snowflake as your data warehouse, data lake, or data lakehouse solution, you probably know that the data that you’re using comes from somewhere. Let’s look at the main sources from which data is imported to Snowflake:

  • Data lake - Your data might come from your own data lake. The data lake may hold a lot of data, and you may only import parts of it to Snowflake for different reasons.
  • Other data warehouses - You may have other data warehouses and databases in your organization, which you’d like to send data from to Snowflake. For example, you may have databases close to your applications, which send aggregated data to Snowflake for further analytics.
  • Applications - Sometimes applications send data to Snowflake as part of their operation.
  • Telemetry - Results of telemetries may be sent from applications and devices.
  • Security logs and other logs - You may use your Snowflake (using Snow Alert or building your own) as a security data lake, or use it for analyzing operational logs.
  • Manual loading of data - Sometimes teams upload their own data without going through a DataOps process, by simply preparing files and ingesting them.
  • Data Sharing - Data is sometimes ingested from shared data. This can be done by using a Snowflake secure data share (such as a private exchange or from a public Snowflake data marketplace listing), but can also be data that is shared from a partner using another datastore.

Let’s look at how data is loaded into Snowflake and some of the things you need to pay attention to for a secure operation of stages.

 

Loading data into Snowflake

Data is loaded to Snowflake from stages. Stages are either internal, or external (public cloud storage buckets), where data can either be put for the sole purpose of importing it to Snowflake or as part of other processes. An example of this is an S3 bucket that is also used as part of your data lake operation, where some of the data is ingested to Snowflake. Note that another good reason to monitor stages is because data can also be exported to them.

 

A stage is a Snowflake object that you can manage using SQL. This can be as simple as a CREATE STAGE <stagename> or customized with optional parameters to configure the parameters.

 

Snowflake internal stage vs external stage

The Snowflake stages can either be internal and part of your Snowflake account storage (for example, if you have a Snowflake deployment in AWS it will be in S3 buckets managed by Snowflake), or it can be an external stage whereby it will be using storage buckets. Currently supported are GCP, GCS (Google Cloud Storage), AWS S3 buckets, and Azure blobs. These buckets may be your own storage buckets, but they may also be buckets shared by other companies or even public storage buckets.

More on loading the data using stages can be found in Snowflake documentation.

 

Snowflake stage security best practices

Let’s now discuss some of the important aspects of data operations security when it comes to loading data from stages.

 

Loading encrypted data to Snowflake

Let’s say that a fictitious company called Sandwich Engineering Inc wants to use a S3 bucket as a stage. To better protect the data at rest, we can store it encrypted. As an example, we can use the following command to use AWS SSE (Server-Side Encryption) on the files in the stage:

CREATE STAGE sandwich_hr
  url='s3://sandwich/hr/'
  credentials=(aws_key_id='<KEY_ID>' aws_secret_key='<SECRET_KEY>')
  encryption=(type = 'AWS_SSE_S3');

 

By storing the data encrypted, we lower the security risk of exposure of the objects stored in the bucket. An even better way to do this is to use storage integration so you don’t need to pass the credentials by the users, and can lock down the stages to a specific set of buckets.

 

Eliminate stages available to the public role

Stages which are owned by the public role in Snowflake are available to all users within the account. That means that all users have access to these objects. This can have unintended consequences, like access to sensitive data by users the data is not intended for. It is best to avoid having stage access to the public role (or to other widespread roles you don’t want to have access to the raw data). Ideally, you should have dedicated roles handling data ingestion as part of your data operations.

 

The following command (which requires access to the Snowflake account usage schema) will show you all stages (regardless of type) that are owned by the public role:

SELECT stage_name, stage_catalog, stage_schema, stage_type, stage_url, created
FROM snowflake.account_usage.stages
WHERE STAGE_OWNER = 'PUBLIC'
AND DELETED IS NULL
ORDER BY CREATED;

You can also directly view the stage files from Snowflake without importing them into tables by using the following command, (this example pulls only the 2 first columns from the file):

// lists the 2 first columns from the file
SELECT $1, $2 FROM @<stage name>/<path>;

Eliminate stages left open

Another common issue is a stage holding data, possibly sensitive, that is left stale after it’s no longer needed. As an example, data analysts from Sandwich Engineering are uploading a list of customers for analysis of the different ingredients people want to avoid for allergy reasons, and they want to conduct this as an ad-hoc project. They upload the data to a stage, then copy it to a table in Snowflake and run the analysis, joined with other tables in the account. Once this ad-hoc project is over, the table is dropped, and everybody forgets about this project. However, in many cases, the stages are not removed, and the files in them may contain sensitive data. This is why stages should be monitored, as part of a healthy DataSecOps mindset.

 

The following SQL query will help Sandwich Engineering (and you) find the open external stages, sorted from oldest to newest:

SELECT stage_name, stage_catalog, stage_schema, stage_type, stage_url, created
FROM snowflake.account_usage.stages
WHERE DELETED IS NULL
ORDER BY CREATED

Monitoring import and export operations

Another important security aspect is to keep in mind (and monitoring) that stages can also be used for exporting data from Snowflake, making it something we would want to monitor closely. Ideally, we need to make sure that only a handful of specific roles should have the privileges to do so, but regardless, this should also be monitored. An example of such monitoring is to watch the query_history view in Snowflake Account Usage for such operations.

 

For example, the following command retrieves a list of the copy into operations done, which can be customized to exclude specific roles or locations:

SELECT start_time, user_name, role_name, query_text
FROM snowflake.account_usage.query_history
WHERE query_type = 'COPY'
ORDER BY start_time DESC;

 

Conclusion

Data ingestion should be handled with care as part of the organization’s DataSecOps mindset. In the same way that you don’t leave unattended S3 buckets, you should not leave unattended stages (which are in many instances the same thing, as stages are often kept in S3 buckets). The main challenge with stages is that they’re often “off the radar”, as, unlike tables, they don’t appear as part of the database schema or data inventories.

 

For more information about how to achieve secure data operations over Snowflake and other data platforms in a simple and scalable way, schedule a demo to see how Satori can help you.