Analyzing Data Access Logs? Here is What You’re Missing
I appreciate data access logs as much as the next person. I would even say that I appreciate logs more than most, especially when using them analytically to provide a more complete picture. Given that, in this blogpost I will discuss why relying solely on data access logs can be insufficient and why using a universal data access control is so appealing. I will focus my examples on cloud data warehouses, but the same concept can be applied, with minor variations, to databases as well.
Data access log content
“Native Logs” or Data Access Logs are logs which are generated by the database engine and provide information about database transactions. The main questions they answer are:
When did the transaction occur? (usually the start and end times).
Who was accessing the data? The answer is usually the user who sent the query to the database, although it may contain additional information including the client application used, the client IP address, or other identifying information. Sometimes, the log also contains information about the role used.
What data was accessed? In most cases, this information means having the query which was sent to the engine.
Were there any errors issued for the transaction (with varying levels of details)?
How much data was scanned, and what was the timing of specific parts of the query? The log provides this operational data, as well as other details, which can be useful for optimizations and cost analytics, especially when there is a pay-per-query element (e.g. in Snowflake and BigQuery data warehouses or in AWS Athena query engine).
As you can see from the above points, data access logs are not standardized, and various data are recorded in different databases. The granularity level also varies which may constitute an added challenge, especially when an organization has several different data stores and is trying to obtain a unified view or report of organization-wide data access.
Furthermore, data access logs are not always “on by default”, and, in some cases, they have to be set up and configured, including “babysitting” them by setting specific ETL processes, and often adding shadowed costs such as storage and analytic processes. The complexity of maintenance varies depending on the following factors:
What the organization is planning to do with the data access logs (or, in some unfortunate cases, what it finds out it wants to do retroactively). Is it looking to keep the data access logs as an audit for compliance purposes? Is it trying to use the data access logs as part of a data breach investigation or another type of incident response? Is it trying to optimize costs?
Various compliance and regulation requirements the organization adheres to. These regulations may mean that data access logs have to be kept separate from the data store itself, or it may mean having a longer retention time than what is naturally provided by the data store itself. This may also mean that the log has to be enriched by more data.
Common Uses of Data Access Logs
Logs are very important and useful in general. Data access logs are extremely critical, as they can shed light on data access, and data is, in most cases, the business’s biggest asset (and liability). Here are common uses of data access logs:
To fulfill requirements. We need logs because we are required to have them. Different regulations and compliance frameworks, as well as security frameworks which organizations comply with for legal reasons, commercial reasons, or risk reduction reasons, consistently require retaining access logs. Some of these requirements necessitate that we crunch the data access log for different reports, such as invalid access attempts and administrative access attempts. Other guidelines impose different retention or storage restrictions for the logs, but, regardless of the specific protocol, requiring some type of audit for data access operations is very common. Examples of such requirements are NIST Cyber-Secuirty Framework PR.PT-1, and PCI DSS 4.2.
Incident response and forensic investigations. In these cases, we are trying to understand more about events that occured, and a log that records those events is often a key element in the investigation. The investigation may be performed simply in order to assert that a certain incident was contained to one environment and did not have further implications (e.g. an endpoint was compromised, and we want to make sure that the credentials were not used to access the data warehouse), or it may be part of an extensive data breach investigation which is trying to establish the exact impact.
Gaining visibility. Logs can help create dashboards or report capabilities on an otherwise “blackbox” system, revealing who are the active users and roles and what actions are being taken. This, in most cases, requires efforts to transform the millions of lines in the logs to something that highlights and quantifies the main activities which transpired.
Reduce over-permissions. By analyzing data access logs, you can gain answers to questions such as, “Which users are exposed to data that they are not actually using?” By using the data access logs, you can reduce this risk. Depending on the desired depth, this process requires significantly investing in data analysis. (At Satori, we recently added this as an out-of-the-box capability for our customers who use Snowflake).
Proactively find threats. By analyzing logs, you can locate anomalous behavior by data consumers which may indicate a security risk.
Operational efficiency., Analyzing the costs can be valuable, especially in pay-per-query engines, although it is not always straightforward. Cost analysis is sometimes performed for internal billing of different teams accessing the data, to find anomalies, and to correct expensive data consumption habits.
Native Data Access Logs Are Insufficient
Data access logs are a great thing to have. The problem is not in the logs, but rather it is in the belief that they contain information which they do not, and sometimes only finding out in retrospect that you are standing on thin ice. Here are the main disadvantages of native access logs:
Not enough information
Sometimes the log does not contain information you expect to find, depending on the exact platform and what you are looking for. For example, sometimes the data access user is not actually the data consumer but instead a generic user employed by an analytic framework. Finding out who was actually sending the query may require correlating logs from other systems, which is sometimes very complex or even impossible.
Furthermore, in most cases, the data access log does not contain the actual locations data was pulled for (databases, schemas, tables), and understanding this from the query is a difficult task, as some data consumers are not running a “SELECT * FROM table”, but rather a 1000 lines analytic query including multiple subqueries.
Multiple data stores
Collecting logs from different data stores sometimes requires substantial work to unify the logs, either for audit reporting or to answer questions in a reasonable manner. Such questions can include, “What tables did Ben access, across all of our data stores?” or “What were all queries to any table containing customer information?”
Native data access logs are lacking context
When a data consumer connects to a data store and sends queries, there is often crucial context that can only be understood by inspecting the entire transaction. This includes information about the user added from the client application or identity provider, as well as metadata which can only be comprehended by analyzing the data retrieved, not only the query itself. This can indicate, for example, that the data retrieved contains certain types of personally identifiable information (PII), or other significant data types.
Using logs is inherently an offline operation
I was thinking of not even stating the obvious, but this fact is not always clear when securing data access, so it is important to point this out. Logs are records of historical events, and you cannot use them to effectively enforce data access restrictions. This means that even if you analyze the logs and gather the necessary context from other sources, you cannot take action immediately.
What Does Satori Do Differently?
These shortcomings call for improvements, and we, at Satori, as people who are passionate about helping organizations secure their data, are doing just that. Since our architecture is that of a context-aware layer controlling access to the data stores in real-time, Satori addresses many of these flaws so that organizations can focus on actually making value out of the data.
Adding context to data access
Satori is also integrated within other systems like identity providers and adds contextual information to the data being logged as part of the transaction. We also analyze the data retrieved from the database to add additional contextual information, such as the types of data being accessed.
Data access compliance
Satori effectively takes care of auditing data separate from the data store itself, preventing many complications when meeting compliance requirements, but also providing much more value. Because Satori also identifies and adds context to the data, the audits and reports we provide can give information that otherwise takes extensive effort to generate, such as answers to questions like, “Who was accessing which PII type?” and “Who in the organization made changes to PII?” In addition, our audit logs can be easily exported, so they can be included in reports or aggregated.
Multiple data stores
With Satori, you gain a unified experience managing access to multiple data stores, as you decouple the access control (as well as access control logging) from the data store infrastructure. That way, setting access control policies, as well as auditing them, is unified across different platforms (e.g. applying the same access policies and logs for Redshift and Snowflake). This asset is great for organizations using multiple data stores or switching from one to another.
The native log is still in there
Since Satori is transparent and does not disrupt normal database activities, you still receive all the value you need, which is sometimes platform specific, from the native data access logs.
If you are interested in learning more or seeing this in action, please contact us to schedule a demo.