Effective Data Protection and Governance in Data Science Environments

We recently hosted a panel featuring Satori Cyber’s advisory board about the overlapping responsibilities of security, privacy and data teams tasked with securing new data science environments. In this post, we’re going to cover the main takeaways provided by our distinguished panel of speakers, which included Andy Roth (CPO at Intuit), Colin Anderson (CISO at Levi Strauss & Co.), Sounil Yu (CISO-in-Residence at YL Ventures) and Eldad Chai (CEO and Co-Founder at Satori Cyber).


The field is evolving

While data protection and data governance aren't new concepts, they’re constantly subjected to new approaches as their fields evolve. The recent upsurge in self-service data models is an excellent example, as these models require a strategic overhaul to effectively secure them.

Andy  has identified three major drivers of change to the field of data governance:

  1. Regulatory compliance - There are philosophical differences to European and American approaches to privacy. The former acknowledges it as a human rights while the latter views it as a vertical requirement with the exception of CCPA (and WPA)

  2. Customer tech compliance - Today’s customers require that their data remain segregated and handled with their own delineated set of access policies and use controls

  3. The strategic value of data - Organizations have caught on that there are much stronger incentives to govern data properly outside of fines. The most successful data-driven organizations are the ones that make their data as accessible as possible-today, maturity tends to occur in organizations that evolve from a mere data science mindset to one focused on data democratization.

Sounil believes that there’s a wide industry push for data governance because of the possibilities it opens up for organizations to drive innovation. In his eyes, effective data governance controls give organization employees the freedom to “swim wherever they’d like in the data lake”

 

...But it remains highly complex

Data science environments are plagued by multiple layers of complexity. Among them are the multitude of teams connecting to data, variety of supporting data technologies, host of data silos and data types and unending pool of data access tools.

 

According to Colin, it’s helpful to approach data like water, a substance that flows according to the path of least resistance. 

 

To this end, he points out how the public cloud unleashed a set of new access capabilities beyond reports, which were the traditional gatekeepers of data. Today, many people across an enterprise have direct access to data. In cloud environments, this means that it’s particularly difficult to understand the following:

  1. Where’s the data?

  2. Who’s accessing the data?

  3. What are they doing with it?

  4. What controls are in place to prevent data leaks?

So, while security requirements haven’t changed, the cloud has unleashed a great deal of containment complexity. With the questions above in consideration, the biggest challenge to protecting data lies in “fine-grained” access control - understanding what data specific users have access to.

 

Sounil agrees, pointing out that this is likely why data is the most difficult asset to secure in an enterprise environment. To make matters worse, data is a moving target and changes all the time. A server, a workstation or a network are somewhat fixed, whereas data is ever-changing. Its very definition can change - take PII, for example.

 

Andy suspects that one of the biggest challenges to enterprise data protection is legacy infrastructure (meaning systems older than 6 months). After investing so much into data lakes and existing architecture, it’s very difficult to rationalize architecting and investing in new ones every time new privacy requirements pass into law.

 

This is why Eldad believes that tying data governance controls with data infrastructure architecture is unsustainable. He is adamant that organizations must decouple data access governance and data infrastructure

 

Satori Expert Panelists-2

 

So, what’s the best strategy?

We’ve seen the rapid development of security frameworks and privacy laws over the past few years and can expect the momentum of change to continue in the coming years. This must be taken into consideration when forming a data strategy and each team within an enterprise must be proactive about emerging requirements and regulations.

 

Andy warns against attempts to mitigate risk by doing the bare minimum to satisfy a specific regulation. Down the line, he says, this will only create larger risks. He shares that, at Intuit, the company wisely decided to implement CCPA across all states, rather than for California alone in anticipation of future policies. 

 

Colin acknowledges that, for global enterprises, applying every regulation is an enormous and difficult endeavor. Security frameworks, he says, continue to evolve and today’s main drivers are data evolutions and attack surfaces. He shares that Levi’s started with security frameworks before augmented them according to what was required to be “good corporate citizens”.

 

How can teams align?

Our design partner meetings at Satori rarely take place with a single part of an organization, as multiple teams are often tasked with owning the protection and governance of data science environments. Data, security and privacy teams are all part of a joint effort on this front.

 

Sounil believes that this can lead to communication roadblocks, as each team can only know so much outside of their respective scopes of expertise.  He encourages teams to focus on doing what they’re best at. Data teams are excellent at mining data and deriving analytics out of it. All they require is a clear boundary and guardrails to prevent them from stepping on any toes and allow them to drive innovation as quickly as possible. Just like in the application security space, where security isn’t a developer’s priority, they simply want to build new features fast.

 

This isn’t to say that teams aren’t aligned over the importance of data protection and security. Colin points out that most cross-team conflicts only take place over 5-10% of issues and usually over the “how” of securing data. What’s required is a method to reduce the friction and this can be done by focusing on the 95% of issues on which there’s agreement - at the end of the day, everyone wants their company to be successful and its data protected.

 

Harmony can also be achieved by rethinking stringent guards for access, according to Sounil. He points to Sharepoint, one of the most restrictive data repositories on the market today, where it’s almost impossible to “share” anything, and the tensions that can rise on its account. He explains how this is what made the concept of data lakes so attractive for data scientists in the first place.

 

Andy shares that Intuit has reached data governance maturity, allowing them to open new conversations around stacking up a privacy engineering team. He doesn’t believe that trade-offs need to occur between privacy and security. And he argues that what’s enabled their success has undoubtedly been that their programs are funded and sponsored.

 

Practical advice for companies starting out

Colin advises companies to know what they have as an important starting point. Next, it’s important to align on the company’s intention for its data and appreciate the key opportunities that lie in harnessing it. Ask if the main draw is to save costs, improve operations, detect new revenue generation streams etc. 

 

He also warns that data becomes a liability at a certain stage, usually when organizations need to phase data hygiene and data cleansing. This can be avoided by asking common sense questions like: if you’re at a consumer business, why do you have to retain PII of a consumer who you haven’t interacted with for five years? Why do you still need to keep his PII if it generates no business value?

 

Sounil warns that the source of the friction between teams usually lies in potential liability inherent in data collection and processing. This must remain front and center when designing its infrastructure. Companies should be thoughtful of how data can remain useful while still ephemeral and privacy preserving.

 

Andy shares that many big tech companies waste a great deal of energy on cleaning data, despite it being far more effective to integrate this approach into data schemes from the start. To this end, he encourages companies to curate their data, keep data destructions in mind, rethink data minimization and decide what data is really required from the start. The sooner these concepts are embraced, the better, as this will help alleviate the burden of legacy infrastructure. He also warns that privacy isn’t a tradeoff for security and that data governance isn’t only meant to comply with the law - it's a strategic imperative for companies to thrive.

 

Final words

Colin: “As a security professional go into discussion with your data teams with a positive mindset. Nine times out of ten, they’re looking for a way to grow the business. Your job is to look out for that ten percent chance when it’s a really bad idea and help the data team protect them from themselves.”

 

Sounil: “Try to avoid the problem all together and think about the nature of the data. Create guardrails for data science teams and try to minimize changes to it - think about a road, everyone hates construction and detours. Foster harmony by removing security and privacy requirements from the data team’s attention, by making sure that the guardrails are there and that they are clear.

 

Find the full recording of the panel here