Applying DataSecOps Principles on MLOps

Machine learning has been commoditized in the last decade or so, and I love it. What was once the “dark magic” of classification, clustering, and prediction is no longer a magic, nor is it dark. With tools like SciPy, scikit-learn and TensorFlow, more and more users can and do utilize data for machine learning through simple mechanisms.

Today many people who use ML algorithms along with these libraries and tools are not data scientists or data analysts. Instead these are business users who understand their domains’ business needs and can learn insights from data relatively easily.

 

What Is MLOps?

Machine Learning Operations, or MLOps, is a process aimed at deploying and maintaining machine learning models in production. MLOps is yet another X-Ops field, derived from DevOps, which applies DevOps practices to machine learning models. The concept behind this application is that machine learning models are trained and tested in dedicated environments (in the same way that software is developed in a dedicated environment), and, when algorithms are “production-grade,” they undergo a DevOps-like process to deploy them to production systems.

The same principles of DevOps, such as a focus on automation and the use of incremental and continuous processes, apply to MLOps as well. Machine Learning Operations is becoming a holistic field that applies not only to the deployment of machine learning models to production but also to the processes surrounding it.

 

MLSecOps: Applying DataSecOps Principles to MLOps

When considering the principles of DataSecOps, they make a lot of sense for organizations with MLOps processes. The main reason for this application is that there are great similarities between DataOps, which is aimed at driving organizational data flow between data producers and data consumers (This is the TL;DR version. For more information, read our introduction to DataSecOps.), and holistic MLOps, which is aimed at turning data into production ML models and algorithms.

Embedding security into MLOps processes is important, especially when considering the bigger picture. If MLOps is defined in a way that is inclusive of the processes that obtain production data from data producers and applies processes like retraining towards that data.

Let’s take a look at some of the different principles of DataSecOps and how they fit into the world of MLOps:

 

Security Is a Continuous Part of Data Operations, not an Afterthought

Although it should not be, it is often easy to exclude security teams from projects. This exclusion is mostly fueled by a fear that security teams (including similar governance, compliance, and privacy teams) will slow down projects and a misconception that bypassing them will prevent delays. This mindset is obviously short-sighted, though, as it can inflict risks—from an unplanned blocker set by one of these teams, to an audit failure, to a security incident.

Nevertheless, security is essential in ML operations. For example, when data is gathered for re-training from multiple sources and enriched to generate additional features for its ML model, this complicated process may contaminate non-sensitive data with sensitive data. If security is a continuous consideration in the process, though, this contamination could be avoided.

The bottom line is that security needs to be part of the collaborative team handling such projects, and they need understand the plans and be integrated at the architecture level.

 

Always Prefer Continuous, Iterative Processes to Ad-Hoc Projects

Let’s suppose that an organization has MLOps processes that include ingestion of data from sources for the creation of production models. Since data is frequently changing, data that is pulled and cleaned may introduce new sensitive data that previously did not exist and is not properly cleaned.

If, for example, there is an ad-hoc project of mapping sensitive data, which is done (surprise surprise) for a yearly audit, the project may only discover the pollution in the data used for ML models too late. It is therefore important to plan processes so that activities such as sensitive data discovery are continuous. 

 

Separation of Environments, Testing, and Automation

This is a pretty basic concept in the DevOps world, but, sometimes, the basics are overlooked. For example, we need to answer questions like: is the data trained and processed in a separate environment, or is it downloaded to machines in a manual way? Are there tests and automations in place to ensure the data is sanitized and data access is only available to the applications that should have access?

 

Prioritization Is Key—Mostly for Sensitive Data.

When an organization has extensive machine learning operations, it often possesses limited resources for keeping data secure, private, and well-governed. Therefore, an element that needs to be considered is prioritization. For example, if there are some ML processes performed on customers’ data while others are done on performance metrics, it would make sense to first focus resources on the former, as there are more chances of security and compliance risks around such data types.

 

Conclusion

Though DataSecOps is mainly applied to DataOps processes, it is useful to also apply these concepts to similar MLOps processes in order to make sure that your data security is aligned with today’s rapidly changing data.


Satori helps organizations add continuous security, governance, and privacy to their data operations and data access. To learn more about our work, visit our product page, or fill out the form below to set up a demo session.