It took a while for data to follow applications and move to the cloud, at least en masse. But cloud adoption, for data as well as applications, is a done deal. Sure, it’s not 100% of the organizations or 100% of the data, but data is in the cloud. And I’m not talking about RDBMSs, key-value storage, or other databases that moved to the cloud to support the applications that migrated to the cloud.
The elasticity and ability to store and process extremely large amounts of data without a prior investment in servers is one of the best catalysts for data-driven innovation in the last decade. There’s no wonder that organizations are moving or building large-scale data warehouses, data lakes, and data lake houses to the cloud. The data consumers, in many cases (such as those in Redshift, BigQuery, Athena, and Snowflake), only have to write “select” queries to query data from huge tables, as if they were querying data from relatively small databases. In other words, with quite a basic skillset, as well as with the capabilities of powerful BI tools, vast amounts of people can make use of the organization’s data, in what’s called “Data Democratization.”
The Need for DataOps ProcessesThese processes, where more people and teams are adding and consuming new data continuously to the organization’s datastores, are causing data teams to evolve and adopt a DataOps mindset. This means that data ingestion, preparation, processing, and consumption are done in a more agile way, which requires the teams handling data—such as data engineering teams of different skill sets—to have more skills in scripting, automation, testing, integration, and production deployment.
Let’s give an example. If a data scientist at a hotel chain wanted to know what features can predict a guest that would cancel their stay in the 24 hours prior to their arrival, they would be able to do a lot with this data (optimize processes, have better capacity planning, etc). For that, they would need a lot of data to learn from. This data may be held in several different departments, and even across different geographies.
Before going through a data democratization process, it would take a lot of time and resources to answer such questions, and many parts of this project would require manual work. In most cases, this would make such a project a one-time effort to understand and learn from.
For a data-democratized organization, the data science team may get to work within a couple of days, or even on the same day, depending on the company’s data access maturity level. Moreover, this can easily become a continuous project, to the point where perhaps guests making a reservation would get different information or rates based on some of these features. That means that the hotel chain now gets a lot of additional value, but their data operations need to enable such activities.
What is DataSecOps?DataSecOps is an evolution in the way organizations treat security as part of their data operations. It is an understanding that security should be a continuous part of the data operations processes and not something that is added as an afterthought. In fact, DataSecOps should be viewed as the enabler of data democratization processes.
It is commonly understood that security is not something that can be done in an ad hoc way, once per year or per quarter: data changes at a much faster pace, more consumers are constantly added, and data access keeps changing. Just as the product cycles in application security became much shorter, so too did data processing cycles (from the time a team wants to collect and process data, or to analyze data, until they are able to do so).
Data democratization means that more people are able to access more data, and that if security is not a constant part of the operation, the risk level for the organization with such high data exposure would be too high.
How to successfully enable DataSecOpsProduction data also need separate staging and testing environments so that changes are automated and tested before meeting production, and so that these separations include making sure that production data is secure.
A DataSecOps mindset is an inclusive one in terms of the teams dealing with data, which may involve more stakeholders than in the DevOps processes. This may include engineering, DevOps teams, and IT teams. However, it may now include data stewards and data owners from marketing, finance, customer support, sales, and many more teams. This means that the understanding of the importance of security (and specifically data security) should be shared by a larger set of people. It is thus even more important (than DevOps) to have a collaborative framework, one where security is not an issue for just security teams, but a concern for everybody—it can be a showstopper.
Shared data ownership, Shared responsibilityDataSecOps means that not only is there a shared mindset around security but also that there is a shared responsibility regarding security between different teams, whenever data is involved.
DataSecOps is also the understanding that many organizations are now dealing with a lot of sensitive data, and that this information in the wrong hands can cause a lot of harm. Security is not just a huge asset for the company, but also a huge liability because of privacy and data protection regulations, as well as the risk of data exposure.
DataSecOps is also the understanding of the importance of time-to-value in data-driven organizations, and that security can’t be a delaying factor in the data processing lifecycle. As an example, clear guidelines around risk vs. value should be made so that data can be used without needing specific approvals for projects.
On the other hand, DataSecOps is also the understanding that even though data and data-driven value are an important part of companies—and everybody wants to get more out of the data and finish projects faster—there shouldn’t be careless compromises about the level of security.
The Security part of DataSecOpsDataSecOps asserts that good and agile data governance is part of healthy and secure data operation. This means that data should have clear owners and that it should be made accessible in a secure but simple way.
Furthermore, DataSecOps acknowledges that automation and testing is a strong part of what separates a successful and secure data operation from a failing one. Because small teams are handling large amounts of data operations, manual work may either mean a bottleneck or increased risks.
Finally, DataSecOps is holistic, as it deals with every process around data—not only when certain teams are involved. For example, it is not only about the data engineering team and does not mean that once another team pulls data they are “out of scope” for DataSecOps processes.
If you believe that there are other important ingredients of DataSecOps that you think I’ve overlooked, or if you have any comments, I’d be happy to know. Finally, I'd like to suggest the following definition to DataSecOps:
An agile, holistic, security-embedded approach to coordination of the ever-changing data and its users, aimed at delivering quick data-to-value, while keeping data private, safe and well-governed.