Organizations are complex, evolve rapidly and are seeking growth continuously. The data owned and utilized by organizations is often messy, inaccessible, and difficult to make sense of.
Hearing the word “decentralization” usually evokes images of diminished authority and lax controls. Specifically in systems of governance, decentralization, while considered essential for development, is embraced reluctantly. And the idea of empowering smaller units by devolving resources and authority remains popular. Unlike the world at large, the concept is much easier to implement in the domain of data architecture, through the data mesh paradigm.
But before we explain what the data mesh is, it is important to understand why it is required.
The term data mesh was coined by Zhamak Dehghani in 2019, a computer scientist who is currently the founder CEO of Nextdata, and she explains its significance as follows:
“Data mesh is the nudge that puts us on a new trajectory in how we approach data: how we imagine data, how we capture and share it, and how we create value from it, at scale and in the fields of analytics and AI. This new trajectory moves us away from the centralization of data and its ownership toward a decentralized model. It aims to enable organizations to get value from data at scale, despite the messiness and organizational complexity.”
Source: (Preface, Data Mesh: Delivering Data-Driven Value at Scale by Zhamak Deghnani)
In other words, data mesh is needed because it allows organizations to scale their data management solutions. It also eases the struggles of all data users – analysts, scientists and engineers – by providing timely access to high quality data while staying in close contact with the business.
Analytical data is at the core of data-driven decision making. It is used for predictive use cases, visualizations, reports, and training machine learning models that add to intelligence in business. Analytical data is the stimulus for organizations moving from gut-based to data-driven insights, and powers the technology of the future. Data mesh is a decentralized approach to share, access and manage analytical data in large-scale organizations. (Deghnani, 2021, p.3)
To get value from analytical data, data mesh enables the following:
It also alters the way organizations manage, use and own analytical data.
These principles are designed to lead to the objectives of data mesh: innovation, agility and gaining value from data at scale.
The first pillar of data mesh puts domain experts in charge of their data. According to this principle, the domain experts not only control the data, but they also leverage it. They have context and know the business priorities, so they can easily maintain data quality and make it readily available for consumption throughout the organization.
This principle ensures that data is always measured by the value it brings to the people who use it. While domain experts oversee and control data, product thinking is applied to ensure the data roadmap meets the accessibility, governance, and usability needs of the organization.
Self-service data infrastructures enable teams to access and manage data products without depending on centralized data teams. Access to data by all stakeholders becomes quick and seamless since a self-service infrastructure removes barriers that slow down data products’ creation and usage.
The last principle of data mesh creates a data governance operating model with a team composed of domain representatives and experts on legal, compliance and security matters. Computational governance involves using software code to automate governance processes and ensure compliance with policies and regulations. FCG helps create a more transparent, accountable and effective decision-making process, while enabling agility, innovation, efficiency and reducing the risk of errors.
Pro tip: Talk to our experts and learn more about data mesh as an enterprise data management strategy.
Years ago, organizations used to have on-premises data architectures. Generation 1 (Gen 1) on-premises data architectures typically refer to the early stages of data infrastructure where organizations store and manage their data within their own physical premises, such as data centers or servers located on-site. In this architecture, data is often stored in relational databases, and traditional Extract, Transform, Load (ETL) processes are used for data integration. These architectures are characterized by their reliance on hardware and infrastructure maintained and operated by the organization itself, requiring significant upfront investment and ongoing maintenance. Gen 1 on-premises data architectures may lack the flexibility and scalability associated with more modern cloud-based solutions.
Generation 2 (Gen 2) cloud data architectures represent an evolution beyond on-premises solutions, leveraging cloud computing platforms for storing, processing, and managing data. In Gen 2 cloud data architectures, organizations utilize cloud-based storage services, computing resources, and scalable databases. This shift to the cloud provides advantages such as flexibility, scalability, and reduced infrastructure management overhead. Gen 2 architectures often incorporate distributed computing technologies, Big Data tools, and serverless computing models. This generation embraces the cloud's elasticity, enabling organizations to adapt quickly to changing data requirements and achieve cost-efficiency through pay-as-you-go models.
Generation 3 (Gen 3) hybrid data architectures build upon the advancements of cloud technologies while incorporating a hybrid approach that combines both on-premises and cloud-based components. In Gen 3, organizations seamlessly integrate their on-premises data infrastructure with cloud services to create a cohesive and flexible data environment. This architecture enables the movement of data between on-premises and cloud environments, allowing organizations to leverage the benefits of both.
Key characteristics of Gen 3 hybrid data architectures include:
1. Data Mobility
2. Interoperability
3. Scalability and Flexibility
4. Security and Compliance
5. Cost Optimization
Overall, Gen 3 hybrid data architectures offer a balanced approach that combines the strengths of on-premises and cloud solutions, allowing organizations to create a dynamic and adaptable data infrastructure.
The current data mesh architecture looks like the diagram below.
Image source: https://www.datamesh-architecture.com/
Data projects that once took the center weeks or even months to complete now sometimes take hours and days. The reduced time to make data available to data users speeds up the research process because of faster access to data.
Before enabling data mesh as a data management approach, MSK struggled to derive timely insights from the large amounts of data, while being restricted by regulations designed to keep medical records private. The organization treats and conducts research on more than 400 different types of cancers. Every year, it sees 20,000 inpatient visits and 700,000 outpatient visits. In addition, it must adhere to more than 1,800 research protocols.
Every lab test, visit and research, translates into data. Without effective management, all that data is meaningless, whether it relates to the treatment of an individual patient or furthering research. These are not simple datasets. And there are regulatory concerns, which means data must be handled well and patient privacy protected.
Initially the center took a centralized approach to data governance. Its data was gathered into an on-premises data warehouse where data engineers and other intermediaries transformed the data. It was then transferred to a data lake. Data managers and software engineers had access to the data once it was in the data lake, and they could send the data to end users upon request.
However, it did not make much difference. The data was spread across categories such as clinical, genome research, radiology and pathology. Also, it was structured and unstructured – manual inputs needed correction, data was incomplete, different versions of the same data existed and obviously all of it had to be kept private. It was not easy to turn all this complex data into meaningful datasets.
Meanwhile, the request submission and fulfillment procedure were glacial once all that data had been converted and put in a data lake. Furthermore, after data was given to end users, the center had no control over it. Pasha pointed out that they exchanged Excel files over email, which resulted in redundant datasets and security flaw disclosure.
Given all these challenges, the center decided to try something different. They wanted a decentralized architecture.
They wanted to eliminate the cumbersome extract, transform and load (ETL) pipelines that were needed for each individual project and slowed its data operations. They also desired a more developed data governance model, an intuitive user interface that would enable more users to consume data, documentation assistance to better organize and monitor data, and more advanced ETL capabilities to reduce the amount of human effort.
Time savings have been the primary advantage of MSK's adoption of data mesh. By removing challenging ETL operations and adding self-service features, customers can complete data projects much more quickly.
Data consumers now find it simpler to track and distribute data across domains thanks to data mesh. Additionally, it has helped to build previously absent trust between data product owners and consumers.
Data mesh is a game-changer for businesses because it provides a framework for breaking down barriers and enabling business domains to generate and curate data quickly and at an enterprise scale.
By Irfan Umer.