Reshaping The Data Landscape: The Rise of The Mesh

Organizations are complex, evolve rapidly and are seeking growth continuously. The data owned and utilized by organizations is often messy, inaccessible, and difficult to make sense of.  

Hearing the word “decentralization” usually evokes images of diminished authority and lax controls. Specifically in systems of governance, decentralization, while considered essential for development, is embraced reluctantly. And the idea of empowering smaller units by devolving resources and authority remains popular. Unlike the world at large, the concept is much easier to implement in the domain of data architecture, through the data mesh paradigm.

But before we explain what the data mesh is, it is important to understand why it is required.  

The term data mesh was coined by Zhamak Dehghani in 2019, a computer scientist who is currently the founder CEO of Nextdata, and she explains its significance as follows:

Data mesh is the nudge that puts us on a new trajectory in how we approach data: how we imagine data, how we capture and share it, and how we create value from it, at scale and in the fields of analytics and AI. This new trajectory moves us away from the centralization of data and its ownership toward a decentralized model. It aims to enable organizations to get value from data at scale, despite the messiness and organizational complexity.”

Source: (Preface, Data Mesh: Delivering Data-Driven Value at Scale by Zhamak Deghnani)

In other words, data mesh is needed because it allows organizations to scale their data management solutions. It also eases the struggles of all data users – analysts, scientists and engineers – by providing timely access to high quality data while staying in close contact with the business.  

Beyond the Data Lake

Analytical data is at the core of data-driven decision making. It is used for predictive use cases, visualizations, reports, and training machine learning models that add to intelligence in business. Analytical data is the stimulus for organizations moving from gut-based to data-driven insights, and powers the technology of the future. Data mesh is a decentralized approach to share, access and manage analytical data in large-scale organizations. (Deghnani, 2021, p.3)

To get value from analytical data, data mesh enables the following:

It also alters the way organizations manage, use and own analytical data.

  • Organization: The responsibility and accountability of analytical data moves from a central data team to the business domains which produce and use data.
  • Architecture: Instead of collecting data in monolithic warehouses and lakes, data is distributed and can be accessed through standardized protocols.
  • Technology: Data is no longer a byproduct of code, and data mesh enables solutions that maintain data and code as an autonomous unit.
  • Operation: Data governance shifts from a top-down, centralized model to federated computational governance which enables each domain to maintain control over its own data while still adhering to a common set of standards and practices.  
  • Principal: Rather than seeing data as an asset to be collected, it is seen as a product used to serve and delight all data users.  
  • Infrastructure: Fragmented platforms (analytics and applications and operations systems) become integrated sets of platforms for data and operations systems.

The Four Principles of Data Mesh Architecture

These principles are designed to lead to the objectives of data mesh: innovation, agility and gaining value from data at scale.  

  1. Domain ownership

The first pillar of data mesh puts domain experts in charge of their data. According to this principle, the domain experts not only control the data, but they also leverage it. They have context and know the business priorities, so they can easily maintain data quality and make it readily available for consumption throughout the organization.

 

  1. Data as a product

This principle ensures that data is always measured by the value it brings to the people who use it. While domain experts oversee and control data, product thinking is applied to ensure the data roadmap meets the accessibility, governance, and usability needs of the organization.  

  1. Self-service data infrastructure

Self-service data infrastructures enable teams to access and manage data products without depending on centralized data teams. Access to data by all stakeholders becomes quick and seamless since a self-service infrastructure removes barriers that slow down data products’ creation and usage.  

  1. Federated computational governance (FCG)

The last principle of data mesh creates a data governance operating model with a team composed of domain representatives and experts on legal, compliance and security matters. Computational governance involves using software code to automate governance processes and ensure compliance with policies and regulations. FCG helps create a more transparent, accountable and effective decision-making process, while enabling agility, innovation, efficiency and reducing the risk of errors.

Pro tip: Talk to our experts and learn more about data mesh as an enterprise data management strategy.

The evolution of data architectures

Years ago, organizations used to have on-premises data architectures. Generation 1 (Gen 1) on-premises data architectures typically refer to the early stages of data infrastructure where organizations store and manage their data within their own physical premises, such as data centers or servers located on-site. In this architecture, data is often stored in relational databases, and traditional Extract, Transform, Load (ETL) processes are used for data integration. These architectures are characterized by their reliance on hardware and infrastructure maintained and operated by the organization itself, requiring significant upfront investment and ongoing maintenance. Gen 1 on-premises data architectures may lack the flexibility and scalability associated with more modern cloud-based solutions.

Generation 2 (Gen 2) cloud data architectures represent an evolution beyond on-premises solutions, leveraging cloud computing platforms for storing, processing, and managing data. In Gen 2 cloud data architectures, organizations utilize cloud-based storage services, computing resources, and scalable databases. This shift to the cloud provides advantages such as flexibility, scalability, and reduced infrastructure management overhead. Gen 2 architectures often incorporate distributed computing technologies, Big Data tools, and serverless computing models. This generation embraces the cloud's elasticity, enabling organizations to adapt quickly to changing data requirements and achieve cost-efficiency through pay-as-you-go models.

Generation 3 (Gen 3) hybrid data architectures build upon the advancements of cloud technologies while incorporating a hybrid approach that combines both on-premises and cloud-based components. In Gen 3, organizations seamlessly integrate their on-premises data infrastructure with cloud services to create a cohesive and flexible data environment. This architecture enables the movement of data between on-premises and cloud environments, allowing organizations to leverage the benefits of both.

Key characteristics of Gen 3 hybrid data architectures include:

1. Data Mobility

2. Interoperability

3. Scalability and Flexibility

4. Security and Compliance

5. Cost Optimization

Overall, Gen 3 hybrid data architectures offer a balanced approach that combines the strengths of on-premises and cloud solutions, allowing organizations to create a dynamic and adaptable data infrastructure.

The current data mesh architecture looks like the diagram below.

Image source: https://www.datamesh-architecture.com/

Data Mesh is the Future of Data Driven Solutions

Memorial Sloan Kettering Cancer Center in New York used a data mesh approach to data management for accelerating its research efforts to find a cure for cancer.  

Data projects that once took the center weeks or even months to complete now sometimes take hours and days. The reduced time to make data available to data users speeds up the research process because of faster access to data.  

Before enabling data mesh as a data management approach, MSK struggled to derive timely insights from the large amounts of data, while being restricted by regulations designed to keep medical records private. The organization treats and conducts research on more than 400 different types of cancers. Every year, it sees 20,000 inpatient visits and 700,000 outpatient visits. In addition, it must adhere to more than 1,800 research protocols.

Every lab test, visit and research, translates into data. Without effective management, all that data is meaningless, whether it relates to the treatment of an individual patient or furthering research. These are not simple datasets. And there are regulatory concerns, which means data must be handled well and patient privacy protected.

 

Initially the center took a centralized approach to data governance. Its data was gathered into an on-premises data warehouse where data engineers and other intermediaries transformed the data. It was then transferred to a data lake. Data managers and software engineers had access to the data once it was in the data lake, and they could send the data to end users upon request.

However, it did not make much difference. The data was spread across categories such as clinical, genome research, radiology and pathology. Also, it was structured and unstructured – manual inputs needed correction, data was incomplete, different versions of the same data existed and obviously all of it had to be kept private. It was not easy to turn all this complex data into meaningful datasets.

Meanwhile, the request submission and fulfillment procedure were glacial once all that data had been converted and put in a data lake. Furthermore, after data was given to end users, the center had no control over it. Pasha pointed out that they exchanged Excel files over email, which resulted in redundant datasets and security flaw disclosure.

Given all these challenges, the center decided to try something different. They wanted a decentralized architecture.  

They wanted to eliminate the cumbersome extract, transform and load (ETL) pipelines that were needed for each individual project and slowed its data operations. They also desired a more developed data governance model, an intuitive user interface that would enable more users to consume data, documentation assistance to better organize and monitor data, and more advanced ETL capabilities to reduce the amount of human effort.

Time savings have been the primary advantage of MSK's adoption of data mesh. By removing challenging ETL operations and adding self-service features, customers can complete data projects much more quickly.  

Data consumers now find it simpler to track and distribute data across domains thanks to data mesh. Additionally, it has helped to build previously absent trust between data product owners and consumers.  

Data mesh is a game-changer for businesses because it provides a framework for breaking down barriers and enabling business domains to generate and curate data quickly and at an enterprise scale.

By Irfan Umer.

Related Blogs