Data Engineering & Infrastructure
The Data Proliferation Paradox
In today’s data-driven world, the “Data Proliferation Paradox,” coined by Aimpoint Digital, highlights an intriguing dilemma: as data availability surges and ingestion becomes simpler, there’s a parallel rise in the demand for observability, data quality checks, and stringent governance.
More data isn’t always synonymous with better insights. Databricks provides many features to consume and query vast amounts of data easily, but beyond the basics of building data pipelines, there are many challenges that companies face in gaining actionable insights from their data, including:
- Ensuring data feeding into Analytics, ML models, or GenAI use cases is accurate and provides evidence that they can be trusted.
- Allowing data owners and data stewards to govern access and use of their data, particularly any sensitive PII or PHI and providing visibility and auditing of how it’s being used.
- Extending this data governance into regulatory and compliance obligations.
- Providing a mechanism for data teams to rapidly deploy changes while ensuring production systems remain stable and compliant.
- Navigating the Data Proliferation Paradox – opposing forces of increasing data needs and the mounting challenges of data governance, security, and accessibility.
Each of these issues can decrease your data team’s efficiency in delivering the data you need to make decisions. Even worse, basing decisions on poor data can have significant financial, regulatory, and reputational impacts. DataOps seeks to address each of these issues and ensure you can extract the maximum value from your data, drive better decisions, and innovate faster.
Let’s explore each of these in a little more detail and uncover some tools that Databricks provides to assist you on your path to achieving DataOps!
What is DataOps?
Wikipedia defines DataOps as:
“A set of practices, processes, and technologies that combines an integrated and process-oriented perspective on data with automation and methods from agile software engineering to improve quality, speed, and collaboration and promote a culture of continuous improvement in the area of data analytics.”
At Aimpoint Digital, we focus on 3 main areas when helping clients implement DataOps using Databricks:
- Data Quality and Observability
- Data Governance and Security
- Deployment Automation
This provides a practical and achievable solution that can be built upon over time. For those starting out with Databricks, we emphasize the importance of mastering these elements from the outset. Establishing these fundamentals early on will streamline the development process and avoid significant re-work to adopt them in the future.
Data Quality and Observability
The familiar adage of “Garbage In – Garbage Out” is particularly relevant to data pipelines. It’s easy to say, “I trust my data,” but it’s difficult to prove. Data Quality checks are the unsung heroes of data pipelines for building and proving that trust. Not just the checks themselves but monitoring and alerting on poor data quality, tagging validated data assets, highlighting trends in data quality over time, and identifying the root cause of data issues are key to building confidence in the output of any data pipelines or ML models.
According to Gartner, “Every year, poor data quality costs organizations an average of $12.9 million.”
The following features enable us to check and track data quality in Databricks.
Delta Live Table (DLT) Expectations
The Data Quality check functions built into Delta Live Tables (DLT) allow you to specify what to check and how to respond when a check fails, including warning, dropping bad data, or failing the job. Here’s a quick example of the Expectations Python function in DLT in action:
When you define these Expectations in your data pipeline, you have instantly available data quality summary metrics of a pipeline run through the UI. The data quality metrics are also stored in an accessible object, allowing you to create your own data quality summary metrics and dashboards quite easily, allowing a more holistic view of all your data quality metrics across your pipelines.
Data Quality Tagging
Tags are a new feature in the Databricks Unity Catalog. With this feature, you can apply a quality tag to a table, column, or schema. Tagging a gold table with `GOLD` for example, would aid in the discovery and trust of any associated data and adds a quality “seal-of-approval.” Tagging your data sets with other useful information like owner of the data, what business area maintains it, or who the SME is can aid in general understanding and increase collaboration, further democratizing your data assets.
Data Lineage
A key part of Data Quality is clearly identifying where your data came from. Many data points are going to be common across data sources, such as Identifiers, dates, amounts, etc.
Knowing the source of each of these fields will aid in:
- Data Integrity and Accuracy: Knowing the data is from a reliable source and used appropriately.
- Compliance and Governance: Data Governors need to know where their data is used and how.
- Troubleshooting and Root-Cause Analysis: Knowing the source is invaluable when anomalies or unexpected values appear in the data. It allows data teams to pinpoint and rectify issues quickly.
Lakehouse Monitoring
Lakehouse Monitoring lets you automatically monitor the statistical properties and trends in your data quality over time and track the performance of machine learning models and model-serving endpoints. By leveraging Unity Catalog and Delta-managed tables, you can create monitors that provide profile and drift metrics, including an automatically created dashboard to visualize the results.
Ensuring data quality is paramount, as accurate and reliable insights hinge upon the integrity of the underlying data. If you are interested in a deeper review, this article thoroughly outlines what Data Quality means within a Databricks context.
Data Governance and Security
Data Owners and Data Stewards need to know who has access to their data and where their data is being used. Ensuring PII or PHI is obfuscated, approving changes to access, and then creating and applying access models in a scalable way are all notoriously difficult to achieve. Quite often, the owners of data are not the users of the data assets being produced. Burdening data owners with the responsibility of governing and auditing data usage without equipping them with the necessary procedures and tools to oversee their data through a pipeline, ML model or analytical layer is impractical.
Creating a Role-Based Access Control (RBAC) model, putting ownership and governance in the hands of (potentially non-technical) data owners, and then building a mechanism that applies this model across your objects in an automated and scalable way is arguably the most challenging aspect of DataOps.
This is where Unity Catalog really shines, but what is it, and how does it benefit Data Governance? According to Databricks, “Organizations can use Unity Catalog to discover securely, access, monitor and collaborate on files, tables, ML models, notebooks and dashboards across any data platform or cloud, while also leveraging AI to boost productivity and unlock the full potential of the lakehouse environment.”
Databricks recently introduced a new capability called Databricks Asset Bundles that “standardizes and unifies the deployment strategy for all data products developed on the platform. It allows developers to describe the infrastructure and resources of their project through a YAML configuration file.” This means you can define a configuration file outlining role, access, and object to secure and run scripts to deploy access changes automatically.
You have several options to maintain access to objects in Unity Catalog in Databricks.
Here’s a look at various methods and when we see them commonly used:
- Dive in with Simple SQL Queries: Perfect for newcomers or those dabbling with Databricks for the first time.
- API & Custom Scripts for RBAC Governance: For tech-savvy teams, use APIs and scripts to convert governance spreadsheets (like CSV, YAML, or JSON) into SQL.
- Direct Access via Catalog UI: If you’ve got non-technical data governors, the Catalog User Interface offers a straightforward way to dictate access manually.
- Infrastructure-as-Code (IaC): Tools like Terraform or Azure Resource Manager shine when you’ve got a Data Platform team seasoned in IaC for infrastructure setups.
- Databricks Asset Bundles: Ideal for Data Platform teams either new to IaC or those wanting to supplement their IaC practices, Databricks Asset Bundles are a new way to express complete end-to-end projects as a collection of config files.
- Credential Passthrough from Cloud Storage: Great for those who manage credentials on Cloud platforms and don’t need hyper-specific access controls.
- Third-Party Tools like Immuta: When intricate data governance over a vast team is required, especially with row or column-level access controls, tools like Immuta are great options.
Deployment Automation
Every time data teams introduce new data pipelines or make code changes, they must incorporate solutions designed to address these data governance challenges. This can sometimes result in considerable operational complexity. To streamline this process, embracing agile principles, offering CI/CD tools, and establishing a straightforward means of validating changes in non-production environments is vital. These measures are essential for enabling data teams to construct and manage data pipelines efficiently, vastly reducing lead time from idea to delivered insights.
Putting it All Together
To attain the coveted state of DataOps excellence, it’s crucial for companies to establish automated processes, define clear roles and responsibilities, and implement well-defined, actionable guidelines and procedures.
Fortunately, Databricks is continuously adding and evolving new features, particularly on Unity Catalog, to aid in overcoming these challenges, including:
- Data anomaly detection and root cause analysis via Lakehouse Monitoring.
- Improved governance, auditing, and lineage tracking through Unity Catalog.
- Configuration-driven deployment using Data Asset Bundles.
- Effortless integration with Git for Continuous Integration and Continuous Deployment (CI/CD) workflows.
This is just an overview of how to start on the path to DataOps. Having all this in place across your organization is going to take dedication, investment, and patience. Achieving the holy grail of Data Ops and having the ability to seamlessly deploy changes that enable quality, secure, and governed data to transform into meaningful insights is worth the effort!
Aimpoint Digital’s Tailored Solutions
Aimpoint Digital offers several tailored solutions to expedite your organization’s DataOps journey on Databricks. Whether you’re interested in our DataOps Accelerator to get started or a comprehensive DataOps strategy, we bring a wealth of expertise and industry-tested experience to the table. To get started on your DataOps journey, fill out the contact info below.