TL;DR
- Schema, null, and uniqueness checks catch the structural drift that silently corrupts dashboards before anyone notices.
- Referential integrity and freshness checks keep cross-team analytics workflows trustworthy when upstream data changes.
- Volume anomaly and business rule checks catch what code can't—semantic errors and unexpected feed disruptions.
- Governance turns checks into auditable artifacts, which regulations like the General Data Protection Regulation (GDPR) and Sarbanes-Oxley (SOX) increasingly require.
- Prophecy operationalizes these checks through agentic data preparation, visual workflows, and built-in governance on Databricks, Snowflake, and BigQuery.
Every analytics leader knows the feeling. A dashboard goes out to the C-suite, a number looks wrong, and three teams spend a week tracing the issue back to a single null value in a foreign key column. Or worse, nobody catches it, and decisions get made on bad data.
The stakes keep rising. 60% of AI projects will be abandoned through 2026 due to a lack of AI-ready data, and 38% of infrastructure and operations leaders already cite poor data quality as a leading cause of project failure. Data quality now determines whether major analytics initiatives succeed.
This guide walks through the seven data quality checks every enterprise analytics team should run on their data workflows.
The cost of getting it wrong
Poor data quality shows up as a recurring budget line item across the business. Here's what's at stake:
- Direct losses: More than 25% of organizations lose over $5M annually due to poor data quality.
- The Rule of 10: It costs ten times as much to complete a unit of work when data is flawed and every uncaught error compounds into hours of rework, missed deadlines, and eroded stakeholder trust.
- The hidden engineering tax: Data workflow requests typically consume 10–30% of engineering time. For a team of 10 engineers, that's 1–3 full salaries spent on slow, ad hoc requests while the business waits on stale or untrusted data.
- A structural problem: Only a few companies meet standards, which makes data quality a systemic gap rather than an incidental one.
The practical shift is moving from reactive cleanup to proactive prevention, where errors are prevented at source rather than cleaned up after the fact. That's where these seven checks come in.
1. Schema validation
This check catches structural drift with renamed columns, changed data types, and unexpected additions or omissions in incoming data.
The data workflow failures often start with schema drift. A type change can corrupt downstream aggregations without raising an obvious error. This is why your data workflow keeps running, and your dashboards keep refreshing, even though the numbers may be wrong.
Here’s how teams typically implement this:
- Databricks: Delta Lake schema enforcement rules act as a write-time gate, blocking data that doesn't match the target schema.
- Snowflake and BigQuery: Scheduled dbt schema tests run within transformation workflows.
For analytics leaders managing teams with varying levels of SQL expertise, schema validation is a baseline control. It catches errors that no amount of domain expertise can spot visually in a dashboard.
2. Null and completeness checks
This check catches missing values in mandatory fields; null primary keys, empty business-critical columns, and failed calculations.
Null values propagate silently through joins and aggregations. A null primary key breaks downstream referential integrity, and a missing revenue field throws off an entire quarterly rollup.
The key architectural decision is what happens when a check fails. Production-grade implementations use a tiered approach aligned to the medallion architecture:
- Bronze (raw ingestion): Warn mode logs violations without blocking ingestion, preserving raw data for investigation.
- Silver (cleansed): Drop mode removes invalid records and routes them to quarantine for later review.
- Gold (business-ready): Fail mode treats any violation as a workflow-breaking event, ensuring nothing bad reaches analysts or executives.
This pattern, a production-grade quality approach, prevents bad data from reaching analysts while preserving raw data for investigation.
Want to try these patterns hands-on? Explore Prophecy's agentic AI features and start building governed data workflows in minutes.
3. Uniqueness and duplicate detection
This check catches duplicate primary keys, reprocessed events, and repeated composite key combinations.
Duplicates inflate metrics, double-count revenue, and corrupt machine learning (ML) training sets. Deduplication is required in streaming workflows with at-least-once delivery semantics.
Common implementations:
- dbt across Snowflake and BigQuery: A straightforward,
uniquetest on any column or composite key. - Databricks: Delta Live Tables (DLT) expectations using SQL logic applied to composite keys.
- Google Cloud Dataplex: Defines the uniqueness core dimension that the data is distinct with no duplicates.
For analytics leaders, duplicates are hard to spot because they often look plausible in aggregate. A low duplicate rate in a transaction table might not flag in a spot check but can meaningfully skew quarterly reporting.
4. Referential integrity checks
This check catches orphaned records; foreign key values that don't exist in the referenced parent table.
When parent-child relationships break, you get data loss in joins, incorrect aggregations, and downstream reporting errors that are notoriously hard to trace.
Tools handle this in similar ways. Dbt's relationships test validates cross-table references declaratively across Snowflake and BigQuery, while Databricks uses EXPECT constraints with cross-table subquery logic.
This check matters especially for analytics teams serving multiple business units. When each team owns different tables, referential integrity is the glue that holds joined data together—and one of the first things to break when upstream changes go uncommunicated.
5. Freshness and timeliness checks
This check catches stale data; records that haven't arrived within their expected time window.
A stale table leads to outdated or incorrect answers, directly affecting business outcomes. This matters for dashboards, operational reports, and AI/ML features that depend on recent data.
Modern platforms can increasingly automate this:
- Databricks Agentic Monitoring learns historical patterns and detects freshness anomalies automatically, without manual configuration
- Snowflake ships a dedicated FRESHNESS Data Metric Function (DMF)
- BigQuery teams typically combine Dataplex auto data quality with scheduled SQL checks
6. Volume anomaly detection
This check catches unexpected drops or spikes in row counts, broken upstream feeds, duplicate ingestion, and runaway processes.
A sudden drop or spike in row counts versus a table's normal range signals a broken feed. The downstream impact is equivalent to a data quality failure, even though every individual record might be perfectly valid.
ML-driven approaches are becoming standard:
- Databricks Agentic Monitoring: AI agents learn historical patterns and trigger alerts on anomalies without manual rule configuration.
- AWS Glue: Applies ML anomaly detection over time to detect abnormal patterns.
- Snowflake: Shipped data quality anomalies last year, followed by incident notification infrastructure.
Distribution monitoring patterns extend this further by tracking whether the min, max, mean, and standard deviation of key columns remain consistent with historical baselines. The distribution shifts signal the changes that rule-based checks won't catch.
7. Business rule validation
This check catches semantically incorrect data that passes every structural check.
A data workflow can have no nulls, no duplicates, a correct schema, and valid types, and still deliver wrong data. Business rule validation catches what structural checks miss, such as invalid status codes, illogical date combinations, and out-of-range values.
Accepted values checks validate that a column contains only predefined allowed values, and multi-condition constraints using CASE expressions can enforce more complex rules across related fields.
This is where domain expertise becomes essential. Data platform teams can build the enforcement mechanism, but analytics leaders and their teams know what "valid" actually means for the business.
Why these checks need governance, not just code
Regulations now mandate accuracy where this used to be best practice:
- General Data Protection Regulation (GDPR)
- Sarbanes-Oxley (SOX)
- Basel Committee on Banking Supervision (BCBS) 239
This means quality checks can't exist as undocumented behaviors in data workflows. They need to be enumerable artifacts with tracked deployment status, queryable audit logs, and governance controls that withstand regulatory scrutiny. Desktop tools and ungoverned AI-generated code make this hard, while combining AI acceleration with human review, standardization, and Git retention gives you the speed of AI with the reliability of engineering.
Operationalizing data quality with Prophecy
Data quality checks are only as good as the team that can build, run, and maintain them. The seven checks above aren't hard in theory; the hard part is getting them into every analytics data workflow without bottlenecking data engineering. That's the gap Prophecy closes.
Prophecy is an agentic data preparation platform that runs on your cloud data platform, enabling analytics teams to build governed data workflows (sometimes referred to as data pipelines) themselves. Data engineers continue to own ETL pipelines, ingestion, and data management; analysts use Prophecy to prepare and transform data for analytics on top of what's already governed. The result is faster delivery, fewer tickets, and quality checks built into every workflow by default.
What you get with Prophecy:
- AI agents: Multiple agents help analysts build, transform, validate, and document data workflows step-by-step, accelerating self-service without sacrificing standards.
- Visual workflows: Drag-and-drop visual workflows surface DLT expectations, per-gem unit tests, and column-level lineage directly in the build experience.
- Built-in governance: Prophecy 4.0's governance-first capabilities keep access, compute costs, and quality standards under platform-team control while analysts iterate within those guardrails.
- Deployment to cloud platforms: Workflows run natively on Databricks, Snowflake, and BigQuery, with continuous integration and continuous delivery (CI/CD), deploying checks automatically.
Ready to see what AI agents can do for your data workflows? Book a demo to see Prophecy in action.
FAQs
What is a data quality check?
A data quality check is an automated test that validates data against defined rules to confirm it is accurate, complete, unique, consistent, timely, and valid. Data quality checks run inside data workflows and flag, quarantine, or block records that fail the rules before they reach dashboards or downstream systems.
What's the difference between data quality checks in ETL pipelines and analytics data workflows?
ETL pipelines run schema, ingestion, and platform-level governance checks owned by data engineering teams. Analytics data workflows add business-rule validation, freshness, and transformation checks owned by analytics teams. Both layers are required, and Prophecy focuses on the analytics side, where analysts need self-service quality without filing tickets.
What are the six dimensions of data quality?
The six core dimensions of data quality are accuracy, completeness, consistency, uniqueness, timeliness, and validity. Accuracy and validity confirm that the data reflects reality and follows business rules. Completeness and uniqueness handle missing or duplicate values. Consistency and timeliness keep data aligned across systems and current enough to act on.
How often should data quality checks run?
Schema, null, and uniqueness checks should run on every workflow execution so issues are caught immediately. Freshness and volume anomaly checks should run on a schedule matching your data refresh cadence, typically hourly or daily. Business rule checks run with the analytics workflow that produces the final dataset for reporting.
What's the difference between data quality and data observability?
Data quality checks test specific assumptions you define, such as "this column should never be null." Data observability monitors data behavior automatically over time and flags unexpected changes you did not predict. Data quality catches known issues; data observability catches unknown ones. Mature teams use both together.
How do you measure data quality?
Data quality is measured by tracking the percentage of records that pass each defined check across the six dimensions: accuracy, completeness, consistency, uniqueness, timeliness, and validity. Teams typically score each dimension separately, then roll the scores into a composite quality metric monitored on dashboards over time.
Do I need to choose between Databricks, Snowflake, and BigQuery to use Prophecy?
No. Prophecy runs natively on Databricks, Snowflake, and BigQuery, so your existing platform investment stays intact. Compute, governance, and security remain inside your cloud data platform, and Prophecy provides the agentic data preparation layer on top of whichever platform your team already uses.
Can AI agents replace human review for data quality?
No, AI agents should not replace human review. Prophecy combines AI acceleration with human review, standardization, and Git retention. AI agents speed up building and validating data workflows, but analysts and platform teams stay in the loop, giving you the speed of AI with the reliability of engineering.
How does Prophecy fit alongside business intelligence (BI) tools?
BI tools are powerful for visualization and analysis, but they depend on well-prepared datasets. Prophecy prepares and transforms data upstream so BI tools can do what they do best. Reporting and dashboards stay in BI tools, and Prophecy ensures the data feeding them is governed, accurate, and trustworthy.
Ready to see Prophecy in action?
Request a demo and we’ll walk you through how Prophecy’s AI-powered visual data pipelines and high-quality open source code empowers everyone to speed data transformation

