Leverage DBT audit to ensure accurate model generation

Digital Hive
Nov 15, 2024
4 min read

Updated: Nov 16, 2024

Introduction

In the evolving landscape of data analytics and engineering, efficient and reliable data transformation tools are crucial. dbt (data build tool) has emerged as a leading solution, offering capabilities that streamline data transformation workflows, enhance data quality, and support collaborative data practices. Many companies might want to start using DBT, but are stuck using an older, legacy sql based etl tool. In this blog we give an example of using the dbt audit helper as a way of checking the results when moving your legacy query to dbt.

Prerequisites

In any move between tools the ideal first step is to create an overview of the various pipelines and their logic. This overview will function as the backbone from which the development can commence. Make sure to document the sources and destinations from which the data is read and written to, and the logic consisting of SQL queries , joins, filters, and other transformations.

Set up the dbt environment

We won’t go over the setting up of a dbt-airflow environment in detail here as the documentation on this subject is widely available. However, the basic steps are:

Install dbt: Follow the official installation guide.
Initialize a dbt project: Run dbt init followed by the project name to create a new dbt project.
Configure profiles: Set up a profiles.yml file with the necessary database connection details.
Create the sources: Define the data sources in dbt/models/sources/source_name.yml.
Set up ingestion pipelines: Use Airflow DAGs to run custom Python ingestions or choose from the wide range of existing operators depending on the requirements.
Fill ingestion layer: Run the ingestion pipelines to populate the ingestion layer with source data.

Once there is data in the ingestion layer which should have been defined in the source data we can begin migrating the pipelines.

Note: The ingestion layer should contain copies of the source data.

Translate sql to dbt

To create the models that will replace the current transformations, first, organize the models depending on how many layers the project has (staging, intermediate, reporting). For file, create a file in the relevant layer folder under models (dbt/models/staging/transformation_name.sql). During the transition phase, it might be useful to introduce prefixes in the table names or different schemas, such that the data generated by the legacy SQL and the ones generated by dbt can coexist in the same layer for comparison.

Note: Use tags to group the models such that they can be run in grouped dbt builds.

Test model accuracy with an audit model

An important step in creating the models is testing whether the behavior of the model transformations results in data that is equal to the data generated by the legacy tool transformations. Here we suggest using an audit model to generate a comparison audit. An example of such a model is shown below. To generate this model we made use of the audit_helper library. https://hub.getdbt.com/dbt-labs/audit_helper/latest/

-- define relation of the dbt model table

{% set a_relation = source('snowflake', 'pharmaceuticals_india') %}

-- define relation of the legacy table

{% set b_relation = source('snowflake2', 'pharmaceuticals_india') %}

{% if execute %}

{% set audit_query = audit_helper.compare_row_counts(

a_relation=a_relation,

b_relation=b_relation

) %}

{% set audit_results = run_query(audit_query) %}

{% do audit_results.print_table() %}

{% endif %}

{% if execute %}

{% set audit_query2 = audit_helper.compare_relation_columns(

a_relation=a_relation,

b_relation=b_relation

) %}

{% set audit_results2 = run_query(audit_query2) %}

{% do audit_results2.print_table() %}

{% endif %}

{% if execute %}

{% set audit_query3 = audit_helper.compare_relations(

a_relation=a_relation,

b_relation=b_relation

) %}

{% set audit_results3 = run_query(audit_query3) %}

{% do audit_results3.print_table() %}

{% endif %}

{% if execute %}

{% set audit_query4 = audit_helper.compare_all_columns(

a_relation=a_relation,

b_relation=b_relation,

primary_key = "ID",

) %}

{% set audit_results4 = run_query(audit_query4) %}

{% do audit_results4.print_table() %}

{% endif %}

select 0 as audit_test

In the first step we use the audit_helper.compare_row_counts here we count the rows in both tables and return the total count.

In the second step we use the audit_helper.compare_relation_columns which checks the order and datatype of the columns in both tables.

In the third step we use the audit_helper.compare_relations. This returns the percentage of matching relationships.

In the final step we use the audit_helper.compare_all_columns. This checks the matching rows in both tables and see how many NULL values each one has for every column, and how many each column is missing compared with the other table.

The “select 0 as audit_test” the end of the models acts as a placeholder query to avoid returning data while ensuring DBT's requirement of returning a select result from every model is met.

Once all models that replace the legacy tool transformations have been generated, the next step should be setting up tests for each model.

Demo

Here we have a db called db1 containing two schemas Schema1 and Schema1__legacy each of the schemas contains a table called “pharmaceuticals_india”. Below we can see the output of comparing two 100% equal tables.

To show the result of a failed audit we removed a column from the table in Schema1 the output is as follows.

Conclusion

The audit model we presented provides a comprehensive method to compare data between legacy systems and newly implemented DBT models, ensuring data consistency and accuracy throughout the transformation process. This approach allows users to easily identify discrepancies between datasets, whether they arise from differences in row counts, column structures, or data content. By utilizing the audit_helper library, these comparisons can be done with minimal overhead, making it a low-impact testing approach that efficiently validates the accuracy of new models without requiring excessive resources or introducing complexity.

One of the key advantages of this methodology is that it offers granular visibility into potential issues. From basic row count mismatches to more intricate column and data integrity checks, this process enables users to pinpoint the exact nature of discrepancies.

Ultimately, by leveraging DBT and audit_helper for model generation, organizations can achieve a smooth, risk-free transition from legacy ETL tools to modern, cloud-based data pipelines. This ensures that transformations are correct, scalable, and maintainable in the long term. Such an approach builds confidence in the migration process, enabling teams to embrace DBT’s benefits—like modularity, version control, and collaboration—without sacrificing the accuracy of critical business data.