Ultimate dbt-utils Cheat Sheet
Last Updated:
September 5, 2023
September 5, 2023
The dbt-utils package enhances the dbt experience by offering a suite of utility macros. Designed to tackle common SQL modeling patterns, it streamlines complex operations, allowing users to focus on data transformation rather than the intricacies of SQL. dbt-utils is a must-have tool for dbt aficionados!
dbt-utils Cheat sheet
The dbt-utils package is a gem in the world of data transformations. Let this cheat sheet guide you swiftly through its features, ensuring you get the most out of dbt-utils. Enjoy!
SQL Generators
The SQL generators in the dbt-utils package streamline your modeling tasks. By automating common SQL patterns, they minimize manual coding and guarantee consistent, high-quality queries. Think of it as a handy toolkit for every dbt user's SQL endeavors!
SQL Generators | |
---|---|
date_spine(datepart, start_date, end_date) (view source) |
This macro provides the SQL needed to create a sequence of dates. It starts with the 'start_date' (if it matches the specified date part) but doesn't go up to the 'end_date'. |
deduplicate(relation, partition_by, order_by) (view source) |
This macro produces the SQL needed to delete duplicate rows from a model, source, or CTE. |
haversine_distance(lat1, lon1, lat2, lon2, unit='mi') (view source) |
This macro computes the haversine distance between a pair of x/y coordinates. You can also specify a 'unit' for the distance. |
group_by(n) (view source) |
This macro constructs a "group by" statement for fields ranging from 1 to N. |
star(from, except= [], relation_alias= [], prefix=' ', suffix=' ', quote_identifiers=True) (view source) |
This macro creates a list of fields separated by commas from the 'from' argument, but it leaves out any fields mentioned in the 'except' argument. Further customizations are available with the optional arguments. |
union_relations(relations, exclude, include, column_override, source_column_name='_dbt_source_relation', where) (view source) |
This macro merges the items from the relations argument using 'union all'. It can handle when the columns are in different orders or when some columns are missing in certain Relations. Any unique columns will have 'null' values where they don't exist in some relations. Additionally, a new column, '_dbt_source_relation', is added to show where each record originated. |
generate_series(upper_bound) (view source) |
This macro creates a cross-database way to produce a list of numbers up to a specified maximum. It generates a SQL result set where the numbers start from 1 and go up to the number you've chosen. |
generate_surrogate_key(<field list>) (view source) |
This macro creates a cross-database way to produce a hashed surrogate key based on the fields you specify. Note: Older versions of this macro is called 'surrogate_key()'. Both nulls and blank strings were treated identically. If you need to revert to this behavior for reasons like maintaining backward compatibility, you should add the surrogate_key_treat_nulls_as_empty_strings variable to your 'dbt_project.yml'. |
safe_add(<field list>) (view source) |
This macro cross-database way to sum up fields that can have null values, based on the fields you indicate. |
safe_divide(numerator, denominator) (view source) |
This macro divides values but gives a null result if the bottom number (denominator) is 0. |
safe_subtract(<field list>) (view source) |
This macro implements a cross-database to calculate the difference between fields that might have null values, based on the fields you select. |
pivot(column, values, alias=True, agg= 'sum', cmp='=', prefix= ' ', suffix= ' ', then_value= 1, else_value=0, quote_identifiers=True) (view source) |
This macro transforms data by turning row values into column headers. |
unpivot(relation, cast_to= 'varchar', exclude=none, remove=none, field_name=none, value_name=none) (view source) |
This macro converts a table from a wide layout to a long layout, similar to the melt function in pandas. It replaces Boolean values with the strings 'true' or 'false'. |
width_bucket(expr, min_value, max_value, num_buckets) (view source) |
This macro creates equal-width histograms by dividing the range into uniform intervals. It then determines the bucket number for a given value after evaluating an expression. The output is either an integer representing the bucket number or null if any input is null. |
Generic tests
Within the dbt-utils package lies a set of generic tests, designed to validate your data effortlessly. These tests ensure consistency and quality, checking for common issues without the need to craft custom validation rules. It's data integrity made simple for dbt users.
Generic Tests | |
---|---|
equal_rowcount(compare_model, group_by_columns) (view source) |
This test compares two models or tables and asserts they have the same number of rows. |
fewer_rows_than(compare_model, group_by_columns) (view source) |
This test asserts that a specific model or table has fewer rows than a specified threshold or compared to another table. |
equality(compare_model, compare_columns, group_by_columns) (view source) |
This test compares two models or tables and asserts their equality. You can specify a subset of columns to compare. |
expression_is_true(expression, where) (view source) |
The test ensures that a given SQL expression holds true for every record, which is particularly helpful in maintaining column integrity. For instance, this can be used to confirm results from simple algebraic operations between columns, validate the length of a column, or assess the truth value of a column. Note: Use the where clause to check a subset of records. |
recency(datepart, field, interval, group_by_columns) (view source) |
This test ensures the timestamp column in the given model has data that's newer than a specific date range. |
at_least_one(group_by_columns) (view source) |
This test checks a specified column and asserts it has at least one value. |
not_constant(group_by_columns) (view source) |
This test makes sure a column doesn't have the same value for every row. |
not_empty_string(trim_whitespace=true) (view source) |
This test asserts that none of the values are equal to an empty string. The optional argument, trim_whitespace will remove whitespace from the column. |
cardinality_equality(field, to) (view source) |
This test asserts that a specific column has the same number of unique values as another column in a different table. |
not_null_proportion(at_least, at_most=1.0, group_by_columns) (view source) |
This test checks that a column has non-empty values within a certain range. If you don't specify an upper limit, it assumes up to 100% (or 1.0) of the values can be non-empty. |
not_accepted_values(values) (view source) |
This test checks that there aren't any rows with the specified values. |
relationships_where(to, field, from_condition, to_condition) (view source) |
This test checks the connection between two models, similar to the basic relationship checks. It also allows for filtering out specific rows, like test entries or recent entries, which might have temporary inconsistencies because of data processing limits. |
mutually_exclusive_ranges(lower_bound_column, upper_bound_column, partition_by=None, gaps='allowed', zero_length_range_allowed=False) (view source) |
By setting the ‘lower_bound_column’ and ‘upper_bound_column’ ranges, This test asserts that the ranges don't overlap with those of other rows. |
sequential_values(interval=1, datepart=None, group_by_columns) (view source) |
This test checks that a column has values in order, either numbers or dates |
unique_combination_of_columns(combination_of_columns, quote_columns=false) (view source) |
This test ensures that when certain columns are combined, their values are unique. For instance, while a month or a product alone might repeat, the pairing of a month with a product is always distinct. |
accepted_range(min_value, max_value, where, Inclusive=true) (view source) |
This test ensures a column's values are within a certain range, set by a minimum and maximum value. You can choose if the range includes or excludes the boundary values. If needed, you can also focus on specific records with a filter. As long as the data type can be compared using > or < signs, you can use this test. |
Grouping in tests | Some tests offer an extra group_by_columns option to get more detailed checks. Arg for group_by_column must be in list format. Tests that support this include: equal_rowcount, fewer_rows_than, recency, at_least_one, not_constant, sequential_values, non_null_proportion. |
Introspective macros
The introspective macros within the dbt-utils package are a window into your data's metadata. They empower you to dynamically interact with and understand the underlying structure of your datasets. It's like having a magnifying glass for the intricacies of your dbt projects!
Introspective Macros | |
---|---|
get_column_values(table, column, where=none, order_by='count(*) desc', max_records=none, default=[]) (view source) |
This macro produces an array containing the distinct values of a column from a specified relation. Note: Setting order_by='my_column' will sort alphabetically, while order_by='min(created_at)' will sort by when the value was first observed. |
get_filtered_columns_in_relation(from, except=[]) (view source) |
This macro provides a Jinja list, containing columns from a specific relation. It's meant for actual relations and not for CTEs. |
get_relations_by_pattern(schema_pattern, table_pattern, exclude, database=target.database) (view source) |
This macro produces a list of Relations that align with a specified schema or table name pattern. It pairs well with the 'union_relations' macro. |
get_relations_by_prefix(schema, prefix, exclude, database= target.database) (view source) |
DEPRECATED SOON: Use 'get_relations_by_pattern' instead. This macro gives a list of Relations matching a specific prefix and has an option to exclude certain patterns. It works especially well when combined with 'union_relations'. |
get_query_results_as_dict(<sql query>) (view source) |
This macro provides a dictionary based on a SQL query, eliminating the need to work with the Agate library to process the outcome. |
get_single_value(<sql query>) (view source) |
This macro fetches a single value from a SQL query, allowing you to bypass the use of the Agate library when handling the result. |
Please contact us with any errors or suggestions.
Author:
Mayra Peña