Can dbt connect to multiple databases?

No. dbt is designed to transform data that already exists in a data warehouse. It is explicitly not a data loader, nor can it execute transformations across databases. For this reason, dbt may only be active on one database per project.

How do I run a dbt project?

dbt projects are ran with the command “dbt run.” Before you can run a project, however, a user must first initialize dbt and configure a proper connection to a data warehouse.

Yes, dbt Core is an open source command-line tool released under the Apache License. This is not to be confused with dbt Cloud, which is a proprietary, hosted solution offered for sale by dbt.

What is a dbt package?

dbt packages are libraries of code that are modularized and installed as plugins to dbt. These packages offer additional functionality to the core product and act as standalone dbt projects, with models and macros that tackle a specific area.

Does dbt create tables?

Yes, dbt can create both tables and views when provided with source data and transformation logic. Note that dbt itself does not create source data— an external loading tool is necessary.

Does dbt work with SQL Server?

Yes, dbt has a package “dbt-sqlserver” that enables connections with Microsoft SQL Server.

No, dbt does not directly store data. Instead, dbt is used to transform existing data for analysis and computation. The result of these transformations are stored in a warehouse by dbt.

How do you use dbt with snowflake?

dbt can be configured with Snowflake by first initializing a dbt project then installing the dbt-snowflake package: https://github.com/dbt-labs/dbt-snowflake

How is dbt different from airflow?

Airflow is an all-purpose orchestration tool, used to schedule, construct, and execute jobs on an almost infinite number of platforms. dbt, by contrast, exists only within the realm of a data warehouse— it relies on an external scheduler and host to perform transformations and only acts on one source, data warehouses.

Database/Cloud

What Are dbt Sources?

Q: What are sources in dbt?

Sources make it possible to reference imported data objects in a dbt project structure. By declaring tables as sources in dbt, you can directly reference them in your models, test assumptions about source data, and calculate “freshness” (the recency of loaded data).

We dive into what dbt sources are, how they work, and how you can use them to advance your data warehouse.

Matt Palmer

Dec 16, 2022 • 6 min read

Intro

Data driven organizations are built on clean, reliable datasets and foundational analysis. These assets can provide tremendous value, but they rely on quality data and production-grade data systems.

Improving data quality starts with the very first step in the transformation process: importing data into the warehouse. These imported datasets form source data for the following transformations and, ultimately, underpin the finished data product.

Today, we’ll be discussing sources in the popular data transformation tool dbt (data build tool— yes, all lowercase!).

To understand sources, we must first understand dbt: the tool, its functionality, and implementation. After a brief overview of dbt, we’ll dive into what dbt sources are, how they work, and how you can use them to advance your data warehouse.

What is dbt?

dbt is short for data build tool, which summarizes its functionality quite nicely: dbt is a transformation workflow that helps to organize programmatic data warehouse manipulations.

Created by Fishtown Analytics, dbt has evolved into a venture-backed data giant through its paid offering dbt Cloud. The CLI (command line interface) that underlies dbt Cloud, dbt Core, is still free, open-source and used by many (though it lacks the GUI and advanced functionality of Cloud).

At its simplest, dbt is a way to manage a collection of SQL transformations that define a data warehouse. Under the hood, dbt contains much more: a code compiler, dependency management system, documentation framework, testing infrastructure, and jinja implementation mechanism.

Ultimately, dbt’s functionality serves to optimize data workflows and provide robust analysis and production-grade, version-controlled data systems.

How does dbt work?

Overview

dbt acts as an orchestration layer, which sits on top of a data warehouse and organizes transformations. If you’ve heard of the popular tool Apache Airflow, the concept of dbt should be familiar, though the tools differ in a number of ways.

dbt accepts SQL and jinja transformations, organizing them into DAGs (directed acyclic graphs) that are then executed against a data warehouse.

Example of directed acyclic graphs — *An example lineage graph (DAG) from a sample dbt project (Source*).

By defining references, dbt can track model dependencies. For example in the figure above, stg_customers contains a ref(‘jaffle_shop’, ‘customers’)— this tells dbt that stg_customers is downstream from our source.

Simple but powerful, references are key to managing complex data workflows and dependencies, since they determine the order in which models are executed.

As mentioned, dbt contains a wealth of additional functionality. But to understand sources, we need only know that dbt is an orchestration tool, programmatically executing templated SQL through the use of references and compiled queries.

dbt project structure

While a dbt project can be structured in any number of ways, dbt best practices dictate a practical approach to building projects.

First, sources are imported into a warehouse and declared. Since dbt is focused on transforming data that already exists in a warehouse, it’s up to the warehouse maintainer to pull in relevant data through extract and load processes or other means.

dbt does offer seed functionality for loading CSVs, but it is only recommended for smaller datasets. For larger data sets, tools such as Zuar's data pipeline solution Runner could be a better fit.

Once sourced, raw data is then staged. Staging provides a temporary area for minor transformations, and helps to mitigate common data pitfalls like upstream changes, inconsistent formats, and more.

Learn more about data staging:

Finally, data is cleaned and larger transformations are applied. The resulting tables are exposed to analysts and stakeholders for consumption.

Since sources are the foundation on which the warehouse is built, it’s essential to invest in architecture and testing. Properly configuring sources is essential to use dbt to its fullest extent. Now that we understand the complete dbt pipeline, we can discuss sources in-depth.

What are dbt sources?

Sources make it possible to reference data objects in a dbt project structure. By declaring tables as sources in dbt, you can directly reference them in your models, test assumptions about source data, and calculate “freshness” (the recency of loaded data).

Sources are defined in yaml files, a format common in dbt, nested under a sources key. Here’s an example source from the dbt docs:

version: 2

sources:
- name: jaffle_shop
tables:
- name: orders
- name: customers

- name: stripe
tables:
- name: payments

Again, these are for tables that already exist in a warehouse or will be seeded using dbt. Once defined, sources can be referenced from a model using the source function:

SELECT
...
FROM {{ source('jaffle_shop', 'orders') }}
LEFT JOIN {{ source('jaffle_shop', 'customers') }} USING (customer_id)

dbt will compile these jina templates to the full table name upon execution, creating proper SQL references and a dependency:

SELECT
...
FROM raw.jaffle_shop.orders
LEFT JOIN raw.jaffle_shop.customers USING (customer_id)

Example of source functionality in dbt — *The source functionality forms the basis for creating tables in dbt (Source*).

Thus, sources create the foundation for all subsequent transformations in dbt— by establishing source data, enable metadata, lineage, and dependency tracking for all downstream tables.

Why use dbt sources?

Sources bring more benefit than simply creating references and dependencies. By declaring sources in dbt, we can add metadata that is rendered to a part of dbt’s document site and create tests to ensure data consistency and freshness.

All versions of dbt ship with project documentation functionality (see dbt’s example site). Declaring sources and references allow dbt users to take full advantage of this.

Additionally, not all data loading tools report on the status of their transformations. Such tools may silently fail, resulting in stale or incorrect data. Using dbt sources allows us to declare tests and assertions on imported data: for example, checking columns for uniqueness, validating non-null values, and ensuring data freshness.

Data observability is crucial to producing consumable data and it starts at the source— hence the power of dbt sources!

Make it Easier to Work With dbt

While dbt is a powerful tool, particularly for developers who like writing SQL, it's automation capabilities are limited. That's where an automated data pipeline solution, such as Zuar's Runner platform, can fill the gap.

Runner allows developers to automate dbt scripts, potentially saving hundreds of hours of manual tasks.

If you are considering implementing dbt, or are currently utilizing this tool, we recommend using it in conjunction with Runner to create a robust, scalable data pipeline. Schedule a free data strategy assessment to see if Runner is a good fit!

Learn More About Runner