Data Science

dbt 101: What You Need on Day 1 (and 10 Best Practices for the Day After)

· 19 min read

Someone hands you a folder. Inside are 240 .sql files. One adds a column to a customer table. Another joins that table to a payments table. A third aggregates that join into a weekly KPI. Two more reference the weekly KPI. Three dashboards point at those two. You are told to add a new field, and to ship by Friday.

In This Article

  1. What dbt actually is
  2. The five things a dbt project is made of
  3. Why dbt over a folder of SQL files
  4. Core vs Cloud: pick once, on cost
  5. A real stack: how we ship dbt to a US telecom client
  6. 10 best practices that separate a project that pays for itself from one you regret
  7. What to do this week

There is no diagram. There is no execution order. There is no test. There is a Slack message from someone who left the company in March that says “run them in alphabetical order I think”.

dbt is the answer to exactly this folder. It turns the pile of SQL into a graph the warehouse already understands, in the order the data needs them, with tests attached, with auto-generated docs that show you what depends on what before you change anything.

This article is not a setup guide. It assumes you have not used dbt and you want to know what it actually is before deciding whether to put your team on it. By the end you should know the five things dbt is made of, why those five things are useful, when to pick Core vs Cloud, and ten practices that separate a dbt project that pays for itself from one that becomes the next folder you inherit.

What dbt actually is

dbt is a transformation framework. It does not store data, it does not load data, it does not orchestrate non-SQL work. It takes SQL you write, organizes it into a dependency graph, and runs it in the right order against the warehouse you already have.

That sentence is short on purpose. The most expensive mistake teams make with dbt is buying it as a replacement for the whole stack. It is not. It is one box in a stack of boxes (extract, load, transform, store, visualize), and the box it owns is the T.

The five things a dbt project is made of

A dbt project is a directory. The directory has a dbt_project.yml at its root and a handful of subfolders. The contents fall into five categories that you need to know by name to read any production project without a guide.

Project. The directory itself. Holds the SQL, the YAML configs, the tests, the seeds, the docs.

Models. SQL files that describe how tables and views in the warehouse get built. One model is one SELECT statement plus a {{ config(...) }} block at the top. dbt supports five materializations for a model:

  • Table. The model becomes a real, persistent table. Refreshed on every run.
  • View. The model becomes a view. No storage cost; query cost paid each time the view is hit.
  • Materialized view. Warehouse-native materialized view where supported (Snowflake, BigQuery, Redshift). Like a view, but the warehouse maintains a cache.
  • Ephemeral. No physical object at all. Other models reference this one as an inline CTE. Useful for short reusable subqueries that are not worth materializing.
  • Incremental. A table that updates only new rows, not the whole thing, on each run. The right choice for any large event stream where rebuilding from scratch is too expensive.

dbt also supports Python models, but this article skips them. Most production projects are still 95% SQL.

DAG. dbt reads the ref('model_name') and source('schema', 'table') calls in every model and assembles a directed graph of dependencies. You then run the graph (dbt run), parts of it (dbt run --select +my_kpi+), or individual nodes. The DAG is the reason dbt exists. Once you have it, “what depends on this table” is a one-line answer, not a Slack archaeology trip.

Macros. Reusable SQL fragments written in Jinja. A macro can be as small as a date formatter or as large as a 200-line module that generates INSERT statements from a config dict. Macros are how a dbt project stays DRY. Jinja also gives you loops, variables, and for/if, which is the thing that finally makes SQL feel like a real language for non-trivial pipelines.

Seeds and snapshots. Seeds are small reference tables loaded from CSV files in the repo — country codes, currency tables, dimension lookups. Snapshots are the dbt mechanism for slowly-changing dimensions: point dbt at a source table and a uniqueness key, and it stores a history of every row’s state with valid_from/valid_to columns. If your source system overwrites in place but you need the history, snapshots are the answer.

A typical dbt project DAG. Sources on the left, staging models in the middle, marts on the right, BI consumers at the edge. Each arrow is one ref() or source() call.
A typical dbt project DAG. Sources on the left, staging models in the middle, marts on the right, BI consumers at the edge. Each arrow is one ref() or source() call.

Why dbt over a folder of SQL files

Without dbt you have execution order in someone’s head, no tests, and documentation that lives in an out-of-date wiki page. With dbt you get three things that change the ergonomics of the work:

Automation of execution order. The DAG is the contract. You write SQL that references other SQL via ref(). dbt figures out the order, runs in parallel where possible, fails fast, and reruns only the failed branch. The scheduler your team uses (Airflow, GitHub Actions, dbt Cloud itself, Cloud Scheduler + Cloud Run) calls one command (dbt build) and the framework owns the rest.

Data quality, declared in YAML. Every model gets a schema.yml next to it where you write tests: not_null, unique, accepted_values, relationships. Run dbt test and the framework runs every assertion against the materialized data, returns rows that fail, and exits non-zero on failure. The standard four tests cover 70% of the data-quality work a team does in their first year on dbt. The remaining 30% is custom tests, which are also SQL.

Documentation, generated from the code. Run dbt docs generate, then dbt docs serve, and you get a navigable site with every model, every column, every test, every source, every downstream dependency. It is not pretty by 2026 standards. It is, however, never out of date, because it is the project itself rendered as a website.

Core vs Cloud: pick once, on cost

dbt ships in two versions that share the same engine but diverge sharply on operational ergonomics.

dbt Coredbt Cloud
CostFree, open sourcePaid (trial + tiers from $100/seat/mo)
DeploymentLocal or VM you operateManaged cloud
SchedulerBring your own (Airflow, Cloud Run, GitHub Actions)Built-in
UINone (CLI + your IDE)Web IDE + run history + lineage browser
CIDIY (dbt build in your runner)First-class
Setup complexityHigh; need someone who runs servicesLow; sign up, point at warehouse
Best forTeams with engineering capacity and customization needsTeams whose priority is shipping models, not infrastructure

The decision is almost always a budget decision. If you already operate Airflow or have someone on staff who runs services, Core costs less than Cloud at any team size above three. If you do not have that person, Cloud pays for itself by week two.

There is no middle path. Running Core badly is worse than running Cloud well; pick one and commit.

A real stack: how we ship dbt to a US telecom client

A concrete example beats an abstract one. Here is the production stack we run for a US telecom that ripped out a 12-year-old internal CRM and rebuilt analytics from scratch. The data volume forced the rebuild; the old system stalled on 30,000-row CSV exports.

  • Warehouse: BigQuery
  • Transformation: dbt Core (Core was chosen for cost; the team’s data engineer was already comfortable running services)
  • Visualization: Metabase
  • Code: GitHub, with a protected main branch and required CI checks
  • Scheduler: GCP Cloud Scheduler triggers a Cloud Run service every hour; the service runs dbt build (= run + test + snapshot in one pass)
  • CI/CD: Cloud Build Trigger on every push to main; builds a Docker image, pushes to Artifact Registry, the next Cloud Run invocation picks it up
  • Observability: Cloud Monitoring on the Cloud Run logs, with alerts on dbt test failures

The shape is “dbt with a GCP stack around it”. Nothing exotic. The total bill, including BigQuery query cost, is under $400/month for a five-person analytics team. Metabase does not store data — it queries BigQuery directly — so the analytics pipeline is effectively near-real-time: data lands in BigQuery, dbt processes it within the hour, dashboards reflect it on next refresh.

The project’s directory has two folders that matter: staging/, where most models are ephemeral or incremental, and core/, where BI-facing models are materialized as tables for query speed.

The production dbt stack at the US telecom: GCP Cloud Scheduler triggers Cloud Run hourly, dbt Core runs against BigQuery, Metabase queries BigQuery directly. Cloud Build pushes new images to Artifact Registry; Cloud Monitoring catches failures.
The production dbt stack at the US telecom: GCP Cloud Scheduler triggers Cloud Run hourly, dbt Core runs against BigQuery, Metabase queries BigQuery directly. Cloud Build pushes new images to Artifact Registry; Cloud Monitoring catches failures.

10 best practices that separate a project that pays for itself from one you regret

The first three are non-negotiable, taken straight from running dbt in production for years. The next seven are the ones that matter once your project crosses 50 models.

1. Reference everything through ref() and source(). Never hard-code a table name. The DAG is built from these two function calls. A hard-coded customer.dim_customers reference works on day one and breaks the lineage graph the moment someone renames a model or changes a schema. The cost of {{ ref('dim_customers') }} is zero; the cost of repaying a broken DAG is two days of someone tracing dependencies by grep.

2. Use incremental materializations for any table over 50 million rows. Rebuilding a 500-million-row events table from scratch on every hourly run costs real money on BigQuery and Snowflake. Incremental updates with a unique_key and a watermark column (event_timestamp > (SELECT max(event_timestamp) FROM {{ this }})) cut the cost by 95% in our telecom case. Set up the incremental config when you first see the model cross 50M rows, not after the warehouse bill arrives.

3. Use upstream and downstream selection. dbt run --select +my_kpi+ rebuilds the model and everything it depends on plus everything that depends on it. dbt run --select model+ is downstream only. Useful for debugging (“which dashboards break if I change this?”) and for cost control during development.

4. Layer the project: staging, intermediate, marts. The dbt Labs canonical layering, and it is canonical because it works. staging/ models do one thing — rename, cast, light cleanup — one staging model per source table. intermediate/ models do business logic that is reused across marts. marts/ models are the things BI consumes. Without this layering, every model is doing too much and the DAG becomes unreadable.

5. Declare every source in sources.yml with freshness checks. A model that pulls from an orders table needs to know that table came from somewhere and that it should be fresh. sources.yml gives you both: the warehouse path is one config line, and freshness: warn_after: { count: 6, period: hour } will surface stale data before a dashboard does.

6. Test the four basics on every model: not_null, unique, accepted_values, relationships. Most data-quality incidents come from one of these four being violated. Adding them in YAML is two lines per column. The day a vendor changes an upstream pipeline and starts emitting nulls in a key column, your dbt run fails and Slack lights up, which is exactly what you want.

7. Strict naming convention so the directory is readable. Use prefixes that match the layer: stg_ for staging, int_ for intermediate, dim_ and fct_ for marts. Running tree models/ should tell a new engineer the shape of the project before they open a single SQL file. Convention is free; the absence of it is what makes dbt projects feel chaotic at scale.

8. Environments and targets: never run production from a laptop. profiles.yml supports multiple targets (dev, ci, prod). Local runs hit a personal dev schema; CI runs hit a CI schema; production runs hit prod and only run from the production scheduler. The cost of someone running dbt run --target prod from their laptop and clobbering production is high enough that you set up the guardrail on day one.

9. Choose materializations on cost, not vibes. Default to view. Switch to table when a downstream model is slow because of the view. Switch to incremental when a table is too expensive to rebuild fully. Switch to ephemeral when a model is a one-time intermediate used by exactly one downstream model. There is a real cost difference between these choices at any production volume; pick on data, not preference.

10. CI with state:modified+ --defer --state path/to/prod-manifest. Run only the models a PR changes, plus their downstream consumers, against deferred prod state. A naive CI that runs every model on every PR is unusable above 100 models. With state:modified+, CI on a 500-model project routinely takes under 10 minutes per PR. This is the single highest-impact dbt practice for any team with more than two analytics engineers.

The 10 practices grouped by what they buy you: correctness, cost, velocity. Three are non-negotiable from day one; seven matter once the project crosses 50 models.
The 10 practices grouped by what they buy you: correctness, cost, velocity. Three are non-negotiable from day one; seven matter once the project crosses 50 models.

What to do this week

If your team has a folder of orphan SQL scripts and one person who knows the run order, you already have the input for a dbt project. The minimum useful first project is:

  • One source declared in sources.yml (your warehouse’s main events or orders table)
  • Three staging models that rename + cast that source
  • One mart model that joins the staging models into a thing BI uses
  • Tests on the join keys (not_null, unique, relationships)
  • A scheduler that runs dbt build once an hour

That entire project is under 200 lines of SQL and YAML, will run in under a minute, and will save the team’s first hour of “which script do I run again?” within a week.

If your team already runs dbt and is somewhere between 30 and 100 models, the highest-ROI move this month is best practice #10 above: configure CI with state:modified+ --defer so PRs run in minutes instead of hours. Everything else can wait; this one buys back review velocity for the rest of the year.

And if you are looking at dbt for the first time and have not picked Core or Cloud, default to Cloud unless you already have engineers who run services. The setup cost of Core compounds; the subscription cost of Cloud is bounded. Most teams under 20 models are better off with Cloud’s first tier.


About the author

Nick Valiotti is the founder of Valiotti Data. 15+ years building analytics infrastructure for SaaS, marketplaces, and consumer subscription. 50+ production deployments across BigQuery, Snowflake, dbt, Metabase, and modern BI stacks. Author of two books on data strategy. LinkedIn · Discovery call.

Keep reading

Enjoyed this article?

Get weekly data strategy insights delivered to your inbox.

Get in Touch

Let's Discuss Your Project

Book a 30-minute discovery call. We'll assess your data maturity and recommend the right approach — no strings attached.

Book a Discovery Call →
Need help with your data strategy? Book a Discovery Call →