Data Terminology: 100 Essential Terms for 2026

This data glossary covers 100 terms a working data team uses in 2026, organized into eight categories from warehouse fundamentals to streaming patterns. Each definition includes a real US business example and one line on when the term actually applies. Each definition is short, with a concrete US example and one line on when you’d actually reach for it. Curated by Nick Valiotti, Fractional CDO with 15+ years building data teams across e-commerce, mobility, and SaaS.

In This Article

Foundations
Data Modeling
Storage & Warehouses
Pipelines & Integration
Analytics & BI
ML/AI & MLOps
Governance & Quality
Streaming & Real-Time

01 · Category

Foundations

The vocabulary every data conversation starts with. These twelve terms come up in every kickoff, every roadmap, and every executive review.

Big Data

Datasets too large or fast-moving for a single server to handle with standard SQL. The label usually kicks in past 10TB of warehouse storage or when streaming ingestion crosses a few thousand events per second.

ExampleA Brooklyn ride-share startup logs 40 million GPS pings per day across 12,000 drivers. That feed lands in BigQuery and cannot be queried in Postgres anymore.

When to useReach for this label when your team starts arguing about partitioning, clustering, or moving off Postgres to a warehouse.

Business Intelligence

The practice of turning company data into reports, dashboards, and answers that operators actually use. BI sits between the warehouse and the people making pricing, hiring, and inventory decisions.

ExampleA Chicago coffee chain uses Looker to track $2.4M in weekly revenue, broken down by store, daypart, and menu category. The COO checks it every Monday at 8am.

When to useUse this term when a team is deciding on a reporting tool, building self-serve analytics, or replacing spreadsheets sent over email. Avoid it when the real question is about data modeling or pipeline architecture; those are different problems with different solutions.

Dashboard

A single screen that shows the handful of numbers an operator needs to make decisions today. Good dashboards answer one role’s questions, not everyone’s.

ExampleThe head of growth at an Austin Shopify merchant opens one Looker dashboard each morning: CAC by channel, $180K spend pacing, and 7-day ROAS. Five tiles, no scroll.

When to useBuild one when a specific person checks the same numbers more than twice a week and currently pulls them from three places.

Data

Recorded facts about the business: transactions, events, customer attributes, product states. Data only becomes useful once it is collected somewhere queryable and tied to a clear definition.

ExampleEach Stripe payment generates a row with charge ID, amount in cents, customer ID, and timestamp. Those rows are the raw material for revenue reporting at any US SaaS company.

When to useUse the word carefully: ‘data’ without context (which table, which grain, which source) usually starts arguments in meetings.

KPI

A Key Performance Indicator: one of the 3 to 7 numbers a team commits to moving this quarter. Every KPI has an owner, a target, and a clear definition in the metric layer.

ExampleA Series B fintech in Manhattan sets one company KPI: activated users per week, target 2,400 by Q3. The CEO reviews it every Tuesday with the growth lead.

When to useApply this label to 5 numbers or fewer per team per quarter. If a team has 20 KPIs, each one is competing with the others for attention, which means none of them is steering decisions.

Metadata

Data about data: column descriptions, table owners, freshness timestamps, source systems, and update cadence. Metadata is what lets a new analyst answer questions on day three instead of day thirty.

ExampleIn dbt, every model in a 400-table warehouse at a Boston insurtech has an owner field, a description, and SLA tags. Atlan reads those and powers search across the whole stack.

When to useInvest in metadata once your warehouse passes about 100 tables or three analysts.

Metric

A defined calculation that produces a number, with explicit grain, filters, and time window. A metric is reproducible: two analysts running it should get the same value.

Example‘Weekly active sellers’ at an Atlanta marketplace means distinct seller IDs with at least one completed order in the trailing 7 days. That definition lives in their semantic layer.

When to useWrite down the metric definition the first time two people in a meeting give different numbers for the same name.

Modern Data Stack

The cloud-native pattern that replaced on-prem data warehouses around 2018: managed ingestion (Fivetran), cloud warehouse (Snowflake, BigQuery), transformation (dbt), and a BI tool on top.

ExampleA 50-person Series A in San Francisco runs Fivetran into Snowflake, transforms with dbt Cloud, and ships dashboards in Hex. Total tooling spend: about $4,200 per month.

When to useDefault to this pattern for any US startup under 500 employees that does not have a strong on-prem legacy.

Raw Data

Records loaded into the warehouse exactly as they arrived from the source: no joins, no renames, no cleanup. Raw is the audit trail you fall back on when downstream logic breaks.

ExampleFivetran loads Stripe charges into a Snowflake schema called raw_stripe. Column names match the Stripe API exactly, including the typo in ‘recieved_at’.

When to useKeep raw layers untouched. Build your cleanup in a separate staging schema so you can replay history.

Semi-Structured Data

Records with a flexible shape, usually JSON or XML, where fields can vary row to row. Warehouses like Snowflake and BigQuery store this natively in VARIANT or JSON columns.

ExampleEvent payloads from Segment land in a single column as JSON. One event has 12 properties, the next has 30. A New York media company queries them with Snowflake’s FLATTEN function.

When to useExpect this format from product event streams, API webhooks, and most third-party SaaS exports.

Structured Data

Records that fit a fixed schema: defined columns, defined types, predictable shape. Rows in a Postgres orders table or a Salesforce opportunities export are structured by default.

ExampleA Denver e-commerce brand’s orders table has 18 columns: order_id (string), total_cents (int), placed_at (timestamp), and so on. Every row matches that contract.

When to useMost CRM, billing, and ERP data is already structured. Treat it as the easiest category to model in dbt.

Unstructured Data

Content without a tabular shape: PDFs, support tickets in free text, call recordings, product photos, chat transcripts. You need parsing or embeddings before this data shows up in a SQL query.

ExampleA Seattle telehealth provider has 2.1 million patient chat messages. They run OpenAI embeddings, store vectors in pgvector, and surface themes in a weekly clinician report.

When to useTackle this once your structured analytics is solid. Unstructured work needs ML infrastructure and clear ROI before it pays off.

02 · Category

Data Modeling

How you shape data before it hits a dashboard. These thirteen terms cover the structures and conventions that decide whether your warehouse stays readable past quarter two.

Denormalization

Denormalization is the practice of intentionally duplicating data across tables to make reads faster. You trade storage and write complexity for query speed, which is the right call in analytical warehouses like Snowflake or BigQuery.

ExampleA Brooklyn DTC brand flattens orders, customers, and products into one wide table so the marketing team gets dashboard load times under 2 seconds instead of 18.

When to useUse denormalization when read latency on BI dashboards matters more than perfect storage efficiency.

Dimension

A dimension is a table that describes the context around a measurable event: who, what, where, when. Dimensions hold attributes you filter, group, and slice by in reports.

ExampleAn Austin Shopify merchant keeps a customer dimension with 14 attributes (state, signup channel, LTV bucket) and joins it to every order fact for cohort analysis.

When to useBuild a dimension whenever an entity (customer, product, store) needs descriptive attributes shared across many fact tables.

Dimensional Modeling

Dimensional modeling is Ralph Kimball’s approach to organizing analytical data into fact tables (events, measures) surrounded by dimension tables (context). It optimizes warehouses for business-user queries rather than transactional integrity.

ExampleA Chicago coffee chain models 4M weekly transactions as one sales fact plus dimensions for store, product, time, and loyalty member, cutting analyst query writing time in half.

When to usePick dimensional modeling when self-serve BI tools (Looker, Tableau) need to expose data to non-engineers.

Fact Table

A fact table stores the measurable events of a business: orders, clicks, payments, subscriptions. Each row is one event with foreign keys to dimensions plus numeric measures like revenue or quantity.

ExampleAn Atlanta SaaS company has a subscription_events fact table with 12M rows: each row a signup, upgrade, or churn, with MRR delta in USD.

When to useCreate a fact table for every business process you want to count, sum, or trend over time.

Natural Key

A natural key is an identifier that comes from the source system or real world: a Stripe customer ID, a SKU, an email address. It carries meaning outside the warehouse.

ExampleA Portland marketplace uses the Stripe customer ID (cus_NffrFeUfNV2Hib) as the natural key when joining payment data to product analytics from Segment.

When to useKeep natural keys visible in your models so analysts can trace a warehouse row back to its source record.

Normalization

Normalization is the process of splitting data into related tables to remove duplication and enforce consistency. Third normal form (3NF) is the standard target in OLTP systems like Postgres or MySQL.

ExampleA Brooklyn pet care startup normalizes customer addresses into a separate table so a single address update propagates to all 1,200 of that customer’s past invoices.

When to useNormalize transactional schemas where data integrity, updates, and storage cost matter more than read speed.

OLAP

OLAP (Online Analytical Processing) refers to systems built for fast queries over large historical datasets: aggregations, group-bys, time-series rollups. Modern OLAP runs on columnar engines like Snowflake, BigQuery, ClickHouse, and DuckDB.

ExampleA Denver e-commerce team scans 400M order lines in BigQuery to compute year-over-year revenue by state in 6 seconds, something Postgres would take 40 minutes to return.

When to useReach for an OLAP system when analysts need to scan millions of rows without hurting your production database.

OLTP

OLTP (Online Transaction Processing) describes databases built for high-volume, low-latency writes and point reads. Postgres, MySQL, and DynamoDB are typical OLTP engines powering production apps.

ExampleA Stripe payment confirmation hits an OLTP Postgres row in 8ms, then a CDC pipeline ships that row to Snowflake for analytics within 90 seconds.

When to useUse OLTP for any system where users expect single-record reads and writes in under 100ms.

Schema

A schema is the structure of your data: tables, columns, types, and relationships. In warehouses like Snowflake or Redshift, schema also refers to a logical namespace that groups related objects.

ExampleA Miami fintech keeps separate schemas in Snowflake for raw_stripe, staging, and marts, so a junior analyst cannot accidentally query raw PII tables.

When to useDefine a schema before you load data, even messy data: ad-hoc structure becomes ad-hoc debugging two months later.

Slowly Changing Dimension (SCD)

A Slowly Changing Dimension is a pattern for tracking how dimension attributes change over time. Type 1 overwrites old values, Type 2 keeps history with effective-from and effective-to dates.

ExampleA Boston B2B SaaS uses SCD Type 2 on the accounts dimension so churn analysis correctly attributes lost revenue to the plan tier active at cancellation, not today’s tier.

When to useApply SCD Type 2 whenever historical accuracy of dimension attributes matters for cohort or attribution analysis.

Snowflake Schema

A snowflake schema is a dimensional model where dimension tables are normalized into multiple related sub-tables. It saves storage but adds joins, which slows queries and confuses business users.

ExampleA Seattle retailer normalizes the product dimension into product, category, and subcategory tables, then spends 3 sprints unwinding it back to a flat dimension after analysts complain about join logic.

When to usePick a snowflake schema only when dimension storage is genuinely expensive and you control all query writing centrally.

Star Schema

A star schema is a dimensional model with one central fact table connected to denormalized dimension tables. The shape resembles a star, hence the name. It is the workhorse layout for BI-friendly warehouses.

ExampleAn Austin food delivery startup models 80M orders as one orders fact joined to 5 flat dimensions (customer, restaurant, courier, time, promotion), giving Looker dashboards sub-second response.

When to useDefault to a star schema for new marts that business teams will query directly through a BI tool.

Surrogate Key

A surrogate key is a warehouse-generated identifier (often a hash or sequence) that uniquely identifies a row independent of the source system. It shields your model from changes in upstream IDs.

ExampleA Chicago marketplace generates an MD5 surrogate key on (stripe_customer_id, source_system) so two Stripe accounts with reused customer IDs do not collide in the unified customers dimension.

When to useUse surrogate keys in dimensions whenever you load data from more than one source system or expect source IDs to change.

03 · Category

Storage & Warehouses

The systems that hold your data. Fifteen terms covering warehouses, lakes, lakehouses, and the table formats and storage layers underneath them.

Amazon Redshift

Amazon Redshift is AWS’s managed cloud data warehouse built on a modified PostgreSQL engine. It stores data in columnar format across compute nodes and is optimized for running analytical SQL on terabytes of structured data.

ExampleA Brooklyn e-commerce brand piped order data from Shopify into Redshift, cut nightly reporting jobs from 4 hours to 18 minutes, and paid about $2,400 per month for a ra3.xlplus cluster.

When to usePick Redshift when your stack already lives in AWS and you want tight integration with S3, Glue, and IAM.

Apache Iceberg

Apache Iceberg is an open table format that adds database-like guarantees on top of files in object storage. It supports schema evolution, hidden partitioning, time travel, and ACID writes from engines like Spark, Trino, Snowflake, and DuckDB.

ExampleA fintech in Austin moved 80TB of transaction logs from raw Parquet to Iceberg on S3, then queried the same tables from both Snowflake and Spark without copying data.

When to useReach for Iceberg when multiple engines need to read and write the same large datasets safely.

ClickHouse

ClickHouse is an open-source columnar database designed for fast analytical queries on huge tables. It compresses data aggressively and can return aggregations over billions of rows in under a second on modest hardware.

ExampleA Chicago ad-tech firm stores 12 billion impression events in ClickHouse and serves a customer-facing dashboard with p95 latency of 230 milliseconds, running on three c6i.4xlarge nodes.

When to useUse ClickHouse when you need real-time analytics, event data, or low-latency user-facing dashboards.

Columnar Database

A columnar database stores data by column instead of by row, which makes scanning a few columns across many rows extremely fast. It also compresses better because values in one column tend to be similar.

ExampleAn Atlanta SaaS company switched a 600-million-row events table from Postgres to BigQuery, dropped aggregation queries from 90 seconds to 2 seconds, and cut storage by roughly 70 percent.

When to useChoose columnar storage for analytics workloads that scan large tables but touch only a handful of columns.

Data Lake

A data lake is a centralized store of raw files (CSV, JSON, Parquet, images, logs) in cheap object storage like S3 or GCS. Schema is applied at read time, so you can dump data first and structure it later.

ExampleA logistics startup in Dallas lands raw GPS pings, driver app logs, and Salesforce exports into an S3 lake, then transforms only what dbt needs into a curated warehouse layer.

When to useUse a data lake when you need cheap storage for raw and semi-structured data before deciding how to model it.

Data Lakehouse

A data lakehouse combines a data lake’s cheap file storage with warehouse-style features like ACID transactions, schema enforcement, and SQL performance. Table formats like Delta Lake, Iceberg, and Hudi sit on object storage to make this work.

ExampleA Seattle marketplace runs Databricks on top of Delta tables in S3 for both ML training and Looker dashboards, paying around $14,000 per month instead of $32,000 on a previous warehouse-plus-lake setup.

When to usePick a lakehouse when ML workloads and BI queries need to share the same underlying data without copies.

Data Warehouse

A data warehouse is a database tuned for analytics rather than transactions. It stores cleaned, modeled data from many source systems and is queried by BI tools and analysts with SQL.

ExampleA direct-to-consumer apparel brand in Los Angeles pipes Shopify, Klaviyo, and Meta Ads data into Snowflake, builds 40 dbt models, and gives the marketing team a single source of truth on $18M in annual revenue.

When to useSet up a warehouse once you have data in more than two source systems and analysts asking the same questions twice.

Delta Lake

Delta Lake is an open-source storage layer that adds ACID transactions, versioning, and schema enforcement to Parquet files on object storage. It was originally built by Databricks and is now widely used outside it.

ExampleA Denver health-tech company stores 4 years of patient event data in Delta tables on S3, uses time travel to reproduce a model from 6 months ago, and runs MERGE statements for nightly CDC from Postgres.

When to useUse Delta Lake when you need reliable upserts, history, and concurrent writes on top of cheap object storage.

Google BigQuery

BigQuery is Google Cloud’s serverless data warehouse. You pay for storage and per-query bytes scanned (or reserved slots), and you do not manage clusters, nodes, or vacuum operations.

ExampleA media company in Miami serves 8TB of article event data from BigQuery, runs about 12,000 queries per day, and keeps the monthly bill near $3,800 by using partitioning and clustering on the events table.

When to useChoose BigQuery when you want zero infrastructure overhead and unpredictable query volume from many analysts.

MySQL

MySQL is an open-source relational database used heavily for web and SaaS application backends. It is row-oriented, transactional, and tuned for many small reads and writes rather than analytical scans.

ExampleA Brooklyn coffee subscription startup runs its entire ordering and billing app on a single 4-core RDS MySQL instance handling 1,200 transactions per minute at peak.

When to usePick MySQL for operational workloads behind a product, not for cross-system reporting on years of history.

NoSQL Database

NoSQL covers databases that drop the strict relational model in favor of document, key-value, graph, or wide-column structures. They trade some join power and consistency for flexible schemas and easier horizontal scaling.

ExampleA Boston food-delivery app stores user sessions and cart state in DynamoDB, handling 60,000 reads per second during the Friday dinner rush at a fraction of relational costs.

When to useUse NoSQL when your access pattern is well-known, schemas vary, and you need to scale writes horizontally.

Object Storage

Object storage holds files as objects in flat namespaces (buckets) accessed over HTTP, with services like Amazon S3, Google Cloud Storage, and Azure Blob. It is durable, cheap per gigabyte, and the foundation for most modern data platforms.

ExampleA Houston energy analytics team stores 240TB of sensor archives in S3 Standard for $5,500 per month, plus another 1.8PB in S3 Glacier for cold backups at roughly a tenth of the price.

When to useUse object storage as the landing zone for raw data and the backing store for lakes and lakehouses.

Partitioning

Partitioning splits a large table into smaller chunks by a column like date or region. Queries that filter on that column scan only the relevant partitions, which cuts cost on usage-priced warehouses and speeds up scans everywhere else.

ExampleA Phoenix retail chain partitioned a 9-billion-row sales table in BigQuery by order_date and dropped a typical 30-day analyst query from scanning 1.2TB to scanning 38GB.

When to useAdd partitioning once a table grows past about 100 million rows and most queries filter on the same column.

PostgreSQL

PostgreSQL is an open-source relational database known for SQL compliance, strong consistency, and a rich extension system (PostGIS, pgvector, TimescaleDB). It powers application backends and also serves as a respectable small-scale analytics database.

ExampleA Nashville B2B SaaS company runs the full product on a single RDS Postgres instance with 200GB of data, plus pgvector for similarity search across 1.4 million support tickets.

When to useDefault to Postgres for application data, geospatial workloads, and any case where you need SQL plus extensibility.

Snowflake

Snowflake is a cloud data warehouse that separates compute from storage, so multiple teams can run independent virtual warehouses against the same data. It runs on AWS, Azure, and GCP, and bills per second of compute.

ExampleA San Francisco fintech keeps 22TB in Snowflake and spins up a dedicated XS warehouse per team (data eng, analytics, ML), keeping the monthly bill around $9,200 with auto-suspend on idle.

When to usePick Snowflake when you need workload isolation, easy data sharing, and predictable performance without tuning.

04 · Category

Pipelines & Integration

The moving parts that get data from source systems into a warehouse and back out again. Fourteen terms covering ingestion, transformation, and orchestration.

Airflow

Apache Airflow is an open-source orchestrator that lets you define data workflows as Python code (DAGs) and run them on a schedule. It tracks task dependencies, retries failures, and shows job status in a web UI.

ExampleA Brooklyn ad-tech startup runs 140 Airflow DAGs each night to pull Meta and Google Ads spend into Snowflake, then trigger dbt models at 4 a.m. ET before standups.

When to useReach for Airflow when you have many scheduled jobs across many systems and need engineering-grade control over retries and dependencies.

Airbyte

Airbyte is an open-source data ingestion platform with 350+ prebuilt connectors that move data from SaaS apps and databases into your warehouse. It runs as a managed cloud service or self-hosted on your own infrastructure.

ExampleA Chicago coffee chain uses Airbyte Cloud to sync Toast POS, Square payouts, and Klaviyo email events into BigQuery for $400 a month, replacing a brittle in-house Python script.

When to useChoose Airbyte when you want Fivetran-style connectors but need lower cost, self-hosting, or the ability to fork a connector yourself.

Change Data Capture (CDC)

Change Data Capture is a method of replicating only the rows that changed in a source database, by reading its transaction log instead of running full table scans. The result is near-real-time copies of operational data in the warehouse.

ExampleAn Atlanta fintech uses Debezium CDC to stream every Postgres row change from their lending platform into Snowflake within 30 seconds, so the risk team sees defaults the moment they happen.

When to useUse CDC when you need fresh production data in the warehouse and full-table snapshots are too slow or too expensive.

Dagster

Dagster is an open-source orchestrator built around data assets (tables, files, ML models) rather than just tasks. It tracks lineage, freshness, and data quality alongside the schedule, which makes failures easier to diagnose.

ExampleA Denver climate-tech team rebuilt 80 legacy cron jobs in Dagster and cut their mean time to resolve a broken pipeline from 4 hours to 25 minutes.

When to usePick Dagster when your team thinks in tables and models, not jobs, and you want lineage and data quality built into the orchestrator.

Data Contract

A data contract is a formal agreement between the team producing data (often product engineers) and the teams consuming it, specifying schema, semantics, freshness, and ownership. It is enforced in CI, not just documented in a wiki.

ExampleA Boston SaaS company added a data contract on the `users.signup_source` field after a Rails refactor silently nulled it for three weeks and broke the CAC dashboard.

When to useIntroduce data contracts when product engineers keep shipping schema changes that break downstream analytics without warning.

Data Ingestion

Data ingestion is the step of pulling data out of source systems (SaaS apps, databases, files, APIs, event streams) and landing it in a central destination such as a warehouse or lake. It is the first stage of any pipeline.

ExampleAn Austin Shopify merchant ingests Shopify orders, Stripe payouts, Recharge subscriptions, and Klaviyo events into BigQuery every 15 minutes through Fivetran.

When to useYou think about ingestion specifically when picking connectors, deciding sync frequency, and budgeting for row volume.

Data Pipeline

A data pipeline is the end-to-end flow that moves data from sources into a warehouse and transforms it into tables the business can query. A typical pipeline has three stages: ingestion, transformation, and serving (BI tools or reverse ETL).

ExampleA Seattle DTC brand runs a daily pipeline: Fivetran lands Shopify and Stripe into Snowflake at 2 a.m., dbt builds 120 models by 5 a.m., and Hightouch syncs LTV segments to Klaviyo by 6 a.m.

When to useYou design a pipeline whenever a business question needs data from more than one source on a repeatable schedule.

dbt

dbt (data build tool) is a framework for writing SQL transformations as version-controlled, tested, documented models. It compiles your SQL, runs it inside your warehouse, and produces a DAG of dependent tables and views.

ExampleA Brooklyn marketplace runs 340 dbt models against Snowflake, with 800 tests gating each deploy. A new analyst opened their first PR on day 3 instead of waiting weeks for tribal knowledge.

When to useAdopt dbt as soon as you have more than 20 hand-written SQL views and need tests, lineage, and reviewable changes.

ELT

ELT (Extract, Load, Transform) is the modern pattern where raw data is loaded into the warehouse first, then transformed inside it using SQL. It became the default once cloud warehouses made compute cheap and elastic.

ExampleA Miami insurance startup loads raw Salesforce, Stripe, and Zendesk tables into Snowflake with Fivetran, then runs 200 dbt models on top. Total stack cost: $3,200 a month.

When to useChoose ELT for almost every new analytics project, unless you have a strict reason to transform data before it touches the warehouse.

ETL

ETL (Extract, Transform, Load) is the older pattern of pulling data from sources, transforming it on a separate server, and loading the clean output into a database. It dominated the on-premise era of Informatica, SSIS, and Talend.

ExampleA Detroit auto supplier still runs nightly Informatica ETL jobs that clean SAP data on a dedicated server before loading 40 fact tables into a SQL Server warehouse used by finance.

When to useStick with ETL when you have regulatory reasons to scrub data before it lands, or when the source warehouse is on-premise and expensive to scale.

Fivetran

Fivetran is a managed data ingestion service that maintains 500+ connectors to SaaS apps, databases, and event sources, and lands the data in your warehouse. It handles schema changes, retries, and historical backfills automatically.

ExampleA Nashville healthcare startup pays Fivetran $1,800 a month to sync HubSpot, Stripe, Intercom, and Postgres into Snowflake. Zero engineering hours per month spent on connector maintenance.

When to usePick Fivetran when engineering time is more expensive than connector fees and you want zero-maintenance ingestion.

Idempotency

Idempotency is the property of an operation that produces the same result whether you run it once or a hundred times. In pipelines, it means you can safely re-run a failed job without creating duplicates or corrupting downstream tables.

ExampleA Portland ride-share rebuilt their nightly trip_facts model to use MERGE on trip_id. After a 3 a.m. outage, they re-ran the same job 4 times until it succeeded, with zero duplicate rows.

When to useDesign every pipeline step to be idempotent before you ship it to production, otherwise every failure becomes a manual cleanup.

Orchestration

Orchestration is the discipline of scheduling and coordinating the many tasks that make up a data pipeline: when ingestion runs, when transformations fire, what depends on what, and what to do on failure. Tools like Airflow, Dagster, and Prefect own this layer.

ExampleA Phoenix logistics company orchestrates 220 daily tasks in Dagster: 14 ingestion jobs trigger dbt runs, dbt triggers ML model scoring, and scoring triggers Hightouch syncs to Slack alerts for dispatchers.

When to useYou need a real orchestrator once cron + bash scripts stop being safe, usually around 20 to 30 interdependent jobs.

Reverse ETL

Reverse ETL is the pattern of pushing modeled data out of the warehouse back into operational tools (Salesforce, HubSpot, Klaviyo, Intercom, ad platforms), so marketing and ops teams act on the same numbers analysts use. Tools include Hightouch and Census.

ExampleA San Diego SaaS company syncs warehouse-derived health scores into Salesforce every 6 hours, so CSMs see churn risk in the same screen where they work, with no copy-pasting from a dashboard.

When to useAdd reverse ETL when your best segments and scores live in the warehouse but the people who need them work in CRM or ad platforms.

05 · Category

Analytics & BI

How people actually consume data. Twelve terms covering the BI tools, query patterns, and semantic-layer ideas that turn warehouse tables into decisions.

Ad-Hoc Query

An ad-hoc query is a one-off question asked of a database, written on the fly to answer a specific business request. It is not part of a scheduled report or saved dashboard, just a quick SQL pull to settle a question.

ExampleA Brooklyn coffee chain CFO asks how many oat-milk lattes sold in Chicago stores between February 1 and February 15, so an analyst writes a 12-line query against Snowflake and returns the answer in 20 minutes.

When to useReach for an ad-hoc query when the answer is needed once, fast, and does not justify building a dashboard.

Drill-Down

Drill-down is the act of moving from a summary number into the underlying detail. You click a high-level metric and the tool shows the rows, segments, or transactions that add up to it.

ExampleA Shopify merchant in Austin sees Q3 revenue of $1.4M, clicks the bar, and drills into 8,200 orders to find that 62% came from three product SKUs.

When to useUse drill-down when a top-line number looks off and you need to find which segment or record is driving the change.

Looker

Looker is a BI platform owned by Google, built around LookML, a code-based modeling layer that defines metrics once and reuses them across dashboards. It targets companies that want governed, version-controlled analytics on top of a cloud warehouse.

ExampleA Series B SaaS company in San Francisco runs Looker on BigQuery, with 40 LookML models defining ARR, churn, and CAC for 120 internal users across sales, finance, and product.

When to usePick Looker when you need one shared definition of every metric across a fast-growing team and engineers are comfortable with code.

Looker Studio

Looker Studio (formerly Data Studio) is Google’s free dashboarding tool, used mostly for marketing and web analytics reports. It connects to Google Analytics, Google Ads, Sheets, and BigQuery, with drag-and-drop charts.

ExampleA digital agency in Atlanta builds a 6-page Looker Studio report for a dental client, pulling Google Ads spend, GA4 conversions, and call tracking into a single weekly view at zero license cost.

When to useUse Looker Studio when the audience is marketing, the data lives in Google, and the budget for BI tooling is near zero.

Metabase

Metabase is an open-source BI tool that lets non-technical users ask questions of a database through a point-and-click interface, while still allowing SQL for power users. It runs self-hosted or as a cloud service.

ExampleA 30-person e-commerce startup in Denver self-hosts Metabase on a $40/month server, gives 25 employees access, and replaces a $1,800/month Tableau license without losing core reporting.

When to useReach for Metabase when budget is tight, the team is small, and most users only need basic charts on top of a Postgres or MySQL database.

Power BI

Power BI is Microsoft’s BI suite, deeply tied to Excel, Azure, and the Microsoft 365 stack. It uses DAX as its formula language and is licensed per user, usually around $14 per seat per month for the Pro tier.

ExampleA Chicago manufacturing firm with 600 Excel users rolls out Power BI to 80 plant managers, building 25 reports on factory throughput and rework rates against an Azure SQL warehouse.

When to usePick Power BI when the company already runs on Microsoft 365 and most stakeholders live in Excel.

Self-Service BI

Self-service BI is a setup where business users build their own reports and dashboards without filing a ticket with the data team. The data team provides clean tables, definitions, and a tool, then steps back from day-to-day chart-making.

ExampleAt a 200-person fintech in New York, the data team publishes 18 certified tables in dbt, and marketing, finance, and ops build 90% of their own dashboards in Looker, freeing analysts for deeper work.

When to useMove to self-service BI when the analyst queue is full of small chart requests and the underlying data is already modeled and trusted.

Semantic Layer

A semantic layer is a translation between raw database tables and business terms. It defines what “active customer” or “net revenue” means in one place, so every dashboard, query, and AI tool returns the same number.

ExampleA Boston B2B SaaS company defines MRR, churn, and net revenue retention in a Cube.js semantic layer; sales, finance, and the CEO board pack all pull the same $4.2M MRR figure.

When to useBuild a semantic layer when finance, sales, and product keep arguing about whose number is right.

SQL

SQL (Structured Query Language) is the standard language for asking questions of relational databases and warehouses. It is used to filter, join, group, and transform data in Postgres, MySQL, Snowflake, BigQuery, Redshift, and almost every other analytics store.

ExampleA data analyst at a Miami insurance company writes a 40-line SQL query against Redshift to pull claim counts by ZIP code for 2025, joining policy, claim, and customer tables.

When to useLearn SQL first if you work with data in any capacity, it remains the most portable skill in analytics.

Apache Superset

Apache Superset is an open-source BI platform originally built at Airbnb. It offers SQL Lab for query writing, a chart builder, and dashboards, with no per-user license cost when self-hosted.

ExampleA logistics startup in Dallas runs Superset on a single AWS EC2 instance, serving 50 internal users with 200 charts on top of a Postgres replica, replacing a $2,500/month commercial BI tool.

When to useChoose Superset when the team has DevOps capacity to host it and wants commercial-grade features without per-seat fees.

Tableau

Tableau is a BI platform owned by Salesforce, known for fast visual exploration and drag-and-drop chart building. It is widely used in enterprise analytics, with Creator licenses around $75 per user per month.

ExampleA 5,000-employee insurer in Hartford has 400 Tableau Viewer seats and 60 Creators, with the analytics team publishing 1,200 dashboards on claims, underwriting, and agent performance.

When to usePick Tableau when business users need to explore data visually and the company can absorb the per-seat cost.

Data Visualization

Data visualization is the practice of turning numbers into charts, maps, or other visual forms so people can spot patterns faster than reading a table. Good visualization matches the chart type to the question being asked.

ExampleA growth analyst at a Seattle DTC brand builds a cohort retention heatmap showing that customers acquired through Instagram in November retain at 38% by month 3, versus 22% from Google.

When to useVisualize data when the audience needs to see trends, comparisons, or outliers, not when they need exact numbers in a row.

06 · Category

ML/AI & MLOps

The model side of the stack. Twelve terms covering classical ML, deep learning, large language models, and the operational discipline (MLOps) that keeps them in production.

Deep Learning

Deep learning is a branch of machine learning that uses neural networks with many layers to find patterns in images, audio, and text. The model learns features on its own instead of relying on hand-coded rules.

ExampleA Brooklyn dermatology startup trained a deep learning model on 80,000 skin photos to flag suspicious moles before a teledermatology visit, lifting recall from 71% to 88%.

When to useReach for deep learning when the input is unstructured (images, audio, free text) and you have at least tens of thousands of labeled examples.

Embedding

An embedding is a list of numbers that represents a piece of text, an image, or a product so that similar items land close together in vector space. It turns messy content into something math can compare.

ExampleA Shopify merchant in Austin embedded 12,000 product titles with OpenAI text-embedding-3 and powered a ‘find similar’ widget that lifted add-to-cart by 9%.

When to useUse embeddings when you need semantic search, recommendations, clustering, or any task where two items should match by meaning rather than exact keywords.

Feature Engineering

Feature engineering is the work of turning raw columns into signals a model can actually learn from: ratios, time-since-last-event, rolling averages, encoded categories. It is where domain knowledge enters the pipeline.

ExampleFor a Chicago coffee chain churn model, the analyst built ‘days since last visit’, ‘avg ticket last 30 days’, and ‘mobile orders share’, which raised AUC from 0.68 to 0.81.

When to useInvest in feature engineering whenever a baseline model underperforms or when the business needs to understand which signals drive the prediction.

Feature Store

A feature store is a central system that defines, computes, stores, and serves model features for both training and live prediction. It guarantees the offline batch values match what the production API sees.

ExampleA fintech in Atlanta moved 140 fraud features into Tecton, cutting training-serving skew incidents from 6 per month to zero and dropping new-model launch time from 4 weeks to 5 days.

When to useStand up a feature store once you have two or more ML models in production sharing features across teams.

Large Language Model (LLM)

A large language model is a neural network trained on trillions of tokens of text to predict the next token, which lets it write, summarize, classify, and reason. GPT-4, Claude, and Llama are common examples.

ExampleA New York law firm uses Claude to draft first-pass NDAs from a client questionnaire, turning a 45-minute paralegal task into a 4-minute review.

When to usePick an LLM when the task is open-ended language work (drafting, extraction, classification of fuzzy categories) and exact rules are hard to write.

Machine Learning

Machine learning is the practice of training a program to make predictions from historical data instead of writing explicit if-then rules. The model finds patterns during training and applies them to new records.

ExampleA Denver e-commerce brand trained an XGBoost model on 18 months of orders to predict 90-day repeat purchase, then sent a $10 coupon only to the top 20% of likely buyers, saving $140K a year.

When to useUse machine learning when you have labeled history, the problem repeats often, and hand-written rules would be too brittle or too long to maintain.

MLOps

MLOps is the set of practices that takes machine learning models from a notebook into reliable production: versioning data and code, automated retraining, monitoring drift, rollback. Think DevOps with the extra problem that the model can quietly get worse.

ExampleA Seattle logistics company rebuilt its ETA model pipeline with MLflow, GitHub Actions, and Evidently, cutting time-to-redeploy from 11 days to 6 hours after a data drift alert.

When to useAdopt MLOps the moment a model starts driving real revenue or customer decisions, not before the first prototype.

Model Registry

A model registry is the system of record for trained models: each version, the data it was trained on, its metrics, and which environment (staging, production) currently serves it. It answers ‘what is actually running right now’.

ExampleA Miami insurance carrier uses MLflow Registry to track 47 pricing models, with promotion to production gated by a reviewer sign-off and a backtest AUC above 0.78.

When to useStand up a registry as soon as more than one person trains models or you need an audit trail for regulators.

Neural Network

A neural network is a model built from layers of simple units called neurons that pass weighted signals forward. Training adjusts those weights so the output matches the labels in the data.

ExampleA Boston ad-tech firm uses a 3-layer neural network to predict click-through on 200 million daily impressions, scoring each request in under 8 milliseconds.

When to useChoose a neural network when the relationship between inputs and outputs is non-linear and tabular methods like gradient boosting hit a ceiling.

Retrieval-Augmented Generation (RAG)

RAG is a pattern where an LLM first retrieves relevant documents from your own data, then writes an answer grounded in those documents. It reduces hallucination and lets the model cite sources.

ExampleA Philadelphia SaaS company built a RAG chatbot over 4,200 help-center articles in Pinecone, cutting Tier-1 support tickets by 31% in the first quarter.

When to useUse RAG when users ask questions about your private content (docs, contracts, product catalog) and you need answers with citations.

Training Data

Training data is the labeled historical examples a model learns from. The quality, balance, and recency of this dataset usually matter more than the choice of algorithm.

ExampleA San Diego solar installer labeled 9,500 past leads as ‘closed’ or ‘lost’ across 22 features, which let a logistic regression beat the sales team’s gut feel by 18 points of precision.

When to useAudit your training data first whenever a model underperforms, before reaching for a fancier algorithm.

Vector Database

A vector database stores embeddings and finds the closest matches to a query vector in milliseconds, even across hundreds of millions of items. Pinecone, Weaviate, and pgvector are common choices.

ExampleA Dallas job board indexed 6 million resumes in Pinecone with 1536-dim embeddings, returning the top 50 matches for any job description in 90ms.

When to useSpin up a vector database once your semantic search or RAG prototype outgrows an in-memory FAISS index, usually past a few hundred thousand items.

07 · Category

Governance & Quality

The accountability layer. Twelve terms covering ownership, lineage, catalogs, and the controls that keep auditors, regulators, and analysts on the same page.

Data Catalog

A data catalog is a searchable inventory of every table, dashboard, and metric in your stack, with owners, descriptions, and usage stats attached. Think of it as a Yelp for your data assets, so analysts stop asking Slack where revenue lives.

ExampleA Shopify merchant in Austin rolls out Atlan, tags 340 dbt models with owners, and cuts new-analyst onboarding from 6 weeks to 9 days.

When to useUse it once you have more than 50 tables and analysts repeatedly ask the same discovery questions in Slack.

Data Governance

Data governance is the set of policies, roles, and controls that decide who can touch what data, how it gets defined, and who answers when it breaks. It is the operating model behind trustworthy numbers, not a software product.

ExampleA Chicago bank assigns 14 domain owners, locks PII columns behind row-level access, and runs a monthly council to approve metric changes before they hit the exec dashboard.

When to useStand it up when finance, marketing, and ops are quoting three different revenue numbers in the same meeting.

Data Lineage

Data lineage is the map showing where a number came from: which source tables fed which transformations, which dashboards consumed the output. It lets you trace a broken KPI back to its origin in minutes instead of half a day of detective work.

ExampleWhen a Brooklyn DTC brand sees subscription revenue drop 18% overnight, dbt lineage shows a Stripe webhook field rename two layers upstream as the culprit.

When to useReach for lineage tools when one upstream change keeps breaking dashboards no one mapped to that pipeline.

Data Mesh

Data mesh is an org model where each business domain (marketing, finance, supply chain) owns its own data products and ships them with documentation, SLAs, and quality tests. Central data team builds the platform, domains build the data.

ExampleA 900-person SaaS company in Atlanta splits its central data team into 5 embedded squads, each owning a domain mart with on-call rotations and 99.5% freshness SLAs.

When to useConsider mesh when your central team is a 30-ticket bottleneck and domains have enough analyst horsepower to ship their own data.

Data Observability

Data observability is monitoring for pipelines: freshness, volume, schema, distribution, and lineage signals that alert you when a table looks off before a stakeholder spots it. Same idea as Datadog, applied to warehouse tables and dbt models.

ExampleA New York fintech runs Monte Carlo on 1,200 Snowflake tables and catches a 40% row-count drop in payments_fact two hours before the CFO opens her Monday dashboard.

When to useInstall it once executives start finding data issues before your team does.

Data Quality

Data quality is the degree to which your data is accurate, complete, fresh, consistent, and unique enough for the decisions riding on it. Quality is always relative to the use case: good enough for marketing attribution may be useless for revenue recognition.

ExampleAn Austin healthtech sets six dbt tests per critical model and lifts patient_id match rate from 91% to 99.6% across Salesforce and the EHR within one quarter.

When to useInvest in quality testing the moment a wrong number reaches a customer invoice, a regulator filing, or the board deck.

Data Steward

A data steward is the named human accountable for a specific dataset or business domain: definitions, access requests, quality issues, documentation. Usually a senior analyst or domain SME, not the data engineer who built the pipe.

ExampleAt a Chicago coffee chain, the FP&A manager serves as steward for the revenue domain, signs off on every metric change, and reviews access requests inside a Notion intake form.

When to useAssign stewards as soon as your governance council needs a name (not a team) to call when a definition is disputed.

Encryption

Encryption converts readable data into ciphertext using a key, so anyone without that key sees gibberish. Two flavors matter in data work: at rest (data sitting in S3 or Snowflake) and in transit (data moving over TLS between systems).

ExampleA Boston health insurer encrypts PHI columns in Snowflake with customer-managed KMS keys and rotates them every 90 days to satisfy HIPAA auditors.

When to useRequired by default for any pipeline touching PII, PHI, or payment data under HIPAA, CCPA, or PCI DSS.

Master Data Management (MDM)

MDM is the discipline of creating one trusted record per customer, product, vendor, or location across systems that all hold partial, conflicting versions. Output: a golden record other tools reference instead of fighting over.

ExampleA Dallas retailer reconciles customer records across Shopify, Klaviyo, and Zendesk into a single golden ID, shrinking duplicate accounts from 22% to 1.8% of the base.

When to useStand up MDM when the same customer shows up four different ways across your CRM, billing, and support tools.

PII (Personally Identifiable Information)

PII is any data point that can identify a specific person on its own or combined with other fields: name, email, phone, SSN, IP address, device ID. Under CCPA the definition is wide, so treat anything that pins a record to a real human as PII.

ExampleA California Shopify merchant tags 17 columns as PII in their dbt project, masks them in dev environments, and tracks every access in an audit log for CCPA requests.

When to useClassify and mask PII before granting analyst access to any production table touching customer records.

Schema Drift

Schema drift is when an upstream source quietly changes column names, types, or structure, and your pipeline either breaks or keeps running with wrong values. Most painful with SaaS APIs that push silent updates on Tuesday afternoons.

ExampleStripe renames a webhook field from amount_captured to captured_amount, and a Brooklyn agency loses 3 days of revenue reporting before lineage alerts surface the diff.

When to useAdd schema tests to any Fivetran or Airbyte connector pulling from a vendor you cannot version-control.

Data Silo

A data silo is a dataset locked inside one team or tool, with no connection to the rest of the business: marketing in HubSpot, finance in NetSuite, product in Mixpanel, none of them joined. Silos make cross-functional questions take weeks to answer.

ExampleA 200-person SaaS in Denver runs 9 SaaS tools with no warehouse, so calculating CAC by channel takes the analyst 4 days of CSV exports each month.

When to useBreak silos by piping every source into a warehouse the moment leadership starts asking cross-functional KPI questions.

08 · Category

Streaming & Real-Time

When batch is too slow. Ten terms covering event streams, message queues, and the architectures behind sub-second analytics.

Apache Flink

Apache Flink is an open-source engine for stateful stream processing with exactly-once guarantees. It treats data as an unbounded stream by default and handles event-time windowing, which makes it strong for low-latency analytics on continuous events.

ExampleA Brooklyn-based food delivery startup uses Flink to compute 30-second sliding averages of driver wait times across 4,000 active couriers, triggering surge bonuses when the metric crosses 90 seconds.

When to useReach for Flink when you need sub-second analytics on event streams with strict correctness and complex windowing logic.

Apache Kafka

Apache Kafka is a distributed log that stores ordered streams of events and lets many producers and consumers read them independently. It acts as the backbone for moving high-volume event data between services with durability and replay.

ExampleA Shopify merchant in Austin pushes every cart update, page view, and Stripe payment into Kafka topics. The analytics team and the fraud team each consume the same topics at their own pace.

When to useUse Kafka when multiple downstream systems need the same event stream and you want to decouple producers from consumers.

Apache Spark

Apache Spark is a distributed compute engine that handles batch jobs and micro-batch streaming through the same API. Structured Streaming processes data in small intervals (typically 1 to 10 seconds) on top of the same DataFrame logic used for historical work.

ExampleA Chicago coffee chain runs Spark on Databricks to score 12 million loyalty events per day for next-best-offer, sharing 80% of the code with the nightly batch model that retrains on 18 months of receipts.

When to usePick Spark when your team already runs batch ETL on it and your streaming SLA allows for a few seconds of delay.

Event Sourcing

Event sourcing is a pattern where the system stores every state change as an immutable event, and current state is derived by replaying those events. The event log becomes the source of truth, not the database row.

ExampleAn Atlanta fintech logs every account event (deposit, withdrawal, fee, dispute) as a separate record. To rebuild a customer’s balance on May 1, they replay all 4,200 of her events from account opening.

When to useApply event sourcing when audit history, time-travel debugging, or regulatory replay are core product requirements.

Kappa Architecture

Kappa architecture is a design where one streaming pipeline handles both real-time and historical processing. Reprocessing is done by replaying the event log through the same code, so there is no separate batch path.

ExampleA Denver e-commerce platform runs one Flink job over Kafka for live dashboards. When they ship a new attribution model, they rewind Kafka 90 days and let the same job recompute history.

When to useUse kappa when your event log is durable enough to replay and you want to eliminate the duplicated logic of dual pipelines.

Lambda Architecture

Lambda architecture combines a fast streaming layer for fresh approximate results with a slower batch layer that recomputes the same metrics for accuracy. A serving layer merges both before exposing the answer to users.

ExampleA New York ad-tech firm shows live impression counts from Spark Streaming, then overwrites them every 6 hours with a corrected batch number from a Snowflake job that joins billing-grade logs.

When to useReach for lambda when streaming gives speed but batch is the only path to numbers your finance team will sign off on.

Latency

Latency is the time between an event happening and the system being ready to act on it. In data pipelines it covers ingest, processing, and serving, and it is usually reported as a percentile (p50, p95, p99) rather than an average.

ExampleA San Francisco rideshare app needs driver-location updates visible in the rider app within 800 ms at p95. Anything over 2 seconds at p99 causes visible map jumps and refund tickets.

When to useTrack latency as a product metric whenever a user, a model, or a downstream system reacts to fresh data.

Message Queue

A message queue is a service that holds messages between a sender and one receiver, delivering each message once and removing it after acknowledgment. Common examples include Amazon SQS and RabbitMQ.

ExampleA Boston subscription box company drops every new order into an SQS queue. One worker pool pulls jobs at a steady 200 per minute to print shipping labels, which prevents the label vendor from being overwhelmed at peak.

When to usePick a queue when you need work distribution to consumers with backpressure, not a replayable log read by many parties.

Real-Time Data

Real-time data is information made available for use within seconds (or milliseconds) of the event that produced it. The exact threshold depends on the decision: fraud blocking needs under 200 ms, while a live ops dashboard can tolerate 5 to 30 seconds.

ExampleA Miami payment processor scores each Stripe charge for fraud in under 150 ms. Their executive dashboard, which refreshes every 10 seconds, also counts as real-time for the ops team.

When to useUse real-time data when the business decision loses value if it waits for the next hourly or nightly batch.

Stream Processing

Stream processing means computing over data as it arrives, one event or small window at a time, instead of waiting for a full batch. Typical operations include filtering, joining streams, aggregating over time windows, and writing to a sink.

ExampleA Seattle gaming studio uses stream processing to count concurrent players per region every 5 seconds across 1.2 million sessions, feeding a live capacity dashboard that triggers auto-scaling on AWS.

When to useChoose stream processing when the metric or alert must reflect events from the last few seconds, not the last batch run.

↑

About the author

Nick Valiotti is the founder of Valiotti Data. 15+ years building analytics infrastructure for SaaS, marketplaces, and consumer subscription. 50+ production deployments across BigQuery, Snowflake, dbt, Metabase, and modern BI stacks. Author of two books on data strategy. LinkedIn · Discovery call.