This data glossary covers 100 terms a working data team uses in 2026, organized into eight categories from warehouse fundamentals to streaming patterns. Each definition includes a real US business example and one line on when the term actually applies. Each definition is short, with a concrete US example and one line on when you’d actually reach for it. Curated by Nick Valiotti, Fractional CDO with 15+ years building data teams across e-commerce, mobility, and SaaS.
In This Article
01 · Category
Foundations
The vocabulary every data conversation starts with. These twelve terms come up in every kickoff, every roadmap, and every executive review.
Big Data
Datasets too large or fast-moving for a single server to handle with standard SQL. The label usually kicks in past 10TB of warehouse storage or when streaming ingestion crosses a few thousand events per second.
Business Intelligence
The practice of turning company data into reports, dashboards, and answers that operators actually use. BI sits between the warehouse and the people making pricing, hiring, and inventory decisions.
Dashboard
A single screen that shows the handful of numbers an operator needs to make decisions today. Good dashboards answer one role’s questions, not everyone’s.
Data
Recorded facts about the business: transactions, events, customer attributes, product states. Data only becomes useful once it is collected somewhere queryable and tied to a clear definition.
KPI
A Key Performance Indicator: one of the 3 to 7 numbers a team commits to moving this quarter. Every KPI has an owner, a target, and a clear definition in the metric layer.
Metadata
Data about data: column descriptions, table owners, freshness timestamps, source systems, and update cadence. Metadata is what lets a new analyst answer questions on day three instead of day thirty.
Metric
A defined calculation that produces a number, with explicit grain, filters, and time window. A metric is reproducible: two analysts running it should get the same value.
Modern Data Stack
The cloud-native pattern that replaced on-prem data warehouses around 2018: managed ingestion (Fivetran), cloud warehouse (Snowflake, BigQuery), transformation (dbt), and a BI tool on top.
Raw Data
Records loaded into the warehouse exactly as they arrived from the source: no joins, no renames, no cleanup. Raw is the audit trail you fall back on when downstream logic breaks.
Semi-Structured Data
Records with a flexible shape, usually JSON or XML, where fields can vary row to row. Warehouses like Snowflake and BigQuery store this natively in VARIANT or JSON columns.
Structured Data
Records that fit a fixed schema: defined columns, defined types, predictable shape. Rows in a Postgres orders table or a Salesforce opportunities export are structured by default.
Unstructured Data
Content without a tabular shape: PDFs, support tickets in free text, call recordings, product photos, chat transcripts. You need parsing or embeddings before this data shows up in a SQL query.
02 · Category
Data Modeling
How you shape data before it hits a dashboard. These thirteen terms cover the structures and conventions that decide whether your warehouse stays readable past quarter two.
Denormalization
Denormalization is the practice of intentionally duplicating data across tables to make reads faster. You trade storage and write complexity for query speed, which is the right call in analytical warehouses like Snowflake or BigQuery.
Dimension
A dimension is a table that describes the context around a measurable event: who, what, where, when. Dimensions hold attributes you filter, group, and slice by in reports.
Dimensional Modeling
Dimensional modeling is Ralph Kimball’s approach to organizing analytical data into fact tables (events, measures) surrounded by dimension tables (context). It optimizes warehouses for business-user queries rather than transactional integrity.
Fact Table
A fact table stores the measurable events of a business: orders, clicks, payments, subscriptions. Each row is one event with foreign keys to dimensions plus numeric measures like revenue or quantity.
Natural Key
A natural key is an identifier that comes from the source system or real world: a Stripe customer ID, a SKU, an email address. It carries meaning outside the warehouse.
Normalization
Normalization is the process of splitting data into related tables to remove duplication and enforce consistency. Third normal form (3NF) is the standard target in OLTP systems like Postgres or MySQL.
OLAP
OLAP (Online Analytical Processing) refers to systems built for fast queries over large historical datasets: aggregations, group-bys, time-series rollups. Modern OLAP runs on columnar engines like Snowflake, BigQuery, ClickHouse, and DuckDB.
OLTP
OLTP (Online Transaction Processing) describes databases built for high-volume, low-latency writes and point reads. Postgres, MySQL, and DynamoDB are typical OLTP engines powering production apps.
Schema
A schema is the structure of your data: tables, columns, types, and relationships. In warehouses like Snowflake or Redshift, schema also refers to a logical namespace that groups related objects.
Slowly Changing Dimension (SCD)
A Slowly Changing Dimension is a pattern for tracking how dimension attributes change over time. Type 1 overwrites old values, Type 2 keeps history with effective-from and effective-to dates.
Snowflake Schema
A snowflake schema is a dimensional model where dimension tables are normalized into multiple related sub-tables. It saves storage but adds joins, which slows queries and confuses business users.
Star Schema
A star schema is a dimensional model with one central fact table connected to denormalized dimension tables. The shape resembles a star, hence the name. It is the workhorse layout for BI-friendly warehouses.
Surrogate Key
A surrogate key is a warehouse-generated identifier (often a hash or sequence) that uniquely identifies a row independent of the source system. It shields your model from changes in upstream IDs.
03 · Category
Storage & Warehouses
The systems that hold your data. Fifteen terms covering warehouses, lakes, lakehouses, and the table formats and storage layers underneath them.
Amazon Redshift
Amazon Redshift is AWS’s managed cloud data warehouse built on a modified PostgreSQL engine. It stores data in columnar format across compute nodes and is optimized for running analytical SQL on terabytes of structured data.
Apache Iceberg
Apache Iceberg is an open table format that adds database-like guarantees on top of files in object storage. It supports schema evolution, hidden partitioning, time travel, and ACID writes from engines like Spark, Trino, Snowflake, and DuckDB.
ClickHouse
ClickHouse is an open-source columnar database designed for fast analytical queries on huge tables. It compresses data aggressively and can return aggregations over billions of rows in under a second on modest hardware.
Columnar Database
A columnar database stores data by column instead of by row, which makes scanning a few columns across many rows extremely fast. It also compresses better because values in one column tend to be similar.
Data Lake
A data lake is a centralized store of raw files (CSV, JSON, Parquet, images, logs) in cheap object storage like S3 or GCS. Schema is applied at read time, so you can dump data first and structure it later.
Data Lakehouse
A data lakehouse combines a data lake’s cheap file storage with warehouse-style features like ACID transactions, schema enforcement, and SQL performance. Table formats like Delta Lake, Iceberg, and Hudi sit on object storage to make this work.
Data Warehouse
A data warehouse is a database tuned for analytics rather than transactions. It stores cleaned, modeled data from many source systems and is queried by BI tools and analysts with SQL.
Delta Lake
Delta Lake is an open-source storage layer that adds ACID transactions, versioning, and schema enforcement to Parquet files on object storage. It was originally built by Databricks and is now widely used outside it.
Google BigQuery
BigQuery is Google Cloud’s serverless data warehouse. You pay for storage and per-query bytes scanned (or reserved slots), and you do not manage clusters, nodes, or vacuum operations.
MySQL
MySQL is an open-source relational database used heavily for web and SaaS application backends. It is row-oriented, transactional, and tuned for many small reads and writes rather than analytical scans.
NoSQL Database
NoSQL covers databases that drop the strict relational model in favor of document, key-value, graph, or wide-column structures. They trade some join power and consistency for flexible schemas and easier horizontal scaling.
Object Storage
Object storage holds files as objects in flat namespaces (buckets) accessed over HTTP, with services like Amazon S3, Google Cloud Storage, and Azure Blob. It is durable, cheap per gigabyte, and the foundation for most modern data platforms.
Partitioning
Partitioning splits a large table into smaller chunks by a column like date or region. Queries that filter on that column scan only the relevant partitions, which cuts cost on usage-priced warehouses and speeds up scans everywhere else.
PostgreSQL
PostgreSQL is an open-source relational database known for SQL compliance, strong consistency, and a rich extension system (PostGIS, pgvector, TimescaleDB). It powers application backends and also serves as a respectable small-scale analytics database.
Snowflake
Snowflake is a cloud data warehouse that separates compute from storage, so multiple teams can run independent virtual warehouses against the same data. It runs on AWS, Azure, and GCP, and bills per second of compute.
04 · Category
Pipelines & Integration
The moving parts that get data from source systems into a warehouse and back out again. Fourteen terms covering ingestion, transformation, and orchestration.
Airflow
Apache Airflow is an open-source orchestrator that lets you define data workflows as Python code (DAGs) and run them on a schedule. It tracks task dependencies, retries failures, and shows job status in a web UI.
Airbyte
Airbyte is an open-source data ingestion platform with 350+ prebuilt connectors that move data from SaaS apps and databases into your warehouse. It runs as a managed cloud service or self-hosted on your own infrastructure.
Change Data Capture (CDC)
Change Data Capture is a method of replicating only the rows that changed in a source database, by reading its transaction log instead of running full table scans. The result is near-real-time copies of operational data in the warehouse.
Dagster
Dagster is an open-source orchestrator built around data assets (tables, files, ML models) rather than just tasks. It tracks lineage, freshness, and data quality alongside the schedule, which makes failures easier to diagnose.
Data Contract
A data contract is a formal agreement between the team producing data (often product engineers) and the teams consuming it, specifying schema, semantics, freshness, and ownership. It is enforced in CI, not just documented in a wiki.
Data Ingestion
Data ingestion is the step of pulling data out of source systems (SaaS apps, databases, files, APIs, event streams) and landing it in a central destination such as a warehouse or lake. It is the first stage of any pipeline.
Data Pipeline
A data pipeline is the end-to-end flow that moves data from sources into a warehouse and transforms it into tables the business can query. A typical pipeline has three stages: ingestion, transformation, and serving (BI tools or reverse ETL).
dbt
dbt (data build tool) is a framework for writing SQL transformations as version-controlled, tested, documented models. It compiles your SQL, runs it inside your warehouse, and produces a DAG of dependent tables and views.
ELT
ELT (Extract, Load, Transform) is the modern pattern where raw data is loaded into the warehouse first, then transformed inside it using SQL. It became the default once cloud warehouses made compute cheap and elastic.
ETL
ETL (Extract, Transform, Load) is the older pattern of pulling data from sources, transforming it on a separate server, and loading the clean output into a database. It dominated the on-premise era of Informatica, SSIS, and Talend.
Fivetran
Fivetran is a managed data ingestion service that maintains 500+ connectors to SaaS apps, databases, and event sources, and lands the data in your warehouse. It handles schema changes, retries, and historical backfills automatically.
Idempotency
Idempotency is the property of an operation that produces the same result whether you run it once or a hundred times. In pipelines, it means you can safely re-run a failed job without creating duplicates or corrupting downstream tables.
Orchestration
Orchestration is the discipline of scheduling and coordinating the many tasks that make up a data pipeline: when ingestion runs, when transformations fire, what depends on what, and what to do on failure. Tools like Airflow, Dagster, and Prefect own this layer.
Reverse ETL
Reverse ETL is the pattern of pushing modeled data out of the warehouse back into operational tools (Salesforce, HubSpot, Klaviyo, Intercom, ad platforms), so marketing and ops teams act on the same numbers analysts use. Tools include Hightouch and Census.
05 · Category
Analytics & BI
How people actually consume data. Twelve terms covering the BI tools, query patterns, and semantic-layer ideas that turn warehouse tables into decisions.
Ad-Hoc Query
An ad-hoc query is a one-off question asked of a database, written on the fly to answer a specific business request. It is not part of a scheduled report or saved dashboard, just a quick SQL pull to settle a question.
Drill-Down
Drill-down is the act of moving from a summary number into the underlying detail. You click a high-level metric and the tool shows the rows, segments, or transactions that add up to it.
Looker
Looker is a BI platform owned by Google, built around LookML, a code-based modeling layer that defines metrics once and reuses them across dashboards. It targets companies that want governed, version-controlled analytics on top of a cloud warehouse.
Looker Studio
Looker Studio (formerly Data Studio) is Google’s free dashboarding tool, used mostly for marketing and web analytics reports. It connects to Google Analytics, Google Ads, Sheets, and BigQuery, with drag-and-drop charts.
Metabase
Metabase is an open-source BI tool that lets non-technical users ask questions of a database through a point-and-click interface, while still allowing SQL for power users. It runs self-hosted or as a cloud service.
Power BI
Power BI is Microsoft’s BI suite, deeply tied to Excel, Azure, and the Microsoft 365 stack. It uses DAX as its formula language and is licensed per user, usually around $14 per seat per month for the Pro tier.
Self-Service BI
Self-service BI is a setup where business users build their own reports and dashboards without filing a ticket with the data team. The data team provides clean tables, definitions, and a tool, then steps back from day-to-day chart-making.
Semantic Layer
A semantic layer is a translation between raw database tables and business terms. It defines what “active customer” or “net revenue” means in one place, so every dashboard, query, and AI tool returns the same number.
SQL
SQL (Structured Query Language) is the standard language for asking questions of relational databases and warehouses. It is used to filter, join, group, and transform data in Postgres, MySQL, Snowflake, BigQuery, Redshift, and almost every other analytics store.
Apache Superset
Apache Superset is an open-source BI platform originally built at Airbnb. It offers SQL Lab for query writing, a chart builder, and dashboards, with no per-user license cost when self-hosted.
Tableau
Tableau is a BI platform owned by Salesforce, known for fast visual exploration and drag-and-drop chart building. It is widely used in enterprise analytics, with Creator licenses around $75 per user per month.
Data Visualization
Data visualization is the practice of turning numbers into charts, maps, or other visual forms so people can spot patterns faster than reading a table. Good visualization matches the chart type to the question being asked.
06 · Category
ML/AI & MLOps
The model side of the stack. Twelve terms covering classical ML, deep learning, large language models, and the operational discipline (MLOps) that keeps them in production.
Deep Learning
Deep learning is a branch of machine learning that uses neural networks with many layers to find patterns in images, audio, and text. The model learns features on its own instead of relying on hand-coded rules.
Embedding
An embedding is a list of numbers that represents a piece of text, an image, or a product so that similar items land close together in vector space. It turns messy content into something math can compare.
Feature Engineering
Feature engineering is the work of turning raw columns into signals a model can actually learn from: ratios, time-since-last-event, rolling averages, encoded categories. It is where domain knowledge enters the pipeline.
Feature Store
A feature store is a central system that defines, computes, stores, and serves model features for both training and live prediction. It guarantees the offline batch values match what the production API sees.
Large Language Model (LLM)
A large language model is a neural network trained on trillions of tokens of text to predict the next token, which lets it write, summarize, classify, and reason. GPT-4, Claude, and Llama are common examples.
Machine Learning
Machine learning is the practice of training a program to make predictions from historical data instead of writing explicit if-then rules. The model finds patterns during training and applies them to new records.
MLOps
MLOps is the set of practices that takes machine learning models from a notebook into reliable production: versioning data and code, automated retraining, monitoring drift, rollback. Think DevOps with the extra problem that the model can quietly get worse.
Model Registry
A model registry is the system of record for trained models: each version, the data it was trained on, its metrics, and which environment (staging, production) currently serves it. It answers ‘what is actually running right now’.
Neural Network
A neural network is a model built from layers of simple units called neurons that pass weighted signals forward. Training adjusts those weights so the output matches the labels in the data.
Retrieval-Augmented Generation (RAG)
RAG is a pattern where an LLM first retrieves relevant documents from your own data, then writes an answer grounded in those documents. It reduces hallucination and lets the model cite sources.
Training Data
Training data is the labeled historical examples a model learns from. The quality, balance, and recency of this dataset usually matter more than the choice of algorithm.
Vector Database
A vector database stores embeddings and finds the closest matches to a query vector in milliseconds, even across hundreds of millions of items. Pinecone, Weaviate, and pgvector are common choices.
07 · Category
Governance & Quality
The accountability layer. Twelve terms covering ownership, lineage, catalogs, and the controls that keep auditors, regulators, and analysts on the same page.
Data Catalog
A data catalog is a searchable inventory of every table, dashboard, and metric in your stack, with owners, descriptions, and usage stats attached. Think of it as a Yelp for your data assets, so analysts stop asking Slack where revenue lives.
Data Governance
Data governance is the set of policies, roles, and controls that decide who can touch what data, how it gets defined, and who answers when it breaks. It is the operating model behind trustworthy numbers, not a software product.
Data Lineage
Data lineage is the map showing where a number came from: which source tables fed which transformations, which dashboards consumed the output. It lets you trace a broken KPI back to its origin in minutes instead of half a day of detective work.
Data Mesh
Data mesh is an org model where each business domain (marketing, finance, supply chain) owns its own data products and ships them with documentation, SLAs, and quality tests. Central data team builds the platform, domains build the data.
Data Observability
Data observability is monitoring for pipelines: freshness, volume, schema, distribution, and lineage signals that alert you when a table looks off before a stakeholder spots it. Same idea as Datadog, applied to warehouse tables and dbt models.
Data Quality
Data quality is the degree to which your data is accurate, complete, fresh, consistent, and unique enough for the decisions riding on it. Quality is always relative to the use case: good enough for marketing attribution may be useless for revenue recognition.
Data Steward
A data steward is the named human accountable for a specific dataset or business domain: definitions, access requests, quality issues, documentation. Usually a senior analyst or domain SME, not the data engineer who built the pipe.
Encryption
Encryption converts readable data into ciphertext using a key, so anyone without that key sees gibberish. Two flavors matter in data work: at rest (data sitting in S3 or Snowflake) and in transit (data moving over TLS between systems).
Master Data Management (MDM)
MDM is the discipline of creating one trusted record per customer, product, vendor, or location across systems that all hold partial, conflicting versions. Output: a golden record other tools reference instead of fighting over.
PII (Personally Identifiable Information)
PII is any data point that can identify a specific person on its own or combined with other fields: name, email, phone, SSN, IP address, device ID. Under CCPA the definition is wide, so treat anything that pins a record to a real human as PII.
Schema Drift
Schema drift is when an upstream source quietly changes column names, types, or structure, and your pipeline either breaks or keeps running with wrong values. Most painful with SaaS APIs that push silent updates on Tuesday afternoons.
Data Silo
A data silo is a dataset locked inside one team or tool, with no connection to the rest of the business: marketing in HubSpot, finance in NetSuite, product in Mixpanel, none of them joined. Silos make cross-functional questions take weeks to answer.
08 · Category
Streaming & Real-Time
When batch is too slow. Ten terms covering event streams, message queues, and the architectures behind sub-second analytics.
Apache Flink
Apache Flink is an open-source engine for stateful stream processing with exactly-once guarantees. It treats data as an unbounded stream by default and handles event-time windowing, which makes it strong for low-latency analytics on continuous events.
Apache Kafka
Apache Kafka is a distributed log that stores ordered streams of events and lets many producers and consumers read them independently. It acts as the backbone for moving high-volume event data between services with durability and replay.
Apache Spark
Apache Spark is a distributed compute engine that handles batch jobs and micro-batch streaming through the same API. Structured Streaming processes data in small intervals (typically 1 to 10 seconds) on top of the same DataFrame logic used for historical work.
Event Sourcing
Event sourcing is a pattern where the system stores every state change as an immutable event, and current state is derived by replaying those events. The event log becomes the source of truth, not the database row.
Kappa Architecture
Kappa architecture is a design where one streaming pipeline handles both real-time and historical processing. Reprocessing is done by replaying the event log through the same code, so there is no separate batch path.
Lambda Architecture
Lambda architecture combines a fast streaming layer for fresh approximate results with a slower batch layer that recomputes the same metrics for accuracy. A serving layer merges both before exposing the answer to users.
Latency
Latency is the time between an event happening and the system being ready to act on it. In data pipelines it covers ingest, processing, and serving, and it is usually reported as a percentile (p50, p95, p99) rather than an average.
Message Queue
A message queue is a service that holds messages between a sender and one receiver, delivering each message once and removing it after acknowledgment. Common examples include Amazon SQS and RabbitMQ.
Real-Time Data
Real-time data is information made available for use within seconds (or milliseconds) of the event that produced it. The exact threshold depends on the decision: fraud blocking needs under 200 ms, while a live ops dashboard can tolerate 5 to 30 seconds.
Stream Processing
Stream processing means computing over data as it arrives, one event or small window at a time, instead of waiting for a full batch. Typical operations include filtering, joining streams, aggregating over time windows, and writing to a sink.
About the author
Nick Valiotti is the founder of Valiotti Data. 15+ years building analytics infrastructure for SaaS, marketplaces, and consumer subscription. 50+ production deployments across BigQuery, Snowflake, dbt, Metabase, and modern BI stacks. Author of two books on data strategy. LinkedIn · Discovery call.