Database normalization is the discipline of structuring tables so each fact lives in exactly one place. Skip it and you ship duplicate rows, broken UPDATE statements, and reports whose answer depends on which copy of a row the query reads. Get it right and a class of data bugs stops being possible.
In This Article
- What “normal form” actually means
- 1NF: every cell holds a single value
- 2NF: every non-key column depends on the whole key
- 3NF: no transitive dependencies
- BCNF: when 3NF leaves an anomaly
- When to denormalize on purpose
- A worked migration: denormalized log to 3NF
- Common mistakes
- When the rules are wrong
- Takeaway
This is the reference we hand a data engineer on day one: 1NF, 2NF, 3NF, BCNF, each with a worked SQL example showing the bad table, the fix, and what breaks when you skip the step. At the end: when to denormalize on purpose, and how to do it without losing the guarantees normalization gave you.
What “normal form” actually means
A normal form is a property of a table’s schema. A table is in 3NF if and only if it satisfies the 3NF rule. The forms stack: 3NF implies 2NF implies 1NF. The work of normalization is moving each table up the stack until further moves would cost more than they save.
The four forms that matter in production:
- 1NF: every column holds a single, atomic value. No comma-separated lists. No JSON arrays where a junction table belongs.
- 2NF: every non-key column depends on the whole primary key, not just part of it. Only relevant when the primary key is composite.
- 3NF: every non-key column depends on the key directly, not through another non-key column.
- BCNF: every functional dependency in the table is on a superkey. The rare cases where 3NF leaves an anomaly.
4NF and 5NF exist and they have legitimate use cases. The frequency with which an analyst encounters them is low enough that we cover them last and briefly.
1NF: every cell holds a single value
The 1NF rule: every column contains an atomic value, and every row is uniquely addressable by its primary key.
The most common 1NF violation in practice is a column that holds a list, separated by commas or stored as JSON, where the rest of the application wants to query individual elements.
CREATE TABLE customer_contacts (
customer_id INT PRIMARY KEY,
name VARCHAR(100),
emails VARCHAR(500) -- 'a@x.com,b@y.com,c@z.com'
);
What breaks: any query that wants “all customers with a .gov email” has to use LIKE '%.gov%', which silently misclassifies not-actually-gov@x.com. Sending a campaign to one of a customer’s three emails requires string-splitting in SQL, which behaves differently on every dialect. Adding a fourth email means an UPDATE ... SET emails = CONCAT(emails, ',d@w.com'), which races with concurrent updates and loses email addresses.
CREATE TABLE customers (
customer_id INT PRIMARY KEY,
name VARCHAR(100)
);
CREATE TABLE customer_emails (
customer_id INT,
email VARCHAR(254),
is_primary BOOLEAN DEFAULT FALSE,
PRIMARY KEY (customer_id, email),
FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);
The migration query, assuming Postgres:
INSERT INTO customers (customer_id, name)
SELECT customer_id, name FROM customer_contacts;
INSERT INTO customer_emails (customer_id, email)
SELECT customer_id, trim(unnest(string_to_array(emails, ',')))
FROM customer_contacts;
The same pattern in BigQuery uses UNNEST(SPLIT(emails, ',')); in Snowflake, LATERAL FLATTEN(input => SPLIT(emails, ',')). In every dialect, the destination schema is the same.
A note on JSON columns. Modern Postgres, BigQuery, and Snowflake all support a JSON type, and there is a legitimate use for storing schema-on-read payloads (webhook bodies, raw event blobs) in JSON. Storing a list of emails in JSON is not that use case. If the application wants to query individual emails, the list is structured data and belongs in its own table, not a JSON column.
2NF: every non-key column depends on the whole key
2NF only applies when the primary key is composite, that is, made of more than one column. The rule: every non-key column must depend on the entire composite key, not on a subset of it.
CREATE TABLE order_items (
order_id INT,
product_id INT,
quantity INT,
product_name VARCHAR(200), -- depends on product_id only
product_price NUMERIC(10, 2), -- depends on product_id only
PRIMARY KEY (order_id, product_id)
);
The primary key is (order_id, product_id). quantity depends on both: the same product can have different quantities in different orders. But product_name and product_price depend on product_id alone. The product name and price will be duplicated across every order line for that product.
What breaks: a price change has to update every historical order_items row, or the rows disagree. A typo in product_name on one row is invisible until a join surfaces the inconsistency. DELETE-ing the last order line for a product also deletes the only place the product’s name was stored.
CREATE TABLE products (
product_id INT PRIMARY KEY,
product_name VARCHAR(200) NOT NULL,
product_price NUMERIC(10, 2) NOT NULL
);
CREATE TABLE order_items (
order_id INT,
product_id INT,
quantity INT NOT NULL,
PRIMARY KEY (order_id, product_id),
FOREIGN KEY (product_id) REFERENCES products(product_id)
);
A common objection: “but I need the price at the time of the order, not the current price.” That objection is correct and the answer is not to denormalize. Add price_at_order to order_items as an event-time snapshot, and keep the current price on products. Those are two different facts and they belong in two different columns.
3NF: no transitive dependencies
The 3NF rule: every non-key column depends on the primary key directly, not through another non-key column.
CREATE TABLE customer_addresses (
address_id INT PRIMARY KEY,
customer_id INT,
street VARCHAR(200),
zip VARCHAR(10),
city VARCHAR(100), -- depends on zip, not address_id
state VARCHAR(2) -- depends on zip, not address_id
);
The transitive chain is address_id → zip → (city, state). city and state are determined by zip, not by address_id. The same ZIP code will appear thousands of times across addresses, and so will the city and state for that ZIP.
What breaks: a ZIP code reassignment (the US Postal Service does this) requires updating every address with that ZIP. Forget one and the table self-contradicts. A typo on city for one address means the same ZIP now maps to two cities, which silently breaks any report that groups by city.
CREATE TABLE zip_codes (
zip VARCHAR(10) PRIMARY KEY,
city VARCHAR(100) NOT NULL,
state VARCHAR(2) NOT NULL
);
CREATE TABLE addresses (
address_id INT PRIMARY KEY,
customer_id INT,
street VARCHAR(200),
zip VARCHAR(10),
FOREIGN KEY (zip) REFERENCES zip_codes(zip)
);
A query that needs the full address joins to zip_codes. The cost is one extra join. The benefit is that city/state cannot disagree across rows for the same ZIP, because there is only one row per ZIP.
3NF is the form most production OLTP schemas target. It eliminates almost all update anomalies and leaves the schema readable. Going past 3NF (to BCNF) is sometimes necessary, but most engagements never need to.
BCNF: when 3NF leaves an anomaly
BCNF is the strengthening of 3NF that handles a specific edge case: a table can satisfy 3NF and still have a functional dependency that causes update anomalies.
A canonical example: a table tracking which instructor teaches which course on which day. Suppose business rules say each instructor teaches exactly one course on each weekday, and each course has exactly one instructor.
CREATE TABLE schedule (
day VARCHAR(10),
instructor VARCHAR(100),
course VARCHAR(50),
PRIMARY KEY (day, instructor)
);
This is in 3NF (no non-key column depends transitively on the key). But course functionally depends on instructor alone, and instructor is not a superkey. If the same instructor teaches the same course on Monday and Wednesday, and the course code changes, the update has to hit two rows. Miss one and the schema lies.
CREATE TABLE instructor_courses (
instructor VARCHAR(100) PRIMARY KEY,
course VARCHAR(50) NOT NULL
);
CREATE TABLE schedule (
day VARCHAR(10),
instructor VARCHAR(100),
PRIMARY KEY (day, instructor),
FOREIGN KEY (instructor) REFERENCES instructor_courses(instructor)
);
Most practitioners encounter BCNF rarely because composite keys with overlapping functional dependencies are uncommon outside of academic examples and certain scheduling, assignment, and reservation domains.
When to denormalize on purpose
Normalization optimizes for write correctness: each fact in one place, no update anomalies. Analytics optimizes for read speed and query simplicity: fewer joins, fewer hops, predictable execution plans. These are in tension.
Three patterns where intentional denormalization is the right call:
- Star schemas in the warehouse. Dimensional models flatten 3NF source tables into fact + dimension tables.
dim_customerincludes city, state, ZIP as columns even though they violate 3NF. The motivation is query latency: BI tools generate queries assuming dimension columns are one join away, not three. The cost (denormalized dim tables) is paid in the ETL pipeline, where it is automatable, rather than at every report run. - Reporting columns on event tables. An
order_eventstable that captures the customer’s segment, channel, and country at the moment of the order. Denormalizing here is correct because it captures historical state, which is exactly what reporting needs and what a normalized schema loses. - Materialized views over normalized base tables. The base tables remain in 3NF (the source of truth). The materialized view caches a denormalized read interface, refreshed on a schedule or on commit. Postgres, Snowflake, BigQuery, and ClickHouse all support some flavor of this. We cover the ClickHouse pattern in detail in our materialized views in ClickHouse writeup.
The deciding question is not “is normalization good or bad?” but “where does this table sit in the write/read tradeoff?” OLTP source tables: normalize aggressively. Warehouse dimensions: denormalize on purpose, with the ETL pipeline as the single source of truth for the duplication.
A worked migration: denormalized log to 3NF
A real example. The starting table is a denormalized log of orders, the kind that lands when a junior engineer captures every relevant field on the order row:
CREATE TABLE order_log (
order_id INT PRIMARY KEY,
customer_email VARCHAR(254),
customer_name VARCHAR(100),
customer_zip VARCHAR(10),
customer_city VARCHAR(100),
customer_state VARCHAR(2),
product_codes VARCHAR(500), -- 'SKU-1,SKU-2,SKU-3'
product_names VARCHAR(2000), -- 'Widget,Gadget,Sprocket'
order_total NUMERIC(10, 2),
ordered_at TIMESTAMP
);
The 1NF violation is the product_codes / product_names pair. The 2NF violation surfaces once we split products onto their own row: product_name depends on product_code, not on (order_id, product_code). The 3NF violation is the (zip, city, state) chain.
Target schema after 1NF, 2NF, and 3NF:
CREATE TABLE zip_codes (
zip VARCHAR(10) PRIMARY KEY,
city VARCHAR(100) NOT NULL,
state VARCHAR(2) NOT NULL
);
CREATE TABLE customers (
customer_id INT PRIMARY KEY,
email VARCHAR(254) UNIQUE NOT NULL,
name VARCHAR(100),
zip VARCHAR(10),
FOREIGN KEY (zip) REFERENCES zip_codes(zip)
);
CREATE TABLE products (
product_code VARCHAR(50) PRIMARY KEY,
product_name VARCHAR(200) NOT NULL
);
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT NOT NULL,
order_total NUMERIC(10, 2),
ordered_at TIMESTAMP NOT NULL,
FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);
CREATE TABLE order_items (
order_id INT,
product_code VARCHAR(50),
PRIMARY KEY (order_id, product_code),
FOREIGN KEY (order_id) REFERENCES orders(order_id),
FOREIGN KEY (product_code) REFERENCES products(product_code)
);
The migration runs in this order: zip_codes first (deduplicated from order_log), then customers (deduplicated by email), then products (deduplicated by code), then orders, then order_items. The deduplicating dedupe queries are straightforward:
INSERT INTO zip_codes (zip, city, state)
SELECT DISTINCT customer_zip, customer_city, customer_state
FROM order_log
WHERE customer_zip IS NOT NULL;
INSERT INTO customers (customer_id, email, name, zip)
SELECT
ROW_NUMBER() OVER (ORDER BY MIN(order_id)) AS customer_id,
customer_email,
MIN(customer_name),
MIN(customer_zip)
FROM order_log
GROUP BY customer_email;
-- Products from the flattened lists
INSERT INTO products (product_code, product_name)
SELECT DISTINCT
trim(unnest(string_to_array(product_codes, ','))) AS product_code,
trim(unnest(string_to_array(product_names, ','))) AS product_name
FROM order_log;
A practical caveat: in the real migration, you’ll discover one customer email mapped to two slightly different names, and one ZIP mapped to two slightly different cities. The data is dirty, and normalization surfaces the dirt that denormalized storage hides. That surfacing is the point.
Common mistakes
Five normalization mistakes worth naming, because we see them every quarter:
- Reaching for JSON to dodge 1NF. A list of tags stored as
tags JSONBis still a 1NF violation if the application queries individual tags. JSON columns are useful for genuinely schema-on-read data; they are not a synonym for “I didn’t want to make a junction table today.” - Conflating 3NF with “no redundancy ever.” 3NF eliminates redundancy that depends on functional dependencies in the schema. It does not say you cannot have a denormalized reporting copy elsewhere. Star schemas and 3NF source tables coexist on every well-run data stack.
- Skipping 2NF on tables with composite keys. 2NF is often the silent failure mode: developers normalize to “3NF” without checking that every non-key column depends on the full composite key. Audit composite-key tables explicitly.
- Updating denormalized analytics tables in place. If a star schema is rebuilt by the ETL pipeline, the schema is correct. If a star schema is hand-patched with
UPDATEstatements when source data changes, the schema is now lying. - Treating views as a normalization substitute. A view that joins three normalized tables presents a denormalized read interface, but it does not give the storage characteristics of a denormalized table. Materialized views do; ordinary views do not.
When the rules are wrong
Three contexts where normalization is the wrong objective:
- Analytics warehouses optimized for read speed. Snowflake and BigQuery thrive on wide, denormalized tables. The clustering and partitioning work assumes you are not joining seven dimension tables on every query.
- Append-only event logs. Capturing the customer’s email or address at the time of an event is correct behavior, even though it duplicates data that lives in a normalized customer table. The event row is a snapshot, not a foreign key.
- Schemas where the “correct” foreign key would block the write path. If a downstream service writes an order before the customer record exists (an eventual-consistency scenario), enforcing the foreign key constraint causes more outages than it prevents. The right move there is to log the customer reference and reconcile asynchronously, not to denormalize the customer record into the order row.
Takeaway
Normalize source tables to 3NF as a default. BCNF when 3NF leaves a real anomaly, which is rare. Denormalize on purpose for analytics, and let the ETL pipeline own the duplication so that the duplication has a single source of truth. The bugs that normalization prevents (update anomalies, deletion anomalies, insert anomalies) are not theoretical: they show up the first time the schema disagrees with itself and a report ships the wrong number.