Data Engineering

Database Normalization: 1NF, 2NF, 3NF in SQL

· 13 min read

Database normalization is the discipline of structuring tables so each fact lives in exactly one place. Skip it and you ship duplicate rows, broken UPDATE statements, and reports whose answer depends on which copy of a row the query reads. Get it right and a class of data bugs stops being possible.

In This Article

  1. What “normal form” actually means
  2. 1NF: every cell holds a single value
  3. 2NF: every non-key column depends on the whole key
  4. 3NF: no transitive dependencies
  5. BCNF: when 3NF leaves an anomaly
  6. When to denormalize on purpose
  7. A worked migration: denormalized log to 3NF
  8. Common mistakes
  9. When the rules are wrong
  10. Takeaway

This is the reference we hand a data engineer on day one: 1NF, 2NF, 3NF, BCNF, each with a worked SQL example showing the bad table, the fix, and what breaks when you skip the step. At the end: when to denormalize on purpose, and how to do it without losing the guarantees normalization gave you.

What “normal form” actually means

A normal form is a property of a table’s schema. A table is in 3NF if and only if it satisfies the 3NF rule. The forms stack: 3NF implies 2NF implies 1NF. The work of normalization is moving each table up the stack until further moves would cost more than they save.

The four forms that matter in production:

  • 1NF: every column holds a single, atomic value. No comma-separated lists. No JSON arrays where a junction table belongs.
  • 2NF: every non-key column depends on the whole primary key, not just part of it. Only relevant when the primary key is composite.
  • 3NF: every non-key column depends on the key directly, not through another non-key column.
  • BCNF: every functional dependency in the table is on a superkey. The rare cases where 3NF leaves an anomaly.

4NF and 5NF exist and they have legitimate use cases. The frequency with which an analyst encounters them is low enough that we cover them last and briefly.

1NF: every cell holds a single value

The 1NF rule: every column contains an atomic value, and every row is uniquely addressable by its primary key.

The most common 1NF violation in practice is a column that holds a list, separated by commas or stored as JSON, where the rest of the application wants to query individual elements.

Bad: a 1NF violation
CREATE TABLE customer_contacts (
    customer_id INT PRIMARY KEY,
    name        VARCHAR(100),
    emails      VARCHAR(500)  -- 'a@x.com,b@y.com,c@z.com'
);

What breaks: any query that wants “all customers with a .gov email” has to use LIKE '%.gov%', which silently misclassifies not-actually-gov@x.com. Sending a campaign to one of a customer’s three emails requires string-splitting in SQL, which behaves differently on every dialect. Adding a fourth email means an UPDATE ... SET emails = CONCAT(emails, ',d@w.com'), which races with concurrent updates and loses email addresses.

Good: 1NF compliant
CREATE TABLE customers (
    customer_id INT PRIMARY KEY,
    name        VARCHAR(100)
);

CREATE TABLE customer_emails (
    customer_id INT,
    email       VARCHAR(254),
    is_primary  BOOLEAN DEFAULT FALSE,
    PRIMARY KEY (customer_id, email),
    FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);

The migration query, assuming Postgres:

INSERT INTO customers (customer_id, name)
SELECT customer_id, name FROM customer_contacts;

INSERT INTO customer_emails (customer_id, email)
SELECT customer_id, trim(unnest(string_to_array(emails, ',')))
FROM customer_contacts;

The same pattern in BigQuery uses UNNEST(SPLIT(emails, ',')); in Snowflake, LATERAL FLATTEN(input => SPLIT(emails, ',')). In every dialect, the destination schema is the same.

A note on JSON columns. Modern Postgres, BigQuery, and Snowflake all support a JSON type, and there is a legitimate use for storing schema-on-read payloads (webhook bodies, raw event blobs) in JSON. Storing a list of emails in JSON is not that use case. If the application wants to query individual emails, the list is structured data and belongs in its own table, not a JSON column.

2NF: every non-key column depends on the whole key

2NF only applies when the primary key is composite, that is, made of more than one column. The rule: every non-key column must depend on the entire composite key, not on a subset of it.

Bad: a 2NF violation
CREATE TABLE order_items (
    order_id      INT,
    product_id    INT,
    quantity      INT,
    product_name  VARCHAR(200),    -- depends on product_id only
    product_price NUMERIC(10, 2),  -- depends on product_id only
    PRIMARY KEY (order_id, product_id)
);

The primary key is (order_id, product_id). quantity depends on both: the same product can have different quantities in different orders. But product_name and product_price depend on product_id alone. The product name and price will be duplicated across every order line for that product.

What breaks: a price change has to update every historical order_items row, or the rows disagree. A typo in product_name on one row is invisible until a join surfaces the inconsistency. DELETE-ing the last order line for a product also deletes the only place the product’s name was stored.

Good: 2NF compliant
CREATE TABLE products (
    product_id    INT PRIMARY KEY,
    product_name  VARCHAR(200) NOT NULL,
    product_price NUMERIC(10, 2) NOT NULL
);

CREATE TABLE order_items (
    order_id   INT,
    product_id INT,
    quantity   INT NOT NULL,
    PRIMARY KEY (order_id, product_id),
    FOREIGN KEY (product_id) REFERENCES products(product_id)
);

A common objection: “but I need the price at the time of the order, not the current price.” That objection is correct and the answer is not to denormalize. Add price_at_order to order_items as an event-time snapshot, and keep the current price on products. Those are two different facts and they belong in two different columns.

3NF: no transitive dependencies

The 3NF rule: every non-key column depends on the primary key directly, not through another non-key column.

Bad: a 3NF violation
CREATE TABLE customer_addresses (
    address_id  INT PRIMARY KEY,
    customer_id INT,
    street      VARCHAR(200),
    zip         VARCHAR(10),
    city        VARCHAR(100),  -- depends on zip, not address_id
    state       VARCHAR(2)     -- depends on zip, not address_id
);

The transitive chain is address_id → zip → (city, state). city and state are determined by zip, not by address_id. The same ZIP code will appear thousands of times across addresses, and so will the city and state for that ZIP.

What breaks: a ZIP code reassignment (the US Postal Service does this) requires updating every address with that ZIP. Forget one and the table self-contradicts. A typo on city for one address means the same ZIP now maps to two cities, which silently breaks any report that groups by city.

Good: 3NF compliant
CREATE TABLE zip_codes (
    zip   VARCHAR(10) PRIMARY KEY,
    city  VARCHAR(100) NOT NULL,
    state VARCHAR(2)   NOT NULL
);

CREATE TABLE addresses (
    address_id  INT PRIMARY KEY,
    customer_id INT,
    street      VARCHAR(200),
    zip         VARCHAR(10),
    FOREIGN KEY (zip) REFERENCES zip_codes(zip)
);

A query that needs the full address joins to zip_codes. The cost is one extra join. The benefit is that city/state cannot disagree across rows for the same ZIP, because there is only one row per ZIP.

3NF is the form most production OLTP schemas target. It eliminates almost all update anomalies and leaves the schema readable. Going past 3NF (to BCNF) is sometimes necessary, but most engagements never need to.

BCNF: when 3NF leaves an anomaly

BCNF is the strengthening of 3NF that handles a specific edge case: a table can satisfy 3NF and still have a functional dependency that causes update anomalies.

A canonical example: a table tracking which instructor teaches which course on which day. Suppose business rules say each instructor teaches exactly one course on each weekday, and each course has exactly one instructor.

Bad: 3NF but not BCNF
CREATE TABLE schedule (
    day        VARCHAR(10),
    instructor VARCHAR(100),
    course     VARCHAR(50),
    PRIMARY KEY (day, instructor)
);

This is in 3NF (no non-key column depends transitively on the key). But course functionally depends on instructor alone, and instructor is not a superkey. If the same instructor teaches the same course on Monday and Wednesday, and the course code changes, the update has to hit two rows. Miss one and the schema lies.

Good: BCNF compliant
CREATE TABLE instructor_courses (
    instructor VARCHAR(100) PRIMARY KEY,
    course     VARCHAR(50)  NOT NULL
);

CREATE TABLE schedule (
    day        VARCHAR(10),
    instructor VARCHAR(100),
    PRIMARY KEY (day, instructor),
    FOREIGN KEY (instructor) REFERENCES instructor_courses(instructor)
);

Most practitioners encounter BCNF rarely because composite keys with overlapping functional dependencies are uncommon outside of academic examples and certain scheduling, assignment, and reservation domains.

When to denormalize on purpose

Normalization optimizes for write correctness: each fact in one place, no update anomalies. Analytics optimizes for read speed and query simplicity: fewer joins, fewer hops, predictable execution plans. These are in tension.

Three patterns where intentional denormalization is the right call:

  1. Star schemas in the warehouse. Dimensional models flatten 3NF source tables into fact + dimension tables. dim_customer includes city, state, ZIP as columns even though they violate 3NF. The motivation is query latency: BI tools generate queries assuming dimension columns are one join away, not three. The cost (denormalized dim tables) is paid in the ETL pipeline, where it is automatable, rather than at every report run.
  2. Reporting columns on event tables. An order_events table that captures the customer’s segment, channel, and country at the moment of the order. Denormalizing here is correct because it captures historical state, which is exactly what reporting needs and what a normalized schema loses.
  3. Materialized views over normalized base tables. The base tables remain in 3NF (the source of truth). The materialized view caches a denormalized read interface, refreshed on a schedule or on commit. Postgres, Snowflake, BigQuery, and ClickHouse all support some flavor of this. We cover the ClickHouse pattern in detail in our materialized views in ClickHouse writeup.

The deciding question is not “is normalization good or bad?” but “where does this table sit in the write/read tradeoff?” OLTP source tables: normalize aggressively. Warehouse dimensions: denormalize on purpose, with the ETL pipeline as the single source of truth for the duplication.

A worked migration: denormalized log to 3NF

A real example. The starting table is a denormalized log of orders, the kind that lands when a junior engineer captures every relevant field on the order row:

CREATE TABLE order_log (
    order_id        INT PRIMARY KEY,
    customer_email  VARCHAR(254),
    customer_name   VARCHAR(100),
    customer_zip    VARCHAR(10),
    customer_city   VARCHAR(100),
    customer_state  VARCHAR(2),
    product_codes   VARCHAR(500),       -- 'SKU-1,SKU-2,SKU-3'
    product_names   VARCHAR(2000),      -- 'Widget,Gadget,Sprocket'
    order_total     NUMERIC(10, 2),
    ordered_at      TIMESTAMP
);

The 1NF violation is the product_codes / product_names pair. The 2NF violation surfaces once we split products onto their own row: product_name depends on product_code, not on (order_id, product_code). The 3NF violation is the (zip, city, state) chain.

Target schema after 1NF, 2NF, and 3NF:

CREATE TABLE zip_codes (
    zip   VARCHAR(10) PRIMARY KEY,
    city  VARCHAR(100) NOT NULL,
    state VARCHAR(2)   NOT NULL
);

CREATE TABLE customers (
    customer_id    INT PRIMARY KEY,
    email          VARCHAR(254) UNIQUE NOT NULL,
    name           VARCHAR(100),
    zip            VARCHAR(10),
    FOREIGN KEY (zip) REFERENCES zip_codes(zip)
);

CREATE TABLE products (
    product_code VARCHAR(50) PRIMARY KEY,
    product_name VARCHAR(200) NOT NULL
);

CREATE TABLE orders (
    order_id    INT PRIMARY KEY,
    customer_id INT NOT NULL,
    order_total NUMERIC(10, 2),
    ordered_at  TIMESTAMP NOT NULL,
    FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);

CREATE TABLE order_items (
    order_id     INT,
    product_code VARCHAR(50),
    PRIMARY KEY (order_id, product_code),
    FOREIGN KEY (order_id) REFERENCES orders(order_id),
    FOREIGN KEY (product_code) REFERENCES products(product_code)
);

The migration runs in this order: zip_codes first (deduplicated from order_log), then customers (deduplicated by email), then products (deduplicated by code), then orders, then order_items. The deduplicating dedupe queries are straightforward:

INSERT INTO zip_codes (zip, city, state)
SELECT DISTINCT customer_zip, customer_city, customer_state
FROM order_log
WHERE customer_zip IS NOT NULL;

INSERT INTO customers (customer_id, email, name, zip)
SELECT
    ROW_NUMBER() OVER (ORDER BY MIN(order_id)) AS customer_id,
    customer_email,
    MIN(customer_name),
    MIN(customer_zip)
FROM order_log
GROUP BY customer_email;

-- Products from the flattened lists
INSERT INTO products (product_code, product_name)
SELECT DISTINCT
    trim(unnest(string_to_array(product_codes, ','))) AS product_code,
    trim(unnest(string_to_array(product_names, ','))) AS product_name
FROM order_log;

A practical caveat: in the real migration, you’ll discover one customer email mapped to two slightly different names, and one ZIP mapped to two slightly different cities. The data is dirty, and normalization surfaces the dirt that denormalized storage hides. That surfacing is the point.

Common mistakes

Five normalization mistakes worth naming, because we see them every quarter:

  1. Reaching for JSON to dodge 1NF. A list of tags stored as tags JSONB is still a 1NF violation if the application queries individual tags. JSON columns are useful for genuinely schema-on-read data; they are not a synonym for “I didn’t want to make a junction table today.”
  2. Conflating 3NF with “no redundancy ever.” 3NF eliminates redundancy that depends on functional dependencies in the schema. It does not say you cannot have a denormalized reporting copy elsewhere. Star schemas and 3NF source tables coexist on every well-run data stack.
  3. Skipping 2NF on tables with composite keys. 2NF is often the silent failure mode: developers normalize to “3NF” without checking that every non-key column depends on the full composite key. Audit composite-key tables explicitly.
  4. Updating denormalized analytics tables in place. If a star schema is rebuilt by the ETL pipeline, the schema is correct. If a star schema is hand-patched with UPDATE statements when source data changes, the schema is now lying.
  5. Treating views as a normalization substitute. A view that joins three normalized tables presents a denormalized read interface, but it does not give the storage characteristics of a denormalized table. Materialized views do; ordinary views do not.

When the rules are wrong

Three contexts where normalization is the wrong objective:

  • Analytics warehouses optimized for read speed. Snowflake and BigQuery thrive on wide, denormalized tables. The clustering and partitioning work assumes you are not joining seven dimension tables on every query.
  • Append-only event logs. Capturing the customer’s email or address at the time of an event is correct behavior, even though it duplicates data that lives in a normalized customer table. The event row is a snapshot, not a foreign key.
  • Schemas where the “correct” foreign key would block the write path. If a downstream service writes an order before the customer record exists (an eventual-consistency scenario), enforcing the foreign key constraint causes more outages than it prevents. The right move there is to log the customer reference and reconcile asynchronously, not to denormalize the customer record into the order row.

Takeaway

Normalize source tables to 3NF as a default. BCNF when 3NF leaves a real anomaly, which is rare. Denormalize on purpose for analytics, and let the ETL pipeline own the duplication so that the duplication has a single source of truth. The bugs that normalization prevents (update anomalies, deletion anomalies, insert anomalies) are not theoretical: they show up the first time the schema disagrees with itself and a report ships the wrong number.

Keep reading

Enjoyed this article?

Get weekly data strategy insights delivered to your inbox.

Get in Touch

Let's Discuss Your Project

Book a 30-minute discovery call. We'll assess your data maturity and recommend the right approach — no strings attached.

Book a Discovery Call →
Need help with your data strategy? Book a Discovery Call →