Data Engineering

Materialized Views in ClickHouse: The Pattern We Use Most

· 12 min read

Quick answer: A materialized view in ClickHouse is a stored, pre-computed query result that updates automatically as new data arrives. Used for real-time aggregations, ETL-like transformations on insert, and dashboard acceleration. Defined with CREATE MATERIALIZED VIEW. Three engine types matter: AggregatingMergeTree for sums and counts, ReplacingMergeTree for upserts, SummingMergeTree for additive metrics.

In This Article

  1. The problem
  2. The schema
  3. The update workaround
  4. Verifying it works
  5. Where this pattern breaks
  6. When to use, when not to
  7. Takeaway

A materialized view in ClickHouse is a table whose contents are defined by a SELECT and refreshed automatically as new rows land in the source. Used well, they collapse expensive aggregations into a single index lookup. Used carelessly, they multiply storage cost and turn into a debugging nightmare.

This article is the one materialized-view pattern we reach for most on consulting engagements: pre-aggregating ad-platform metrics from a fact stream into a daily summary table. It’s specific enough to be concrete and general enough that swapping Meta Ads for Google Ads or TikTok Ads is a column rename, not a redesign.

The problem

A client running paid acquisition pulls Meta Ads insights into ClickHouse every hour. The source table is row-per-(campaign, day) granularity. Their dashboard queries always group by date_start to show daily spend, impressions, and clicks across all active campaigns. Two months in, the dashboard query is scanning 40M rows on every refresh and the marketing team is complaining about the 8-second lag.

The fix isn’t bigger hardware. It’s a materialized view that pre-aggregates the daily totals as new rows arrive.

The schema

The source table uses CollapsingMergeTree so we can handle updates — Meta’s API re-sends metrics for the last 28 days as attribution windows close, so the same (campaign_id, date_start) shows up with corrected numbers. The sign column tracks whether a row is a current value (+1) or a tombstone for an old value (-1):

CREATE DATABASE ads;

CREATE TABLE ads.meta_insights (
    campaign_id  UInt64,
    date_start   Date,
    date_stop    Date,
    spend        Float32,
    impressions  UInt32,
    clicks       UInt32,
    sign         Int8
)
ENGINE = CollapsingMergeTree(sign)
ORDER BY (date_start, date_stop, campaign_id);

The materialized view sits on top and aggregates into SummingMergeTree, which collapses rows with the same ORDER BY key by summing their numeric columns:

CREATE MATERIALIZED VIEW ads.daily_campaign_metrics
ENGINE = SummingMergeTree()
ORDER BY (date_start, campaign_id)
AS
SELECT
    date_start,
    campaign_id,
    sum(spend * sign)       AS spend,
    sum(impressions * sign) AS impressions,
    sum(clicks * sign)      AS clicks
FROM ads.meta_insights
GROUP BY date_start, campaign_id;

The * sign trick makes the negation arithmetic. When a sign = -1 row lands, its values subtract from the running total in the MV. The +1 row that replaces it adds the corrected values back in.

Meta Ads insights → CollapsingMergeTree (sign tracks updates) → SummingMergeTree materialized view (daily totals per campaign). New rows trigger automatic MV refresh.
Meta Ads insights → CollapsingMergeTree (sign tracks updates) → SummingMergeTree materialized view (daily totals per campaign). New rows trigger automatic MV refresh.

The update workaround

ClickHouse doesn’t have a row-level UPDATE in the OLTP sense. The supported pattern for “this row changed” is: insert a -1 row that’s identical to the old one, then insert a +1 row with the new values. The Collapsing engine reconciles on the next merge. The MV picks up both inserts and the running totals stay correct.

Here’s the Python that does this. It runs once an hour, pulls the latest insights from Meta’s Graph API, and writes the diff into ClickHouse:

from clickhouse_driver import Client
from datetime import date, timedelta
import requests

ch = Client(host='clickhouse.internal', database='ads')

def fetch_meta_insights(since: date) -> list[dict]:
    """Pull the last 28 days of campaign-level insights from Meta's API."""
    resp = requests.get(
        f"https://graph.facebook.com/v19.0/act_{AD_ACCOUNT}/insights",
        params={
            'level': 'campaign',
            'fields': 'campaign_id,date_start,date_stop,spend,impressions,clicks',
            'time_range': f'{{"since":"{since}","until":"{date.today()}"}}',
            'access_token': META_TOKEN,
        },
    )
    return resp.json()['data']

def sync():
    since = date.today() - timedelta(days=28)
    fresh = fetch_meta_insights(since)

    # Pull whatever ClickHouse currently thinks the last 28 days look like
    current = ch.execute("""
        SELECT campaign_id, date_start, date_stop, spend, impressions, clicks
        FROM ads.meta_insights FINAL
        WHERE date_start >= %(since)s
    """, {'since': since})

    # Build tombstone rows for every existing record
    tombstones = [(*row, -1) for row in current]

    # Build fresh rows tagged with sign=+1
    inserts = [
        (
            int(r['campaign_id']),
            r['date_start'],
            r['date_stop'],
            float(r['spend']),
            int(r['impressions']),
            int(r['clicks']),
            1,
        )
        for r in fresh
    ]

    ch.execute(
        "INSERT INTO ads.meta_insights "
        "(campaign_id, date_start, date_stop, spend, impressions, clicks, sign) VALUES",
        tombstones + inserts,
    )
    ch.execute("OPTIMIZE TABLE ads.meta_insights")
    ch.execute("OPTIMIZE TABLE ads.daily_campaign_metrics")

if __name__ == '__main__':
    sync()

A few details that matter:

  • SELECT ... FINAL forces collapsing on read so we get the current (deduplicated) state. Without FINAL you’d see both the +1 and -1 rows from the previous run.
  • OPTIMIZE TABLE triggers a synchronous merge. ClickHouse merges in the background anyway, but the dashboard query immediately after the sync wants to see clean data. In production we call this only after sync, not on every dashboard refresh.
  • The MV refreshes on insert, not on schedule. The moment the INSERT commits to meta_insights, the matching aggregated rows appear in daily_campaign_metrics. No separate cron.

Verifying it works

After the sync runs, the dashboard query against the materialized view is one row scan per date in the date range, not 40M:

SELECT
    date_start,
    sum(spend)       AS spend,
    sum(impressions) AS impressions,
    sum(clicks)      AS clicks
FROM ads.daily_campaign_metrics FINAL
WHERE date_start BETWEEN today() - 30 AND today()
GROUP BY date_start
ORDER BY date_start;

On the client’s hardware this dropped from ~8 seconds to under 200 milliseconds. The marketing dashboard refreshes in real time and the source-table scan is reserved for the analytics team’s ad-hoc drill-downs.

Where this pattern breaks

A few cases where we’ve watched this go wrong on engagements:

  • High-cardinality GROUP BY in the view. Aggregating by (date_start, campaign_id, ad_set_id, creative_id, country) blows up the row count in the MV. The view becomes nearly as big as the source. Aggregate only as deep as your dashboard actually queries.
  • *Forgetting sign on a numeric column*. If you add reach later and write sum(reach) instead of sum(reach sign), the MV double-counts every update. We’ve seen this in code review more than once.
  • OPTIMIZE TABLE on a multi-TB table during business hours. The synchronous merge can lock the table for minutes. Run it off-peak, or rely on background merges and trust FINAL.
  • Backfills. If you need to reprocess the last six months because Meta changed an attribution model, you have to insert tombstones for the entire range first, then the fresh rows. The MV will handle it correctly but the source table doubles in size during the operation. Plan capacity.

When to use, when not to

Reach for this pattern when:

  • The aggregation grain is fixed (daily, hourly, per-customer)
  • The dashboard always groups the same way
  • Source data has predictable updates with a stable primary key
  • You can tolerate eventual consistency between source and view (seconds, not minutes)

Don’t reach for it when:

  • The aggregation grain changes per query (use the source table directly)
  • Sources are append-only with no updates (just use a regular SummingMergeTree on the source — no sign needed)
  • The aggregation is one part of a larger join (materialize the join result, not just the aggregation)

Takeaway

Materialized views with CollapsingMergeTree + SummingMergeTree + the sign trick are the cleanest way we know to keep daily-totals dashboards fast on ClickHouse when the underlying data updates. The pattern fits Meta Ads, Google Ads, TikTok Ads, any ad-platform feed. Swap the column names, keep the engines, keep the * sign arithmetic, and the same skeleton ships.


About the author

Nick Valiotti is the founder of Valiotti Data. 15+ years building analytics infrastructure for SaaS, marketplaces, and consumer subscription. 50+ production deployments across BigQuery, Snowflake, dbt, Metabase, and modern BI stacks. Author of two books on data strategy. LinkedIn · Discovery call.

Keep reading

Enjoyed this article?

Get weekly data strategy insights delivered to your inbox.

Get in Touch

Let's Discuss Your Project

Book a 30-minute discovery call. We'll assess your data maturity and recommend the right approach — no strings attached.

Book a Discovery Call →
Need help with your data strategy? Book a Discovery Call →