10M-Row Cumulative Sum Benchmark

Companion notebook to Cumulative Sum in SQL: Why OVER Beats Self-Joins by 60×. Reproduce all three approaches on your hardware. Synthetic 10M-row dataset, generated in the notebook, no downloads beyond this file.

~60×
Window vs self-join, 10M rows

3
Engines · Postgres · DuckDB · ClickHouse

19
Cells · runnable end-to-end

Free notebook
cumulative-sum-benchmark.ipynb

Zip · 4 KB · Python 3.11 · Apache 2.0

Download notebook →

No signup. No email gate. Plain link.

Inside the file
  • 01Dataset generator (10M and 100M rows, Parquet)
  • 02DuckDB 1.0 — single-process, no Docker needed
  • 03Postgres 16 — window function path (SUM OVER)
  • 04Postgres 16 — self-join path (with O(n²) safeguard)
  • 05ClickHouse 24 — SUM OVER
  • 06ClickHouse 24 — groupArrayMovingSum + ARRAY JOIN
  • 07Results table + log-scale bar chart

How to run it

  1. Unzip and open in Jupyter Lab, VS Code, or upload to Google Colab.
  2. Install dependencies in the first code cell: pip install duckdb psycopg[binary] clickhouse-connect pandas numpy matplotlib.
  3. Start Postgres and ClickHouse via the docker run commands in cells 3 and 4. Skip those cells to run only the DuckDB benchmark — it works standalone.
  4. Run all cells. Generation + DuckDB + Postgres window: ~2 minutes on a laptop. Self-join is auto-scaled from a 100K-row sample to avoid an O(n²) wait.

Reference numbers

From the article — AWS c6i.4xlarge, 16 vCPU, 32 GB RAM. Your numbers will differ but the ratios should hold.

Approach 10M rows 100M rows
Window function (SUM OVER) 2.1s 24s
ClickHouse groupArrayMovingSum 1.8s 19s
Self-join (INNER JOIN) 128s did not finish
Get in Touch

Let's Discuss Your Project

Book a 30-minute discovery call. We'll assess your data maturity and recommend the right approach — no strings attached.

Book a Discovery Call →