Companion notebook to Cumulative Sum in SQL: Why OVER Beats Self-Joins by 60×. Reproduce all three approaches on your hardware. Synthetic 10M-row dataset, generated in the notebook, no downloads beyond this file.
~60×
Window vs self-join, 10M rows
3
Engines · Postgres · DuckDB · ClickHouse
19
Cells · runnable end-to-end
Free notebook
cumulative-sum-benchmark.ipynb
No signup. No email gate. Plain link.
Inside the file
- 01Dataset generator (10M and 100M rows, Parquet)
- 02DuckDB 1.0 — single-process, no Docker needed
- 03Postgres 16 — window function path (
SUM OVER) - 04Postgres 16 — self-join path (with O(n²) safeguard)
- 05ClickHouse 24 —
SUM OVER - 06ClickHouse 24 —
groupArrayMovingSum+ARRAY JOIN - 07Results table + log-scale bar chart
How to run it
- Unzip and open in Jupyter Lab, VS Code, or upload to Google Colab.
- Install dependencies in the first code cell:
pip install duckdb psycopg[binary] clickhouse-connect pandas numpy matplotlib. - Start Postgres and ClickHouse via the
docker runcommands in cells 3 and 4. Skip those cells to run only the DuckDB benchmark — it works standalone. - Run all cells. Generation + DuckDB + Postgres window: ~2 minutes on a laptop. Self-join is auto-scaled from a 100K-row sample to avoid an O(n²) wait.
Reference numbers
From the article — AWS c6i.4xlarge, 16 vCPU, 32 GB RAM. Your numbers will differ but the ratios should hold.
| Approach | 10M rows | 100M rows |
|---|---|---|
| Window function (SUM OVER) | 2.1s | 24s |
ClickHouse groupArrayMovingSum |
1.8s | 19s |
| Self-join (INNER JOIN) | 128s | did not finish |