16.09.2022 | Nikolay Valiotti

Clickhouse: Overview and Applications

Clickhouse

Data professionals are faced with huge amounts of information. Even enterprises with fully staffed analytical departments can struggle with handling data-related tasks—monitoring market dynamics, analyzing competitor activity, or analyzing customer behavior on online platforms. Such laborious tasks require not only the right specialists but also the right tools. Clickhouse is one of the tools that can be of help.

Let’s talk about Clickhouse, discuss its features and main downsides, cover common applications, and finish it up with a Clickhouse comparison with other systems. There are even a few words on how to be a ninja with Cilckhouse and a glossary of important terms.

What Is Clickhouse? General Info, Catalog Definition, etc.

Cilckhouse is a widely adopted, open-source column-oriented OLAP DBMS. It can also be described as a tool for structuring analytical data using real-time SQL queries. While it doesn’t offer any dedicated scaling features or mechanisms, it is designed to provide linear scalability of queries, which means the cluster potential is sky-high.

Before we go on, let’s fire up a quick glossary:

  • OLAP – Online Analytical Processing
  • DBMS – Database Management Systems
  • SQL query – A request for data or information from a database table or combination of tables
  • Catalog – Contains metadata of each table that the server knows about
  • MergeTree – Table engine
  • RAM – Random Access Memory
  • API – Application Programming Interface

If you’re unclear about other terms— or you want to learn more about specific functions like Clickhouse’s “create database,” machine learning, or predictive analytics—view the official documentation. For example, there are tabs explaining how to CREATE DATABASE and CREATE TABLE, as well as tutorials for connecting to the Clickhouse database server from Power BI dashboards.

The Clickhouse Architecture

NYPD Complaint Data

NYPD Complaint Data

Here are the key aspects of the Clickhouse architecture:

  • Compressed, column-oriented storage – A query that needs to read a few columns much faster; there is also potential for greater compressibility.
  • MergeTree – Nearly all table engines derive from MergeTree, which determines the type of table and the features that will be available for data processing.
  • Indexes – Clickhouse relies on the primary index and a secondary data skipping index to limit the amount of data analyzed.
  • Vector computation engine – Clickhouse implemented a series of vector algorithms to handle large arrays of data on a column-by-column basis.

Key Properties of OLAP Scenario

Here is why Clickhouse is the sharp weapon of the OLAP scene:

  • Most requests are for read access.
  • Data updates are made in batches of over 1,000 rows or not updated at all.
  • A large number of rows and only a small subset of columns are extracted from the database (for reads).
  • Tables are made of a large number of columns.
  • Latencies of around 50 ms are allowed for simple queries.
  • Column values are fairly small.
  • Requires high throughput when processing a single query.
  • All but one table are small.
  • A query result fits in a single server’s RAM.

Row-Oriented DBMS vs. Column-Oriented DBMS

In a “traditional” row-oriented DBMS, data is stored like this (taken from clickhouse.com):

RowWatchIDJavaEnableTitleGoodEventEventTime
#0893543506621Investor Relations12016-05-18 05:19:20
#1903295099580Contact us12016-05-18 08:10:20
#2899537060541Mission12016-05-18 07:38:00
#N

As you can see, all the values related to a row are stored next to each other. MySQL, Postgres, and MS SQL Server are examples of row-oriented DBMS.

A column-oriented DBMS stores data like this (also taken from clickhouse.com):

Row:#0#1#2#N
WatchID:893543506629032950995889953706054
JavaEnable:101
Title:Investor RelationsContact usMission
GoodEvent:111
EventTime:2016-05-18 05:19:202016-05-18 08:10:202016-05-18 07:38:00

Here, the values from different columns are stored separately, and data from the same column are stored together.

Examples of a column-oriented DBMS are Vertica, Paraccel, Sybase IQ, Exasol, Infobright, InfiniDB, MonetDB, LucidDB, SAP HANA, Google Dremel, Google PowerDrill, Druid, and kdb+.

Clusters

Clickhouse is compatible with another open-source project, Apache Zookeeper, and relies on it to handle replication and distributed DDL commands. Apache Zookeeper is a server for distributed coordination of cloud applications, and, in this context, it ensures the high performance of distributed clusters (a group of nodes).

Any Clickhouse cluster consists of shards, which, in turn, consist of replicas storing the same data. Here is an example configuration of a cluster of Clickhouse nodes.

3 (shards) x 2 (replicas) = 6 Clickhouse cluster nodes

3 (shards) x 2 (replicas) = 6 Clickhouse cluster nodes

Shards

Sharding is a cluster horizontal scaling strategy that allows you to store parts of the data on different shards. With data spread out evenly across nodes in a cluster, sharding supports load distribution.

In order to distribute data across shards, you need to create a table on the Distributed engine and specify information about the location of shards and replicas on each node in the configuration file. You can configure any number of replicas within a shard.

There might be issues with broken nodes or the rest of the cluster getting locked up. So, when deploying new nodes, it’s recommended to work (which will also save you time).

Replication

Replication works within a single shard at the table level. In other words, replication does not depend on sharding because each shard has its own independent replication.

Database replication is needed for fault tolerance. This is especially important if your company deals with sensitive information, and data loss is unacceptable.

Once again, everything you’ll need is in the documentation. The only real issue is that there is a lot of documentation, so you’ll be able to extract valuable information from it if you have relevant experience and expertise. Less proficient users will need more digestible instructions.

Although, replication is made much easier by the ON CLUSTER mechanism, which largely abstracts the user from understanding the cluster topology. For example, when creating tables, a user doesn’t need to execute the query themselves on all nodes of the cluster.

Differentiating Features of Clickhouse

Differentiating Features of Clickhouse

Clickhouse stands out in the following ways:

Open-Source

As mentioned earlier, Clickhouse is open-source, which means it’s free to download, change, and deploy.

Fast Performance

Clickhouse is known for its incredibly fast performance. It will be the fastest option available in both the open source and commercial markets if you’re dealing with large volumes of structured relational data.

If you’re interested in comparing Clickhouse performance benchmark results against other systems, many are available on GitHub. We’ll also offer some insight on other data systems in the “Quick Clickhouse Comparison” section below.

Build System

Clickhouse uses CMake (a meta-build system that can generate build tasks) and Ninja (a smaller speed-focused build system that executes those tasks) for building.

SQL Native

In contrast to other competing tools in this space (e.g., Druid or Elastic), which are primarily interacted with via a JSON HTTP API, Clickhouse is fully based on ANSI SQL. This makes interactions through APIs and reporting tools easier and more familiar.

Ease of Deployment

Clickhouse can be easily started out of the box. Users shouldn’t have any issues with configuring or running it because it’s delivered as a single binary with minimum dependencies.

Main Downsides to Clickhouse

High performance does not come without its downsides:

  • Limited support for transactions – No concept of transactions, so that data can be inconsistent
  • No real delete/update support – Not unique to Clickhouse; also, you can now mark data to be deleted
  • No window functions – Rows don’t get grouped into a single output row
  • Management overhead – Not self-installed or self-managed

Common Applications

UK Property Price Paid

UK Property Price Paid

Clickhouse excels at business tasks that require consistent low-latency responses across tables. The system dominates in use cases like:

  • Web analytics
  • Real-time network management
  • Service log analysis
  • Real-time ad bidding
  • Asset valuation in financial markets
  • Security threat identification

In addition to the appropriate uses of Clickhouse, we should also highlight a couple of operations that this DBMS is not suitable for:

  • Storing unstructured data (for example, media and entertainment data, surveillance data, geospatial data, audio, weather data, sensor data, ticker data, etc.)
  • Frequently updated data

Example Application: Analytics Dashboards for a Development Studio

Suppose a development studio releases a new software product and wants to check whether the users like it.

Tracking the number of downloads is a good start but not enough. You also want to know whether they keep using it or never run it again. For this, you want to visualize the user churn to see how many people still use the software, let’s say after a few weeks since downloading.

Here is how to solve it creatively: the software will send a ping with a user identifier to a tracking endpoint every time they open it. Then, you build an extraction pipeline to load those pings and analyze them. Fortunately, Clickhouse will easily handle such a large amount of data. After all, it was designed to power the second largest web analytics platform in the world.

With the help of a business intelligence tool, e.g., Tableau dashboards, you can visualize the user churn and charts of daily, weekly, and monthly active users (DAU, WAU, MAU).

Quick Clickhouse Comparison

There is a lot of choice in today’s database market.

  • MySQL vs. Clickhouse – Clickhouse is 267 times faster, but they work best together.
  • MySQL vs. Microsoft SQL Server – Microsoft SQL Server has broader approval.
  • Druid/Pinot vs. Clickhouse – Clickhouse is simpler and has fewer moving parts and services but still requires a lot of attention from professional SREs.
  • Tableau vs. Clickhouse – They work together best because ClickHouse gives Tableau more strength.
  • SingleStore vs. Clickhouse – SingleStore outperforms Clickhouse.
  • Snowflake vs. Clickhouse – Snowflake struggles to deliver low-latency query performance at scale.

Choosing the best technology can make all the difference down the road. We recommend taking a step back and analyzing your stack, your team’s skills, your business needs now, and your goals for the future.

Summary

As you evaluate a database management system, look for administrative simplicity, high performance, efficient visualization and reporting, and high scalability. Look explicitly at the current and planned machine learning capabilities since much of the business value of data warehouses comes from obtaining and applying predictive analytics.

Clickhouse is less versatile than similar DBMS but works great for business analytics tasks. Thanks to the column-oriented structure, the system is able to execute several queries at once while running a minimal amount of RAM. One of the most common uses is the processing of analytical requests for monitoring user activity on websites and applications.

If you’ve already arrived at a decision to use Clickhouse and are struggling to deploy your first cluster, you can search for managed Clickhouse as a service. Or perhaps you’re planning to scale your cluster beyond 1 or 2 nodes, having a production problem, or hitting a bug. Whatever the reason might be, consider getting outside help to make your analytic development as productive as possible without hiring new people.

If you want to use Clickhouse to empower your business processes and drive innovation, we’re always here to help. Our engineers will get your databases up and running in no time.