Data professionals are faced with huge amounts of information. Even enterprises with fully staffed analytical departments can struggle with handling data-related tasks—monitoring market dynamics, analyzing competitor activity, or analyzing customer behavior on online platforms. Such laborious tasks require not only the right specialists but also the right tools. Clickhouse is one of the tools that can be of help.
Let’s talk about Clickhouse, discuss its features and main downsides, cover common applications, and finish it up with a Clickhouse comparison with other systems. There are even a few words on how to be a ninja with Cilckhouse and a glossary of important terms.
Cilckhouse is a widely adopted, open-source column-oriented OLAP DBMS. It can also be described as a tool for structuring analytical data using real-time SQL queries. While it doesn’t offer any dedicated scaling features or mechanisms, it is designed to provide linear scalability of queries, which means the cluster potential is sky-high.
Before we go on, let’s fire up a quick glossary:
If you’re unclear about other terms— or you want to learn more about specific functions like Clickhouse’s “create database,” machine learning, or predictive analytics—view the official documentation. For example, there are tabs explaining how to CREATE DATABASE and CREATE TABLE, as well as tutorials for connecting to the Clickhouse database server from Power BI dashboards.
NYPD Complaint Data
Here are the key aspects of the Clickhouse architecture:
Here is why Clickhouse is the sharp weapon of the OLAP scene:
In a “traditional” row-oriented DBMS, data is stored like this (taken from clickhouse.com):
Row | WatchID | JavaEnable | Title | GoodEvent | EventTime |
#0 | 89354350662 | 1 | Investor Relations | 1 | 2016-05-18 05:19:20 |
#1 | 90329509958 | 0 | Contact us | 1 | 2016-05-18 08:10:20 |
#2 | 89953706054 | 1 | Mission | 1 | 2016-05-18 07:38:00 |
#N | … | … | … | … | … |
As you can see, all the values related to a row are stored next to each other. MySQL, Postgres, and MS SQL Server are examples of row-oriented DBMS.
A column-oriented DBMS stores data like this (also taken from clickhouse.com):
Row: | #0 | #1 | #2 | #N |
WatchID: | 89354350662 | 90329509958 | 89953706054 | … |
JavaEnable: | 1 | 0 | 1 | … |
Title: | Investor Relations | Contact us | Mission | … |
GoodEvent: | 1 | 1 | 1 | … |
EventTime: | 2016-05-18 05:19:20 | 2016-05-18 08:10:20 | 2016-05-18 07:38:00 | … |
Here, the values from different columns are stored separately, and data from the same column are stored together.
Examples of a column-oriented DBMS are Vertica, Paraccel, Sybase IQ, Exasol, Infobright, InfiniDB, MonetDB, LucidDB, SAP HANA, Google Dremel, Google PowerDrill, Druid, and kdb+.
Clickhouse is compatible with another open-source project, Apache Zookeeper, and relies on it to handle replication and distributed DDL commands. Apache Zookeeper is a server for distributed coordination of cloud applications, and, in this context, it ensures the high performance of distributed clusters (a group of nodes).
Any Clickhouse cluster consists of shards, which, in turn, consist of replicas storing the same data. Here is an example configuration of a cluster of Clickhouse nodes.
3 (shards) x 2 (replicas) = 6 Clickhouse cluster nodes
Sharding is a cluster horizontal scaling strategy that allows you to store parts of the data on different shards. With data spread out evenly across nodes in a cluster, sharding supports load distribution.
In order to distribute data across shards, you need to create a table on the Distributed engine and specify information about the location of shards and replicas on each node in the configuration file. You can configure any number of replicas within a shard.
There might be issues with broken nodes or the rest of the cluster getting locked up. So, when deploying new nodes, it’s recommended to work (which will also save you time).
Replication works within a single shard at the table level. In other words, replication does not depend on sharding because each shard has its own independent replication.
Database replication is needed for fault tolerance. This is especially important if your company deals with sensitive information, and data loss is unacceptable.
Once again, everything you’ll need is in the documentation. The only real issue is that there is a lot of documentation, so you’ll be able to extract valuable information from it if you have relevant experience and expertise. Less proficient users will need more digestible instructions.
Although, replication is made much easier by the ON CLUSTER mechanism, which largely abstracts the user from understanding the cluster topology. For example, when creating tables, a user doesn’t need to execute the query themselves on all nodes of the cluster.
Clickhouse stands out in the following ways:
As mentioned earlier, Clickhouse is open-source, which means it’s free to download, change, and deploy.
Clickhouse is known for its incredibly fast performance. It will be the fastest option available in both the open source and commercial markets if you’re dealing with large volumes of structured relational data.
If you’re interested in comparing Clickhouse performance benchmark results against other systems, many are available on GitHub. We’ll also offer some insight on other data systems in the “Quick Clickhouse Comparison” section below.
Clickhouse uses CMake (a meta-build system that can generate build tasks) and Ninja (a smaller speed-focused build system that executes those tasks) for building.
In contrast to other competing tools in this space (e.g., Druid or Elastic), which are primarily interacted with via a JSON HTTP API, Clickhouse is fully based on ANSI SQL. This makes interactions through APIs and reporting tools easier and more familiar.
Clickhouse can be easily started out of the box. Users shouldn’t have any issues with configuring or running it because it’s delivered as a single binary with minimum dependencies.
High performance does not come without its downsides:
UK Property Price Paid
Clickhouse excels at business tasks that require consistent low-latency responses across tables. The system dominates in use cases like:
In addition to the appropriate uses of Clickhouse, we should also highlight a couple of operations that this DBMS is not suitable for:
Suppose a development studio releases a new software product and wants to check whether the users like it.
Tracking the number of downloads is a good start but not enough. You also want to know whether they keep using it or never run it again. For this, you want to visualize the user churn to see how many people still use the software, let’s say after a few weeks since downloading.
Here is how to solve it creatively: the software will send a ping with a user identifier to a tracking endpoint every time they open it. Then, you build an extraction pipeline to load those pings and analyze them. Fortunately, Clickhouse will easily handle such a large amount of data. After all, it was designed to power the second largest web analytics platform in the world.
With the help of a business intelligence tool, e.g., Tableau dashboards, you can visualize the user churn and charts of daily, weekly, and monthly active users (DAU, WAU, MAU).
There is a lot of choice in today’s database market.
Choosing the best technology can make all the difference down the road. We recommend taking a step back and analyzing your stack, your team’s skills, your business needs now, and your goals for the future.
As you evaluate a database management system, look for administrative simplicity, high performance, efficient visualization and reporting, and high scalability. Look explicitly at the current and planned machine learning capabilities since much of the business value of data warehouses comes from obtaining and applying predictive analytics.
Clickhouse is less versatile than similar DBMS but works great for business analytics tasks. Thanks to the column-oriented structure, the system is able to execute several queries at once while running a minimal amount of RAM. One of the most common uses is the processing of analytical requests for monitoring user activity on websites and applications.
If you’ve already arrived at a decision to use Clickhouse and are struggling to deploy your first cluster, you can search for managed Clickhouse as a service. Or perhaps you’re planning to scale your cluster beyond 1 or 2 nodes, having a production problem, or hitting a bug. Whatever the reason might be, consider getting outside help to make your analytic development as productive as possible without hiring new people.
If you want to use Clickhouse to empower your business processes and drive innovation, we’re always here to help. Our engineers will get your databases up and running in no time.
You need this guide, because it includes:
Turn it on to get exclusive guide on modern data stack
Emails suck. This newsletter doesn’t
Subscribe to the newsletter and get the most useful guide on modern data stack
The newsletter
You will also receive other useful materials on data analysis hacks with case examples from our company.