With Testimonials

A Flexible and Scalable DWH system Re-Built from Scratch with Improved Data Processing Time and Quality

Impact

5-10x faster

Processing Speed

Markets Supported

Enterprise-grade

Data Quality

Re-built data warehouse from scratch achieving 5-10x processing scalability. Supported rapid expansion to 8+ African markets with enterprise-grade data quality.

Client

betPawa is a trusted African sports betting provider licensed in Cameroon, Kenya, Uganda, Nigeria, Tanzania, Ghana, and some other African countries. The Client deals with a lot of OLTP (Online Transaction Processing) data in real-time.

Problem

betPawa’s DWH (Data Warehouse) system lacked reliability and flexibility and required significant effort to support it. For example, the ETL processes that worked nightly failed to deliver the results by morning, and reports lacked some OLTP or real-time data.

Results

A DWH system with a new Schema was built from scratch for it to match relevant business processes. It’s fast, can be scaled 5-10 times, and allows users to come up with complicated reports for more informed decision-making. To ensure the reports contain up-to-date information, we have deployed a new BI tool and rebuilt the ETL process as Change Data Capture (CDC).

Industry

GameDev

Services

Data Warehouse Modern Data Architecture

Technologies

ClickHouse

Redash

"Valiotti Data has wonderful people and professionals in their team. Our company required technical assistance in maintaining and developing the existing data warehouse. Data experts from Valiotti quickly grasped the essence of the project and since then the team has been the strongest support for our analytics department. Many tests have been successfully passed and no less awaits us ahead.…

Kirill Vassiljev

Head of Business Intelligence, Betpawa

Goal

betPawa is a sports betting company that manages tons of OLP data for bet/game processing. They used a DWH system with ETL processes that operated around the clock. Yet, the system lacked stability and reliability. Here are some of the problems that undermined data management:

Too time-consuming to support
Some reports lacking OLTP data
Not flexible enough to be adjusted to new corporate business processes and data structures
Sometimes, the results took too long to be delivered. For instance, an ETL process that had been running for a whole night failed to deliver the results by morning.
Reports were based on yesterday’s DWH data, which wasn’t always efficient enough to conduct in-depth analysis.

Since the Client lacked both the expertise and human resources to overcome the challenge, they turned to Valiotti Data as a reliable partner with extensive experience in BI reporting, DWH, and ETL design.

Results

A flexible and scalable DWH system

After analyzing the existing DWH system, source data structure, and actual business processes, we started to build a new Schema design. Here are some key achievements:

Different fact tables had various solutions to extend DWH with new data sources and business processes.
All the data flows were moved to Kafka: some without a MySQL intermediary, some (available as MySQL tables only) through MaxWell’s Daemon tool, to catch real-time data updates, standardize ETL processing, and simplify future development.
Wide Fact tables were built by the ETL processor to be compatible with Clickhouse report queries and to allow for improved processing speeds.
ETL-validated fact table rows were built, and ETL Logs can find data inconsistency across Redash reports, which improves data quality.

The Schema was based on Kimball’s dimensional approach and equipped with the following:

Proposed table structures, engine, and indexes suitable for fast report queries
Schema-less Stage layer (by new Clickhouse JSON data type)

Replicated tables (cluster)

We needed to set up data transfer from OLTP systems and tested four methods:

Materialized My SQL. This experimental Clickhouse database engine replicates MySQL tables into Clickhouse tables. Although it worked for some sets of source tables, it was incompatible with the Percona Update tool used to make MySQL schema changes. The solution was declined.
Full and incremental load by accessing MySQL data directly from the Clickhouse SQL code. The MySQL database or table engine and MySQL table functions allow data to be received from MySQL servers. The method is used to source data for small tables (full reload daily or hourly) or as incremental loads based on checking the designation table data. Incremental loads deliver rapid results for large tables only if there is a helpful index on the MySQL side. The solution is used for a limited set of MySQL source tables.
Kafka topics. Some corporate data is pushed to the Kafka message broker in Google’s Protocol Buffers format. The messages are extracted by Clickhouse’s Kafka Engine and processed internally through the Materialized Views pipeline.
Converting MySQL transactions to Kafka events. The majority of corporate OLTP data is still available as MySQL tables. To capture inserted and changed data, the MySQL transaction log (binlog) is read, and events are pushed to the Kafka broker with topics named alongside the source database. We considered two alternative options – Debezium and MaxWell’s Daemon: the latter is an easy-to-maintain tool suitable for building a schema-free solution. This is why Altinity’s Clickhouse-sink-connector and Debezium were rejected.

Improved data accuracy and real-time insights

Part of the corporate data was processed directly in Kafka without an SQL intermediary. The other part, available as MySQL tables, was sent to Kafka topics by MaxWell’s Daemon tool. This allowed not only for standardized data processing but also for real-time updates.

To avoid data deduplication, we enhanced fact and dimensional table processing. If the data is duplicated because Kafka offsets rewind or there is another failure at the data processing stage, Stage tables can’t be used: Clickhouse and Kafka topics/partitions have parallel processing, and the same data can be inserted several times because Stage tables are used as queues with generated IDs. To deduplicate, we either check the inserted data at the Insert Transform stage or process transformation inserts so that a Clickhouse table engine spots duplicates by checksum.

A new BI tool – Redash

Before deploying Redash, we conducted several rounds on a test server. After we trained and collected how-tos, they were sent to DevOps to build a production version. When redesigning the process and coming up with new and relevant reports, we first created them in Tableau and then recreated them in Redash. Once the testing and bug-fixing stage was done, we provided access to the Client.

Facing similar data challenges?

Book a Discovery Call →

Key Takeaways

Share actual information on business processes and existing system limitations. This data allows data engineers to come up with an up-to-date system.

Think through your future objectives to make the system as scalable as possible.

If you lack experience in specific tools, set up regular maintenance checks to ensure proper system performance over time.

Have a similar challenge?
Let's talk about your data

A 30-minute conversation about your data stack, pain points, and opportunities.

30-min video call No commitment Actionable next steps

Book a Discovery Call →

Or email directly: nick@valiotti.com

More Case Studies

Explore related projects

Product analytics

Buff

Regained focus on business KPIs with dashboards transforming data into revenue.

All

Skycoach

Non-expert In-House Team Receives and Acts upon Professional Guidance to Establish Monitoring of Product Insights

Setting up data infrastructure

Building a Data-Driven Culture at a $25M ARR B2B SaaS

A $25M ARR B2B SaaS with 200 employees suffered from data silos, no single source of truth, and rising churn.…

View All Case Studies →

A Flexible and Scalable DWH system Re-Built from Scratch with Improved Data Processing Time and Quality

Goal

Results

Key Takeaways

Have a similar challenge? Let's talk about your data

Explore related projects

Have a similar challenge?
Let's talk about your data