A Flexible and Scalable DWH system Re-Built from Scratch with Improved Data Processing Time and Quality
Goal
betPawa is a sports betting company that manages tons of OLP data for bet/game processing. They used a DWH system with ETL processes that operated around the clock. Yet, the system lacked stability and reliability. Here are some of the problems that undermined data management:
- Too time-consuming to support
- Some reports lacking OLTP data
- Not flexible enough to be adjusted to new corporate business processes and data structures
- Sometimes, the results took too long to be delivered. For instance, an ETL process that had been running for a whole night failed to deliver the results by morning.
- Reports were based on yesterday’s DWH data, which wasn’t always efficient enough to conduct in-depth analysis.
Since the Client lacked both the expertise and human resources to overcome the challenge, they turned to Valiotti Analytics as a reliable partner with extensive experience in BI reporting, DWH, and ETL design.
Results
- A flexible and scalable DWH system
After analyzing the existing DWH system, source data structure, and actual business processes, we started to build a new Schema design. Here are some key achievements:
- Different fact tables had various solutions to extend DWH with new data sources and business processes.
- All the data flows were moved to Kafka: some without a MySQL intermediary, some (available as MySQL tables only) through MaxWell’s Daemon tool, to catch real-time data updates, standardize ETL processing, and simplify future development.
- Wide Fact tables were built by the ETL processor to be compatible with Clickhouse report queries and to allow for improved processing speeds.
- ETL-validated fact table rows were built, and ETL Logs can find data inconsistency across Redash reports, which improves data quality.
The Schema was based on Kimball’s dimensional approach and equipped with the following:
- Proposed table structures, engine, and indexes suitable for fast report queries
- Schema-less Stage layer (by new Clickhouse JSON data type)
- Replicated tables (cluster)
We needed to set up data transfer from OLTP systems and tested four methods:
- Materialized My SQL. This experimental Clickhouse database engine replicates MySQL tables into Clickhouse tables. Although it worked for some sets of source tables, it was incompatible with the Percona Update tool used to make MySQL schema changes. The solution was declined.
- Full and incremental load by accessing MySQL data directly from the Clickhouse SQL code. The MySQL database or table engine and MySQL table functions allow data to be received from MySQL servers. The method is used to source data for small tables (full reload daily or hourly) or as incremental loads based on checking the designation table data. Incremental loads deliver rapid results for large tables only if there is a helpful index on the MySQL side. The solution is used for a limited set of MySQL source tables.
- Kafka topics. Some corporate data is pushed to the Kafka message broker in Google’s Protocol Buffers format. The messages are extracted by Clickhouse’s Kafka Engine and processed internally through the Materialized Views pipeline.
- Converting MySQL transactions to Kafka events. The majority of corporate OLTP data is still available as MySQL tables. To capture inserted and changed data, the MySQL transaction log (binlog) is read, and events are pushed to the Kafka broker with topics named alongside the source database. We considered two alternative options – Debezium and MaxWell’s Daemon: the latter is an easy-to-maintain tool suitable for building a schema-free solution. This is why Altinity’s Clickhouse-sink-connector and Debezium were rejected.
- Improved data accuracy and real-time insights
Part of the corporate data was processed directly in Kafka without an SQL intermediary. The other part, available as MySQL tables, was sent to Kafka topics by MaxWell’s Daemon tool. This allowed not only for standardized data processing but also for real-time updates.
To avoid data deduplication, we enhanced fact and dimensional table processing. If the data is duplicated because Kafka offsets rewind or there is another failure at the data processing stage, Stage tables can’t be used: Clickhouse and Kafka topics/partitions have parallel processing, and the same data can be inserted several times because Stage tables are used as queues with generated IDs. To deduplicate, we either check the inserted data at the Insert Transform stage or process transformation inserts so that a Clickhouse table engine spots duplicates by checksum.
- A new BI tool – Redash
Before deploying Redash, we conducted several rounds on a test server. After we trained and collected how-tos, they were sent to DevOps to build a production version. When redesigning the process and coming up with new and relevant reports, we first created them in Tableau and then recreated them in Redash. Once the testing and bug-fixing stage was done, we provided access to the Client.
Tips
- Share actual information on business processes and existing system limitations. This data allows data engineers to come up with an up-to-date system.
- Think through your future objectives to make the system as scalable as possible.
- If you lack experience in specific tools, set up regular maintenance checks to ensure proper system performance over time.
Learn How Data Insights Can Benefit Your Business
Wondering what value data insights can bring your business? Get in touch, and we'll answer your questions!
Other Case Studies
-
Mentorshow
Comprehensive Reports Allow an EdTech Startup to Analyze User Behavior and Refine Its Product Strategy
Read more