At Inigo, we were seeking a database solution that could handle a high volume of raw data for analytics. After exploring various alternatives, including SQLite, Snowflake, PostgreSQL, we decided to experiment with Materialized Views using ClickHouse on a relatively slow machine. In this post, we share our findings after loading billions of rows of data and conducting measurements.
Before arriving at ClickHouse, we tried several other databases:
Clickhouse is a columnar database specifically designed to handle high volumes of data while providing fast query execution. Materialized views in Clickhouse serve as pre-aggregated datasets that can significantly improve the performance of analytical queries. Furthermore, Clickhouse's ability to automatically manage and update materialized views as new data is ingested simplifies the maintenance process, making it an attractive choice for those seeking an efficient and powerful analytical database solution.
We ended up choosing Clickhouse for its general high performance, ability to handle large scale, and ability to run it locally for testing purposes.
Our current ClickHouse setup is a single database where nearly every Inigo service has its own table, with some tables containing billions of rows of data. It runs on 4 vCPUs, 32GB RAM for the development environment and 16 vCPUs, 128GB RAM for the production environment. Setting up ClickHouse was challenging, as understanding the nuances of each index, sharding, and materialized views required considerable effort. However, the end result has been a system capable of loading and querying vast amounts of analytics data in a scalable manner, both in terms of hardware cost and engineering development time.
Given that we have billions of rows in some of our raw tables, we strive to minimize joins whenever possible. Instead, we have consolidated the data into several materialized views, each grouping the data into distinct time window buckets. As a result, most of the time, we are querying thousands of aggregated rows instead of millions of raw data rows, resulting in faster real-time filtering.
Our raw tables all contain a timestamp column observed_at. We then aggregate the data rows into specified intervals for the materialized tables. For example, if the interval is 1 minute, we use the toStartOfInterval(observed_at, INTERVAL 1 minute) function to aggregate data into one minute intervals. Each row in the materialized views contains several raw data values, all corresponding to a particular minute based on the observed_at column. By creating materialized views with different intervals, such as 6 minutes, 40 minutes, and 3 hours, we can further enhance query performance and enable more efficient data analysis across various timeframes.
We exclude all columns containing unique data, such as trace_id, from materialized views. When we need to filter by trace_id, we query the raw data. It is worth noting that raw data queries are reasonably quick in Clickhouse, as Clickhouse can efficiently sift through a large number of rows in real-time using the appropriate index for each column.
In our analytics, we primarily use grouped queries to obtain data. Standard queries and materialized view queries use the same SQL files, but there is a slight modification in the way they handle counting. Instead of using count(1) to count the rows, materialized view queries use sum(calls) with a derived column named calls. The column calls is created by using count(1) as calls in the materialized view definition. This change allows us to aggregate the count of rows from the materialized view, resulting in more efficient querying and better performance.
In Clickhouse, the State and Merge suffixes are used with aggregation functions to handle intermediate aggregation states and merge them efficiently. These suffixes are particularly useful when working with distributed databases or when we need to combine aggregated results from multiple sources.
For example, if we want to calculate the 95th (0.95) and 99th (0.99) percentiles of a column called server_process_time, we can use the quantilesState(value) function. This function returns the intermediate state of the quantiles calculation, which we can store in a separate table or combine with other intermediate states.
For instance, to obtain the 99th percentile of the aggregated server_process_time, we would use quantileMerge(0.99)(quantiles_server_process_time), where quantiles_server_process_time is the intermediate state resulting from the quantilesState function.
By leveraging Materialized Views and ClickHouse, we achieved the following:
The scalability of this solution depends on various factors, including the scaling setup. ClickHouse officially offers a cloud service with 720 GiB RAM, 180 vCPU, and 10 TB of data, indicating its potential for handling significant workloads. However, we have been able to easily scale to 1B rows and still growing.
While ClickHouse does have limitations, such as the absence of transactions and potential inaccuracies in Materialized Views, these drawbacks are minor for our analytics use case. Overall, ClickHouse, when paired with Materialized Views, has proven to be an excellent solution for handling large-scale analytics data with impressive performance and scalability.