ClickHouse and ColumnStore in the Star Schema Benchmark
There were two new releases in the OpenSource Analytical Databases space, which made me want to evaluate how they perform in the Star Schema Benchmark.
I covered Star Schema Benchmarks a few times before:
- Star Schema Bechmark: InfoBright, InfiniDB and LucidDB
- ClickHouse in a General Analytical Workload (Based on a Star Schema Benchmark)
What are the new releases:
MariaDB 10.5 comes with built-in ColumnStore and an interesting capability to replicate data from InnoDB to ColumnStore in real-time within the same instance, and they named it “hybrid transactional and analytical processing”.
For ClickHouse, there was not a single release, but continuous improvements for the last three years to support JOIN queries, support of updating data, and improved integration with MySQL.
For the ClickHouse testing, I will use the 20.4.6 version.
For the benchmark I will Star Schema Benchmark with scale factor 2500, which in sizes is:
- 15 bln rows in lineorder table
- 1.7T raw data for lineorder table
Star Schema Benchmark prescribes to execute 13 queries and I will measure execution time for each query.
Example of table schemas and queries you can find in my results repo:
Side note: ClickHouse documentation provides examples for Star Schema queries, however, they took a denormalization approach, which is valid, but does not allow to test how original queries with 3-way JOINs would perform, which is my goal in this experiment.
Load Data Results
The first result I want to see is how long it would take to load 15bln rows into the database and what the final table size will be.
time clickhouse-client –query “INSERT INTO lineorder FORMAT CSV” < lineorder.tbl
time cpimport -m1 -s”,” -E ‘”‘ sbtest lineorder lineorder.tbl
root@nodesm:/data/sdd/clickhouse/data# du -sh default
root@nodesm:/data/sdd/cs# du -sh data1
Now I want to compare query execution times for ColumnStore and ClickHouse. I will use queries as close to the original as possible, but for ClickHouse I still need to adjust queries in order to be able to execute them.
We can see that ClickHouse outperforms, sometimes by a huge margin, in all queries but one – Query 3.1. Also, query 4.1 is particularly hard for ClickHouse.
Note on Queries in ClickHouse
I mentioned that to run queries in ClickHouse they still need modifications, so let’s take a look.
If three years ago (see the blog I mentioned earlier, ClickHouse in a General Analytical Workload – Based on a Star Schema Benchmark), a 3-way JOIN query like:
SELECT sum(LO_REVENUE),P_MFGR, toYear(LO_ORDERDATE) yod FROM lineorderfull ,customerfull,partfull WHERE C_REGION = 'ASIA' and LO_CUSTKEY=C_CUSTKEY and P_PARTKEY=LO_PARTKEY GROUP BY P_MFGR,yod ORDER BY P_MFGR,yod;
Had to be written as:
SELECT sum(LO_REVENUE), P_MFGR, toYear(LO_ORDERDATE) AS yod FROM ( SELECT LO_PARTKEY, LO_ORDERDATE, LO_REVENUE FROM lineorderfull ALL INNER JOIN ( SELECT C_REGION, C_CUSTKEY AS LO_CUSTKEY FROM customerfull ) USING (LO_CUSTKEY) WHERE C_REGION = 'ASIA' ) ALL INNER JOIN ( SELECT P_MFGR, P_PARTKEY AS LO_PARTKEY FROM partfull ) USING (LO_PARTKEY) GROUP BY P_MFGR, yod ORDER BY P_MFGR ASC, yod ASC
Now, the multi-table queries can be written as:
SELECT toYear(LO_ORDERDATE) AS year, C_NATION, SUM(LO_REVENUE - LO_SUPPLYCOST) AS profit FROM lineorder INNER JOIN customer ON C_CUSTKEY = LO_CUSTKEY INNER JOIN supplier ON LO_SUPPKEY = S_SUPPKEY INNER JOIN part ON P_PARTKEY = LO_PARTKEY WHERE (C_REGION = 'AMERICA') AND (S_REGION = 'AMERICA') AND ((P_MFGR = 'MFGR#1') OR (P_MFGR = 'MFGR#2')) GROUP BY year, C_NATION ORDER BY year ASC, C_NATION ASC
Pretty much a query with standard JOIN syntax.
Word of Caution:
Even though queries in ClickHouse look like normal JOIN queries, ClickHouse DOES NOT HAVE query optimizer. That means that tables will be joined in the order you have written them, and ClickHouse will not try to change table order for the optimal execution. This task is still on developers.
With human intervention, it is still possible to improve Query 3.1 for ClickHouse, if we write it in the following form (Thanks to Alexander Zaytsev from Altinity for the help!):
SELECT C_NATION, S_NATION, toYear(LO_ORDERDATE) AS year, SUM(LO_REVENUE) AS revenue FROM lineorder INNER JOIN customer ON C_CUSTKEY = LO_CUSTKEY INNER JOIN supplier ON LO_SUPPKEY = S_SUPPKEY WHERE LO_CUSTKEY in (select distinct C_CUSTKEY from customer where C_REGION = 'ASIA') AND LO_SUPPKEY in (select distinct S_SUPPKEY from supplier where S_REGION = 'ASIA') AND AND year >= 1992 AND year <= 1997 GROUP BY C_NATION, S_NATION, year ORDER BY year ASC, revenue DESC;
Execution time for query 3.1 is reduced from 7 min 33 sec to 5 min 44 sec.
It is possible that other queries also can be improved, but I want to keep them in the original form.
Update queries are not part of Star Schema Benchmark, but this question is quite regular.
Can we update data AND how well UPDATE queries are performing, because it can be a challenge for analytical databases?
For this, ClickHouse introduced a special UPDATE syntax, which looks like ALTER TABLE … UPDATE to highlight this is not a typical UPDATE statement.
Also it is worth noting that ALTER TABLE UPDATE in ClickHouse is executed in an asynchronous way, meaning that until it is finished you may see inconsistencies during the period while changes are applied.
So let’s execute the query:
ALTER TABLE lineorder UPDATE LO_QUANTITY = LO_QUANTITY-1 WHERE 1=1;
Please note this query will update 15 bln rows, which is not a trivial task!
We can see how long it will take to apply the changes. For this, I will run the following query
clickhouse-client -m default -q " select toYear(LO_ORDERDATE) year, sum(LO_QUANTITY) from lineorder GROUP BY year;"
in the loop to see when sum(LO_QUANTITY) is adjusted for new values.
The result: It took about 2 mins after ALTER TABLE query for SELECT query to reflect new values.
For the ColumnStore I will use the normal UPDATE syntax:
UPDATE lineorder SET LO_QUANTITY = LO_QUANTITY-1;
The query is performed in a synchronous way and it is transactional.
I waited 6 hours for the query to finish, and at that point, I had to kill it, as I could not wait any longer.
Again, please note the difference in executing updates:
ClickHouse – the update is run asynchronous and is NOT transactional, which likely will result in read queries returning inconsistent results while UPDATE is running.
ColumnStore – the update is synchronous and transactional. Likely this is why it takes that long to update 15bln rows.
So obviously, if it comes to preference, I would choose ClickHouse given query execution times, data load times, and new improved JOIN syntax.
The benefit of ColumnStore is that it comes in a single MySQL-like package (MariaDB-flavor) and there is an easy way to integrate data flow from InnoDB to ColumnStore.
ClickHouse makes improvements for integrations with MySQL (see ClickHouse and MySQL – Better Together, but there is still room for improvements!
For reference – the hardware I used is:
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
NUMA node(s): 2
CPU family: 6
Model name: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
and SATA SSD storage and 128GB of RAM
You can find queries and the results on our GitHub repository:
by Vadim Tkachenko via Percona Database Performance Blog