ClickHouse and ColumnStore in the Star Schema Benchmark

ClickHouse and ColumnStore There were two new releases in the OpenSource Analytical Databases space, which made me want to evaluate how they perform in the Star Schema Benchmark.

I covered Star Schema Benchmarks a few times before:

What are the new releases:

MariaDB 10.5 comes with built-in ColumnStore and an interesting capability to replicate data from InnoDB to ColumnStore in real-time within the same instance, and they named it “hybrid transactional and analytical processing”.

For ClickHouse, there was not a single release, but continuous improvements for the last three years to support JOIN queries, support of updating data, and improved integration with MySQL.

For the ClickHouse testing, I will use the 20.4.6 version.

Benchmark

For the benchmark I will Star Schema Benchmark with scale factor 2500, which in sizes is:

15 bln rows in lineorder table
1.7T raw data for lineorder table

Star Schema Benchmark prescribes to execute 13 queries and I will measure execution time for each query.

Example of table schemas and queries you can find in my results repo:

Side note: ClickHouse documentation provides examples for Star Schema queries, however, they took a denormalization approach, which is valid, but does not allow to test how original queries with 3-way JOINs would perform, which is my goal in this experiment.

Load Data Results

The first result I want to see is how long it would take to load 15bln rows into the database and what the final table size will be.

ClickHouse:

time clickhouse-client –query “INSERT INTO lineorder FORMAT CSV” < lineorder.tbl

real 283m44.217s
user 557m13.079s
sys 17m28.207s

ColumnStore:

time cpimport -m1 -s”,” -E ‘”‘ sbtest lineorder lineorder.tbl

real 800m11.175s
user 767m50.866s
sys 22m22.703s

Datasize:

ClickHouse:

root@nodesm:/data/sdd/clickhouse/data# du -sh default
355G default

ColumnStore:

root@nodesm:/data/sdd/cs# du -sh data1
625G data1

ClickHouse and ColumnStore benchmark

Queries

Now I want to compare query execution times for ColumnStore and ClickHouse. I will use queries as close to the original as possible, but for ClickHouse I still need to adjust queries in order to be able to execute them.

ClickHouse and ColumnStore

We can see that ClickHouse outperforms, sometimes by a huge margin, in all queries but one – Query 3.1. Also, query 4.1 is particularly hard for ClickHouse.

Note on Queries in ClickHouse

I mentioned that to run queries in ClickHouse they still need modifications, so let’s take a look.

If three years ago (see the blog I mentioned earlier, ClickHouse in a General Analytical Workload – Based on a Star Schema Benchmark), a 3-way JOIN query like:

SELECT sum(LO_REVENUE),P_MFGR, toYear(LO_ORDERDATE) yod FROM lineorderfull ,customerfull,partfull WHERE C_REGION = 'ASIA' and
LO_CUSTKEY=C_CUSTKEY and P_PARTKEY=LO_PARTKEY GROUP BY P_MFGR,yod ORDER BY P_MFGR,yod;

Had to be written as:

SELECT 
    sum(LO_REVENUE), 
    P_MFGR, 
    toYear(LO_ORDERDATE) AS yod
FROM 
(
    SELECT 
        LO_PARTKEY, 
        LO_ORDERDATE, 
        LO_REVENUE
    FROM lineorderfull 
    ALL INNER JOIN 
    (
        SELECT 
            C_REGION, 
            C_CUSTKEY AS LO_CUSTKEY
        FROM customerfull 
    ) USING (LO_CUSTKEY)
    WHERE C_REGION = 'ASIA'
) 
ALL INNER JOIN 
(
    SELECT 
        P_MFGR, 
        P_PARTKEY AS LO_PARTKEY
    FROM partfull 
) USING (LO_PARTKEY)
GROUP BY 
    P_MFGR, 
    yod
ORDER BY 
    P_MFGR ASC, 
    yod ASC

Now, the multi-table queries can be written as:

SELECT 
    toYear(LO_ORDERDATE) AS year, 
    C_NATION, 
    SUM(LO_REVENUE - LO_SUPPLYCOST) AS profit
FROM lineorder
INNER JOIN customer ON C_CUSTKEY = LO_CUSTKEY
INNER JOIN supplier ON LO_SUPPKEY = S_SUPPKEY
INNER JOIN part ON P_PARTKEY = LO_PARTKEY
WHERE (C_REGION = 'AMERICA') AND (S_REGION = 'AMERICA') AND ((P_MFGR = 'MFGR#1') OR (P_MFGR = 'MFGR#2'))
GROUP BY 
    year, 
    C_NATION
ORDER BY 
    year ASC, 
    C_NATION ASC

Pretty much a query with standard JOIN syntax.

Word of Caution:

Even though queries in ClickHouse look like normal JOIN queries, ClickHouse DOES NOT HAVE query optimizer. That means that tables will be joined in the order you have written them, and ClickHouse will not try to change table order for the optimal execution. This task is still on developers.

With human intervention, it is still possible to improve Query 3.1 for ClickHouse, if we write it in the following form (Thanks to Alexander Zaytsev from Altinity for the help!):

SELECT
  C_NATION,
  S_NATION,
 toYear(LO_ORDERDATE) AS year,
   SUM(LO_REVENUE) AS revenue
FROM lineorder
INNER JOIN customer ON C_CUSTKEY = LO_CUSTKEY
INNER JOIN supplier ON LO_SUPPKEY = S_SUPPKEY
WHERE 
LO_CUSTKEY in (select distinct C_CUSTKEY from customer where C_REGION = 'ASIA') AND
LO_SUPPKEY in (select distinct S_SUPPKEY from supplier where S_REGION = 'ASIA') AND
  AND year  >= 1992
  AND year  <= 1997
GROUP BY
  C_NATION,
  S_NATION,
  year
ORDER BY
  year ASC,
  revenue DESC;

Execution time for query 3.1 is reduced from 7 min 33 sec to 5 min 44 sec.

It is possible that other queries also can be improved, but I want to keep them in the original form.

Update Queries

Update queries are not part of Star Schema Benchmark, but this question is quite regular.

Can we update data AND how well UPDATE queries are performing, because it can be a challenge for analytical databases?

For this, ClickHouse introduced a special UPDATE syntax, which looks like ALTER TABLE … UPDATE to highlight this is not a typical UPDATE statement.

Also it is worth noting that ALTER TABLE UPDATE in ClickHouse is executed in an asynchronous way, meaning that until it is finished you may see inconsistencies during the period while changes are applied.

So let’s execute the query:

ALTER TABLE lineorder UPDATE LO_QUANTITY = LO_QUANTITY-1 WHERE 1=1;

ClickHouse:

Please note this query will update 15 bln rows, which is not a trivial task!

We can see how long it will take to apply the changes. For this, I will run the following query

clickhouse-client -m default -q " select toYear(LO_ORDERDATE) year, sum(LO_QUANTITY) from lineorder GROUP BY year;"

in the loop to see when sum(LO_QUANTITY) is adjusted for new values.

The result: It took about 2 mins after ALTER TABLE query for SELECT query to reflect new values.

ColumnStore:

For the ColumnStore I will use the normal UPDATE syntax:

UPDATE lineorder SET LO_QUANTITY = LO_QUANTITY-1;

The query is performed in a synchronous way and it is transactional.

I waited 6 hours for the query to finish, and at that point, I had to kill it, as I could not wait any longer.

Again, please note the difference in executing updates:

ClickHouse – the update is run asynchronous and is NOT transactional, which likely will result in read queries returning inconsistent results while UPDATE is running.

ColumnStore – the update is synchronous and transactional. Likely this is why it takes that long to update 15bln rows.

Final words

So obviously, if it comes to preference, I would choose ClickHouse given query execution times, data load times, and new improved JOIN syntax.

The benefit of ColumnStore is that it comes in a single MySQL-like package (MariaDB-flavor) and there is an easy way to integrate data flow from InnoDB to ColumnStore.

ClickHouse makes improvements for integrations with MySQL (see ClickHouse and MySQL – Better Together, but there is still room for improvements!

Hardware

For reference – the hardware I used is:

Architecture: x86_64
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
and SATA SSD storage and 128GB of RAM

You can find queries and the results on our GitHub repository:

https://github.com/Percona-Lab-results/Clickhouse-ColumnStore-SSB-Jul2020

by Vadim Tkachenko via Percona Database Performance Blog

Search This Blog

Small Business News