Skip to content

trivial view on top of distributed table can significantly impact the performance of some queries #51645

@filimonov

Description

@filimonov
create table x engine = MergeTree ORDER BY tuple() AS SELECT intDiv(number,100000) as a, number as b FROM numbers(1000000000);
SET prefer_localhost_replica = 0;
CREATE TABLE x_dist AS x ENGINE = Distributed(test_cluster_two_shards_localhost, currentDatabase(), x);
CREATE VIEW x_dist_v AS SELECT * FROM x_dist;

-- direct query from distributed table
select a, sum(b) from x_dist group by a ORDER BY sum(b) DESC limit 1;
--Stats: 1 row in set. Elapsed: 5.816 sec. Processed 2.00 billion rows, 32.00 GB (343.89 million rows/s., 5.50 GB/s.)


-- query from view
select a, sum(b) from x_dist_v group by a ORDER BY sum(b) DESC limit 1;
-- Stats: 1 row in set. Elapsed: 17.093 sec. Processed 2.00 billion rows, 32.00 GB (117.01 million rows/s., 1.87 GB/s.)

The differernce is caused by different local queries.

In case of the trivial view it pushes down the innermost subquery (i.e. view):

... executeQuery: ... SELECT `x`.`a`, `x`.`b` FROM `default`.`x` (stage: Complete)

In case of direct query from the distributed table - it constructs the subquery from the main one:

...  executeQuery: ...  SELECT `a`, sum(`b`) FROM `default`.`x` GROUP BY `a` ORDER BY sum(`b`) DESC LIMIT 1 (stage: WithMergeableState)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions