Skip to content

perf(query): fuse no-WHERE multi-key count-only group-by#218

Open
ser-vasilich wants to merge 2 commits into
perf/clickbench-improvementsfrom
serhii/improve1
Open

perf(query): fuse no-WHERE multi-key count-only group-by#218
ser-vasilich wants to merge 2 commits into
perf/clickbench-improvementsfrom
serhii/improve1

Conversation

@ser-vasilich
Copy link
Copy Markdown
Collaborator

@ser-vasilich ser-vasilich commented May 29, 2026

Relax the fused group-by planner gate so a no-WHERE multi-key
count-only shape routes onto exec_filtered_group_multi instead of
the unfused exec_group radix path. ray_filtered_group already
accepts a NULL predicate (worker runs with a const-true mask); the
only blocker was where_expr && in the gate.

Gate now fires no-WHERE only when n_keys >= 2 && has_only_count.
Single-key no-WHERE and multi-agg over near-unique composites stay on
exec_group — at very high cardinality the radix path's
per-(worker, partition) scatter beats a single linear-probe shard.

Follow-up commit: narrow I64 results of known-small temporal extracts
(minute / hh / ss / dd / dow / mm / doy / yyyy) to I16 before adding
to the table. Brings q18's composite under the 16-byte mk_compile
budget so it fuses too.

ClickBench 10M:

  • q16 744 → 154 ms
  • q18 1748 → 449 ms
  • total 8.0 → 5.2 s

ser-vasilich and others added 2 commits May 30, 2026 15:12
The fused multi-key path already accepts a NULL predicate; only the
planner gate required where_expr.  Allow no-WHERE when n_keys >= 2 AND
count-only.  Single-key no-WHERE and multi-agg over near-unique
composites stay on exec_group's radix — fusing them regresses at very
high cardinality.

ClickBench 10M:
  q16  744 → 154 ms
  total 8.0 → 7.3 s

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
mk_compile packs the composite by-key into a 16-byte slot.  An I64
column for minute() (values 0..59) blows the budget on q18's
{UserID, minute, SearchPhrase} composite (~20 bytes) and the query
drops to exec_group.

After eval'ing a computed by-val whose AST head is minute / hh / ss /
dd / dow / mm / doy / yyyy, downcast the I64 result to I16 before
adding it to the table.  I16 is the smallest type that holds every
output range (year up to 32767, doy up to 366) and still prints as
decimal (U8 prints hex, unreadable for a minute value).

Skipped when the source column has nulls.

ClickBench 10M:
  q18  1748 → 449 ms
  total 6.6 → 5.2 s

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant