feat(mcp): add trace waterfall and breakdown tools by brandon-pereira · Pull Request #2334 · hyperdxio/hyperdx

brandon-pereira · 2026-05-22T22:31:41Z

What

Add two new MCP tools for trace investigation:

`hyperdx_trace_waterfall`

Fetch all spans in a single trace as a parent/child waterfall tree, pre-ordered for human-readable display.

Specific trace mode: pass traceId to fetch a known trace
Auto-pick mode: pass pickFilter + pickBy (slowest / first_error / most_recent) to find one matching trace
Includes correlated log rows when the trace source has a linked logSourceId
Respects maxSpans / maxLogs caps with truncation notes

`hyperdx_trace_top_time_consuming_operations`

Aggregate breakdown of child operations consuming the most cumulative time across traces matching a parent-span filter. Same algorithm as the in-app "Top Most Time Consuming Operations" chart.

Two-stage SQL: find distinct TraceIds matching the parent filter, then aggregate child spans
Supports minParentDurationMs to focus on slow parents
Returns ranked operations with totalTimeMs, calls, inParents, p50Ms, p99Ms, and shareOfTotalTime

Why

MCP agents investigating latency issues need a way to:

Inspect a single concrete trace tree without writing self-JOINs (waterfall)
Ask "where does the time go" across many slow traces of an operation (breakdown)

The existing builder tools (table/timeseries/search) can't express the TraceId subselect pattern needed for the breakdown.

Testing

Integration tests in packages/api/src/mcp/__tests__/trace.test.ts cover:

Schema serialization (expected input properties)
Error paths: non-existent source, non-trace source, invalid time range, bad SQL
Seeded data: waterfall tree structure, auto-pick modes, truncation, correlated logs, breakdown ranking, minParentDurationMs filtering, topN

Run with: make dev-int FILE=trace

Add two new MCP tools for trace investigation: - hyperdx_trace_waterfall: Fetch all spans in a single trace as a parent/child waterfall tree. Supports auto-pick by slowest, first error, or most recent trace. Includes correlated logs when the trace source has a linked logSourceId. - hyperdx_trace_top_time_consuming_operations: Aggregate breakdown of child operations consuming the most cumulative time across traces matching a parent-span filter. Same algorithm as the in-app 'Top Most Time Consuming Operations' chart. Both tools use source-configured expressions for attribute extraction, set readonly:1 on ClickHouse queries for defense-in-depth, and output JSON. Includes integration tests covering schema serialization, error paths, seeded data scenarios (waterfall tree structure, auto-pick modes, truncation, correlated logs, breakdown ranking, minParentDurationMs filtering, topN).

changeset-bot · 2026-05-22T22:31:45Z

🦋 Changeset detected

Latest commit: 9005569

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 3 packages

Name	Type
@hyperdx/api	Patch
@hyperdx/app	Patch
@hyperdx/otel-collector	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

vercel · 2026-05-22T22:31:46Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
hyperdx-oss	Ready	Preview, Comment	May 26, 2026 3:40pm

github-actions · 2026-05-22T22:32:02Z

🔴 Tier 4 — Critical

Touches auth, data models, config, tasks, OTel pipeline, ClickHouse, or CI/CD.

Why this tier:

Large diff: 1007 production lines changed (threshold: 1000)

Review process: Deep review from a domain expert. Synchronous walkthrough may be required.
SLA: Schedule synchronous review within 2 business days.

Stats

Production files changed: 4
Production lines changed: 1007 (+ 877 in test files, excluded from tier calculation)
Branch: brandon/mcp-trace-tools
Author: brandon-pereira

To override this classification, remove the review/tier-4 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

github-actions · 2026-05-22T22:33:51Z

E2E Test Results

✅ All tests passed • 189 passed • 3 skipped • 1219s

Status	Count
✅ Passed	189
❌ Failed	0
⚠️ Flaky	3
⏭️ Skipped	3

Tests ran across 4 shards in parallel.

View full report →

…llisions otel_traces is not truncated between tests (commented out in clearClickhouseTables), so trace data accumulates across all test suites. Use unique trc-test-/wf-test- prefixed names to prevent collisions with other tests' data.

github-actions · 2026-05-22T22:50:38Z

Deep Review

Scope: PR #2334 — adds two MCP trace tools (hyperdx_trace_waterfall, hyperdx_trace_top_time_consuming_operations) plus tests, against base 633eda61.

Mode: report-only multi-agent review (11 reviewers dispatched in parallel).

🔴 P0/P1 -- must fix

packages/api/src/mcp/tools/trace/breakdown.ts:137-138 -- Tool description documents response fields as total_time_ms, in_parents, p50_ms, p99_ms, but the handler emits camelCase (totalTimeMs, inParents, p50Ms, p99Ms) at lines 333-337, so any agent that reads the documented field names gets undefined for every metric.
- Fix: Align the description's field names with the camelCase keys actually returned by the handler (or alias the SQL output so both match).
- _{api-contract, agent-native}
packages/api/src/mcp/tools/trace/breakdown.ts:302-309 -- clickhouse_settings is built as { readonly: '1', ...Object.fromEntries(querySettings...) }, so a source whose querySettings contains a readonly entry silently overrides the only safeguard preventing DDL/DML via the raw-interpolated parentFilter; the same pattern exists in waterfall.ts:431-439 and :554-564.
- Fix: Spread querySettings first and apply readonly: '1' last ({ ...settings, readonly: '1' }) so the safeguard cannot be overridden by source config.
- _{security, reliability}
packages/api/src/mcp/tools/trace/breakdown.ts:239,255 -- input.parentFilter is interpolated verbatim into the CTE and outer NOT (...) clauses with no AST validation, length cap is 4096 chars, and readonly=1 still permits arbitrary UNION SELECT against tables the connection user can read (other team databases on the same cluster, system.* introspection, etc.); error messages echo back the constructed SQL fragments to aid exploitation.
- Fix: Validate parentFilter (allowlist of operators/columns or AST parse) and stop echoing raw ClickHouse error text — return a generic "invalid parentFilter SQL" with the detail logged server-side only.
- _{security, adversarial, agent-native, reliability}
packages/api/src/mcp/tools/trace/breakdown.ts:278-282 -- new ClickhouseClient({ host, username, password }) is constructed without a queryTimeout, so processClickhouseSettings never injects max_execution_time; a deliberately broad parentFilter combined with maxParentTraces: 1_000_000 can pin a ClickHouse executor for the client's 1-hour requestTimeout default. Same omission at waterfall.ts:264-268 and :522-526.
- Fix: Pass a default queryTimeout (e.g. 30–60 s) when constructing the client, or set max_execution_time directly in each call's clickhouse_settings.
- _{reliability, performance, security, adversarial}

🟡 P2 -- recommended

packages/api/src/mcp/tools/trace/breakdown.ts:255 -- The outer query excludes parent rows via AND NOT (${input.parentFilter}), but when parentFilter is service-only (e.g. ServiceName='svc-a') every same-service child span is also excluded, so the breakdown silently drops in-process work and over-weights cross-service children; the description warns against service-only filters but still accepts them.
- Fix: Switch the exclusion to a (TraceId, SpanId) pair check — collect parent span IDs in the CTE and exclude rows by SpanId NOT IN (...) instead of re-applying the filter expression.
- _{adversarial, correctness}
packages/api/src/mcp/tools/trace/waterfall.ts:159-163 and breakdown.ts:97-99 -- durationDivisor = Math.pow(10, Math.max(0, precision - 3)) returns 1 for durationPrecision ∈ {0,1,2}, so sources with sub-millisecond precision report durations 10ⁿ× too small and minParentDurationMs * divisor (breakdown.ts:275) computes the wrong stored threshold.
- Fix: Drop the Math.max(0, ...) clamp and use Math.pow(10, precision - 3) so negative exponents convert correctly, or reject durationPrecision < 3 at source construction.
- _{correctness, adversarial, testing}
packages/api/src/mcp/tools/trace/waterfall.ts:296-302 -- When pickBy: 'first_error' is combined with a Lucene pickFilter, the code appends the literal string StatusCode:STATUS_CODE_ERROR instead of using the source's statusCodeExpression; sources whose status column is named anything else silently match zero traces, and the SQL branch at lines 293-295 already handles this correctly.
- Fix: Convert the Lucene filter to SQL and AND it with the proper ${statusCodeExpr} = 'STATUS_CODE_ERROR' SQL clause, mirroring the SQL branch.
- _{correctness, adversarial, agent-native, testing}
packages/api/src/mcp/tools/trace/breakdown.ts:236,251 -- breakdown.ts interpolates from.databaseName / from.tableName as raw backtick-wrapped strings (```${dbName}`.`${tableName}```), while waterfall.ts:412-423 uses ClickHouse-native `{db:Identifier}.{tbl:Identifier}` parameter binding; the inconsistency means a tableName containing a backtick is escaped safely in one tool and breaks out of the identifier in the other.
- Fix: Replace the backtick interpolation in breakdown.ts with the same {db:Identifier} / {tbl:Identifier} binding used in waterfall.ts.
- _{security, adversarial}
packages/api/src/mcp/tools/trace/waterfall.ts:479 -- totalDurationMs = Math.max(...spans.map(s => s.durationMs)) is labelled as the trace total but returns the longest single span, which diverges from the root duration whenever the root is truncated out of the result or async spans outlast their parent.
- Fix: Rename to maxSpanDurationMs to match the computation, or change the value to root?.durationMs and surface both fields.
- _{correctness, api-contract, agent-native}
packages/api/src/mcp/tools/trace/breakdown.ts:139,348-363 -- The description states the summary returns a "matched-parent count" but the emitted summary has no such field (only operationsReturned, topN, grandTotalTimeMs, hint), leaving agents without the denominator needed to assess result confidence.
- Fix: Add the distinct-parent-trace count to the SQL output (e.g. via a second cheap query against the CTE or uniqExact in the outer aggregate) and include it in summary as matchedParentTraces.
- _{agent-native, api-contract}
packages/api/src/mcp/tools/trace/breakdown.ts:339-345,360 -- shareOfTotalTime divides each row by the sum of returned top-N rows, not by the cumulative child time across all matching traces, so the percentages inflate when topN is small; the hint describes it as the latter.
- Fix: Either compute grandTotalMs from a separate aggregate over the full set (no LIMIT) or correct the hint text to "share of returned top-N rows".
- _api-contract
packages/api/src/mcp/tools/trace/waterfall.ts:454-456,478-479 -- After slice(0, maxSpans) the orphaned-root logic in buildPreOrderTree can produce many depth-0 spans when truncation drops the real root (e.g. clock-skewed child with earlier timestamp), and the returned rootSpan is then an arbitrary orphan with no indication in the response.
- Fix: When truncated, surface rootSpanCount and a warning when the tree has more than one depth-0 span, or re-fetch the root span explicitly when missing.
- _{correctness, adversarial}
packages/api/src/mcp/tools/trace/waterfall.ts:1 & breakdown.ts:1 -- Both files exceed the 300-line ceiling stated in AGENTS.md "Key Principles §4" (waterfall.ts 620 lines, breakdown.ts 373 lines) and duplicate durationDivisor plus the querySettings→clickhouse_settings spread three times across the two files.
- Fix: Extract durationDivisor, the clickhouse_settings helper, and shared row types into packages/api/src/mcp/tools/trace/helpers.ts, mirroring the existing query/helpers.ts pattern.
- _{project-standards, maintainability, kieran-typescript}
packages/api/src/mcp/__tests__/trace.test.ts -- New behavior with no test coverage: durationPrecision ≠ 9 (would expose the divisor bug), log truncation (logsTruncated/logsNote branches), cross-connection log source (waterfall.ts:511-527), first_error + pickFilterLanguage:'sql' composition, NOT (parentFilter) exclusion is asserted only implicitly, and multi-level (depth > 2) span trees in buildPreOrderTree.
- Fix: Add targeted tests for each branch — particularly precision=3 / precision=6, a >maxLogs log fixture, a second-connection log source, and a deep-tree fixture asserting depth assignment and orphan-as-root behavior.
- _{testing, correctness, adversarial, agent-native, maintainability}

🔵 P3 nitpicks (10)

packages/api/src/mcp/tools/trace/breakdown.ts:267-268 -- The ±60_000ms child-window widening is a magic literal with no named constant.
- Fix: Extract to a module-level CHILD_WINDOW_SLACK_MS = 60_000 constant referenced from both bounds.
packages/api/src/mcp/tools/trace/waterfall.ts:382-396 -- The picker extracts the TraceId column by eliminating known meta keys; any future chart-config renderer change that adds another meta column would silently pick the wrong key.
- Fix: Alias the groupBy in the pick config (e.g. alias: 'picked_trace_id') and look up that exact key.
packages/api/src/mcp/tools/trace/waterfall.ts:482 and breakdown.ts:284 -- LogRow and Row are declared inside the async handler bodies, inconsistent with module-scope SpanRow / TreeSpan.
- Fix: Move both types to module scope; rename Row to RawBreakdownRow to signal it is the pre-normalization wire shape.
packages/api/src/mcp/tools/trace/waterfall.ts:341-345,442,567-568 and breakdown.ts:311-313 -- Repeated casts like (result as { json: () => Promise<T> }).json() discard the typed return shape of BaseResultSet.json<T>() and ResponseJSON.
- Fix: Call result.json<SpanRow[]>() directly and let TypeScript bind the generic; remove the structural cast.
packages/api/src/mcp/tools/trace/breakdown.ts:249-250 -- quantile(0.5)(...) and quantile(0.99)(...) are computed as two independent aggregates with separate sketch state.
- Fix: Use a single quantiles(0.5, 0.99)(${durationExpr}) and unpack the tuple in result processing.
packages/api/src/mcp/tools/trace/breakdown.ts:248 -- uniqExact(${traceIdExpr}) for in_parents is more expensive than an approximate count.
- Fix: Switch to uniq(...) (HyperLogLog) and document the ~2.5% approximation in the description.
packages/api/src/mcp/tools/trace/breakdown.ts:317-321 and waterfall.ts:352,449,575 -- Catch blocks return the raw ${e.message} to the caller, leaking SQL fragments, column types, and schema names.
- Fix: Strip the SQL-fragment tail (after "while processing query:") before returning, or return a generic message and log the detail.
packages/api/src/mcp/__tests__/trace.test.ts:445,684 -- Inline span ID literals ('root_span_0001' = 14 chars, 'child_db_0001' = 13 chars, 'err_root_span01' = 15 chars) violate the OpenTelemetry SpanId 16-hex-char constraint observed elsewhere in the file; tests pass because parent linking compares strings, but any future schema validation breaks them silently.
- Fix: Pad the literals to 16 hex chars (e.g. '1111111111111111' style already used in the waterfall suite).
packages/api/src/mcp/tools/trace/breakdown.ts:142-144 -- The "NEXT STEP" blurb references the tool by the name hyperdx_event_deltas, which is unverified against the actual registered tool name.
- Fix: Confirm the registered tool name in packages/api/src/mcp/tools/query/eventDeltas.ts and update the description if it differs.
packages/api/src/mcp/__tests__/trace.test.ts:305,319-321,430-432 -- Auto-pick tests assert toBeGreaterThan(0) for spanCount and totalDurationMs where exact values are knowable from the seeded fixtures.
- Fix: Assert the exact expected counts and a bounded duration range so an incorrect divisor or truncation regression fails the test.

Reviewers (11): correctness, security, testing, performance, maintainability, project-standards, agent-native, reliability, adversarial, api-contract, kieran-typescript, learnings.

Testing gaps:

No coverage for durationPrecision other than 9 (ns) — masks the divisor bug for low-precision sources.
No coverage for log-truncation (logsTruncated/logsNote over-limit branch).
No coverage for a log source on a different connection than the trace source.
No coverage for first_error + pickFilterLanguage:'sql' composition or first_error + lucene on a source with a non-default statusCodeExpression.
No explicit assertion that root parent spans are excluded from breakdown.operations, nor that the ±60s widening behaves at the boundary.

- Fix first_error pickBy description: 'root span' → 'a span' (matches actual any-error-in-trace semantics) - Add .max(4096) to parentFilter schema to cap input length - Use uniqExact() instead of count(DISTINCT) for in_parents aggregate - Add MCP.md entries for both new trace tools - Add tests: first_error pick mode, sql pickFilterLanguage, logSource edge cases (missing, wrong kind, non-existent)

github-actions Bot added the review/tier-4 Critical — deep review + domain expert sign-off label May 22, 2026

vercel Bot had a problem deploying to Preview May 22, 2026 22:32 Failure

chore: add changeset for trace tools

0ffa7a3

fix: use string type for readonly clickhouse_settings

e6562cd

vercel Bot deployed to Preview May 22, 2026 22:38 View deployment

vercel Bot deployed to Preview May 22, 2026 22:47 View deployment

vercel Bot deployed to Preview May 22, 2026 22:56 View deployment

brandon-pereira requested review from a team, bot-hyperdx and dhable and removed request for a team and bot-hyperdx May 22, 2026 23:07

karl-power approved these changes May 26, 2026

View reviewed changes

brandon-pereira added the automerge label May 26, 2026

Merge branch 'main' into brandon/mcp-trace-tools

9005569

vercel Bot deployed to Preview May 26, 2026 15:40 View deployment

kodiakhq Bot merged commit 07911fd into main May 26, 2026
18 checks passed

kodiakhq Bot deleted the brandon/mcp-trace-tools branch May 26, 2026 15:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(mcp): add trace waterfall and breakdown tools#2334

feat(mcp): add trace waterfall and breakdown tools#2334
kodiakhq[bot] merged 6 commits into
mainfrom
brandon/mcp-trace-tools

brandon-pereira commented May 22, 2026

Uh oh!

changeset-bot Bot commented May 22, 2026 •

edited

Loading

Uh oh!

vercel Bot commented May 22, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 22, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 22, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

brandon-pereira commented May 22, 2026

What

hyperdx_trace_waterfall

hyperdx_trace_top_time_consuming_operations

Why

Testing

Uh oh!

changeset-bot Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

vercel Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔴 Tier 4 — Critical

Uh oh!

github-actions Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Test Results

Uh oh!

github-actions Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deep Review

🔴 P0/P1 -- must fix

🟡 P2 -- recommended

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`hyperdx_trace_waterfall`

`hyperdx_trace_top_time_consuming_operations`

changeset-bot Bot commented May 22, 2026 •

edited

Loading

vercel Bot commented May 22, 2026 •

edited

Loading

github-actions Bot commented May 22, 2026 •

edited

Loading

github-actions Bot commented May 22, 2026 •

edited

Loading

github-actions Bot commented May 22, 2026 •

edited

Loading