Skip to content

feat(mcp): add trace waterfall and breakdown tools#2334

Merged
kodiakhq[bot] merged 6 commits into
mainfrom
brandon/mcp-trace-tools
May 26, 2026
Merged

feat(mcp): add trace waterfall and breakdown tools#2334
kodiakhq[bot] merged 6 commits into
mainfrom
brandon/mcp-trace-tools

Conversation

@brandon-pereira
Copy link
Copy Markdown
Member

What

Add two new MCP tools for trace investigation:

hyperdx_trace_waterfall

Fetch all spans in a single trace as a parent/child waterfall tree, pre-ordered for human-readable display.

  • Specific trace mode: pass traceId to fetch a known trace
  • Auto-pick mode: pass pickFilter + pickBy (slowest / first_error / most_recent) to find one matching trace
  • Includes correlated log rows when the trace source has a linked logSourceId
  • Respects maxSpans / maxLogs caps with truncation notes

hyperdx_trace_top_time_consuming_operations

Aggregate breakdown of child operations consuming the most cumulative time across traces matching a parent-span filter. Same algorithm as the in-app "Top Most Time Consuming Operations" chart.

  • Two-stage SQL: find distinct TraceIds matching the parent filter, then aggregate child spans
  • Supports minParentDurationMs to focus on slow parents
  • Returns ranked operations with totalTimeMs, calls, inParents, p50Ms, p99Ms, and shareOfTotalTime

Why

MCP agents investigating latency issues need a way to:

  1. Inspect a single concrete trace tree without writing self-JOINs (waterfall)
  2. Ask "where does the time go" across many slow traces of an operation (breakdown)

The existing builder tools (table/timeseries/search) can't express the TraceId subselect pattern needed for the breakdown.

Testing

Integration tests in packages/api/src/mcp/__tests__/trace.test.ts cover:

  • Schema serialization (expected input properties)
  • Error paths: non-existent source, non-trace source, invalid time range, bad SQL
  • Seeded data: waterfall tree structure, auto-pick modes, truncation, correlated logs, breakdown ranking, minParentDurationMs filtering, topN

Run with: make dev-int FILE=trace

Add two new MCP tools for trace investigation:

- hyperdx_trace_waterfall: Fetch all spans in a single trace as a
  parent/child waterfall tree. Supports auto-pick by slowest, first
  error, or most recent trace. Includes correlated logs when the trace
  source has a linked logSourceId.

- hyperdx_trace_top_time_consuming_operations: Aggregate breakdown of
  child operations consuming the most cumulative time across traces
  matching a parent-span filter. Same algorithm as the in-app
  'Top Most Time Consuming Operations' chart.

Both tools use source-configured expressions for attribute extraction,
set readonly:1 on ClickHouse queries for defense-in-depth, and output
JSON.

Includes integration tests covering schema serialization, error paths,
seeded data scenarios (waterfall tree structure, auto-pick modes,
truncation, correlated logs, breakdown ranking, minParentDurationMs
filtering, topN).
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 22, 2026

🦋 Changeset detected

Latest commit: 9005569

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 3 packages
Name Type
@hyperdx/api Patch
@hyperdx/app Patch
@hyperdx/otel-collector Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel
Copy link
Copy Markdown

vercel Bot commented May 22, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hyperdx-oss Ready Ready Preview, Comment May 26, 2026 3:40pm

Request Review

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 22, 2026

🔴 Tier 4 — Critical

Touches auth, data models, config, tasks, OTel pipeline, ClickHouse, or CI/CD.

Why this tier:

  • Large diff: 1007 production lines changed (threshold: 1000)

Review process: Deep review from a domain expert. Synchronous walkthrough may be required.
SLA: Schedule synchronous review within 2 business days.

Stats
  • Production files changed: 4
  • Production lines changed: 1007 (+ 877 in test files, excluded from tier calculation)
  • Branch: brandon/mcp-trace-tools
  • Author: brandon-pereira

To override this classification, remove the review/tier-4 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

@github-actions github-actions Bot added the review/tier-4 Critical — deep review + domain expert sign-off label May 22, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 22, 2026

E2E Test Results

All tests passed • 189 passed • 3 skipped • 1219s

Status Count
✅ Passed 189
❌ Failed 0
⚠️ Flaky 3
⏭️ Skipped 3

Tests ran across 4 shards in parallel.

View full report →

…llisions

otel_traces is not truncated between tests (commented out in
clearClickhouseTables), so trace data accumulates across all test
suites. Use unique trc-test-/wf-test- prefixed names to prevent
collisions with other tests' data.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 22, 2026

Deep Review

Scope: PR #2334 — adds two MCP trace tools (hyperdx_trace_waterfall, hyperdx_trace_top_time_consuming_operations) plus tests, against base 633eda61.

Mode: report-only multi-agent review (11 reviewers dispatched in parallel).

🔴 P0/P1 -- must fix

  • packages/api/src/mcp/tools/trace/breakdown.ts:137-138 -- Tool description documents response fields as total_time_ms, in_parents, p50_ms, p99_ms, but the handler emits camelCase (totalTimeMs, inParents, p50Ms, p99Ms) at lines 333-337, so any agent that reads the documented field names gets undefined for every metric.

    • Fix: Align the description's field names with the camelCase keys actually returned by the handler (or alias the SQL output so both match).
    • api-contract, agent-native
  • packages/api/src/mcp/tools/trace/breakdown.ts:302-309 -- clickhouse_settings is built as { readonly: '1', ...Object.fromEntries(querySettings...) }, so a source whose querySettings contains a readonly entry silently overrides the only safeguard preventing DDL/DML via the raw-interpolated parentFilter; the same pattern exists in waterfall.ts:431-439 and :554-564.

    • Fix: Spread querySettings first and apply readonly: '1' last ({ ...settings, readonly: '1' }) so the safeguard cannot be overridden by source config.
    • security, reliability
  • packages/api/src/mcp/tools/trace/breakdown.ts:239,255 -- input.parentFilter is interpolated verbatim into the CTE and outer NOT (...) clauses with no AST validation, length cap is 4096 chars, and readonly=1 still permits arbitrary UNION SELECT against tables the connection user can read (other team databases on the same cluster, system.* introspection, etc.); error messages echo back the constructed SQL fragments to aid exploitation.

    • Fix: Validate parentFilter (allowlist of operators/columns or AST parse) and stop echoing raw ClickHouse error text — return a generic "invalid parentFilter SQL" with the detail logged server-side only.
    • security, adversarial, agent-native, reliability
  • packages/api/src/mcp/tools/trace/breakdown.ts:278-282 -- new ClickhouseClient({ host, username, password }) is constructed without a queryTimeout, so processClickhouseSettings never injects max_execution_time; a deliberately broad parentFilter combined with maxParentTraces: 1_000_000 can pin a ClickHouse executor for the client's 1-hour requestTimeout default. Same omission at waterfall.ts:264-268 and :522-526.

    • Fix: Pass a default queryTimeout (e.g. 30–60 s) when constructing the client, or set max_execution_time directly in each call's clickhouse_settings.
    • reliability, performance, security, adversarial

🟡 P2 -- recommended

  • packages/api/src/mcp/tools/trace/breakdown.ts:255 -- The outer query excludes parent rows via AND NOT (${input.parentFilter}), but when parentFilter is service-only (e.g. ServiceName='svc-a') every same-service child span is also excluded, so the breakdown silently drops in-process work and over-weights cross-service children; the description warns against service-only filters but still accepts them.

    • Fix: Switch the exclusion to a (TraceId, SpanId) pair check — collect parent span IDs in the CTE and exclude rows by SpanId NOT IN (...) instead of re-applying the filter expression.
    • adversarial, correctness
  • packages/api/src/mcp/tools/trace/waterfall.ts:159-163 and breakdown.ts:97-99 -- durationDivisor = Math.pow(10, Math.max(0, precision - 3)) returns 1 for durationPrecision ∈ {0,1,2}, so sources with sub-millisecond precision report durations 10ⁿ× too small and minParentDurationMs * divisor (breakdown.ts:275) computes the wrong stored threshold.

    • Fix: Drop the Math.max(0, ...) clamp and use Math.pow(10, precision - 3) so negative exponents convert correctly, or reject durationPrecision < 3 at source construction.
    • correctness, adversarial, testing
  • packages/api/src/mcp/tools/trace/waterfall.ts:296-302 -- When pickBy: 'first_error' is combined with a Lucene pickFilter, the code appends the literal string StatusCode:STATUS_CODE_ERROR instead of using the source's statusCodeExpression; sources whose status column is named anything else silently match zero traces, and the SQL branch at lines 293-295 already handles this correctly.

    • Fix: Convert the Lucene filter to SQL and AND it with the proper ${statusCodeExpr} = 'STATUS_CODE_ERROR' SQL clause, mirroring the SQL branch.
    • correctness, adversarial, agent-native, testing
  • packages/api/src/mcp/tools/trace/breakdown.ts:236,251 -- breakdown.ts interpolates from.databaseName / from.tableName as raw backtick-wrapped strings (```${dbName}`.`${tableName}```), while waterfall.ts:412-423 uses ClickHouse-native `{db:Identifier}.{tbl:Identifier}` parameter binding; the inconsistency means a tableName containing a backtick is escaped safely in one tool and breaks out of the identifier in the other.

    • Fix: Replace the backtick interpolation in breakdown.ts with the same {db:Identifier} / {tbl:Identifier} binding used in waterfall.ts.
    • security, adversarial
  • packages/api/src/mcp/tools/trace/waterfall.ts:479 -- totalDurationMs = Math.max(...spans.map(s => s.durationMs)) is labelled as the trace total but returns the longest single span, which diverges from the root duration whenever the root is truncated out of the result or async spans outlast their parent.

    • Fix: Rename to maxSpanDurationMs to match the computation, or change the value to root?.durationMs and surface both fields.
    • correctness, api-contract, agent-native
  • packages/api/src/mcp/tools/trace/breakdown.ts:139,348-363 -- The description states the summary returns a "matched-parent count" but the emitted summary has no such field (only operationsReturned, topN, grandTotalTimeMs, hint), leaving agents without the denominator needed to assess result confidence.

    • Fix: Add the distinct-parent-trace count to the SQL output (e.g. via a second cheap query against the CTE or uniqExact in the outer aggregate) and include it in summary as matchedParentTraces.
    • agent-native, api-contract
  • packages/api/src/mcp/tools/trace/breakdown.ts:339-345,360 -- shareOfTotalTime divides each row by the sum of returned top-N rows, not by the cumulative child time across all matching traces, so the percentages inflate when topN is small; the hint describes it as the latter.

    • Fix: Either compute grandTotalMs from a separate aggregate over the full set (no LIMIT) or correct the hint text to "share of returned top-N rows".
    • api-contract
  • packages/api/src/mcp/tools/trace/waterfall.ts:454-456,478-479 -- After slice(0, maxSpans) the orphaned-root logic in buildPreOrderTree can produce many depth-0 spans when truncation drops the real root (e.g. clock-skewed child with earlier timestamp), and the returned rootSpan is then an arbitrary orphan with no indication in the response.

    • Fix: When truncated, surface rootSpanCount and a warning when the tree has more than one depth-0 span, or re-fetch the root span explicitly when missing.
    • correctness, adversarial
  • packages/api/src/mcp/tools/trace/waterfall.ts:1 & breakdown.ts:1 -- Both files exceed the 300-line ceiling stated in AGENTS.md "Key Principles §4" (waterfall.ts 620 lines, breakdown.ts 373 lines) and duplicate durationDivisor plus the querySettingsclickhouse_settings spread three times across the two files.

    • Fix: Extract durationDivisor, the clickhouse_settings helper, and shared row types into packages/api/src/mcp/tools/trace/helpers.ts, mirroring the existing query/helpers.ts pattern.
    • project-standards, maintainability, kieran-typescript
  • packages/api/src/mcp/__tests__/trace.test.ts -- New behavior with no test coverage: durationPrecision ≠ 9 (would expose the divisor bug), log truncation (logsTruncated/logsNote branches), cross-connection log source (waterfall.ts:511-527), first_error + pickFilterLanguage:'sql' composition, NOT (parentFilter) exclusion is asserted only implicitly, and multi-level (depth > 2) span trees in buildPreOrderTree.

    • Fix: Add targeted tests for each branch — particularly precision=3 / precision=6, a >maxLogs log fixture, a second-connection log source, and a deep-tree fixture asserting depth assignment and orphan-as-root behavior.
    • testing, correctness, adversarial, agent-native, maintainability
🔵 P3 nitpicks (10)
  • packages/api/src/mcp/tools/trace/breakdown.ts:267-268 -- The ±60_000ms child-window widening is a magic literal with no named constant.

    • Fix: Extract to a module-level CHILD_WINDOW_SLACK_MS = 60_000 constant referenced from both bounds.
  • packages/api/src/mcp/tools/trace/waterfall.ts:382-396 -- The picker extracts the TraceId column by eliminating known meta keys; any future chart-config renderer change that adds another meta column would silently pick the wrong key.

    • Fix: Alias the groupBy in the pick config (e.g. alias: 'picked_trace_id') and look up that exact key.
  • packages/api/src/mcp/tools/trace/waterfall.ts:482 and breakdown.ts:284 -- LogRow and Row are declared inside the async handler bodies, inconsistent with module-scope SpanRow / TreeSpan.

    • Fix: Move both types to module scope; rename Row to RawBreakdownRow to signal it is the pre-normalization wire shape.
  • packages/api/src/mcp/tools/trace/waterfall.ts:341-345,442,567-568 and breakdown.ts:311-313 -- Repeated casts like (result as { json: () => Promise<T> }).json() discard the typed return shape of BaseResultSet.json<T>() and ResponseJSON.

    • Fix: Call result.json<SpanRow[]>() directly and let TypeScript bind the generic; remove the structural cast.
  • packages/api/src/mcp/tools/trace/breakdown.ts:249-250 -- quantile(0.5)(...) and quantile(0.99)(...) are computed as two independent aggregates with separate sketch state.

    • Fix: Use a single quantiles(0.5, 0.99)(${durationExpr}) and unpack the tuple in result processing.
  • packages/api/src/mcp/tools/trace/breakdown.ts:248 -- uniqExact(${traceIdExpr}) for in_parents is more expensive than an approximate count.

    • Fix: Switch to uniq(...) (HyperLogLog) and document the ~2.5% approximation in the description.
  • packages/api/src/mcp/tools/trace/breakdown.ts:317-321 and waterfall.ts:352,449,575 -- Catch blocks return the raw ${e.message} to the caller, leaking SQL fragments, column types, and schema names.

    • Fix: Strip the SQL-fragment tail (after "while processing query:") before returning, or return a generic message and log the detail.
  • packages/api/src/mcp/__tests__/trace.test.ts:445,684 -- Inline span ID literals ('root_span_0001' = 14 chars, 'child_db_0001' = 13 chars, 'err_root_span01' = 15 chars) violate the OpenTelemetry SpanId 16-hex-char constraint observed elsewhere in the file; tests pass because parent linking compares strings, but any future schema validation breaks them silently.

    • Fix: Pad the literals to 16 hex chars (e.g. '1111111111111111' style already used in the waterfall suite).
  • packages/api/src/mcp/tools/trace/breakdown.ts:142-144 -- The "NEXT STEP" blurb references the tool by the name hyperdx_event_deltas, which is unverified against the actual registered tool name.

    • Fix: Confirm the registered tool name in packages/api/src/mcp/tools/query/eventDeltas.ts and update the description if it differs.
  • packages/api/src/mcp/__tests__/trace.test.ts:305,319-321,430-432 -- Auto-pick tests assert toBeGreaterThan(0) for spanCount and totalDurationMs where exact values are knowable from the seeded fixtures.

    • Fix: Assert the exact expected counts and a bounded duration range so an incorrect divisor or truncation regression fails the test.

Reviewers (11): correctness, security, testing, performance, maintainability, project-standards, agent-native, reliability, adversarial, api-contract, kieran-typescript, learnings.

Testing gaps:

  • No coverage for durationPrecision other than 9 (ns) — masks the divisor bug for low-precision sources.
  • No coverage for log-truncation (logsTruncated/logsNote over-limit branch).
  • No coverage for a log source on a different connection than the trace source.
  • No coverage for first_error + pickFilterLanguage:'sql' composition or first_error + lucene on a source with a non-default statusCodeExpression.
  • No explicit assertion that root parent spans are excluded from breakdown.operations, nor that the ±60s widening behaves at the boundary.

- Fix first_error pickBy description: 'root span' → 'a span' (matches
  actual any-error-in-trace semantics)
- Add .max(4096) to parentFilter schema to cap input length
- Use uniqExact() instead of count(DISTINCT) for in_parents aggregate
- Add MCP.md entries for both new trace tools
- Add tests: first_error pick mode, sql pickFilterLanguage, logSource
  edge cases (missing, wrong kind, non-existent)
@brandon-pereira brandon-pereira requested review from a team, bot-hyperdx and dhable and removed request for a team and bot-hyperdx May 22, 2026 23:07
@kodiakhq kodiakhq Bot merged commit 07911fd into main May 26, 2026
18 checks passed
@kodiakhq kodiakhq Bot deleted the brandon/mcp-trace-tools branch May 26, 2026 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automerge review/tier-4 Critical — deep review + domain expert sign-off

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants