Skip to content

bug: Metrics emitted after waitpoint resume are unattributed — TaskContextMetricExporter strips RUN_ID because taskContext.disable() is never reversed #3672

@tentatshu

Description

@tentatshu

Problem

On managed Trigger Cloud, all process.* and nodejs.* auto-metrics emitted by a task after it resumes from a waitpoint are recorded with empty RUN_ID, TASK_SLUG, and ATTEMPT_NUMBER attributes. They are still ingested into the metrics ClickHouse table and visible by machine_id, but they're invisible to the obvious per-run query:

SELECT * FROM metrics WHERE run_id = 'run_…'

For long-running tasks that hit a waitpoint (e.g. wait.for(), wait.until(), human-approval / token waitpoints), this means most of a multi-hour run's memory/CPU/heap history is unattributable to the run that produced it, which makes per-run capacity planning and post-mortem debugging much harder.

Reproduction

A task that runs for several minutes, hits a waitpoint, then resumes and runs for several minutes more. Grouped by attribution columns on the same machine_id:

SELECT run_id, task_identifier, attempt_number, count() AS rows,
       min(bucket_start) AS first, max(bucket_start) AS last
FROM metrics
WHERE machine_id = 'machine_xxxxxxxxxxxxxxxxxxxxx'
GROUP BY run_id, task_identifier, attempt_number
ORDER BY first ASC
run_id task_identifier attempt rows first last
run_xxxxxxxxxxxxxxxxxxxxxxxx my-task 1 910 T+00:00 T+15:00 (waitpoint hit)
(empty) (empty) 0 6,230 T+15:00 T+1h59m (run completes)

Continuous emission throughout — same machine_id, no process restart visible at the metric layer. The transition happens at the exact moment the waitpoint is reached.

Reproduces on every run of this task that uses a waitpoint, regardless of whether the run completes successfully or fails.

Root cause

Source pinned to v4.4.5.

  1. At a waitpoint, the supervisor sends the worker an IPC FLUSH { disableContext: true }. The handler calls taskContext.disable():

    https://github.com/triggerdotdev/trigger.dev/blob/v4.4.5/packages/cli-v3/src/entryPoints/managed-run-worker.ts#L633-L638

  2. taskContext.disable() sets _runDisabled = true:

    https://github.com/triggerdotdev/trigger.dev/blob/v4.4.5/packages/core/src/v3/taskContext/index.ts#L106-L108

  3. Meanwhile the OTel PeriodicExportingMetricReader keeps firing every 10 s, and each export goes through TaskContextMetricExporter.export(), which branches on isRunDisabled:

    https://github.com/triggerdotdev/trigger.dev/blob/v4.4.5/packages/core/src/v3/taskContext/otelProcessors.ts#L138-L199

    In the isRunDisabled branch (line 149-157), only env/project/org/machine attrs are kept; RUN_ID, TASK_SLUG, ATTEMPT_NUMBER are stripped. Comment: "Between runs: keep environment/project/org/machine attrs, strip run-specific ones".

  4. The only call site that flips _runDisabled back to false is setGlobalTaskContext():

    https://github.com/triggerdotdev/trigger.dev/blob/v4.4.5/packages/core/src/v3/taskContext/index.ts#L110-L113

    …which is invoked only from the worker's EXECUTE_TASK_RUN IPC (managed-run-worker.ts:421). The RESOLVE_WAITPOINT IPC handler (managed-run-worker.ts:639-641) just resolves the runtime's waitpoint promise — it doesn't touch taskContext.

So under process keep-alive, when a run resumes from a waitpoint in the same Node process, _runDisabled stays true for the remainder of that run, and every metric the periodic reader emits is tagged with empty run context. The user task code resumes and runs to completion, but its telemetry is invisible per-run.

Environment

  • Managed Trigger Cloud (us-east-1)
  • @trigger.dev/sdk 4.4.3 (problem applies to v4.4.5 source per inspection above)
  • Long-running task that hits a waitpoint partway through
  • processKeepAliveEnabled not explicitly set — using Cloud defaults

Related

#3556 — different surface symptom (lost telemetry on shutdown via Promise.all in TracingSDK.flush()), but touches the same general TaskContextMetricExporter / BufferingMetricExporter pipeline and may share root causes.

Possible fixes

Happy to send a PR but wanted to check direction first if a preference. Some ideas:

  1. Re-enable on waitpoint resume. Add a taskContext.enable() method (just flips _runDisabled = false) and call it when a waitpoint resolves. Smallest diff. The right hook may be runtime-side at the point where await wait.for(...) returns rather than the RESOLVE_WAITPOINT IPC, since the IPC fires before the user code is "back". Open question: was there a billing/usage reason disable() was chosen at waitpoints in the first place?

  2. Don't disable() at the waitpoint at all. Reserve disable() for true between-runs state. Cleaner conceptually but I don't have full context on why disable-at-waitpoint was added.

  3. Stop stripping RUN_ID in TaskContextMetricExporter. Keep the run attrs even when isRunDisabled is true. Easiest diff but contradicts the comment in the exporter — likely violates intent.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions