Problem
On managed Trigger Cloud, all process.* and nodejs.* auto-metrics emitted by a task after it resumes from a waitpoint are recorded with empty RUN_ID, TASK_SLUG, and ATTEMPT_NUMBER attributes. They are still ingested into the metrics ClickHouse table and visible by machine_id, but they're invisible to the obvious per-run query:
SELECT * FROM metrics WHERE run_id = 'run_…'
For long-running tasks that hit a waitpoint (e.g. wait.for(), wait.until(), human-approval / token waitpoints), this means most of a multi-hour run's memory/CPU/heap history is unattributable to the run that produced it, which makes per-run capacity planning and post-mortem debugging much harder.
Reproduction
A task that runs for several minutes, hits a waitpoint, then resumes and runs for several minutes more. Grouped by attribution columns on the same machine_id:
SELECT run_id, task_identifier, attempt_number, count() AS rows,
min(bucket_start) AS first, max(bucket_start) AS last
FROM metrics
WHERE machine_id = 'machine_xxxxxxxxxxxxxxxxxxxxx'
GROUP BY run_id, task_identifier, attempt_number
ORDER BY first ASC
| run_id |
task_identifier |
attempt |
rows |
first |
last |
run_xxxxxxxxxxxxxxxxxxxxxxxx |
my-task |
1 |
910 |
T+00:00 |
T+15:00 (waitpoint hit) |
| (empty) |
(empty) |
0 |
6,230 |
T+15:00 |
T+1h59m (run completes) |
Continuous emission throughout — same machine_id, no process restart visible at the metric layer. The transition happens at the exact moment the waitpoint is reached.
Reproduces on every run of this task that uses a waitpoint, regardless of whether the run completes successfully or fails.
Root cause
Source pinned to v4.4.5.
-
At a waitpoint, the supervisor sends the worker an IPC FLUSH { disableContext: true }. The handler calls taskContext.disable():
https://github.com/triggerdotdev/trigger.dev/blob/v4.4.5/packages/cli-v3/src/entryPoints/managed-run-worker.ts#L633-L638
-
taskContext.disable() sets _runDisabled = true:
https://github.com/triggerdotdev/trigger.dev/blob/v4.4.5/packages/core/src/v3/taskContext/index.ts#L106-L108
-
Meanwhile the OTel PeriodicExportingMetricReader keeps firing every 10 s, and each export goes through TaskContextMetricExporter.export(), which branches on isRunDisabled:
https://github.com/triggerdotdev/trigger.dev/blob/v4.4.5/packages/core/src/v3/taskContext/otelProcessors.ts#L138-L199
In the isRunDisabled branch (line 149-157), only env/project/org/machine attrs are kept; RUN_ID, TASK_SLUG, ATTEMPT_NUMBER are stripped. Comment: "Between runs: keep environment/project/org/machine attrs, strip run-specific ones".
-
The only call site that flips _runDisabled back to false is setGlobalTaskContext():
https://github.com/triggerdotdev/trigger.dev/blob/v4.4.5/packages/core/src/v3/taskContext/index.ts#L110-L113
…which is invoked only from the worker's EXECUTE_TASK_RUN IPC (managed-run-worker.ts:421). The RESOLVE_WAITPOINT IPC handler (managed-run-worker.ts:639-641) just resolves the runtime's waitpoint promise — it doesn't touch taskContext.
So under process keep-alive, when a run resumes from a waitpoint in the same Node process, _runDisabled stays true for the remainder of that run, and every metric the periodic reader emits is tagged with empty run context. The user task code resumes and runs to completion, but its telemetry is invisible per-run.
Environment
- Managed Trigger Cloud (us-east-1)
@trigger.dev/sdk 4.4.3 (problem applies to v4.4.5 source per inspection above)
- Long-running task that hits a waitpoint partway through
processKeepAliveEnabled not explicitly set — using Cloud defaults
Related
#3556 — different surface symptom (lost telemetry on shutdown via Promise.all in TracingSDK.flush()), but touches the same general TaskContextMetricExporter / BufferingMetricExporter pipeline and may share root causes.
Possible fixes
Happy to send a PR but wanted to check direction first if a preference. Some ideas:
-
Re-enable on waitpoint resume. Add a taskContext.enable() method (just flips _runDisabled = false) and call it when a waitpoint resolves. Smallest diff. The right hook may be runtime-side at the point where await wait.for(...) returns rather than the RESOLVE_WAITPOINT IPC, since the IPC fires before the user code is "back". Open question: was there a billing/usage reason disable() was chosen at waitpoints in the first place?
-
Don't disable() at the waitpoint at all. Reserve disable() for true between-runs state. Cleaner conceptually but I don't have full context on why disable-at-waitpoint was added.
-
Stop stripping RUN_ID in TaskContextMetricExporter. Keep the run attrs even when isRunDisabled is true. Easiest diff but contradicts the comment in the exporter — likely violates intent.
Problem
On managed Trigger Cloud, all
process.*andnodejs.*auto-metrics emitted by a task after it resumes from a waitpoint are recorded with emptyRUN_ID,TASK_SLUG, andATTEMPT_NUMBERattributes. They are still ingested into themetricsClickHouse table and visible bymachine_id, but they're invisible to the obvious per-run query:For long-running tasks that hit a waitpoint (e.g.
wait.for(),wait.until(), human-approval / token waitpoints), this means most of a multi-hour run's memory/CPU/heap history is unattributable to the run that produced it, which makes per-run capacity planning and post-mortem debugging much harder.Reproduction
A task that runs for several minutes, hits a waitpoint, then resumes and runs for several minutes more. Grouped by attribution columns on the same
machine_id:run_xxxxxxxxxxxxxxxxxxxxxxxxmy-taskT+00:00T+15:00(waitpoint hit)T+15:00T+1h59m(run completes)Continuous emission throughout — same
machine_id, no process restart visible at the metric layer. The transition happens at the exact moment the waitpoint is reached.Reproduces on every run of this task that uses a waitpoint, regardless of whether the run completes successfully or fails.
Root cause
Source pinned to
v4.4.5.At a waitpoint, the supervisor sends the worker an IPC
FLUSH { disableContext: true }. The handler callstaskContext.disable():https://github.com/triggerdotdev/trigger.dev/blob/v4.4.5/packages/cli-v3/src/entryPoints/managed-run-worker.ts#L633-L638
taskContext.disable()sets_runDisabled = true:https://github.com/triggerdotdev/trigger.dev/blob/v4.4.5/packages/core/src/v3/taskContext/index.ts#L106-L108
Meanwhile the OTel
PeriodicExportingMetricReaderkeeps firing every 10 s, and each export goes throughTaskContextMetricExporter.export(), which branches onisRunDisabled:https://github.com/triggerdotdev/trigger.dev/blob/v4.4.5/packages/core/src/v3/taskContext/otelProcessors.ts#L138-L199
In the
isRunDisabledbranch (line 149-157), only env/project/org/machine attrs are kept;RUN_ID,TASK_SLUG,ATTEMPT_NUMBERare stripped. Comment: "Between runs: keep environment/project/org/machine attrs, strip run-specific ones".The only call site that flips
_runDisabledback tofalseissetGlobalTaskContext():https://github.com/triggerdotdev/trigger.dev/blob/v4.4.5/packages/core/src/v3/taskContext/index.ts#L110-L113
…which is invoked only from the worker's
EXECUTE_TASK_RUNIPC (managed-run-worker.ts:421). TheRESOLVE_WAITPOINTIPC handler (managed-run-worker.ts:639-641) just resolves the runtime's waitpoint promise — it doesn't touchtaskContext.So under process keep-alive, when a run resumes from a waitpoint in the same Node process,
_runDisabledstaystruefor the remainder of that run, and every metric the periodic reader emits is tagged with empty run context. The user task code resumes and runs to completion, but its telemetry is invisible per-run.Environment
@trigger.dev/sdk4.4.3(problem applies tov4.4.5source per inspection above)processKeepAliveEnablednot explicitly set — using Cloud defaultsRelated
#3556 — different surface symptom (lost telemetry on shutdown via
Promise.allinTracingSDK.flush()), but touches the same generalTaskContextMetricExporter/BufferingMetricExporterpipeline and may share root causes.Possible fixes
Happy to send a PR but wanted to check direction first if a preference. Some ideas:
Re-enable on waitpoint resume. Add a
taskContext.enable()method (just flips_runDisabled = false) and call it when a waitpoint resolves. Smallest diff. The right hook may be runtime-side at the point whereawait wait.for(...)returns rather than theRESOLVE_WAITPOINTIPC, since the IPC fires before the user code is "back". Open question: was there a billing/usage reasondisable()was chosen at waitpoints in the first place?Don't
disable()at the waitpoint at all. Reservedisable()for true between-runs state. Cleaner conceptually but I don't have full context on why disable-at-waitpoint was added.Stop stripping RUN_ID in
TaskContextMetricExporter. Keep the run attrs even whenisRunDisabledis true. Easiest diff but contradicts the comment in the exporter — likely violates intent.