bug: Metrics emitted after waitpoint resume are unattributed — TaskContextMetricExporter strips RUN_ID because taskContext.disable() is never reversed

## Problem

On managed Trigger Cloud, all `process.*` and `nodejs.*` auto-metrics emitted by a task **after** it resumes from a waitpoint are recorded with empty `RUN_ID`, `TASK_SLUG`, and `ATTEMPT_NUMBER` attributes. They are still ingested into the `metrics` ClickHouse table and visible by `machine_id`, but they're invisible to the obvious per-run query:

```sql
SELECT * FROM metrics WHERE run_id = 'run_…'
```

For long-running tasks that hit a waitpoint (e.g. `wait.for()`, `wait.until()`, human-approval / token waitpoints), this means **most of a multi-hour run's memory/CPU/heap history is unattributable to the run that produced it**, which makes per-run capacity planning and post-mortem debugging much harder.

## Reproduction

A task that runs for several minutes, hits a waitpoint, then resumes and runs for several minutes more. Grouped by attribution columns on the same `machine_id`:

```sql
SELECT run_id, task_identifier, attempt_number, count() AS rows,
       min(bucket_start) AS first, max(bucket_start) AS last
FROM metrics
WHERE machine_id = 'machine_xxxxxxxxxxxxxxxxxxxxx'
GROUP BY run_id, task_identifier, attempt_number
ORDER BY first ASC
```

| run_id | task_identifier | attempt | rows | first | last |
|---|---|---|---|---|---|
| `run_xxxxxxxxxxxxxxxxxxxxxxxx` | `my-task` | 1 | 910 | `T+00:00` | `T+15:00` (waitpoint hit) |
| *(empty)* | *(empty)* | 0 | **6,230** | **`T+15:00`** | **`T+1h59m`** (run completes) |

Continuous emission throughout — same `machine_id`, no process restart visible at the metric layer. The transition happens at the exact moment the waitpoint is reached.

Reproduces on every run of this task that uses a waitpoint, regardless of whether the run completes successfully or fails.

## Root cause

Source pinned to `v4.4.5`.

1. At a waitpoint, the supervisor sends the worker an IPC `FLUSH { disableContext: true }`. The handler calls `taskContext.disable()`:

   https://github.com/triggerdotdev/trigger.dev/blob/v4.4.5/packages/cli-v3/src/entryPoints/managed-run-worker.ts#L633-L638

2. `taskContext.disable()` sets `_runDisabled = true`:

   https://github.com/triggerdotdev/trigger.dev/blob/v4.4.5/packages/core/src/v3/taskContext/index.ts#L106-L108

3. Meanwhile the OTel `PeriodicExportingMetricReader` keeps firing every 10 s, and each export goes through `TaskContextMetricExporter.export()`, which branches on `isRunDisabled`:

   https://github.com/triggerdotdev/trigger.dev/blob/v4.4.5/packages/core/src/v3/taskContext/otelProcessors.ts#L138-L199

   In the `isRunDisabled` branch (line 149-157), only env/project/org/machine attrs are kept; `RUN_ID`, `TASK_SLUG`, `ATTEMPT_NUMBER` are stripped. Comment: *"Between runs: keep environment/project/org/machine attrs, strip run-specific ones"*.

4. The only call site that flips `_runDisabled` back to `false` is `setGlobalTaskContext()`:

   https://github.com/triggerdotdev/trigger.dev/blob/v4.4.5/packages/core/src/v3/taskContext/index.ts#L110-L113

   …which is invoked only from the worker's `EXECUTE_TASK_RUN` IPC (`managed-run-worker.ts:421`). The `RESOLVE_WAITPOINT` IPC handler (`managed-run-worker.ts:639-641`) just resolves the runtime's waitpoint promise — it doesn't touch `taskContext`.

So under process keep-alive, when a run resumes from a waitpoint in the same Node process, `_runDisabled` stays `true` for the remainder of that run, and every metric the periodic reader emits is tagged with empty run context. The user task code resumes and runs to completion, but its telemetry is invisible per-run.

## Environment

- Managed Trigger Cloud (us-east-1)
- `@trigger.dev/sdk` `4.4.3` (problem applies to `v4.4.5` source per inspection above)
- Long-running task that hits a waitpoint partway through
- `processKeepAliveEnabled` not explicitly set — using Cloud defaults

## Related

#3556 — different surface symptom (lost telemetry on shutdown via `Promise.all` in `TracingSDK.flush()`), but touches the same general `TaskContextMetricExporter` / `BufferingMetricExporter` pipeline and may share root causes.

## Possible fixes

Happy to send a PR but wanted to check direction first if a preference. Some ideas:

1. **Re-enable on waitpoint resume.** Add a `taskContext.enable()` method (just flips `_runDisabled = false`) and call it when a waitpoint resolves. Smallest diff. The right hook may be runtime-side at the point where `await wait.for(...)` returns rather than the `RESOLVE_WAITPOINT` IPC, since the IPC fires before the user code is "back". Open question: was there a billing/usage reason `disable()` was chosen at waitpoints in the first place?

2. **Don't `disable()` at the waitpoint at all.** Reserve `disable()` for true between-runs state. Cleaner conceptually but I don't have full context on why disable-at-waitpoint was added.

3. **Stop stripping RUN_ID in `TaskContextMetricExporter`.** Keep the run attrs even when `isRunDisabled` is true. Easiest diff but contradicts the comment in the exporter — likely violates intent.


run_id	task_identifier	attempt	rows	first	last
`run_xxxxxxxxxxxxxxxxxxxxxxxx`	`my-task`	1	910	`T+00:00`	`T+15:00` (waitpoint hit)
(empty)	(empty)	0	6,230	`T+15:00`	`T+1h59m` (run completes)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug: Metrics emitted after waitpoint resume are unattributed — TaskContextMetricExporter strips RUN_ID because taskContext.disable() is never reversed #3672

Problem

Reproduction

Root cause

Environment

Related

Possible fixes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

bug: Metrics emitted after waitpoint resume are unattributed — TaskContextMetricExporter strips RUN_ID because taskContext.disable() is never reversed #3672

Description

Problem

Reproduction

Root cause

Environment

Related

Possible fixes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions