Skip to content

feat(CLDSRV-884): Add OpenTelemetry tracing instrumentation#6140

Open
delthas wants to merge 8 commits into
development/9.4from
improvement/CLDSRV-884/otel-instrumentation
Open

feat(CLDSRV-884): Add OpenTelemetry tracing instrumentation#6140
delthas wants to merge 8 commits into
development/9.4from
improvement/CLDSRV-884/otel-instrumentation

Conversation

@delthas
Copy link
Copy Markdown
Contributor

@delthas delthas commented Apr 2, 2026

Summary

Add OpenTelemetry tracing instrumentation to cloudserver, gated behind ENABLE_OTEL=true. When the flag is unset, no @opentelemetry/* package is loaded — zero overhead off the OTEL path.

The four atomic commits:

  1. feat: add OpenTelemetry tracing with trust boundaries and probe filtering

    • NodeSDK with OTLP/HTTP trace exporter, default 1% sampling
    • ParentBasedSampler({ root: TraceIdRatioBasedSampler(ratio) }) so inbound traceparent sampled=1 from NGINX/Beyla is honored end-to-end (was being re-sampled by a plain ratio sampler before)
    • Three explicit instrumentation packages, declared as direct deps — no auto-instrumentations-node bundle (which pulled ~36 unused instrumentations: pg, mysql, kafkajs, cassandra, oracle, etc.):
      • instrumentation-http with ignoreIncomingRequestHook (drops k8s probes + OPTIONS) and requestHook (strips traceparent/tracestate on outbound requests to hosts not in the trusted allowlist; client span is preserved and tagged scality.trace.suppressed=true)
      • instrumentation-mongodb with low-cardinality settings (no collection names in span names, no captured query payloads)
      • instrumentation-ioredis with requireParentSpan: true (no orphan spans from background stats/rate-limit jobs)
    • buildTrustedHosts(config) derives the allowlist from cloudserver's own Config (vaultd, dataClient, metadataClient, pfsClient, cdmi, bucketd, KMIP, KMS, scuba, utapi, mongodb replica set, backbeat, managementAgent + hdclient/sproxyd bootstrap entries from locationConstraints for direct Scality storage connectors), plus PUSH_ENDPOINT/MANAGEMENT_ENDPOINT env vars and loopback aliases. Unit test asserts every config host shape is included so the derivation stays honest.
    • Probe/scrape filter (lib/tracing/healthPaths.js): /live, /ready, /_/healthcheck, /_/healthcheck/deep, /metrics produce no spans
    • diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.WARN) so SDK errors (export failures, malformed traceparent, etc.) surface instead of disappearing
    • Span limits: attributeValueLengthLimit: 4096, attributeCountLimit: 128, eventCountLimit: 128, linkCountLimit: 128 to bound memory under pathological cases
    • logRecordProcessors: [], metricReaders: [] — explicit traces-only; without these, NodeSDK silently spins up OTLP log + metric exporters
  2. feat: flush OTEL on shutdown

    • shutdownOtel() wired into S3Server.cleanUp()'s shutdown chain (between server.close() and process.exit(0)), capped at 5 s via Promise.race so an unreachable collector can't block past Kubernetes' default 30 s terminationGracePeriodSeconds
    • caughtExceptionShutdown() also attempts a best-effort flush before process.exit(1) on the non-cluster path
    • .finally() guard so any unexpected promise rejection in the shutdown chain still exits the process
    • Inbound trace-context extraction is intentionally NOT done manually here: @opentelemetry/instrumentation-http already extracts traceparent and creates the server span as a child of the remote parent. A manual extract on top would replace the active context with the (non-recording) remote parent and demote downstream api.<handler> spans to siblings instead of children of the HTTP server span. Removed from the original draft.
  3. feat: instrument all S3 API handlers with OTEL spans

    • lib/instrumentation/simple.js exports a single helper instrumentApiMethod(handler, methodName) that wraps an S3 handler in an api.<methodName> span owning the entire handler execution (auth, body parsing, metadata I/O, data path, finalizers). Auto-instrumentation spans (HTTP, MongoDB, ioredis) nest beneath. ~70 distinct span names total — well within trace-backend limits.
    • lib/api/api.js applies the wrapper via Object.entries(api) loop with a NON_HANDLER_KEYS skip set (callApiMethod, checkAuthResults, handleAuthorizationResults). New S3 handlers added to the literal are auto-instrumented; no per-handler boilerplate.
    • The wrapper handles callback / Promise / sync return shapes, has a single-end-once guard, sets cloudserver.error_code on the error path
  4. chore: bump arsenal to ARSN-572 PoC branch for e2e trace context testing

    • Temporary: pins arsenal at the ARSN-572 PoC branch (scality/Arsenal#2611), which adds traceContext plumbing to MongoDB metadata writes (object oplog entries carry the trace context that caused them, so async consumers — Backbeat, Sorbet, lifecycle — can continue the trace across the oplog/queue boundary)
    • To be reverted to a real 8.3.x release tag once ARSN-572 ships

Origin

Extracted and cleaned up from William Lardier's user/test/wlardier/servicemesh-2 branch (based on development/9.0, July–Sep 2025), which mixed OTEL with performance optimizations and debug artifacts. This PR contains only the OTEL instrumentation, rebased onto development/9.3. Dropped:

  • lib/otelContextPropagation.js (dead code; never imported)
  • MOCK_DOAUTH / lib/api/apiUtils/mock/backendMocks.js (production-dangerous auth-result caching, unrelated to tracing)
  • ~15 console.log() debug artifacts, commented-out span code
  • All performance optimizations (manual GC, monitorLatency, releaseRequestContexts, arsenal perf pin) — out of scope; will land separately if needed

Tests

29 unit tests (22 in tests/unit/lib/otel.spec.js, 7 in tests/unit/lib/instrumentationSimple.spec.js) covering extractHost, buildTrustedHosts (full config snapshot + locationConstraints connector enumeration), makeRequestHook (outbound trust check, inbound IncomingMessage skip), isHealthPath, and the full lifecycle of instrumentApiMethod (callback success/error, double-end guard, async resolve/reject, sync throw, OTEL-disabled bypass).

Context

  • Parent investigation: OS-1072
  • Companion arsenal change: scality/Arsenal#2611 (ARSN-572) — to be merged before this PR can lose its temporary branch pin
  • Scality ADR mandates OpenTelemetry across all products. The storage layer (hdcontroller 1.12+ / hyperiod) is already OTEL-instrumented; this PR makes cloudserver complete the chain via @opentelemetry/instrumentation-http's automatic outbound traceparent propagation

Issue: CLDSRV-884

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Apr 2, 2026

Hello delthas,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Apr 2, 2026

Request integration branches

Waiting for integration branch creation to be requested by the user.

To request integration branches, please comment on this pull request with the following command:

/create_integration_branches

Alternatively, the /approve and /create_pull_requests commands will automatically
create the integration branches.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 2, 2026

Codecov Report

❌ Patch coverage is 89.15663% with 27 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.26%. Comparing base (3f7ce59) to head (0610440).
⚠️ Report is 2 commits behind head on development/9.4.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
lib/api/api.js 89.47% 10 Missing ⚠️
lib/server.js 64.00% 9 Missing ⚠️
lib/tracing/index.js 86.36% 6 Missing ⚠️
lib/instrumentation/simple.js 96.29% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

Files with missing lines Coverage Δ
lib/tracing/healthPaths.js 100.00% <100.00%> (ø)
lib/tracing/trustedHosts.js 100.00% <100.00%> (ø)
lib/instrumentation/simple.js 96.29% <96.29%> (ø)
lib/tracing/index.js 86.36% <86.36%> (ø)
lib/server.js 78.09% <64.00%> (-1.52%) ⬇️
lib/api/api.js 90.87% <89.47%> (+0.19%) ⬆️

... and 1 file with indirect coverage changes

@@                 Coverage Diff                 @@
##           development/9.4    #6140      +/-   ##
===================================================
+ Coverage            85.21%   85.26%   +0.05%     
===================================================
  Files                  207      211       +4     
  Lines                13832    13971     +139     
===================================================
+ Hits                 11787    11913     +126     
- Misses                2045     2058      +13     
Flag Coverage Δ
file-ft-tests 68.69% <48.19%> (-0.56%) ⬇️
kmip-ft-tests 28.17% <41.36%> (-0.10%) ⬇️
mongo-v0-ft-tests 69.88% <48.19%> (-0.53%) ⬇️
mongo-v1-ft-tests 69.90% <48.19%> (-0.58%) ⬇️
multiple-backend 36.61% <42.97%> (-0.18%) ⬇️
sur-tests 35.47% <42.97%> (-0.17%) ⬇️
sur-tests-inflights 37.35% <42.97%> (-0.22%) ⬇️
unit 72.03% <83.93%> (+0.18%) ⬆️
utapi-v2-tests 34.60% <43.37%> (-0.16%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread lib/otel.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
@delthas delthas force-pushed the improvement/CLDSRV-884/otel-instrumentation branch from 9c93a1d to 97e63fc Compare April 2, 2026 16:32
Comment thread lib/instrumentation/simple.js Outdated
Comment thread package.json Outdated
Comment thread lib/otel.js Outdated
Comment thread lib/otel.js Outdated
@delthas delthas force-pushed the improvement/CLDSRV-884/otel-instrumentation branch from 97e63fc to f978280 Compare April 2, 2026 16:35
Comment thread lib/instrumentation/simple.js Outdated
Comment thread lib/otel.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
@delthas delthas force-pushed the improvement/CLDSRV-884/otel-instrumentation branch from f978280 to 06eea4e Compare April 2, 2026 16:49
Comment thread lib/instrumentation/simple.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
Comment thread package.json Outdated
@delthas delthas force-pushed the improvement/CLDSRV-884/otel-instrumentation branch from 06eea4e to 75c3afb Compare April 2, 2026 17:02
Comment thread lib/instrumentation/simple.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
Comment thread package.json Outdated
Comment thread lib/otel.js Outdated
Comment thread lib/instrumentation/simple.js Outdated
@delthas delthas force-pushed the improvement/CLDSRV-884/otel-instrumentation branch from 75c3afb to 207b785 Compare April 13, 2026 13:05
Comment thread lib/instrumentation/simple.js Outdated
Comment thread package.json
Comment thread tests/unit/lib/instrumentationSimple.spec.js
Comment thread tests/unit/lib/tracing/healthPaths.spec.js Outdated
Comment thread tests/unit/lib/tracing/init.spec.js
Comment thread tests/unit/lib/tracing/trustedHosts.spec.js Outdated
Comment thread package.json
Comment thread tests/unit/lib/instrumentationSimple.spec.js Outdated
Comment thread tests/unit/lib/tracing/healthPaths.spec.js Outdated
Comment thread tests/unit/lib/tracing/init.spec.js Outdated
Comment thread tests/unit/lib/tracing/trustedHosts.spec.js Outdated
@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented May 21, 2026

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • 2 peers

The following reviewers are expecting changes from the author, or must review again:

Comment thread package.json
Pre-existing prettier debt on lib/api/api.js and lib/server.js — the
9.4 base did not run prettier on these. Extracted as a separate
commit so the subsequent CLDSRV-884 OTEL commits show only logic
changes, not formatting noise.

Issue: CLDSRV-884
Comment thread package.json
Comment thread package.json
Brings in the OTEL Node SDK, OTLP/HTTP exporter, and the three
instrumentation packages we explicitly opt into (http, mongodb,
ioredis). Split out of the main CLDSRV-884 work so reviewers can
verify the dependency surface independently of the code that
consumes it.

Issue: CLDSRV-884
Comment thread package.json
Comment thread lib/server.js Outdated
Comment thread package.json
Comment thread package.json
Comment thread lib/tracing/trustedHosts.js
delthas added 6 commits May 21, 2026 15:28
Pure helper: parses OTEL_TRUSTED_HOSTS into a Set, exposes a
requestHook that strips traceparent/tracestate from outbound
HTTP requests to hosts outside the set. Loopback aliases are
always trusted so dev / local tooling work without extra config.

Used in the next commit's bootstrap (HttpInstrumentation requestHook).

Issue: CLDSRV-884
Pure helper: matches request URLs against a set of known
probe/scrape paths (k8s liveness/readiness, metrics scrape).
Wired into HttpInstrumentation.ignoreIncomingRequestHook in
the next commit so probe traffic doesn't dominate the trace
volume.

Issue: CLDSRV-884
The two pieces that wire OTEL into the process:

  - lib/tracing/index.js: init() / close() / isEnabled() facade.
    Boots the Node SDK with an explicit allowlist of three
    instrumentations (http, mongodb, ioredis), a ParentBased
    TraceIdRatio sampler, the trust-boundary requestHook, and
    the health-probe ignore hook. Cluster primary is skipped
    (workers re-exec the entry and init themselves). close()
    flushes the BatchSpanProcessor with a bounded deadline so
    SIGTERM can't hang past the kubelet's grace period.

  - lib/instrumentation/simple.js: instrumentApiMethod() — wraps
    a callback-style or async S3 handler in a span that ends
    when the underlying handler settles. The next commits use
    this to instrument all 80+ S3 API methods.

  - index.js: calls tracing.init() before requiring server.js
    so the require-in-the-middle patches land before any
    HTTP/Mongo/ioredis module is loaded.

Issue: CLDSRV-884
Comment thread package.json
@claude
Copy link
Copy Markdown

claude Bot commented May 21, 2026

  • Arsenal is pinned to a branch instead of a release tag (version downgrade from 8.4.2 to 8.3.11). Pin to the exact commit hash or wait for the ARSN-572 tag to ship.

    The OTEL instrumentation itself is well-structured: gating behind ENABLE_OTEL=true with zero overhead when off, explicit traces-only config, proper shutdown flushing with a 5s deadline, race-safe close(), idempotent span ending, trust-boundary enforcement on outbound traceparent, and good test coverage (29 unit tests). No issues found in the tracing, instrumentation, or server lifecycle changes beyond the dependency pin.

    Review by Claude Code

@delthas
Copy link
Copy Markdown
Contributor Author

delthas commented May 21, 2026

Updates since the last review pass:

  • Extracted a first chore: prettier-format files touched by CLDSRV-884 commit, so the subsequent OTEL commits show only logic changes (no formatting noise).
  • Split the original big OTEL commit into 4 smaller ones (deps, trust-boundary host filter, health-probe filter, bootstrap + manual span helper), each reviewable in isolation.
  • Trust list is now an OTEL_TRUSTED_HOSTS env var supplied by the operator (comma-joined bare hostnames). Cloudserver no longer self-derives from Config. Deploying the env var is tracked in ZKOP-551 (PR).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants