feat(CLDSRV-884): Add OpenTelemetry tracing instrumentation#6140
feat(CLDSRV-884): Add OpenTelemetry tracing instrumentation#6140delthas wants to merge 8 commits into
Conversation
Hello delthas,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
Request integration branchesWaiting for integration branch creation to be requested by the user. To request integration branches, please comment on this pull request with the following command: Alternatively, the |
Codecov Report❌ Patch coverage is
Additional details and impacted files
... and 1 file with indirect coverage changes @@ Coverage Diff @@
## development/9.4 #6140 +/- ##
===================================================
+ Coverage 85.21% 85.26% +0.05%
===================================================
Files 207 211 +4
Lines 13832 13971 +139
===================================================
+ Hits 11787 11913 +126
- Misses 2045 2058 +13
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
9c93a1d to
97e63fc
Compare
97e63fc to
f978280
Compare
f978280 to
06eea4e
Compare
06eea4e to
75c3afb
Compare
75c3afb to
207b785
Compare
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
The following reviewers are expecting changes from the author, or must review again: |
Pre-existing prettier debt on lib/api/api.js and lib/server.js — the 9.4 base did not run prettier on these. Extracted as a separate commit so the subsequent CLDSRV-884 OTEL commits show only logic changes, not formatting noise. Issue: CLDSRV-884
Brings in the OTEL Node SDK, OTLP/HTTP exporter, and the three instrumentation packages we explicitly opt into (http, mongodb, ioredis). Split out of the main CLDSRV-884 work so reviewers can verify the dependency surface independently of the code that consumes it. Issue: CLDSRV-884
Pure helper: parses OTEL_TRUSTED_HOSTS into a Set, exposes a requestHook that strips traceparent/tracestate from outbound HTTP requests to hosts outside the set. Loopback aliases are always trusted so dev / local tooling work without extra config. Used in the next commit's bootstrap (HttpInstrumentation requestHook). Issue: CLDSRV-884
Pure helper: matches request URLs against a set of known probe/scrape paths (k8s liveness/readiness, metrics scrape). Wired into HttpInstrumentation.ignoreIncomingRequestHook in the next commit so probe traffic doesn't dominate the trace volume. Issue: CLDSRV-884
The two pieces that wire OTEL into the process:
- lib/tracing/index.js: init() / close() / isEnabled() facade.
Boots the Node SDK with an explicit allowlist of three
instrumentations (http, mongodb, ioredis), a ParentBased
TraceIdRatio sampler, the trust-boundary requestHook, and
the health-probe ignore hook. Cluster primary is skipped
(workers re-exec the entry and init themselves). close()
flushes the BatchSpanProcessor with a bounded deadline so
SIGTERM can't hang past the kubelet's grace period.
- lib/instrumentation/simple.js: instrumentApiMethod() — wraps
a callback-style or async S3 handler in a span that ends
when the underlying handler settles. The next commits use
this to instrument all 80+ S3 API methods.
- index.js: calls tracing.init() before requiring server.js
so the require-in-the-middle patches land before any
HTTP/Mongo/ioredis module is loaded.
Issue: CLDSRV-884
|
|
Updates since the last review pass:
|
Summary
Add OpenTelemetry tracing instrumentation to cloudserver, gated behind
ENABLE_OTEL=true. When the flag is unset, no@opentelemetry/*package is loaded — zero overhead off the OTEL path.The four atomic commits:
feat: add OpenTelemetry tracing with trust boundaries and probe filteringParentBasedSampler({ root: TraceIdRatioBasedSampler(ratio) })so inboundtraceparentsampled=1from NGINX/Beyla is honored end-to-end (was being re-sampled by a plain ratio sampler before)auto-instrumentations-nodebundle (which pulled ~36 unused instrumentations: pg, mysql, kafkajs, cassandra, oracle, etc.):instrumentation-httpwithignoreIncomingRequestHook(drops k8s probes + OPTIONS) andrequestHook(stripstraceparent/tracestateon outbound requests to hosts not in the trusted allowlist; client span is preserved and taggedscality.trace.suppressed=true)instrumentation-mongodbwith low-cardinality settings (no collection names in span names, no captured query payloads)instrumentation-iorediswithrequireParentSpan: true(no orphan spans from background stats/rate-limit jobs)buildTrustedHosts(config)derives the allowlist from cloudserver's own Config (vaultd, dataClient, metadataClient, pfsClient, cdmi, bucketd, KMIP, KMS, scuba, utapi, mongodb replica set, backbeat, managementAgent +hdclient/sproxydbootstrap entries fromlocationConstraintsfor direct Scality storage connectors), plusPUSH_ENDPOINT/MANAGEMENT_ENDPOINTenv vars and loopback aliases. Unit test asserts every config host shape is included so the derivation stays honest.lib/tracing/healthPaths.js):/live,/ready,/_/healthcheck,/_/healthcheck/deep,/metricsproduce no spansdiag.setLogger(new DiagConsoleLogger(), DiagLogLevel.WARN)so SDK errors (export failures, malformedtraceparent, etc.) surface instead of disappearingattributeValueLengthLimit: 4096,attributeCountLimit: 128,eventCountLimit: 128,linkCountLimit: 128to bound memory under pathological caseslogRecordProcessors: [],metricReaders: []— explicit traces-only; without these, NodeSDK silently spins up OTLP log + metric exportersfeat: flush OTEL on shutdownshutdownOtel()wired intoS3Server.cleanUp()'s shutdown chain (betweenserver.close()andprocess.exit(0)), capped at 5 s viaPromise.raceso an unreachable collector can't block past Kubernetes' default 30 sterminationGracePeriodSecondscaughtExceptionShutdown()also attempts a best-effort flush beforeprocess.exit(1)on the non-cluster path.finally()guard so any unexpected promise rejection in the shutdown chain still exits the process@opentelemetry/instrumentation-httpalready extractstraceparentand creates the server span as a child of the remote parent. A manual extract on top would replace the active context with the (non-recording) remote parent and demote downstreamapi.<handler>spans to siblings instead of children of the HTTP server span. Removed from the original draft.feat: instrument all S3 API handlers with OTEL spanslib/instrumentation/simple.jsexports a single helperinstrumentApiMethod(handler, methodName)that wraps an S3 handler in anapi.<methodName>span owning the entire handler execution (auth, body parsing, metadata I/O, data path, finalizers). Auto-instrumentation spans (HTTP, MongoDB, ioredis) nest beneath. ~70 distinct span names total — well within trace-backend limits.lib/api/api.jsapplies the wrapper viaObject.entries(api)loop with aNON_HANDLER_KEYSskip set (callApiMethod,checkAuthResults,handleAuthorizationResults). New S3 handlers added to the literal are auto-instrumented; no per-handler boilerplate.cloudserver.error_codeon the error pathchore: bump arsenal to ARSN-572 PoC branch for e2e trace context testingtraceContextplumbing to MongoDB metadata writes (object oplog entries carry the trace context that caused them, so async consumers — Backbeat, Sorbet, lifecycle — can continue the trace across the oplog/queue boundary)8.3.xrelease tag once ARSN-572 shipsOrigin
Extracted and cleaned up from William Lardier's
user/test/wlardier/servicemesh-2branch (based ondevelopment/9.0, July–Sep 2025), which mixed OTEL with performance optimizations and debug artifacts. This PR contains only the OTEL instrumentation, rebased ontodevelopment/9.3. Dropped:lib/otelContextPropagation.js(dead code; never imported)MOCK_DOAUTH/lib/api/apiUtils/mock/backendMocks.js(production-dangerous auth-result caching, unrelated to tracing)console.log()debug artifacts, commented-out span codemonitorLatency,releaseRequestContexts, arsenal perf pin) — out of scope; will land separately if neededTests
29 unit tests (22 in
tests/unit/lib/otel.spec.js, 7 intests/unit/lib/instrumentationSimple.spec.js) coveringextractHost,buildTrustedHosts(full config snapshot + locationConstraints connector enumeration),makeRequestHook(outbound trust check, inbound IncomingMessage skip),isHealthPath, and the full lifecycle ofinstrumentApiMethod(callback success/error, double-end guard, async resolve/reject, sync throw, OTEL-disabled bypass).Context
@opentelemetry/instrumentation-http's automatic outboundtraceparentpropagationIssue: CLDSRV-884