feat(distributed): gated X-LocalAI-Node response header (middleware + wrapper)#9976
Conversation
End-to-end verification (2026-05-24)Ran the full distributed stack via
Header value ( Anthropic + Ollama shims worked without any handler-side change: the wrapper reads model name from |
Per-request attribution e2e verification (2026-05-25)Followup to the earlier e2e comment. The previous LookupNodeID-from-store design had a real race under concurrent multi-replica routing: the per-modelID store entry gets overwritten on each routing decision, so two simultaneous requests for the same model would both end up reading the most-recently-stamped node ID. Refactored to per-request Verified with 2-worker docker-compose stack,
Exact match. Pre-fix this would have shown either ~20/0 or 0/~20 because the store-based read returned whichever replica won the most recent overwrite. Also added a Ginkgo regression test ( |
Extended endpoint coverage (2026-05-25)Wired the middleware to audio transcriptions / audio speech / image generations / image inpainting / rerank / VAD. While doing so, found and fixed a real underlying bug: all five backend helpers were silently using the long-lived app context instead of the request context, which would have made the header a silent no-op on those endpoints even with the middleware wired. Fix is regression-pinned in Bonus: image generation now actually honors client cancellation (was using E2E recipe (post-merge)With # Audio transcriptions
curl -i -X POST $ENDPOINT/v1/audio/transcriptions \
-H "Authorization: Bearer $KEY" \
-F model=whisper-base.en -F file=@samples/jfk.wav | grep -i x-localai-node
# TTS
curl -i -X POST $ENDPOINT/v1/audio/speech \
-H "Authorization: Bearer $KEY" -H "Content-Type: application/json" \
-d '{"model":"tts-1","input":"Hello node","voice":"alloy"}' --output /tmp/out.wav -D - | grep -i x-localai-node
# Image generation
curl -i -X POST $ENDPOINT/v1/images/generations \
-H "Authorization: Bearer $KEY" -H "Content-Type: application/json" \
-d '{"model":"sd-1.5","prompt":"a tiny red dot","size":"64x64","n":1}' | grep -i x-localai-node
# Rerank
curl -i -X POST $ENDPOINT/v1/rerank \
-H "Authorization: Bearer $KEY" -H "Content-Type: application/json" \
-d '{"model":"bge-reranker-v2-m3","query":"q","documents":["a","b","c"]}' | grep -i x-localai-node
# VAD
curl -i -X POST $ENDPOINT/v1/vad \
-H "Authorization: Bearer $KEY" -H "Content-Type: application/json" \
-d '{"model":"silero-vad","audio":[0.0,0.1,0.2]}' | grep -i x-localai-nodeExpect |
Introduce pkg/distributedhdr, a leaf package carrying a per-request *atomic.Value holder for the picked worker node ID from the SmartRouter (core/services/nodes) up to the HTTP response writer wrapper (core/http/middleware). Avoids the import cycle that a shared key in either consumer would create. Exposes NewHolder, WithHolder, Holder, Stamp, Load, Inherit. The holder is atomic.Value so cross-goroutine publish from the router to the response writer wrapper is race-clean. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…wrapper New ApplicationConfig.ExposeNodeHeader bool + --expose-node-header CLI flag / LOCALAI_EXPOSE_NODE_HEADER env var (default off; the node ID reveals internal topology and is opt-in). The middleware creates a per-request *atomic.Value holder, attaches it to c.Request().Context() via distributedhdr.WithHolder, and wraps c.Response().Writer with a custom http.ResponseWriter that sets the X-LocalAI-Node header on first Write / WriteHeader / Flush by reading the holder. Implements http.Flusher, http.Hijacker, Unwrap so it composes cleanly with Echo and http.NewResponseController. request.go propagates the holder onto derived contexts via distributedhdr.Inherit so the holder survives the correlation-ID context replacement. Unit + race-clean concurrency + integration specs. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…erence routes ModelRouterAdapter.Route stamps the picked node ID into the per-request holder via distributedhdr.Stamp(ctx, result.Node.ID) right after replica selection. Wire ExposeNodeHeader middleware to: - OpenAI chat/completion/embeddings + audio transcriptions/speech + image generations/inpainting - Anthropic /v1/messages - Ollama /api/chat, /api/generate, /api/embed, /api/embeddings - Jina /v1/rerank - LocalAI /v1/vad The middleware's wrapper reads the holder on first byte and sets the X-LocalAI-Node response header before delegating to the underlying writer. Per-request scope means no race under concurrent multi-replica routing. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
… ctx propagation Five non-OpenAI backend helpers were silently using app.Context instead of the request context for the gRPC backend call: transcription, TTS, image generation, rerank, VAD. Effect: distributedhdr.Stamp in the router callback was a silent no-op for these paths, AND client cancellation didn't propagate to in-flight inference. Thread c.Request().Context() (or the equivalent input.Context after the request middleware has installed the correlation-ID derived context) through each helper and into ModelOptions via model.WithContext(ctx). ImageGeneration's signature gains a leading ctx parameter; in-tree callers (openai image, openai inpainting, openai inpainting_test) are updated to match. ModelEmbedding gains a leading ctx parameter for the same reason; the openai and ollama embedding handlers pass the request context through. chat_stream_workers.go defers the initial role=assistant chunk emission until the first token callback so the wrapper's lazy X-LocalAI-Node lookup against the loader runs AFTER ml.Load has stamped the per-modelID node ID; semantically identical for clients (role still arrives before any text). Regression test core/backend/ctx_propagation_test.go pins ctx propagation for all five helpers. Docs updated to enumerate the full endpoint coverage of the --expose-node-header flag. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
36e6625 to
017d89b
Compare
Summary
Opt-in
X-LocalAI-Noderesponse header carrying the ID of the distributed-mode worker that served the request. Gated behind--expose-node-header/LOCALAI_EXPOSE_NODE_HEADER(default off because the node ID reveals internal cluster topology).Implemented as an Echo middleware + custom ResponseWriter wrapper with per-request node ID rendezvous via request context. The wrapper sets the header on first
Write/WriteHeader/Flushfrom a per-request*atomic.Valueholder that the SmartRouter stamps after picking a replica. No shared mutable state, no store lookup on the hot path, correct under concurrent multi-replica routing.Implements Bug-0 from the distributed-mode bug-hunt spec.
Architecture
pkg/distributedhdris a tiny leaf package containing the context key +Stamp(ctx, nodeID)/Load(holder)/WithHolder/Inherithelpers.*atomic.Valueper request, attaches it toc.Request().Context(), closes over it for the wrapper'sresolve().ModelRouterAdapter.Routereads the holder viadistributedhdr.Stamp(ctx, result.Node.ID)right after picking. No-op if the flag is off.Endpoint coverage
/v1/chat/completions(OpenAI, buffered + streaming)/v1/completions(OpenAI, buffered + streaming)/v1/embeddings(OpenAI)/v1/messages(Anthropic)/api/chat,/api/generate,/api/embed,/api/embeddings(Ollama)/v1/audio/transcriptions(Whisper / Diarize)/v1/audio/speech(TTS)/v1/images/generations+/v1/images/inpainting(SD / Flux)/v1/rerank(rerank)/v1/vad(VAD)Underlying fix bundled with extended coverage
While wiring audio/image/TTS/rerank/VAD, found that all five backend helpers were silently dropping the request context:
ModelOptions(cfg, app)always seedsmodel.WithContext(app.Context)(the long-lived app context). Without an explicitmodel.WithContext(c.Request().Context())appended after,distributedhdr.Stamp(ctx, ...)inside the router callback was a silent no-op for those endpoints. Fixed all five (core/backend/{transcript,tts,image,rerank,vad}.go). Regression-pinned bycore/backend/ctx_propagation_test.go.Bonus behavioural change:
backend.ImageGenerationnow takes a leadingctx context.Contextargument and threads it through to the gRPCGenerateImagecall (previously usedappConfig.Context). Client disconnects will now actually cancel in-flight diffusion runs. The two in-tree callers and one test stub are updated.Architecture wins
core/http/middleware/node_header.go- no per-handler hooks.Write/Flushtriggers the header set BEFORE delegating to the underlying writer.RequestExtractor- no per-endpoint header logic.pkg/modelshrinks by 69 lines - removed deadnodeIDfield,SetNodeID,NodeID,LookupNodeID,storeMu,getStore.Caveats
/v1/realtime) is not wired. Backend ctx propagation is now in place, but the WebSocket lifecycle would need careful handling.Test plan
go test -race ./core/http/middleware/ ./pkg/model/ ./core/services/nodes/ ./pkg/distributedhdr/ ./core/backend/ -count=1passescore/http/middleware/node_header_concurrency_test.go(N=64 concurrent requests, each must observe its own stamp)core/backend/ctx_propagation_test.gopins ctx propagation for transcription / TTS / image / rerank / VAD (reverting any backend fix makes the corresponding spec fail)next(c)without wrapping the writerqwen3-0.6bloaded on both replicas. 20 concurrent chat requests: 10/10 split in both the harness headers AND the frontend routing log.Assisted-by: Claude:claude-opus-4-7[1m]