feat(distributed): gated X-LocalAI-Node response header (middleware + wrapper) by localai-bot · Pull Request #9976 · mudler/LocalAI

localai-bot · 2026-05-24T21:03:47Z

Summary

Opt-in X-LocalAI-Node response header carrying the ID of the distributed-mode worker that served the request. Gated behind --expose-node-header / LOCALAI_EXPOSE_NODE_HEADER (default off because the node ID reveals internal cluster topology).

Implemented as an Echo middleware + custom ResponseWriter wrapper with per-request node ID rendezvous via request context. The wrapper sets the header on first Write / WriteHeader / Flush from a per-request *atomic.Value holder that the SmartRouter stamps after picking a replica. No shared mutable state, no store lookup on the hot path, correct under concurrent multi-replica routing.

Implements Bug-0 from the distributed-mode bug-hunt spec.

Architecture

                  pkg/distributedhdr
                         |
       +-----------------+------------------+
       |                                    |
core/http/middleware           core/services/nodes
  (creates holder,             (writes holder via
   wraps writer,                ctx after picking
   reads holder on              replica)
   first byte)

pkg/distributedhdr is a tiny leaf package containing the context key + Stamp(ctx, nodeID) / Load(holder) / WithHolder / Inherit helpers.
Middleware creates *atomic.Value per request, attaches it to c.Request().Context(), closes over it for the wrapper's resolve().
ModelRouterAdapter.Route reads the holder via distributedhdr.Stamp(ctx, result.Node.ID) right after picking. No-op if the flag is off.
Wrapper reads the holder via direct closure access (no context lookup), sets header on first byte. Race-clean (atomic.Value), zero hot-path I/O.

Endpoint coverage

Endpoint	Routing	Header
`/v1/chat/completions` (OpenAI, buffered + streaming)	SmartRouter	✅
`/v1/completions` (OpenAI, buffered + streaming)	SmartRouter	✅
`/v1/embeddings` (OpenAI)	SmartRouter	✅
`/v1/messages` (Anthropic)	SmartRouter	✅
`/api/chat`, `/api/generate`, `/api/embed`, `/api/embeddings` (Ollama)	SmartRouter	✅
`/v1/audio/transcriptions` (Whisper / Diarize)	SmartRouter	✅
`/v1/audio/speech` (TTS)	SmartRouter	✅
`/v1/images/generations` + `/v1/images/inpainting` (SD / Flux)	SmartRouter	✅
`/v1/rerank` (rerank)	SmartRouter	✅
`/v1/vad` (VAD)	SmartRouter	✅

Underlying fix bundled with extended coverage

While wiring audio/image/TTS/rerank/VAD, found that all five backend helpers were silently dropping the request context: ModelOptions(cfg, app) always seeds model.WithContext(app.Context) (the long-lived app context). Without an explicit model.WithContext(c.Request().Context()) appended after, distributedhdr.Stamp(ctx, ...) inside the router callback was a silent no-op for those endpoints. Fixed all five (core/backend/{transcript,tts,image,rerank,vad}.go). Regression-pinned by core/backend/ctx_propagation_test.go.

Bonus behavioural change: backend.ImageGeneration now takes a leading ctx context.Context argument and threads it through to the gRPC GenerateImage call (previously used appConfig.Context). Client disconnects will now actually cancel in-flight diffusion runs. The two in-tree callers and one test stub are updated.

Architecture wins

Single point of policy enforcement in core/http/middleware/node_header.go - no per-handler hooks.
Per-request attribution under concurrent multi-replica routing - 20-request burst across 2 replicas: 10/10 headers exactly match the frontend's routing decisions (verified e2e).
Streaming correctness - the wrapper's Write / Flush triggers the header set BEFORE delegating to the underlying writer.
Zero hot-path I/O - per-request atomic.Value read, no model loader lookup.
Anthropic / Ollama / audio / image / TTS / rerank / VAD covered via the shared RequestExtractor - no per-endpoint header logic.
pkg/model shrinks by 69 lines - removed dead nodeID field, SetNodeID, NodeID, LookupNodeID, storeMu, getStore.

Caveats

Realtime / WebSocket (/v1/realtime) is not wired. Backend ctx propagation is now in place, but the WebSocket lifecycle would need careful handling.

Test plan

go test -race ./core/http/middleware/ ./pkg/model/ ./core/services/nodes/ ./pkg/distributedhdr/ ./core/backend/ -count=1 passes
Multi-replica concurrent attribution test in core/http/middleware/node_header_concurrency_test.go (N=64 concurrent requests, each must observe its own stamp)
Route-level integration spec asserts header set BEFORE first SSE byte
core/backend/ctx_propagation_test.go pins ctx propagation for transcription / TTS / image / rerank / VAD (reverting any backend fix makes the corresponding spec fail)
With the flag off (default), middleware fast-paths to next(c) without wrapping the writer
E2E verified against docker-compose distributed stack with 2 workers + qwen3-0.6b loaded on both replicas. 20 concurrent chat requests: 10/10 split in both the harness headers AND the frontend routing log.
E2E for audio / image / TTS / rerank / VAD pending: requires loading the respective model types on a test stack. Curl recipes in the verification comment below.

Assisted-by: Claude:claude-opus-4-7[1m]

localai-bot · 2026-05-24T22:09:04Z

End-to-end verification (2026-05-24)

Ran the full distributed stack via docker-compose.distributed.yaml against this branch's image with LOCALAI_EXPOSE_NODE_HEADER=true and qwen3-0.6b loaded on worker-1.

Endpoint	Mode	Flag ON	Flag OFF
`/v1/chat/completions`	buffered	header set to worker-1 node ID	header absent
`/v1/chat/completions`	streaming SSE	header set in response head before first SSE byte	header absent
`/api/generate` (Ollama)	buffered	header set	(not tested)
`/api/chat` (Ollama)	buffered	header set	(not tested)
`/v1/messages` (Anthropic)	buffered	header set	(not tested)

Header value (fe221165-b6c4-4439-b993-e192a1f0b86b) matches worker-1's ID as reported by GET /api/nodes. The flag-off path only emits X-Correlation-Id, no node header.

Anthropic + Ollama shims worked without any handler-side change: the wrapper reads model name from CONTEXT_LOCALS_KEY_MODEL_NAME populated by the shared RequestExtractor middleware that all three route stacks already use.

localai-bot · 2026-05-25T06:45:14Z

Per-request attribution e2e verification (2026-05-25)

Followup to the earlier e2e comment. The previous LookupNodeID-from-store design had a real race under concurrent multi-replica routing: the per-modelID store entry gets overwritten on each routing decision, so two simultaneous requests for the same model would both end up reading the most-recently-stamped node ID. Refactored to per-request *atomic.Value holder on c.Request().Context().

Verified with 2-worker docker-compose stack, qwen3-0.6b loaded on both worker-1 and worker-2 (via min_replicas=2 scheduling), 20 concurrent requests through the bug-hunt harness:

	worker-1	worker-2
Headers reported by harness (`X-LocalAI-Node`)	10	10
Routing decisions logged by frontend (`Model routed to remote node`)	10	10

Exact match. Pre-fix this would have shown either ~20/0 or 0/~20 because the store-based read returned whichever replica won the most recent overwrite.

Also added a Ginkgo regression test (core/http/middleware/node_header_concurrency_test.go) that drives 64 parallel requests through one middleware instance with each request stamping a unique node ID, asserts every response's header equals THAT request's stamp - so the bug cannot regress silently.

localai-bot · 2026-05-25T07:35:47Z

Extended endpoint coverage (2026-05-25)

Wired the middleware to audio transcriptions / audio speech / image generations / image inpainting / rerank / VAD. While doing so, found and fixed a real underlying bug: all five backend helpers were silently using the long-lived app context instead of the request context, which would have made the header a silent no-op on those endpoints even with the middleware wired. Fix is regression-pinned in core/backend/ctx_propagation_test.go.

Bonus: image generation now actually honors client cancellation (was using appConfig.Context for the gRPC call).

E2E recipe (post-merge)

With LOCALAI_EXPOSE_NODE_HEADER=true and the relevant model loaded on at least one worker:

# Audio transcriptions
curl -i -X POST $ENDPOINT/v1/audio/transcriptions \
  -H "Authorization: Bearer $KEY" \
  -F model=whisper-base.en -F file=@samples/jfk.wav | grep -i x-localai-node

# TTS
curl -i -X POST $ENDPOINT/v1/audio/speech \
  -H "Authorization: Bearer $KEY" -H "Content-Type: application/json" \
  -d '{"model":"tts-1","input":"Hello node","voice":"alloy"}' --output /tmp/out.wav -D - | grep -i x-localai-node

# Image generation
curl -i -X POST $ENDPOINT/v1/images/generations \
  -H "Authorization: Bearer $KEY" -H "Content-Type: application/json" \
  -d '{"model":"sd-1.5","prompt":"a tiny red dot","size":"64x64","n":1}' | grep -i x-localai-node

# Rerank
curl -i -X POST $ENDPOINT/v1/rerank \
  -H "Authorization: Bearer $KEY" -H "Content-Type: application/json" \
  -d '{"model":"bge-reranker-v2-m3","query":"q","documents":["a","b","c"]}' | grep -i x-localai-node

# VAD
curl -i -X POST $ENDPOINT/v1/vad \
  -H "Authorization: Bearer $KEY" -H "Content-Type: application/json" \
  -d '{"model":"silero-vad","audio":[0.0,0.1,0.2]}' | grep -i x-localai-node

Expect X-LocalAI-Node: <worker-id> on every response; alternating across requests proves per-request routing.

Introduce pkg/distributedhdr, a leaf package carrying a per-request *atomic.Value holder for the picked worker node ID from the SmartRouter (core/services/nodes) up to the HTTP response writer wrapper (core/http/middleware). Avoids the import cycle that a shared key in either consumer would create. Exposes NewHolder, WithHolder, Holder, Stamp, Load, Inherit. The holder is atomic.Value so cross-goroutine publish from the router to the response writer wrapper is race-clean. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…wrapper New ApplicationConfig.ExposeNodeHeader bool + --expose-node-header CLI flag / LOCALAI_EXPOSE_NODE_HEADER env var (default off; the node ID reveals internal topology and is opt-in). The middleware creates a per-request *atomic.Value holder, attaches it to c.Request().Context() via distributedhdr.WithHolder, and wraps c.Response().Writer with a custom http.ResponseWriter that sets the X-LocalAI-Node header on first Write / WriteHeader / Flush by reading the holder. Implements http.Flusher, http.Hijacker, Unwrap so it composes cleanly with Echo and http.NewResponseController. request.go propagates the holder onto derived contexts via distributedhdr.Inherit so the holder survives the correlation-ID context replacement. Unit + race-clean concurrency + integration specs. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…erence routes ModelRouterAdapter.Route stamps the picked node ID into the per-request holder via distributedhdr.Stamp(ctx, result.Node.ID) right after replica selection. Wire ExposeNodeHeader middleware to: - OpenAI chat/completion/embeddings + audio transcriptions/speech + image generations/inpainting - Anthropic /v1/messages - Ollama /api/chat, /api/generate, /api/embed, /api/embeddings - Jina /v1/rerank - LocalAI /v1/vad The middleware's wrapper reads the holder on first byte and sets the X-LocalAI-Node response header before delegating to the underlying writer. Per-request scope means no race under concurrent multi-replica routing. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

… ctx propagation Five non-OpenAI backend helpers were silently using app.Context instead of the request context for the gRPC backend call: transcription, TTS, image generation, rerank, VAD. Effect: distributedhdr.Stamp in the router callback was a silent no-op for these paths, AND client cancellation didn't propagate to in-flight inference. Thread c.Request().Context() (or the equivalent input.Context after the request middleware has installed the correlation-ID derived context) through each helper and into ModelOptions via model.WithContext(ctx). ImageGeneration's signature gains a leading ctx parameter; in-tree callers (openai image, openai inpainting, openai inpainting_test) are updated to match. ModelEmbedding gains a leading ctx parameter for the same reason; the openai and ollama embedding handlers pass the request context through. chat_stream_workers.go defers the initial role=assistant chunk emission until the first token callback so the wrapper's lazy X-LocalAI-Node lookup against the loader runs AFTER ml.Load has stamped the per-modelID node ID; semantically identical for clients (role still arrives before any text). Regression test core/backend/ctx_propagation_test.go pins ctx propagation for all five helpers. Docs updated to enumerate the full endpoint coverage of the --expose-node-header flag. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

localai-bot changed the title ~~feat(distributed): gated X-LocalAI-Node response header~~ feat(distributed): gated X-LocalAI-Node response header (middleware + wrapper) May 24, 2026

mudler added 4 commits May 25, 2026 08:00

mudler force-pushed the feat/expose-node-header branch from 36e6625 to 017d89b Compare May 25, 2026 08:33

Repository owner deleted a comment from localai-bot May 25, 2026

mudler merged commit 06e777b into master May 25, 2026
57 checks passed

mudler deleted the feat/expose-node-header branch May 25, 2026 08:51

localai-bot mentioned this pull request May 25, 2026

fix(distributed): persist per-model load info so reconciler survives frontend restart #9981

Merged

5 tasks

BrewTestBot mentioned this pull request May 27, 2026

localai 4.3.2 Homebrew/homebrew-core#285003

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(distributed): gated X-LocalAI-Node response header (middleware + wrapper)#9976

feat(distributed): gated X-LocalAI-Node response header (middleware + wrapper)#9976
mudler merged 4 commits into
masterfrom
feat/expose-node-header

localai-bot commented May 24, 2026 •

edited

Loading

Uh oh!

localai-bot commented May 24, 2026

Uh oh!

localai-bot commented May 25, 2026

Uh oh!

localai-bot commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

localai-bot commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Endpoint coverage

Underlying fix bundled with extended coverage

Architecture wins

Caveats

Test plan

Uh oh!

localai-bot commented May 24, 2026

End-to-end verification (2026-05-24)

Uh oh!

localai-bot commented May 25, 2026

Per-request attribution e2e verification (2026-05-25)

Uh oh!

localai-bot commented May 25, 2026

Extended endpoint coverage (2026-05-25)

E2E recipe (post-merge)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

localai-bot commented May 24, 2026 •

edited

Loading