Skip to content

feat(distributed): gated X-LocalAI-Node response header (middleware + wrapper)#9976

Merged
mudler merged 4 commits into
masterfrom
feat/expose-node-header
May 25, 2026
Merged

feat(distributed): gated X-LocalAI-Node response header (middleware + wrapper)#9976
mudler merged 4 commits into
masterfrom
feat/expose-node-header

Conversation

@localai-bot
Copy link
Copy Markdown
Collaborator

@localai-bot localai-bot commented May 24, 2026

Summary

Opt-in X-LocalAI-Node response header carrying the ID of the distributed-mode worker that served the request. Gated behind --expose-node-header / LOCALAI_EXPOSE_NODE_HEADER (default off because the node ID reveals internal cluster topology).

Implemented as an Echo middleware + custom ResponseWriter wrapper with per-request node ID rendezvous via request context. The wrapper sets the header on first Write / WriteHeader / Flush from a per-request *atomic.Value holder that the SmartRouter stamps after picking a replica. No shared mutable state, no store lookup on the hot path, correct under concurrent multi-replica routing.

Implements Bug-0 from the distributed-mode bug-hunt spec.

Architecture

                  pkg/distributedhdr
                         |
       +-----------------+------------------+
       |                                    |
core/http/middleware           core/services/nodes
  (creates holder,             (writes holder via
   wraps writer,                ctx after picking
   reads holder on              replica)
   first byte)
  • pkg/distributedhdr is a tiny leaf package containing the context key + Stamp(ctx, nodeID) / Load(holder) / WithHolder / Inherit helpers.
  • Middleware creates *atomic.Value per request, attaches it to c.Request().Context(), closes over it for the wrapper's resolve().
  • ModelRouterAdapter.Route reads the holder via distributedhdr.Stamp(ctx, result.Node.ID) right after picking. No-op if the flag is off.
  • Wrapper reads the holder via direct closure access (no context lookup), sets header on first byte. Race-clean (atomic.Value), zero hot-path I/O.

Endpoint coverage

Endpoint Routing Header
/v1/chat/completions (OpenAI, buffered + streaming) SmartRouter
/v1/completions (OpenAI, buffered + streaming) SmartRouter
/v1/embeddings (OpenAI) SmartRouter
/v1/messages (Anthropic) SmartRouter
/api/chat, /api/generate, /api/embed, /api/embeddings (Ollama) SmartRouter
/v1/audio/transcriptions (Whisper / Diarize) SmartRouter
/v1/audio/speech (TTS) SmartRouter
/v1/images/generations + /v1/images/inpainting (SD / Flux) SmartRouter
/v1/rerank (rerank) SmartRouter
/v1/vad (VAD) SmartRouter

Underlying fix bundled with extended coverage

While wiring audio/image/TTS/rerank/VAD, found that all five backend helpers were silently dropping the request context: ModelOptions(cfg, app) always seeds model.WithContext(app.Context) (the long-lived app context). Without an explicit model.WithContext(c.Request().Context()) appended after, distributedhdr.Stamp(ctx, ...) inside the router callback was a silent no-op for those endpoints. Fixed all five (core/backend/{transcript,tts,image,rerank,vad}.go). Regression-pinned by core/backend/ctx_propagation_test.go.

Bonus behavioural change: backend.ImageGeneration now takes a leading ctx context.Context argument and threads it through to the gRPC GenerateImage call (previously used appConfig.Context). Client disconnects will now actually cancel in-flight diffusion runs. The two in-tree callers and one test stub are updated.

Architecture wins

  • Single point of policy enforcement in core/http/middleware/node_header.go - no per-handler hooks.
  • Per-request attribution under concurrent multi-replica routing - 20-request burst across 2 replicas: 10/10 headers exactly match the frontend's routing decisions (verified e2e).
  • Streaming correctness - the wrapper's Write / Flush triggers the header set BEFORE delegating to the underlying writer.
  • Zero hot-path I/O - per-request atomic.Value read, no model loader lookup.
  • Anthropic / Ollama / audio / image / TTS / rerank / VAD covered via the shared RequestExtractor - no per-endpoint header logic.
  • pkg/model shrinks by 69 lines - removed dead nodeID field, SetNodeID, NodeID, LookupNodeID, storeMu, getStore.

Caveats

  • Realtime / WebSocket (/v1/realtime) is not wired. Backend ctx propagation is now in place, but the WebSocket lifecycle would need careful handling.

Test plan

  • go test -race ./core/http/middleware/ ./pkg/model/ ./core/services/nodes/ ./pkg/distributedhdr/ ./core/backend/ -count=1 passes
  • Multi-replica concurrent attribution test in core/http/middleware/node_header_concurrency_test.go (N=64 concurrent requests, each must observe its own stamp)
  • Route-level integration spec asserts header set BEFORE first SSE byte
  • core/backend/ctx_propagation_test.go pins ctx propagation for transcription / TTS / image / rerank / VAD (reverting any backend fix makes the corresponding spec fail)
  • With the flag off (default), middleware fast-paths to next(c) without wrapping the writer
  • E2E verified against docker-compose distributed stack with 2 workers + qwen3-0.6b loaded on both replicas. 20 concurrent chat requests: 10/10 split in both the harness headers AND the frontend routing log.
  • E2E for audio / image / TTS / rerank / VAD pending: requires loading the respective model types on a test stack. Curl recipes in the verification comment below.

Assisted-by: Claude:claude-opus-4-7[1m]

@localai-bot localai-bot changed the title feat(distributed): gated X-LocalAI-Node response header feat(distributed): gated X-LocalAI-Node response header (middleware + wrapper) May 24, 2026
@localai-bot
Copy link
Copy Markdown
Collaborator Author

End-to-end verification (2026-05-24)

Ran the full distributed stack via docker-compose.distributed.yaml against this branch's image with LOCALAI_EXPOSE_NODE_HEADER=true and qwen3-0.6b loaded on worker-1.

Endpoint Mode Flag ON Flag OFF
/v1/chat/completions buffered header set to worker-1 node ID header absent
/v1/chat/completions streaming SSE header set in response head before first SSE byte header absent
/api/generate (Ollama) buffered header set (not tested)
/api/chat (Ollama) buffered header set (not tested)
/v1/messages (Anthropic) buffered header set (not tested)

Header value (fe221165-b6c4-4439-b993-e192a1f0b86b) matches worker-1's ID as reported by GET /api/nodes. The flag-off path only emits X-Correlation-Id, no node header.

Anthropic + Ollama shims worked without any handler-side change: the wrapper reads model name from CONTEXT_LOCALS_KEY_MODEL_NAME populated by the shared RequestExtractor middleware that all three route stacks already use.

@localai-bot
Copy link
Copy Markdown
Collaborator Author

Per-request attribution e2e verification (2026-05-25)

Followup to the earlier e2e comment. The previous LookupNodeID-from-store design had a real race under concurrent multi-replica routing: the per-modelID store entry gets overwritten on each routing decision, so two simultaneous requests for the same model would both end up reading the most-recently-stamped node ID. Refactored to per-request *atomic.Value holder on c.Request().Context().

Verified with 2-worker docker-compose stack, qwen3-0.6b loaded on both worker-1 and worker-2 (via min_replicas=2 scheduling), 20 concurrent requests through the bug-hunt harness:

worker-1 worker-2
Headers reported by harness (X-LocalAI-Node) 10 10
Routing decisions logged by frontend (Model routed to remote node) 10 10

Exact match. Pre-fix this would have shown either ~20/0 or 0/~20 because the store-based read returned whichever replica won the most recent overwrite.

Also added a Ginkgo regression test (core/http/middleware/node_header_concurrency_test.go) that drives 64 parallel requests through one middleware instance with each request stamping a unique node ID, asserts every response's header equals THAT request's stamp - so the bug cannot regress silently.

@localai-bot
Copy link
Copy Markdown
Collaborator Author

Extended endpoint coverage (2026-05-25)

Wired the middleware to audio transcriptions / audio speech / image generations / image inpainting / rerank / VAD. While doing so, found and fixed a real underlying bug: all five backend helpers were silently using the long-lived app context instead of the request context, which would have made the header a silent no-op on those endpoints even with the middleware wired. Fix is regression-pinned in core/backend/ctx_propagation_test.go.

Bonus: image generation now actually honors client cancellation (was using appConfig.Context for the gRPC call).

E2E recipe (post-merge)

With LOCALAI_EXPOSE_NODE_HEADER=true and the relevant model loaded on at least one worker:

# Audio transcriptions
curl -i -X POST $ENDPOINT/v1/audio/transcriptions \
  -H "Authorization: Bearer $KEY" \
  -F model=whisper-base.en -F file=@samples/jfk.wav | grep -i x-localai-node

# TTS
curl -i -X POST $ENDPOINT/v1/audio/speech \
  -H "Authorization: Bearer $KEY" -H "Content-Type: application/json" \
  -d '{"model":"tts-1","input":"Hello node","voice":"alloy"}' --output /tmp/out.wav -D - | grep -i x-localai-node

# Image generation
curl -i -X POST $ENDPOINT/v1/images/generations \
  -H "Authorization: Bearer $KEY" -H "Content-Type: application/json" \
  -d '{"model":"sd-1.5","prompt":"a tiny red dot","size":"64x64","n":1}' | grep -i x-localai-node

# Rerank
curl -i -X POST $ENDPOINT/v1/rerank \
  -H "Authorization: Bearer $KEY" -H "Content-Type: application/json" \
  -d '{"model":"bge-reranker-v2-m3","query":"q","documents":["a","b","c"]}' | grep -i x-localai-node

# VAD
curl -i -X POST $ENDPOINT/v1/vad \
  -H "Authorization: Bearer $KEY" -H "Content-Type: application/json" \
  -d '{"model":"silero-vad","audio":[0.0,0.1,0.2]}' | grep -i x-localai-node

Expect X-LocalAI-Node: <worker-id> on every response; alternating across requests proves per-request routing.

mudler added 4 commits May 25, 2026 08:00
Introduce pkg/distributedhdr, a leaf package carrying a per-request
*atomic.Value holder for the picked worker node ID from the
SmartRouter (core/services/nodes) up to the HTTP response writer
wrapper (core/http/middleware). Avoids the import cycle that a shared
key in either consumer would create.

Exposes NewHolder, WithHolder, Holder, Stamp, Load, Inherit. The
holder is atomic.Value so cross-goroutine publish from the router to
the response writer wrapper is race-clean.

Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…wrapper

New ApplicationConfig.ExposeNodeHeader bool + --expose-node-header CLI
flag / LOCALAI_EXPOSE_NODE_HEADER env var (default off; the node ID
reveals internal topology and is opt-in).

The middleware creates a per-request *atomic.Value holder, attaches it
to c.Request().Context() via distributedhdr.WithHolder, and wraps
c.Response().Writer with a custom http.ResponseWriter that sets the
X-LocalAI-Node header on first Write / WriteHeader / Flush by reading
the holder. Implements http.Flusher, http.Hijacker, Unwrap so it
composes cleanly with Echo and http.NewResponseController.

request.go propagates the holder onto derived contexts via
distributedhdr.Inherit so the holder survives the correlation-ID
context replacement.

Unit + race-clean concurrency + integration specs.

Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…erence routes

ModelRouterAdapter.Route stamps the picked node ID into the
per-request holder via distributedhdr.Stamp(ctx, result.Node.ID) right
after replica selection.

Wire ExposeNodeHeader middleware to:
- OpenAI chat/completion/embeddings + audio transcriptions/speech + image generations/inpainting
- Anthropic /v1/messages
- Ollama /api/chat, /api/generate, /api/embed, /api/embeddings
- Jina /v1/rerank
- LocalAI /v1/vad

The middleware's wrapper reads the holder on first byte and sets the
X-LocalAI-Node response header before delegating to the underlying
writer. Per-request scope means no race under concurrent multi-replica
routing.

Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
… ctx propagation

Five non-OpenAI backend helpers were silently using app.Context instead
of the request context for the gRPC backend call: transcription, TTS,
image generation, rerank, VAD. Effect: distributedhdr.Stamp in the
router callback was a silent no-op for these paths, AND client
cancellation didn't propagate to in-flight inference.

Thread c.Request().Context() (or the equivalent input.Context after
the request middleware has installed the correlation-ID derived
context) through each helper and into ModelOptions via
model.WithContext(ctx). ImageGeneration's signature gains a leading
ctx parameter; in-tree callers (openai image, openai inpainting,
openai inpainting_test) are updated to match.

ModelEmbedding gains a leading ctx parameter for the same reason; the
openai and ollama embedding handlers pass the request context through.

chat_stream_workers.go defers the initial role=assistant chunk
emission until the first token callback so the wrapper's lazy
X-LocalAI-Node lookup against the loader runs AFTER ml.Load has
stamped the per-modelID node ID; semantically identical for clients
(role still arrives before any text).

Regression test core/backend/ctx_propagation_test.go pins ctx
propagation for all five helpers.

Docs updated to enumerate the full endpoint coverage of the
--expose-node-header flag.

Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
@mudler mudler force-pushed the feat/expose-node-header branch from 36e6625 to 017d89b Compare May 25, 2026 08:33
Repository owner deleted a comment from localai-bot May 25, 2026
@mudler mudler merged commit 06e777b into master May 25, 2026
57 checks passed
@mudler mudler deleted the feat/expose-node-header branch May 25, 2026 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants