Add Entra OIDC auth, chat/search UX overhaul, and supporting fixes#45
Merged
Conversation
Auth: - Wire Microsoft/Entra OIDC directly into the OSS app (no ee/ dependency): OIDC router in main.py, AuthType.OIDC allowlisted, public auth routes registered, OPENID_CONFIG_URL + DEFAULT_ADMIN_EMAILS env vars. - Auto-verify OIDC users in oauth_callback; env-driven admin allowlist. - Pin bcrypt==4.0.1 (passlib 1.7.4 incompatible with bcrypt 4.1+). Chat/search UX: - Persona document_sets act as an outer fence (intersect with user filters). - Search-mode framing on the default persona; assistant scope chip; starter prompts; sidebar timestamps; Cmd+K new chat; distinct assistant message styling; 3-step onboarding cards; larger chat input. - Searchable/scrollable knowledge-set picker; removed tag filters. - Hide /search from nav (still reachable by URL); remove redundant top-left assistant selector; highlight applied filters. - next.config.js: drop 308 stream redirects that stripped the session cookie. Docs: AGENTS.md + CONTRIBUTING.md updated for OIDC setup and the new footguns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- DocumentDisplay: fall back to blurb when match_highlights is a non-empty array of only falsy/whitespace strings. Previously sections stayed empty and sections[0][2] threw "Cannot read properties of undefined (reading '2')", crashing the chat doc sidebar (and search page) when retrieved docs had empty highlights — more likely with large/many-doc contexts. - Slack blocks: strip the language token off opening code fences (```bash -> ```) in build_qa_response_blocks. Slack mrkdwn has no fenced-code info string, so the language rendered as a literal first line of the block. (Slack still cannot syntax-highlight; that's a platform limit.) - SelectedFilterDisplay/ChatInputBar: remove the locked persona "Scope" chips from the chat input bar. Cosmetic only — the assistant still scopes search to its document sets server-side in search/preprocessing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- env-configmap: AUTH_TYPE=oidc, OPENID_CONFIG_URL (Entra discovery), DEFAULT_ADMIN_EMAILS, and set WEB_DOMAIN/DOMAIN to the external https origin (required for a correct OIDC redirect_uri and Secure session cookie). - api_server deployment: inject OAUTH_CLIENT_ID/OAUTH_CLIENT_SECRET/ USER_AUTH_SECRET from the danswer-secrets secret via secretKeyRef. - secrets.yaml: replace stub values with documented placeholders and a "do not commit real secrets" header; real values applied out-of-band. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switching the assistant silently created and navigated to a new chat session with no feedback. Show an auto-expiring toast explaining each chat is bound to a single assistant (and to re-upload any attached files). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Settings.default_page now defaults to CHAT instead of SEARCH. Only affects deployments with no stored settings yet; existing deployments keep their persisted value (change via Admin -> Settings). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
get_persona_by_id grants non-admins access to ownerless personas (user_id IS NULL), which includes the shared default assistants. Guard mark_persona_as_deleted so a basic user gets 403 for default/ownerless personas, mirroring the frontend's !default_persona rule. Closes a gap where a basic user could soft-delete a default assistant for everyone via a direct API call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Track .mcp.json (shared Playwright MCP server) so the browser-driven debugging setup is reproducible. Gitignore .playwright-mcp/ and the ad-hoc screenshot, which are local session output, not source. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Out of scope for the OIDC / UX work; revisit separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
swati354
approved these changes
May 28, 2026
Sarath1018
approved these changes
May 28, 2026
rajivml
added a commit
that referenced
this pull request
May 29, 2026
rajiv/add-claude was merged to feature/darwin upstream, so the doc's "on top of rajiv/add-claude (PR #45)" reference is stale. The branch is now rebased onto origin/feature/darwin directly — same diff, just a fresher base. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rajivml
added a commit
that referenced
this pull request
May 29, 2026
rajiv/add-claude was merged to feature/darwin upstream, so the doc's "on top of rajiv/add-claude (PR #45)" reference is stale. The branch is now rebased onto origin/feature/darwin directly — same diff, just a fresher base. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rajivml
added a commit
that referenced
this pull request
Jun 3, 2026
rajiv/add-claude was merged to feature/darwin upstream, so the doc's "on top of rajiv/add-claude (PR #45)" reference is stale. The branch is now rebased onto origin/feature/darwin directly — same diff, just a fresher base. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rajivml
added a commit
that referenced
this pull request
Jun 3, 2026
…, retention + connector reliability (#46) * docs: add Redis caching & scaling plan Plan for exposing chat to a few hundred users: P0 connection-pool/session fix, P1 Redis foundation + DynamicConfigStore read-through cache, P2 per-user request rate limiting, P3 per-chat-turn config caches. Plan only, no implementation yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * P1: Redis foundation + read-through KV cache Foundation for caching/rate-limiting work (see REDIS_CACHING_PLAN.md). This commit only ships the cache piece — no behavioural change unless REDIS_KV_CACHE_ENABLED=true is set. * requirements: pin redis==5.0.8. * configs/app_configs.py: REDIS_HOST/PORT/PASSWORD/DB_NUMBER/SSL, REDIS_POOL_MAX_CONNECTIONS, REDIS_HEALTH_CHECK_INTERVAL, REDIS_SOCKET_TIMEOUT_SECONDS; cache toggle REDIS_KV_CACHE_ENABLED (default OFF) and REDIS_KV_CACHE_TTL_SECONDS (1 day). * danswer/redis/redis_pool.py: lazy ConnectionPool singleton + get_redis_client() helper. Single-tenant — DANSWER_REDIS_KEY_PREFIX is the only namespace; upstream's TenantRedisClient is intentionally not ported. * dynamic_configs/store.py: RedisCachedDynamicConfigStore wraps any inner DynamicConfigStore with read-through / write-through caching. Inner store stays the source of truth (writes inner first), encrypted values are NEVER cached plaintext (just invalidated), every Redis call is fail-open so an outage degrades latency, not availability. * dynamic_configs/factory.py: when REDIS_KV_CACHE_ENABLED, transparently wraps the existing PostgresBackedDynamicConfigStore — call sites unchanged. * Deployment: redis service in docker-compose.dev.yml (cache-only: no AOF, no RDB snapshots, allkeys-lru @ 256mb so a runaway producer can't OOM the node). darwin-kubernetes/redis-statefulset.yaml mirrors that posture. REDIS_HOST etc. in env-configmap; REDIS_PASSWORD wired via optional secretKeyRef so the deployments still boot when Redis is unauth'd. NOT the Celery broker — that stays on Postgres by design. * backend/.gitignore: ignore stray pywikibot apicache/throttle.ctrl artifacts dropped by the existing mediawiki test. Tests (unittest, no real Redis required — mocks at the get_redis_client boundary): - tests/.../redis_layer/test_redis_pool.py: pool singleton, prefix constant, reset_pool_for_tests. - tests/.../dynamic_configs/test_redis_cached_store.py: read-through, write-through invalidation, TTL on SET, cached-None vs miss, not-found NOT cached, encrypted values not mirrored, corrupt entry treated as miss, fail-open on Redis errors. 13 cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * P2: per-user request rate limiter on chat/query endpoints Layered on top of P1's Redis client. Complements the existing token- budget limiter (token_limit.py) — that's a DB-backed COST cap, this is a Redis-backed REQUEST-COUNT cap that's correct across api_server replicas. Both run; this one runs first so a 429'd caller never even touches the DB-backed usage query. Default OFF. Enable per-environment via: REQUEST_RATE_LIMIT_ENABLED=true REQUEST_RATE_LIMIT_PER_MINUTE=<N> # 0 = disable that window REQUEST_RATE_LIMIT_PER_HOUR=<N> * server/middleware/request_rate_limit.py: fixed-window buckets keyed by floor(time/window). Atomic INCR + EXPIRE NX so the bucket boundary is fixed on first increment (without NX, every request would push expiry forward and the bucket would never reset — that bug is covered by an explicit test). Authenticated users keyed by uuid; anonymous keyed by the first X-Forwarded-For hop, falling back to the socket peer; if neither yields an IP we skip (better than bucketing every anonymous request under ""). * Fail-OPEN on any Redis error: a Redis blip lets requests through with a warning, never wedges the chat path. * 429 response carries a Retry-After header with seconds-until-bucket- rollover so well-behaved clients back off precisely. * Wired as a FastAPI Depends on: POST /chat/send-message POST /direct-qa/stream-answer-with-quote Both endpoints also keep the existing check_token_rate_limits. Tests (unittest, mocked Redis pipeline — no real Redis required): - default-OFF short-circuits before any Redis call (covers both REQUEST_RATE_LIMIT_ENABLED=false AND both windows = 0). - within-limit: N requests under cap all allowed. - over-limit raises 429 with Retry-After header. - per-user isolation: distinct users have independent counters. - bucket rollover resets count (time-mocked). - EXPIRE NX semantics — locks down the no-sliding-TTL invariant. - anonymous keyed by XFF first hop; no-IP skips silently. - fail-open: Redis error doesn't propagate. 9 cases total. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Persona list cache with explicit write-through invalidation GET /persona (Manage Assistants → "View available assistants") fires get_personas(user_id, ...) — a multi-OR permission-filtered query joining Persona, Persona__User, Persona__UserGroup, User__UserGroup. At hundreds of concurrent users opening the chat UI around the same time, the burst puts unnecessary pressure on the DB connection pool (which is the actual scaling ceiling for streaming chat — see REDIS_CACHING_PLAN.md). Design: global cache + per-user filter (not per-user response cache), so the multi-user-burst pattern collapses 200 queries into ~1: danswer:personas:all:not_deleted global, all PersonaSnapshot including is_public / users / groups (PersonaSnapshot already carries the permission inputs — no separate payload shape needed) danswer:personas:groups:{user_id} per-user, list[int] of group ids At request time the cached list is filtered in Python mirroring the SQL OR-block exactly: is_public OR user.id IN persona.users.id OR (user_group_ids ∩ persona.groups) The parity vs SQL is locked down by tests (one per branch + negative). Invalidation is explicit + write-through: - 9 mutation paths in db/persona.py call invalidate_personas_all() AFTER db_session.commit() (after-commit ordering avoids stale-fill race during open transactions). - 3 paths in ee/danswer/db/user_group.py (insert/update/prepare-delete) call invalidate_user_groups(uid) for each affected user. - 24h TTL is ONLY a safety net for missed busts; primary mechanism is explicit so persona/group edits are visible immediately. - Default OFF (PERSONA_CACHE_ENABLED=false); enable per environment. - Fail-OPEN on every Redis op: a Redis outage degrades latency, not availability, and a failed bust doesn't roll back the DB write. - include_deleted=True falls through to direct DB (uncommon shape; we deliberately don't cache it). Encrypted values: N/A — PersonaSnapshot has no encryption-at-rest guarantee to bypass (unlike the KV store layer from P1). Tests (17, mocked Redis + db boundary, no real services): - 6 filter-parity cases (one per SQL branch + mixed + zero-groups edge) - 2 user-group cache cases (miss/hit, TTL propagation) - 3 routing cases (disabled fallthrough, include_deleted bypass, admin user_id=None path skips group lookup) - 4 invalidation cases (right key for each side, disabled short-circuit, Redis-error-during-bust swallowed) - 2 fail-open-on-read cases (GET error → miss, corrupt entry → miss) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Manage Assistants page UX overhaul Replaces the prior "move up / move down inside a 3-dot popover" flow on /assistants/mine with eight coordinated changes. Backend unchanged — the existing PATCH /api/user/assistant-list endpoint already accepts the full chosen_assistants array, so every interaction lands as one optimistic local update + one PATCH + (on failure) a rollback. 1. Drag-and-drop reorder via @dnd-kit (already in package.json) with a grab handle on each visible row. Pointer activation distance of 6px so clicks on the handle don't accidentally start a drag. Keyboard reordering comes for free via dnd-kit's default activator focus. 2. Explicit "set as default" — pin/star icon on each visible row; filled when the row is the user's default (position 0 of chosen_assistants), with an accent border + "Default" chip on that row. Ordering and default are now orthogonal — reorder freely without accidentally changing your default. 3. Visibility as a row-level switch instead of a buried "Hide / Remove" popover item. One unified list with a "Hidden (N)" divider; hidden rows render at reduced opacity and have no drag handle (no position to drag to). The prior separate "Active Assistants" / "Your Hidden Assistants" sections collapse into this single list. Refuses to hide the last visible row (can't ship the user a broken picker). 4. Client-side search filter — matches name, description, or tool name. Applies to both visible and hidden sections so search-then-toggle for "where did I put X" is one motion. 5. Information density rebalanced. Description is now the primary signal (was the smallest text). Tools/sources collapse into compact "{n} tools" / "{n} sources" chips so the row scans for "should I pick this?" not "what are its internals?". Full tool list reveals on hover via title attribute. 6. Bulk select column + sticky action bar. Checkbox appears on hover or focus and stays visible when selected. Action bar shows Show / Hide / Remove + Clear when anything is selected. Refuses bulk-hide that would empty the visible list. 7. Header cleanup: title + 1-line subtitle + Create button top-right, "Browse all available" as a text link. The prior two giant nav tiles + paragraph of explanatory copy are gone — recovers vertical space on a page whose real content is the list. 8. Undo on every state-mutating toast (reorder / set-default / hide / show / bulk ops). PopupSpec gains an optional `undo: { onClick }` field; the popup stays on screen 6s instead of 4s when undoable so the user has time to react. Undo restores the prior chosen_assistants array via another PATCH — symmetric round-trip, no special endpoint. New helpers in lib/assistants/updateAssistantPreferences.ts: reorderAssistantList(newOrder) — full-array PATCH (drag, undo) setDefaultAssistant(id, list) — move id to position 0 bulkRemoveFromList(ids, list) — set difference bulkAddToList(ids, list) — set union, appended at end The pre-existing moveAssistantUp/Down helpers are kept (other callers may still import them) but no longer used on this page. Verified: npx tsc --noEmit clean across web/ (0 errors). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Assistant Gallery page UX overhaul Sister rework to the Manage Assistants page. With 50+ accessible assistants and growing, the old flat 2-column grid had no hierarchy and no status signal — every card looked identical regardless of whether it was yours, shared, public, or already in your picker. Same conceptual fix as Manage: give the page structure so scanning answers "what's mine?", "what's new?", "what does this one do?". Backend unchanged — every interaction PATCHes chosen_assistants via the existing /api/user/assistant-list endpoint (same path the Manage page uses). All mutations are optimistic + undoable. Changes (numbers map to the design proposal): 1. Per-card "In your picker" badge + muted card style when added. Eye now finds the un-added ones in a glance. 2. Three implicit sections: Yours / Shared with you / Featured & Built-in. Empty sections hide; section headers carry counts. 3. Filter chip rows: availability (All / Available to add / Already added) with live counts, plus auto-generated per-tool chips for tools that appear in ≥2 assistants (avoids chip-bloat as the dataset grows). Tool filters use OR semantics. 4. Owner display: best-effort name from the email local-part (split on '@', dots/underscores→spaces) with a "Built-in" badge for default_persona assistants. Kills the fork-specific "Author: Darwin" magic string. 6. Responsive grid: 1 / 2 / 3 / 4 cols by breakpoint. 7. Header matches the Manage rebuild — title + subtitle + Create button top-right, "Back to my assistants" as a text link. Cut the giant centered nav button and the explanatory paragraph. 8. Sort dropdown: Featured (API order, respects admin display_priority) / A → Z / Newly added (id desc proxy for recency). 9. Search broadened to name + description + tool names + document-set names. Empty-result state with a real "Clear all filters" button. 10. Compact "{n} tools" / "{n} sources" chips with hover-reveal of the full tool list. Flat Add/Remove buttons replace Tremor's color="green/red" which was visually shoutier than the action. 11. Design tokens fixed — border-border / focus-ring-accent in place of hardcoded gray-300 / blue-500. Consistent with the rest of the app. Skipped (per proposal): - #5 detail drawer / modal — revisit after observing how users use the new grid; bigger feature. - Bulk select — adding 5 assistants at once isn't a real use case here (bulk hide on Manage was). The pre-existing addAssistantToList / removeAssistantFromList helpers are kept and used at the call sites. The shared reorderAssistantList helper added in the prior commit is reused for the undo paths. Verified: npx tsc --noEmit clean across web/ (0 errors). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add backend/scripts/seed_assistants.py for local UX testing Dev-only seed script that creates N varied personas in the local DB for exercising the redesigned gallery / manage pages. Refuses to run when POSTGRES_HOST looks like a managed/prod database (azure.com, amazonaws.com, .cloud., "prod", etc.) — guard against pointing this at the wrong env by accident. Mix is designed to populate each section of the new gallery: ~30% "Yours" — owned by target user, private ~20% "Shared with you" — owned by another user, target user in users[] ~50% "Featured" — public, no specific owner Per row randomly attaches 0–3 tools and 0–2 document sets so the {n} tools / {n} sources chips render with variety. Half of "Yours" auto- land in chosen_assistants (and all "Shared with you" do), so the "Already added" vs "Available to add" filter chips have content on both sides without manual setup. 60 distinct names + 30 description templates so 50 rows feel populated and varied. Uses a fixed RNG seed by default (deterministic across runs). Name prefix "[seed] " makes rows easy to spot and to wipe via --clear. Usage: cd backend && source ../.venv/bin/activate python -m scripts.seed_assistants --email you@example.com python -m scripts.seed_assistants --clear # wipe and re-seed Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Assistants UX polish: toggle highlight + gallery declutter Two follow-ups on the assistants UX rework, both from user feedback. Manage (/assistants/mine): * Toggle now accepts a `highlight` prop that draws a transient ring + slight scale-up on the switch. Used by hidden rows so a click anywhere on the (faded) row body flashes the toggle for ~1.2s, pointing the eye at the action that brings the assistant back. Doesn't auto-enable on body click — surprising a user mid-read into enabling would be a worse outcome than the discoverability gap. * Restructured opacity: only the content column (icon + name + description + chips) fades when a row is hidden. The action zone (checkbox, drag-slot, pin, toggle, share, edit) stays at full opacity so the toggle is the bright, clickable target on a dim row. Previously the parent opacity-50 cascaded to every child, making the toggle the dimmest thing on the dimmest row. * stopPropagation on the action zone so clicks on buttons inside it don't trigger the row-body flash handler. Gallery (/assistants/gallery): * Removed all tool-related UI per user request — the page is for browsing assistants, and tool filter chips + per-card "{n} tools" pulled focus from the assistant itself. Gone: the auto-generated tool filter chip row, the per-card tools chip, the toolDisplayName / toolIcon helpers, the FiTool / FiImage / FiCheck imports, and the toolFilters state + commonTools memo. Search hay is now name + description + document-set names (no tool names). * Dropped the absolute top-right "In your picker" badge. The muted card style (border + opacity-75) plus the "Remove" button in the footer already signal "added"; the badge ate horizontal space (pr-24 on the header reserved 96px) and crowded the title at narrower widths. Removed the pr-24 reservation now that nothing overlays the header. * Grid capped at `1 / 2 / 3` cols — 1 on mobile, 2 on most laptops and standard desktops, 3 only at `2xl` (≥1536px). Previously 1/2/3/4 with the 4-col breakpoint making cards cramped and hard to read once descriptions hit their 3-line clamp. * Bumped card padding p-4 → p-5 and description line-height to leading-relaxed for breathing room. * Updated clearAllFilters / hasAnyFilter to drop the toolFilters references (now dead). Verified: npx tsc --noEmit clean across web/ (0 errors), zero stray references to the removed helpers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Remove tools chip from Manage Assistants page Mirrors the gallery treatment from the previous commit. The user reported tool execution isn't reliable yet, and surfacing "{n} tools" on assistant rows misleads users into picking an assistant for a capability that may not work in practice. Dropped: the {n} tools Bubble in the row's chip block, the toolCount derivation, and the FiTool import. The {n} sources chip stays — it's about the assistant's knowledge scope, which works fine. Verified: npx tsc --noEmit clean across web/ (0 errors). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Parameterize gallery grid column count (default 3) AssistantsGallery now accepts an optional `columns` prop (default 3, supported 1-5). Responsive scaling below the widest breakpoint is fixed per row of GRID_CLASSES — each row is a complete static Tailwind class string so the purge step actually emits the classes (dynamic `md:grid-cols-${n}` would silently disappear at build). Unsupported values fall back to the default rather than rendering broken — a bad prop here shouldn't break the page. The single existing caller (page.tsx) doesn't pass columns, so it gets the default 3 — same layout as before. To switch to 4 columns on a wide-monitor deployment: `<AssistantsGallery columns={4} ... />`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Show document-set names on assistant cards (was: count only) A "{n} sources" chip told users how MUCH knowledge an assistant had access to but not WHICH knowledge — defeating the point of the chip for someone deciding "which assistant should I pick for this task?". Both the Manage page row and the Gallery card now render one Bubble per document-set name, capped at MAX_VISIBLE_DOC_SETS (3). When an assistant points at more than that, a "+N more" pill collects the overflow with the rest of the names exposed via the title tooltip, so we don't blow the card width or row layout at narrower column counts. Each name chip caps at a max-width with truncate + a hover title, so a single absurdly long document-set name can't push the actions off the row. Verified: npx tsc --noEmit clean across web/ (0 errors). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Gallery: user-controllable column count (segmented control, persists) Adds a small "Columns 2 | 3 | 4" segmented control at the right end of the filter row (next to Sort). The pick persists to localStorage under "danswer:assistants-gallery:columns" so it survives reloads on the same device. State precedence: user choice (localStorage) wins ↓ prop `columns` from caller (default for new users / new device) ↓ DEFAULT_COLUMNS = 3 (final fallback) The localStorage read happens in a useEffect so SSR + first paint use the prop value — avoids a hydration mismatch the time the stored value disagrees with the prop. localStorage writes are wrapped in try/catch because some sandboxed contexts (private modes, restrictive iframes) throw on access — the control still works for the session, just doesn't persist there. Picker is hidden below md (768px) because the layout falls back to 1 column at that width regardless of the chosen value. Exposed options are 2/3/4 — 1 is mobile-only via responsive, 5 is too cramped for typical screens (GRID_CLASSES still supports 5 if a deployment wants to set it via prop). Verified: npx tsc --noEmit clean across web/ (0 errors). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Gallery: column picker as dropdown to match Sort Reverts the segmented "2 | 3 | 4" button group to a single <select> that mirrors the existing Sort dropdown for visual consistency on the "view controls" cluster at the right end of the filter row. Behavior unchanged: pure client-side state + localStorage persistence, no fetch and no router.refresh() in the column path — the user's column choice never triggers a backend call. Verified: npx tsc --noEmit clean across web/ (0 errors). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Scale indexing via remote Dask scheduler topology Replace the monolithic supervisord background pod with separate deployments for celery-worker, celery-beat, indexer-scheduler, dask-scheduler, and dask-worker. The indexer-scheduler now reads DASK_SCHEDULER_ADDRESS to dispatch run_indexing_entrypoint to a remote Dask cluster instead of an in-process LocalCluster, so indexing throughput scales horizontally with dask-worker replicas instead of being capped by one pod's RAM. Local dev keeps the LocalCluster path (no env var); a new scripts/dev_run_dask_distributed.py and docker-compose overlay reproduce the prod-shape topology without K8s. scripts/test_dask_distributed_e2e.py exercises the topology (parallelism, worker death, scheduler death) end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: add MIGRATION.md covering Redis / bg-scaling / UX Single migration doc covering all three slices on this branch: 1. Background indexing scaling (Dask topology) 2. Redis caching + rate limiting 3. Assistants UX rework Organized for an operator: TL;DR up top ("everything default OFF"), new deps/env/secrets summarized, deployment order, verification checklist BEFORE flipping any flags, per-feature enable steps, and the known footguns (k8s manifests missing REDIS_PASSWORD env wire-up in the bg-scaling path, seed script bypassing persona cache, CLAUDE.md update.py gate). Plus the recommended manual test list and the branch's commit map. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * darwin-kubernetes: port split-background manifests + lock convention in AGENTS.md The bg-scaling commit (03d1649f) added 5 new k8s manifests under `deployment/kubernetes/` that split the combined background pod into beat / celery / indexer-scheduler / dask-scheduler / dask-worker. But Darwin doesn't apply from `deployment/kubernetes/` — its prod manifests live under `darwin-kubernetes/`, and the two trees aren't kept in sync. Porting all five into `darwin-kubernetes/` with Darwin conventions: - Image registry sfbrdevhelmweacr.azurecr.io/danswer/danswer-backend - configMapRef env-configmap, secretKeyRef danswer-secrets - POSTGRES_USER / POSTGRES_PASSWORD wired everywhere that talks to PG - REDIS_PASSWORD wired as optional secretKeyRef (the latent footgun flagged in MIGRATION.md §10a is now closed for the Darwin path) - indexcpu nodeAffinity + darwin/indexing toleration on every indexing-side pod (celery, indexer-scheduler, dask-scheduler, dask-worker); beat stays on the default pool (lightweight) - dynamic-pvc + file-connector-pvc volume mounts where any task may stage files The existing `darwin-kubernetes/background-deployment.yaml` (combined beat+celery+indexer via supervisord) is intentionally LEFT IN PLACE — the split is an opt-in rollout, not a forced cutover. To switch: apply the new five, verify the new pods are healthy, scale the combined deployment to 0. Also lock the convention in AGENTS.md so this doesn't recur: - New divergence-table row noting darwin-kubernetes/ is source of truth for prod. - New "Critical facts that bite" §9 documenting the two-tree split, when to touch which, and the per-pod adaptation checklist (image registry, configmap, secrets, REDIS_PASSWORD, affinity, PVCs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(MIGRATION.md): reflect the darwin-kubernetes port §5b — Dask topology section now points at the actual ported darwin-kubernetes/*.yaml manifests with a concrete cutover script, not just "you'll need to port these later" boilerplate. §10a — Footgun is RESOLVED for the Darwin path (the 5 new Darwin manifests all wire REDIS_PASSWORD via optional secretKeyRef). Marks the entry as such rather than removing it, so the history of "why was this previously a concern" stays readable. §12 — Commit count, file count, and totals updated for the two new commits (MIGRATION.md itself + the darwin-kubernetes port). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(MIGRATION.md): update base reference after add-claude merge rajiv/add-claude was merged to feature/darwin upstream, so the doc's "on top of rajiv/add-claude (PR #45)" reference is stale. The branch is now rebased onto origin/feature/darwin directly — same diff, just a fresher base. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Introduce k8s/ — kustomize-based manifests replacing darwin-kubernetes/ Single source of truth for both production (Darwin AKS) and local dev, with image tags + env values + secrets externalised so a deploy is "edit one file, kubectl apply -k". No Helm. Replaces the flat darwin-kubernetes/ tree (which the operator will delete once they've verified the new structure against the live cluster). Layout: k8s/base/ Cleaned env-neutral manifests (one file per logical service; deployment+service merged where natural). Image refs use logical names (e.g. `danswer-backend`) which overlays rewrite to concrete registry+tag. k8s/overlays/prod/ Darwin AKS production: kustomization.yaml → images, replicas env.properties → non-secret config secrets.env.example → template (committed) secrets.env → real values (gitignored) k8s/overlays/local/ Same shape, local-dev defaults (host.docker.internal, latest tags, AUTH_TYPE=disabled, smaller replicas). k8s/optional/ Opt-in deployments not part of base: redis.yaml background-{beat,celery,indexer-scheduler}.yaml dask-{scheduler,worker}.yaml Apply with `kubectl apply -f <file>` when rolling out the corresponding feature. k8s/README.md Layout explanation + common workflows (image bump, env change, secret rotation, Redis rollout, migrating off darwin-kubernetes/). Built from the live-cluster dump in darwin-kubernetes/temp/ (gitignored, never committed). The cleaner script (intentionally not committed) strips status, uid, resourceVersion, generation, creationTimestamp, managedFields, last-applied-configuration annotation, restartedAt, progressDeadlineSeconds, revisionHistoryLimit, and the auto-assigned clusterIP/ipFamilies/sessionAffinity on Services. Image references in base/ are normalised to logical names so kustomize can rewrite them. SECURITY: the live env-configmap was discovered to contain real plaintext secrets — Slack tokens, GEN_AI client secret, Jira token, Opsgenie key. The new structure moves all of those to k8s/overlays/*/secrets.env (gitignored) which renders into a kustomize-generated Secret. api-server and background deployments gain `envFrom: secretRef: danswer-secrets` so the moved values continue to reach the app as env vars. Rotation of the leaked credentials is a separate operator task — every "REPLACE_ME" in secrets.env.example marked LEAKED is one of them. Validation: kubectl kustomize k8s/overlays/prod → 26 resources, clean render kubectl kustomize k8s/overlays/local → 26 resources, clean render Image substitution verified in both. .gitignore additions: darwin-kubernetes/temp/ Live cluster dumps k8s/overlays/*/secrets.env Real secret values per environment k8s/overlays/*/*.secrets.env Defensive (any *.secrets.env variant) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * k8s: opt-in Redis via kustomize component (local includes, prod doesn't) In-cluster Redis is now an opt-in kustomize Component at k8s/optional/redis/, included by the local overlay via `components: [../../optional/redis]` and NOT by prod (which uses managed Redis instead). Why Component instead of `resources: ../../optional/redis.yaml`: kustomize's load restrictor rejects file-resource refs that escape the overlay's directory tree. Components are explicitly designed for opt-in cross-tree refs and pass the security check; they also let us add patches later that only apply when the component is opted in. Layout change: before: k8s/optional/redis.yaml after: k8s/optional/redis/ kustomization.yaml (kind: Component) redis.yaml The plain `kubectl apply -f k8s/optional/redis/redis.yaml` or `kubectl apply -k k8s/optional/redis` workflows still work — the file just moved one level deeper. env.properties updates: local: REDIS_HOST=redis (the in-cluster Service name, matching the component's deployment) prod: REDIS_HOST=<your-managed-redis>.redis.cache.windows.net (placeholder for Azure Cache for Redis; rename + drop the access key into secrets.env as `redis_password` when you adopt managed Redis) Validated: kubectl kustomize k8s/overlays/prod → 26 resources (no Redis) kubectl kustomize k8s/overlays/local → 28 resources (+Service +StatefulSet) README updated with the components pattern and how to add more opt-in features the same way (split-background, Dask, etc.). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * k8s: prod now uses in-cluster Redis (was: managed Redis); bump memory to 512MB Reversal of the earlier "prod will use managed Redis" decision. Prod overlay now opts into the same in-cluster Redis component as local: k8s/overlays/prod/kustomization.yaml — adds: components: - ../../optional/redis k8s/overlays/prod/env.properties — REDIS_HOST back to `redis` (the in-cluster Service name) Redis StatefulSet bumped from 256MB to 512MB: --maxmemory 256mb → 512mb resources.requests.memory 128Mi → 256Mi resources.limits.memory 384Mi → 1Gi Limit set to ~2x maxmemory rather than 1.5x because the single-replica StatefulSet has no failover — OOM = cache outage. Redis uses extra RSS beyond --maxmemory for client output buffers, COW pages during BGSAVE (if we ever turn on persistence), and fragmentation; safer to over- provision the cgroup limit and let `maxmemory-policy: allkeys-lru` do its job inside Redis's own accounting. Validated: kubectl kustomize k8s/overlays/prod → 28 resources (now includes Redis) kubectl kustomize k8s/overlays/local → 28 resources (unchanged) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * k8s: move redis from optional component to base Both prod and local overlays opted into the in-cluster Redis component, so it's no longer optional — promoted to base/redis.yaml and added to base/kustomization.yaml. Removed the now-redundant `components:` blocks from both overlays and the optional/redis/ component dir. Net effect is identical (prod + local still render 28 resources each, both including Redis) — just less indirection now that Redis is universal rather than opt-in. README updated: optional/ table drops the redis row with a note that it moved to base; the components: "flag" explanation now points at the split-background deployments as the example opt-in. Validated: kubectl kustomize k8s/overlays/prod → 28 resources (redis in base) kubectl kustomize k8s/overlays/local → 28 resources (redis in base) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(k8s/README): drop migration narrative + darwin-kubernetes references The README is the doc for the k8s/ layout as it stands, not a record of how it came to be. Removed: - "Replaces the older darwin-kubernetes/ tree" subtitle - the whole "Migration plan (deleting darwin-kubernetes/)" section - the "darwin-kubernetes/ is being retired" + temp/ convention bullets Also fixed two bits left stale by moving Redis into base: - structure diagram listed Redis under optional/ → now correctly omits it (it's in base) - "Roll out Redis" workflow told you to `kubectl apply -f k8s/optional/redis.yaml` → rewritten as "Redis ships in base; flip the env flags to enable the cache/rate-limiter features" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(k8s/README): add instructions for applying optional/ manifests optional/ holds plain manifests (no kustomization), so they need `kubectl apply -f` and aren't picked up by `apply -k overlays/...`. Added a workflow covering: - single-file and whole-folder apply - the dependency on the overlay being applied first (optional pods reference the overlay-generated env-configmap / danswer-secrets) - the full split-background + Dask cutover in dependency order (scheduler/workers → split bg pods → scale down combined), plus rollback and the dual-beat warning Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * k8s: make optional manifests parameterized like base (component + logical images) The optional/ manifests hardcoded the image tag (sfbrdevhelmweacr.azurecr.io/danswer/danswer-backend:vha-138) while base uses the logical name `danswer-backend` that the overlay's images: block rewrites. That inconsistency meant a tag bump had to be made in two places and the optional pods could drift from the rest of the cluster. Fix: grouped the five split-background + Dask manifests into a single kustomize Component at k8s/optional/background-scaling/, and changed their image refs to the logical `danswer-backend`. When an overlay opts in via `components: [../../optional/background-scaling]`, the overlay's existing `images:` entry for danswer-backend parameterizes them — same tag as api-server / background, set in one place. Verified: temporarily opting the component into the prod overlay renders all five bg-scaling pods with sfbrdevhelmweacr.azurecr.io/danswer/ danswer-backend:vha-138 (34 resources total), then reverted. Neither overlay opts in by default (prod/local still 28 resources each). Layout change: before: k8s/optional/{background-beat,background-celery, background-indexer-scheduler,dask-scheduler,dask-worker}.yaml (plain manifests, hardcoded image, applied via kubectl apply -f) after: k8s/optional/background-scaling/ kustomization.yaml (kind: Component) <same five manifests, logical image name> (opted into via the overlay's components: block) README updated: optional/ is now described as opt-in components with logical-image parameterization; the apply workflow switched from `kubectl apply -f` to the components: + replicas:0 overlay edits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * k8s: fully parameterize background-scaling component (replicas, env, env-neutral) Follow-up to the component conversion — three remaining inconsistencies vs base that were pointed out: 1. Replicas were hardcoded per-manifest. Removed them from the manifests and moved the counts into the component's kustomization.yaml `replicas:` block (one place; mirrors how the overlay parameterizes base replicas). dask-worker=3 is the indexing-throughput knob. 2. Secret/config loading differed: the component had an extra explicit REDIS_PASSWORD secretKeyRef that base doesn't. Dropped it so every pod's env block is byte-identical to base/background.yaml — explicit POSTGRES_USER/POSTGRES_PASSWORD via secretKeyRef + envFrom [configMapRef env-configmap, secretRef danswer-secrets]. (redis_password still reaches the app via the envFrom secretRef like every other secrets.env key; the explicit entry was redundant and base-divergent.) 3. Manifests carried Darwin-specific node affinity + darwin/indexing tolerations, which base does NOT (base is env-neutral; the live cluster runs without pool affinity). Stripped them so the component is environment-neutral and won't fail to schedule on a local cluster that lacks the indexcpu pool. The prod overlay re-adds indexcpu affinity + toleration via a patch when it opts in — documented in the README opt-in steps with a ready-to-paste patch block. Verified end-to-end: opting the component into prod renders 34 resources, all five bg-scaling pods get sfbrdevhelmweacr.azurecr.io/danswer/ danswer-backend:vha-138, replicas come from the component kustomization (beat=1, celery=2, worker=3), background-deployment scaled to 0. Default (not opted in): prod/local both 28 resources. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(k8s/README): explicit apply/preview/verify commands for background-scaling The apply command was present but buried under the overlay-edit YAML block and read as the generic overlay apply. Made the deploy commands explicit and labeled: preview the rendered bg/dask pods, kubectl diff vs live, apply, and rollout-status watches. Also stated plainly why there's no standalone `kubectl apply -f` for the component (logical image name only resolves through the overlay's images: block). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * k8s: right-size background-scaling for a few-hundred-user deployment dask-worker 3 → 2, background-celery 2 → 1. The originals were inherited defaults from the cherry-picked feature/backgroundscaling commit, not sized to Darwin's load. - dask-worker=2: each pod runs one connector at a time (--nworkers=1 --nthreads=1), so this caps concurrent indexing at 2. Enough unless many connectors backlog in NOT_STARTED; raise then. Halves the worst-case indexcpu footprint (was 3×4Gi, now 2×4Gi). - background-celery=1: Celery here only runs maintenance tasks (prune, sync, deletion, analytics rollup) — NOT indexing. One pod already autoscales 3-10 threads (--autoscale=3,10), which easily covers the bursty maintenance queue at this scale. The 2nd replica was redundancy we don't need. Added inline comments noting which counts are singletons that must stay at 1 (beat, indexer-scheduler, dask-scheduler) vs the throughput knobs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * k8s: add slack-listener to background-scaling component The combined `background` supervisord pod ran 5 programs; the split component covered 4 (indexer-scheduler, celery, beat — via dask too) but NOT slack_bot_listener. Migrating to the split topology + scaling the combined pod to 0 would have killed the Slack bot, which Darwin uses. Adds slack-listener-deployment running `python danswer/danswerbot/slack/listener.py`, modeled on the celery manifest: logical danswer-backend image, env-neutral, env-configmap + danswer-secrets (the DANSWER_BOT_SLACK_* tokens arrive via the envFrom secretRef). SINGLETON (count: 1 in the component kustomization) — the listener holds a Slack Socket Mode websocket; a second replica would double-process every event. Added to the prod affinity patch's labelSelector in the README so it lands on the indexcpu pool with the other app pods. Validated: opting the component into prod now renders 35 resources (was 34), slack-listener gets the prod image tag + replicas=1; default (not opted in) unchanged at 28. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * k8s: consolidate background-scaling 6 deployments → 4 The split topology had three separate deployments for low-traffic singletons (background-celery, background-beat, slack-listener) that don't scale with indexing load — only dask-worker does. Collapsed them into one `background-lite` deployment running the three as separate containers in a single pod. This trims pod count + per-deployment resource reservations while keeping each container independently restartable. Now four deployments (was six): - background-lite celery-worker + celery-beat + slack-listener (3 containers, 1 pod, replicas:1 — contains beat + the Slack websocket, both singletons) - background-indexer-scheduler update.py polling loop (singleton) - dask-scheduler Dask scheduler Service + Deployment - dask-worker indexing executors (the actual scaling knob) Chose a multi-container pod over a supervisord-with-custom-conf approach: no ConfigMap to mount, no risk of the custom conf drifting from the image's baked-in supervisord.conf, and each container runs the exact command its former standalone deployment used. strategy: Recreate on the pod so celery-beat never overlaps during a rollout (dup beats double-fire). Validated: opting into prod renders 33 resources (was 35), background-lite shows containers [celery-worker, celery-beat, slack-listener] at replicas 1, dask-worker at 2, base background scaled to 0. Default (not opted in) unchanged at 28. README updated: component table, affinity-patch labelSelector (background-lite replaces the three), rollout-status command, and the dual-beat warning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * k8s: disable MULTILINGUAL_QUERY_EXPANSION in prod overlay The query-expansion path (secondary_llm_flows/query_expansion.py) builds its fast_llm with a HARDCODED 5s timeout. With no distinct fast model (FAST_GEN_AI_MODEL_VERSION empty), it routes to full gpt-4o via the UiPath gateway, which routinely takes >5s — causing repeated ReadTimeouts on Slack queries (observed in prod logs). The committed darwin-kubernetes/env-configmap.yaml already had this empty; the LIVE cluster had drifted to "English,Japanese" (set out-of-band, never committed), which is what triggered the timeouts. Setting it empty in the new prod overlay keeps the go-forward source of truth correct. App reads `os.environ.get(...) or None`, so empty = feature off. To re-enable later: wire FAST_GEN_AI_MODEL_VERSION to a genuinely fast model (gpt-4o-mini / gpt-4.1-mini) so the 5s budget is realistic, or make that timeout env-configurable first. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * k8s: pin Vespa to 8.600.35 (was :latest → caused prod outage) INCIDENT: applying the kustomize overlay rendered vespaengine/vespa:latest against a cluster running 8.600.35. The bare→:latest image change rolled all Vespa StatefulSets onto 8.696.20 — a >30-release jump. Vespa's config server refuses an auto-upgrade that large (incompatible-upgrade guard, VersionState.verifyVersionIntervalForUpgrade) and crash-looped on ConfigServerBootstrap, taking the whole cluster down (config tier → no quorum → cluster-wide connection-refused 503s on every search + the api-server's ensure_indices_exist). FIX: pin both overlays to 8.600.35 — the version the content nodes' on-disk index is written in, so there is no upgrade and the version check passes. Recovery performed on the live cluster: set all 5 vespa StatefulSets back to 8.600.35, cleared the (now-irrelevant) wedged config-server ZK state, restarted. Content data on the 100Gi content PVCs was never touched. NEVER use :latest for Vespa. Upgrades must be STEPWISE (≤30 releases per hop) and done as a deliberate, ordered operation — not a bare tag bump. busybox also pinned (1.36.1) for the same drift hygiene. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * k8s: add readiness probes to Service-backed Vespa nodes Vespa nodes had NO probes, so a k8s Service added a pod to its endpoints the instant the container started — before Vespa was actually serving. That's why every Vespa restart re-opened a window of "upstream connect error / connection refused" 503s (the incident). Adds readinessProbe (httpGet /state/v1/health) to the three nodes that sit behind a query/deploy Service: - vespa-configserver (:19071) — gates the deploy + inter-node config path - vespa-query (:8080) — gates the search path the app hits - vespa-feed (:8080) — gates the feed/index path Deliberately NOT added to vespa-content / vespa-admin: they aren't behind a query-serving k8s Service (content is reached internally via Vespa's own distributor, admin runs cluster control), and a mis-tuned probe there could pull a healthy node from rotation and make things worse. Readiness ONLY — no liveness probe anywhere. An aggressive liveness check could kill a slow-but-healthy Vespa node mid-bootstrap and cause a restart loop, which is the failure mode we just spent the incident fighting. Generous initialDelay (45s configserver / 30s others) + failureThreshold 6 so normal slow startup doesn't flap nodes out of rotation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(AGENTS.md): add Critical fact §10 — never :latest for Vespa, pin the version Captures the prod-outage learning in the repo's shared operating notes (the section CLAUDE.md routes every agent to read first). Covers: why a big Vespa version jump takes the cluster down (config-server refuses >30-release auto-upgrade), the rule (pin to the running version, 8.600.35; upgrades stepwise; don't force SKIP_UPGRADE_CHECK on prod), and the recovery runbook (re-pin image, restart config servers, redeploy schema; ZK clear only if genuinely corrupt; readiness-only probes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * k8s: add guarded-apply.sh — block unsafe Vespa version jumps at apply time Turns the "never jump Vespa >30 minor releases" rule into an enforced guardrail instead of a thing to remember. guarded-apply.sh wraps kubectl apply -k and, before applying: - reads the LIVE running Vespa version (kubectl) + the version the overlay would deploy (kubectl kustomize) - REFUSES a >30-minor upgrade (Vespa's auto-upgrade limit — the thing that caused the outage) - REFUSES a major-version change (needs dedicated migration) - REFUSES a floating/unparseable tag (:latest) - WARNS + requires FORCE=1 on a large downgrade (legit only when recovering to the on-disk version — which is why downgrade isn't a hard block; our recovery was a 96-minor downgrade) - otherwise runs kubectl diff, then apply Checks against LIVE, not the repo's previous pin — config drifts out of git (we saw it), so the running version is the only truth that matters at apply time. FORCE=1 overrides with an explicit "I accept the risk". Wired into k8s/README.md (Quick start now uses guarded-apply.sh; new "Bump the Vespa version" + "Vespa version guard" sections) and referenced from AGENTS.md §10. Verified: parses live=8.600.35 vs overlay=8.600.35 (gap 0 → OK); bash -n clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * k8s: add KEDA indexing-autoscale component (opt-in) for dask-worker Autoscales dask-worker-deployment on real indexing demand instead of a fixed replica count, via a KEDA PostgreSQL scaler. Metric (validated against the live DB): the number of index attempts that can run CONCURRENTLY right now, respecting INDEXING_PER_SOURCE_CAP — SUM over source of LEAST(cap, pending_count_for_source) not a raw pending count (10 same-source attempts still run 1 at a time under cap=1, so we must not spin up 10 workers). Counting IN_PROGRESS in the metric keeps replicas >= running jobs, so KEDA never scales a busy worker away; scale-to-0 only when there's genuinely no work. Grounded in code + live DB, not guesses: - remote-Dask mode does NOT cap dispatch by NUM_INDEXING_WORKERS, so adding worker pods truly adds parallelism (the in-process LocalCluster path is the only one bounded by that env) - PER_SOURCE_CAP (default 1) is the real concurrency ceiling - index_attempt links directly to connector.connector_id (this fork), not connector_credential_pair_id - status is stored UPPERCASE (Enum native_enum=False) — confirmed live: NOT_STARTED / IN_PROGRESS / SUCCESS / FAILED Shipped as opt-in component k8s/optional/keda-indexing-autoscale/ (ScaledObject + TriggerAuthentication; password from danswer-secrets, no duplication). minReplica 0 (scale to zero when idle), maxReplica 4, cooldown 300s. README documents prerequisites (KEDA operator install; opt in after background-scaling; REMOVE the static dask-worker replicas entry so it doesn't fight the HPA), the scale-down safety reasoning, and a recommended dask-worker graceful-shutdown companion change. Validated: component renders standalone (2 resources) and opted into prod (namespaced to darwin). Not enabled by default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * k8s: add KEDA operator install manifest (optional/keda, own namespace) Companion to the keda-indexing-autoscale component: the KEDA operator itself. No Helm — a standalone kustomization referencing KEDA's official release bundle PINNED to v2.14.0 (GitHub release assets are immutable, so the URL is a content pin; never a moving ref — same lesson as Vespa :latest, AGENTS.md §10). It's cluster-scoped infra (CRDs + RBAC + operator), installed ONCE per cluster independent of the danswer overlays — hence kind: Kustomization (apply on its own), NOT a Component layered into prod/local. KEDA's bundle creates and installs into its own `keda` namespace internally, so no `namespace:` override here (that would wrongly re-namespace the cluster-scoped CRDs). Install: kubectl apply --server-side -k k8s/optional/keda (--server-side required — KEDA CRDs exceed the client-side last-applied-configuration annotation limit) README updated: optional/ table now distinguishes Components (opt into an overlay) from Standalone installs (apply on their own), and the KEDA autoscale prereq points at `kubectl apply -k k8s/optional/keda`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(k8s/README): document mandatory pod restarts after a config change Fixes a misleading instruction and adds the missing operational step the user hit: after `kubectl apply -k`, ConfigMap changes do NOT auto-roll pods, because disableNameSuffixHash:true keeps the name stable (the hash-suffix is exactly what would trigger a rollout). envFrom reads env only at pod start, so running pods keep stale values until restarted. Changes: - "Add a new env var": corrected step that implied auto-pickup; now states you must restart consumers. - "Enable the Redis cache + per-user rate limiter": adds the explicit `kubectl rollout restart deploy/api-server-deployment deploy/background-deployment` after apply, clarifies the rate limit is PER-USER, and includes PERSONA_CACHE_ENABLED. - New "Which workloads to restart after a config change" table mapping changed vars → workloads (Redis flags → api-server + background; model vars → model servers; etc.), plus the split-background variant (background-lite / indexer-scheduler / dask-worker, no background-deployment). - disableNameSuffixHash footgun now spells out the manual-restart consequence. Also commits the prod env.properties with the Redis features enabled (REDIS_KV_CACHE_ENABLED / REQUEST_RATE_LIMIT_ENABLED 20-per-min,300-per-hr / PERSONA_CACHE_ENABLED) — the user turned these on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * k8s: add startup + readiness probes to api-server (own /health, not deps) api-server had no probes. Adds: - startupProbe on /health:8080 — the container runs `alembic upgrade heads` before uvicorn, so /health isn't up until migrations finish; this allows ~5min (30×10s) for migrations+boot before readiness/liveness count. Transitively gates on Postgres (no migrations → never Ready). - readinessProbe on /health:8080 — gates the api-server Service so it doesn't route to a still-booting pod. Deliberately: - checks the app's OWN /health, NOT Vespa/Redis. Those are partial/optional deps (Vespa retried-then-proceeds, Redis fail-open); coupling API availability to them would turn a partial outage into a total one — the Vespa incident is the proof (api-server stayed up serving auth/settings while search was down). - NO liveness probe — an aggressive liveness on a slow-migrating api-server could kill it mid-migration (same lesson as the Vespa probes). /health is in the auth_check public-endpoint allowlist, so the probe isn't 401'd. Postgres remains gated by the alembic step in the start command; Redis/Vespa intentionally NOT gated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(k8s/README): add "Verify Redis caching is working" runbook After enabling the Redis flags, document how to confirm the cache is actually populated + hit (vs silently failing open to Postgres): - key-namespace table (kv / personas / groups / ratelimit) - --scan for presence (the "is it on" check) - INFO stats keyspace hit/miss ratio - TTL/STRLEN on a specific entry - MONITOR to watch a live request hit the cache - DEL + reload to prove the read-through refill - rename-an-assistant to prove write-through invalidation (TTL -> -2) Plus the gotcha: the cache is silent on success (only logs on Redis error), so api-server logs won't show hits — Redis-side inspection is the only way to observe it; a "Redis GET/SET failed" warning means it's failing open to Postgres. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf: fix N+1 in basic indexing-status (chat page load + folder create) get_basic_connector_indexing_status reads cc_pair.connector.source for every cc-pair, but get_connector_credential_pairs didn't eager-load the connector relationship → one lazy query per cc-pair. At ~404 cc-pairs against a remote Azure Postgres that's 404 sequential round-trips, which dominated chat page load — and re-ran on every folder create (the chat page's router.refresh() re-fetches the whole fetchChatData bundle, and this is the slowest endpoint in it). Fix: add eager_load_connector to get_connector_credential_pairs (opt-in, joinedload on ConnectorCredentialPair.connector) and use it in the basic indexing-status endpoint. 405 queries -> 2. No API/contract change, no frontend change; speeds every chat page load, not just folder create. Verified the doc-count GROUP BY itself was already fast (7ms over 689k rows on the live DB) — the cost was the N+1, not the aggregation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf(chat): optimistic folder create — no full refetch Creating a folder called router.refresh(), which re-runs the entire fetchChatData server bundle (chat sessions + doc sets + assistants + tags + llm providers + indexing-status + folders, uncached via noStore) just to show one new empty folder in the sidebar. The create POST itself is a single fast INSERT. Now: mirror the server `folders` prop into local state (re-synced when the prop changes) and, on create, append the returned folder to that state instead of refreshing. The folder appears as soon as the INSERT returns — no fan-out, no SSR re-render. Paired with the indexing-status N+1 fix, this removes both the trigger (the refetch) and the worst cost within it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf(chat): Redis-cache the connector indexing-status read The chat page calls GET /manage/indexing-status on every load to derive available source types. Its cost is a per-cc-pair document-count aggregation (~300ms on the live DB at a few hundred cc-pairs) — the dominant fan-out cost after the #1/#2 fixes. The result is identical for all users and changes only when a connector is added/removed or an indexing run completes, so front it with a short-TTL global Redis cache. - Split the DB build into `_build_basic_cc_pair_info` and wrap the endpoint with an inline fail-open cache (global key `danswer:cc_pair_basic_info`, default 60s TTL). - Pure TTL, no explicit invalidation: staleness is bounded by the TTL and harmless. Any Redis error falls straight through to a direct DB build — never an outage. - Default OFF via CC_PAIR_INFO_CACHE_ENABLED; prod overlay enables it, local leaves it opt-in. Documented in k8s/README.md key table. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * k8s(vespa): group into base/vespa/ + ordered, health-gated upgrade script Vespa is a version-stateful subsystem with a lifecycle unlike the rest of base (pinned version, the :latest outage history, multi-StatefulSet ordered upgrades). Group it and give the upgrade ordering a real home — which is a script, not the manifests, since kustomize is declarative and cannot sequence a health-gated multi-StatefulSet rollout. Structure: - Move the 7 vespa-*.yaml into k8s/base/vespa/ with its own kustomization.yaml (referenced from base as `- vespa/`). Rendered output is unchanged. - Split the single `vespa` logical image into per-role names (vespa-configserver/-admin/-content/-feed/-query); both overlays map all five to vespaengine/vespa:8.600.35. This lets the upgrade script move one role's version at a time. Safety prereqs (these change the content/admin pod template, so the next apply rolls those StatefulSets — safe, one pod at a time, data on retained PVCs): - Add readiness probes to content + admin on :19092 /state/v1/health (verified serving 200 live; node-agnostic, unlike the containers' 8080). - Set publishNotReadyAddresses: true on vespa-internal so peer discovery is never gated by readiness (a slow/booting node must stay resolvable). Upgrade tooling: - k8s/scripts/vespa-upgrade.sh <target> [ns]: ordered (configserver → admin → content one-ordinal-at-a-time via partition stepping → feed → query), health-gated between each (kubectl exec → localhost, Istio-aware), single hop, refuses >30-minor/major/downgrade (FORCE to override), DRY_RUN/YES flags. bash-3.2 compatible. Dry-run verified against live. Docs: README "Upgrade Vespa" rewritten around the script; base/ section describes the folder; guarded-apply clarified as the everyday-apply net, not the upgrade tool; AGENTS.md §10 updated with the script + per-role structure. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(chat): createFolder returns the new folder id (was undefined) POST /folder returns the new id as a bare integer, but createFolder read `data.folder_id` — always undefined. This was harmless while the create handler just called router.refresh() and ignored the return, but the optimistic-folder insertion (a84600b3) uses the id, so new folders rendered with folder_id=undefined and rename/delete PATCHed /api/folder/undefined → no-op. Parse the bare integer instead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * polish(chat): tidy chat-row drag (compact ghost + no browser split-view) Two native-DnD annoyances when dragging a chat row to a folder: - The row is a <Link> (<a href>), so the browser auto-attaches the URL to the drag and some browsers (Arc/Edge/Safari) offer "open in split view" when dragging toward the edge. Clear the auto-added link payload and set effectAllowed=move so only the folder DnD remains. - The default drag image is a translucent clone of the full-width row that trails awkwardly across the sidebar. Replace it via setDragImage with a compact chip (chat name, ellipsized) built off-screen and removed on the next tick. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(chat): pending spinner on "Manage Assistants" navigation Navigating to /assistants/mine awaits the heavy fetchChatData bundle, with no feedback — it felt frozen. (A route-level loading.tsx was wrong here: the app renders the sidebar inside each page, not a shared layout, so the fallback blanked the whole shell and read as a full reload.) Instead drive the navigation with useTransition + router.push: the current page (and sidebar) stays mounted and visible until the new page's server fetch completes, and isPending swaps the button's brain icon for a spinner + "Loading…" so the click clearly registers. Feels like an in-app transition, not a reload. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * perf(db): Celery broker on Redis + env-driven Postgres pool sizing Reduce and contain Postgres connection pressure (the real ceiling for chat at scale — sessions are held through the LLM stream): - Celery broker + result backend optionally on Redis via CELERY_BROKER_REDIS_ENABLED (default off; prod on). Uses a separate logical DB (CELERY_REDIS_DB_NUMBER, default 1) so Celery keys never collide with the cache/rate-limit DB 0. Removes Celery's queue polling/writes from Postgres. Task status is unaffected — this fork tracks it in its own task_queue_jobs table, not the Celery backend. Indexing stays on Dask. Falls back to the Postgres broker when off, so local dev without Redis still boots. Note: the broker (unlike the fail-open cache) is a hard dependency when enabled. - Postgres pool size/overflow are now env-driven (POSTGRES_POOL_SIZE / POSTGRES_POOL_OVERFLOW, defaults preserve the previous 40+10) so each deployment can size its pool to its replica count and stay under Azure Postgres max_connections. Applied to both the sync and async engines. Overlays: prod enables the Redis broker and sets explicit pool values (documented to lower as api-server replicas grow); local leaves both opt-in/empty. README gains a "Celery on Redis + pool sizing" section, a verify command, and restart-matrix rows; AGENTS.md divergence table notes the broker is now configurable. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * style(config): black-format app_configs.py (line-length 130) Collapse the multi-line env-read statements added during the Redis/cache work to single lines, per the repo's pinned black==23.3.0 + pyproject line-length=130. Cosmetic only — no values or logic change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * style: black-format Redis persona cache + rate-limit middleware Collapse wrapped signatures/calls to single lines per the repo's black==23.3.0 + pyproject line-length=130. Cosmetic only — no logic change. Completes black compliance for the Redis-feature files on this branch (the remaining black-flagged files are unrelated connector/llm code, left for a separate cleanup). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * perf(chat): per-user Redis cache for the document-set list /document-set is on the chat-page bundle (fired every load); the read is a multi-join plus, in EE, a per-user permission query. Cache it. Per-user (not global+filter like the persona cache) on purpose: the doc-set permission filter is edition-dependent — EE filters by is_public/users/groups, MIT base returns all — so memoizing the exact versioned result per user avoids replicating that branchy logic, where a parity bug would leak doc-set visibility. Trade-off: a cold burst of N distinct users still costs N first-loads, but a user's repeat loads (new-chat / nav / router.refresh) collapse to one DB build per TTL. - New db/d…
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Brings the
rajiv/add-claudebranch up tofeature/darwin: Microsoft/Entra OIDC auth wired into the OSS code path (noee/dependency), a chat & search UX overhaul, a chat-page crash fix, a persona-deletion permission gap close, per-channel Slack-bot model config, and Playwright MCP tooling. 9 commits, 50 files, +909 / −582.Auth — Microsoft/Entra OIDC
Wired directly into the OSS app (no
ee/dependency):backend/danswer/main.py;AuthType.OIDCallowlisted inverify_auth_setting; public auth routes registered.OPENID_CONFIG_URL,DEFAULT_ADMIN_EMAILS.oauth_callback; env-driven admin allowlist promotes addresses inDEFAULT_ADMIN_EMAILS.backend/requirements/default.txt: pinbcrypt==4.0.1(passlib 1.7.4 is incompatible with bcrypt 4.1+; do not bump without also fixing passlib).Kubernetes (
darwin-kubernetes/)env-configmap.yaml:AUTH_TYPE=oidc,OPENID_CONFIG_URL(Entra discovery),DEFAULT_ADMIN_EMAILS;WEB_DOMAIN/DOMAINset to the externalhttps://origin (required for a correct OIDCredirect_uriand aSecuresession cookie).api_server-service-deployment.yaml: injectOAUTH_CLIENT_ID,OAUTH_CLIENT_SECRET,USER_AUTH_SECRETfrom thedanswer-secretssecret viasecretKeyRef.secrets.yaml: stub values replaced with documented placeholders + a "do not commit real secrets" header — real values are applied out-of-band.Chat & search UX
document_setsintersect with user-applied filters server-side insearch/preprocessing/preprocessing.py(the input-bar "Scope" chips are gone but the scoping itself is unchanged).FiltersTab.tsxandChatFilters.tsxwere significantly rewritten (−188 / −176 lines) around the new picker. Tag filters were removed./searchhidden from nav (still reachable by URL); removed the redundant top-left assistant selector; highlight applied filters.web/next.config.js: dropped the 308 stream redirects that were stripping the session cookie.Settings.default_page = CHAT). Existing deployments keep their persisted value — change via Admin → Settings.Bug fixes
web/src/components/search/DocumentDisplay.tsx): fall back toblurbwhenmatch_highlightsis a non-empty array of only falsy/whitespace strings. Previouslysections[0][2]threwCannot read properties of undefined (reading '2'), crashing the chat doc sidebar and the search page when retrieved docs had empty highlights — more likely with large or many-doc contexts.build_qa_response_blocks. Slack mrkdwn has no info string, so the language was rendering as a literal first line of the block. (Slack still cannot syntax-highlight; that's a platform limit.)SelectedFilterDisplay,ChatInputBar) — cosmetic only; server-side scoping unchanged.Security / permissions
mark_persona_as_deleted(backend/danswer/db/persona.py) now returns 403 for non-admins ondefault_personaor ownerless (user_id IS NULL) personas, mirroring the frontend's!default_personarule. Closes a gap where a basic user could soft-delete a shared default assistant for everyone via a direct API call (get_persona_by_idgrants non-admins access to ownerless personas).Slack bot
SlackBotConfigCreationForm.tsx,server/manage/slack_bot.py,server/manage/models.py).Tooling
.mcp.json(shared Playwright MCP server) so the browser-driven debugging setup is reproducible across the team..gitignore— ignore.playwright-mcp/session output and the ad-hocmodel-picker-open.pngscreenshot (they are local artifacts, not source).Configuration required for reviewers / operators
Before this can be deployed, the following must be set:
Footguns
Commits
🤖 Generated with Claude Code