k8s kustomize, Redis caching, background/indexing scaling, Azure Blob, retention + connector reliability by rajivml · Pull Request #46 · UiPath/danswer

rajivml · 2026-06-02T20:44:39Z

Summary

This branch brings the Darwin deployment up to a scalable, observable, and reliable footing: a kustomize-based k8s layout, Redis caching + rate limiting, horizontally-scaled background/indexing, Azure Blob file storage, DB retention, and a series of indexing/connector performance & reliability fixes (several found and verified against prod).

Functionalities implemented

Kubernetes / deployment

Replaced the flat darwin-kubernetes/ manifests and the Helm chart with a kustomize base/ + overlays/{prod,local} layout (single source of truth for image tags, replicas, env, secrets).
In-cluster Redis (StatefulSet) in base; secrets via gitignored secrets.env.
Vespa pinned to an exact version (never :latest after the prior outage), decoupled from the app overlays into overlays/{prod,local}-vespa, with a guarded-apply.sh version-gate and an ordered, health-gated vespa-upgrade.sh.
Startup/readiness probes on the api-server; removed vestigial PVC mounts.

Redis caching & rate limiting

Per-user Redis caches (document-set list, connector indexing-status) and a per-user chat rate limiter.
Celery broker moved onto Redis; env-driven Postgres pool sizing.

Background / indexing scaling

Split the monolithic background pod into background-lite (celery worker + beat + slack listener), background-indexer-scheduler, and a remote Dask scheduler + horizontally-scalable workers; optional KEDA autoscale for dask-workers.
Right-sized memory; fixed the celery-worker --autoscale+threads-pool crashloop (→ --concurrency).

Azure Blob file store

Optional AzureBlobFileStore backend (bytes off Postgres), direct-to-Blob SAS upload from the browser with a progress bar, and chat-upload size/token limits.

DB retention

Pluggable retention sweep (kombu_message, task_queue, index_attempt, chat, usage_reports, permission_sync) under an advisory lock, run daily; opt-in index_attempt pruning.

Indexing performance & reliability

Content-hash skip: re-index a document only when its indexed content actually changed (not just its timestamp) — eliminates needless Vespa rewrites for churny sources (e.g. Salesforce LastModifiedDate). Backward-compatible, all connectors.
get_last_attempt LIMIT 1 + an audit of other over-materializing queries (prune id-only fetch, doc-set sync, cc-pair attempts) — fixed the indexer-scheduler OOM.
Parallelized per-document Vespa chunk deletes.

Connectors

Web: dropped the per-page connectivity GET, domcontentloaded navigation, per-page retries, no whole-browser teardown on a single page error, User-Agent, and a max-pages cap — fixed the ~89% failure rate (verified live).
Jira: fixed the ORDER BY JQL bug (HTTP 400 every poll), per-issue error tolerance, IdConnector ID-only retrieval (enables pruning of deleted issues), richer metadata, and a broken load_from_state.

UX / analytics

cc-pair index-attempt history moved to server-side pagination (FE + BE).
Durable per-user daily analytics (survives chat retention), adoption curves, most-used assistants; chat folder/assistant UX polish.

Verification

pre-commit (black / reorder-imports / autoflake / ruff / prettier) passes on the changed files; backend py_compile clean; web tsc --noEmit clean.
No secrets committed (the only key-like string is the public Azurite emulator default in a test docstring; real secrets live in gitignored secrets.env).
Prod-verified: web + Jira connectors now succeed; indexer-scheduler memory flat; celery-worker stable.

🤖 Generated with Claude Code

Plan for exposing chat to a few hundred users: P0 connection-pool/session fix, P1 Redis foundation + DynamicConfigStore read-through cache, P2 per-user request rate limiting, P3 per-chat-turn config caches. Plan only, no implementation yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Foundation for caching/rate-limiting work (see REDIS_CACHING_PLAN.md). This commit only ships the cache piece — no behavioural change unless REDIS_KV_CACHE_ENABLED=true is set. * requirements: pin redis==5.0.8. * configs/app_configs.py: REDIS_HOST/PORT/PASSWORD/DB_NUMBER/SSL, REDIS_POOL_MAX_CONNECTIONS, REDIS_HEALTH_CHECK_INTERVAL, REDIS_SOCKET_TIMEOUT_SECONDS; cache toggle REDIS_KV_CACHE_ENABLED (default OFF) and REDIS_KV_CACHE_TTL_SECONDS (1 day). * danswer/redis/redis_pool.py: lazy ConnectionPool singleton + get_redis_client() helper. Single-tenant — DANSWER_REDIS_KEY_PREFIX is the only namespace; upstream's TenantRedisClient is intentionally not ported. * dynamic_configs/store.py: RedisCachedDynamicConfigStore wraps any inner DynamicConfigStore with read-through / write-through caching. Inner store stays the source of truth (writes inner first), encrypted values are NEVER cached plaintext (just invalidated), every Redis call is fail-open so an outage degrades latency, not availability. * dynamic_configs/factory.py: when REDIS_KV_CACHE_ENABLED, transparently wraps the existing PostgresBackedDynamicConfigStore — call sites unchanged. * Deployment: redis service in docker-compose.dev.yml (cache-only: no AOF, no RDB snapshots, allkeys-lru @ 256mb so a runaway producer can't OOM the node). darwin-kubernetes/redis-statefulset.yaml mirrors that posture. REDIS_HOST etc. in env-configmap; REDIS_PASSWORD wired via optional secretKeyRef so the deployments still boot when Redis is unauth'd. NOT the Celery broker — that stays on Postgres by design. * backend/.gitignore: ignore stray pywikibot apicache/throttle.ctrl artifacts dropped by the existing mediawiki test. Tests (unittest, no real Redis required — mocks at the get_redis_client boundary): - tests/.../redis_layer/test_redis_pool.py: pool singleton, prefix constant, reset_pool_for_tests. - tests/.../dynamic_configs/test_redis_cached_store.py: read-through, write-through invalidation, TTL on SET, cached-None vs miss, not-found NOT cached, encrypted values not mirrored, corrupt entry treated as miss, fail-open on Redis errors. 13 cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Layered on top of P1's Redis client. Complements the existing token- budget limiter (token_limit.py) — that's a DB-backed COST cap, this is a Redis-backed REQUEST-COUNT cap that's correct across api_server replicas. Both run; this one runs first so a 429'd caller never even touches the DB-backed usage query. Default OFF. Enable per-environment via: REQUEST_RATE_LIMIT_ENABLED=true REQUEST_RATE_LIMIT_PER_MINUTE=<N> # 0 = disable that window REQUEST_RATE_LIMIT_PER_HOUR=<N> * server/middleware/request_rate_limit.py: fixed-window buckets keyed by floor(time/window). Atomic INCR + EXPIRE NX so the bucket boundary is fixed on first increment (without NX, every request would push expiry forward and the bucket would never reset — that bug is covered by an explicit test). Authenticated users keyed by uuid; anonymous keyed by the first X-Forwarded-For hop, falling back to the socket peer; if neither yields an IP we skip (better than bucketing every anonymous request under ""). * Fail-OPEN on any Redis error: a Redis blip lets requests through with a warning, never wedges the chat path. * 429 response carries a Retry-After header with seconds-until-bucket- rollover so well-behaved clients back off precisely. * Wired as a FastAPI Depends on: POST /chat/send-message POST /direct-qa/stream-answer-with-quote Both endpoints also keep the existing check_token_rate_limits. Tests (unittest, mocked Redis pipeline — no real Redis required): - default-OFF short-circuits before any Redis call (covers both REQUEST_RATE_LIMIT_ENABLED=false AND both windows = 0). - within-limit: N requests under cap all allowed. - over-limit raises 429 with Retry-After header. - per-user isolation: distinct users have independent counters. - bucket rollover resets count (time-mocked). - EXPIRE NX semantics — locks down the no-sliding-TTL invariant. - anonymous keyed by XFF first hop; no-IP skips silently. - fail-open: Redis error doesn't propagate. 9 cases total. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

GET /persona (Manage Assistants → "View available assistants") fires get_personas(user_id, ...) — a multi-OR permission-filtered query joining Persona, Persona__User, Persona__UserGroup, User__UserGroup. At hundreds of concurrent users opening the chat UI around the same time, the burst puts unnecessary pressure on the DB connection pool (which is the actual scaling ceiling for streaming chat — see REDIS_CACHING_PLAN.md). Design: global cache + per-user filter (not per-user response cache), so the multi-user-burst pattern collapses 200 queries into ~1: danswer:personas:all:not_deleted global, all PersonaSnapshot including is_public / users / groups (PersonaSnapshot already carries the permission inputs — no separate payload shape needed) danswer:personas:groups:{user_id} per-user, list[int] of group ids At request time the cached list is filtered in Python mirroring the SQL OR-block exactly: is_public OR user.id IN persona.users.id OR (user_group_ids ∩ persona.groups) The parity vs SQL is locked down by tests (one per branch + negative). Invalidation is explicit + write-through: - 9 mutation paths in db/persona.py call invalidate_personas_all() AFTER db_session.commit() (after-commit ordering avoids stale-fill race during open transactions). - 3 paths in ee/danswer/db/user_group.py (insert/update/prepare-delete) call invalidate_user_groups(uid) for each affected user. - 24h TTL is ONLY a safety net for missed busts; primary mechanism is explicit so persona/group edits are visible immediately. - Default OFF (PERSONA_CACHE_ENABLED=false); enable per environment. - Fail-OPEN on every Redis op: a Redis outage degrades latency, not availability, and a failed bust doesn't roll back the DB write. - include_deleted=True falls through to direct DB (uncommon shape; we deliberately don't cache it). Encrypted values: N/A — PersonaSnapshot has no encryption-at-rest guarantee to bypass (unlike the KV store layer from P1). Tests (17, mocked Redis + db boundary, no real services): - 6 filter-parity cases (one per SQL branch + mixed + zero-groups edge) - 2 user-group cache cases (miss/hit, TTL propagation) - 3 routing cases (disabled fallthrough, include_deleted bypass, admin user_id=None path skips group lookup) - 4 invalidation cases (right key for each side, disabled short-circuit, Redis-error-during-bust swallowed) - 2 fail-open-on-read cases (GET error → miss, corrupt entry → miss) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces the prior "move up / move down inside a 3-dot popover" flow on /assistants/mine with eight coordinated changes. Backend unchanged — the existing PATCH /api/user/assistant-list endpoint already accepts the full chosen_assistants array, so every interaction lands as one optimistic local update + one PATCH + (on failure) a rollback. 1. Drag-and-drop reorder via @dnd-kit (already in package.json) with a grab handle on each visible row. Pointer activation distance of 6px so clicks on the handle don't accidentally start a drag. Keyboard reordering comes for free via dnd-kit's default activator focus. 2. Explicit "set as default" — pin/star icon on each visible row; filled when the row is the user's default (position 0 of chosen_assistants), with an accent border + "Default" chip on that row. Ordering and default are now orthogonal — reorder freely without accidentally changing your default. 3. Visibility as a row-level switch instead of a buried "Hide / Remove" popover item. One unified list with a "Hidden (N)" divider; hidden rows render at reduced opacity and have no drag handle (no position to drag to). The prior separate "Active Assistants" / "Your Hidden Assistants" sections collapse into this single list. Refuses to hide the last visible row (can't ship the user a broken picker). 4. Client-side search filter — matches name, description, or tool name. Applies to both visible and hidden sections so search-then-toggle for "where did I put X" is one motion. 5. Information density rebalanced. Description is now the primary signal (was the smallest text). Tools/sources collapse into compact "{n} tools" / "{n} sources" chips so the row scans for "should I pick this?" not "what are its internals?". Full tool list reveals on hover via title attribute. 6. Bulk select column + sticky action bar. Checkbox appears on hover or focus and stays visible when selected. Action bar shows Show / Hide / Remove + Clear when anything is selected. Refuses bulk-hide that would empty the visible list. 7. Header cleanup: title + 1-line subtitle + Create button top-right, "Browse all available" as a text link. The prior two giant nav tiles + paragraph of explanatory copy are gone — recovers vertical space on a page whose real content is the list. 8. Undo on every state-mutating toast (reorder / set-default / hide / show / bulk ops). PopupSpec gains an optional `undo: { onClick }` field; the popup stays on screen 6s instead of 4s when undoable so the user has time to react. Undo restores the prior chosen_assistants array via another PATCH — symmetric round-trip, no special endpoint. New helpers in lib/assistants/updateAssistantPreferences.ts: reorderAssistantList(newOrder) — full-array PATCH (drag, undo) setDefaultAssistant(id, list) — move id to position 0 bulkRemoveFromList(ids, list) — set difference bulkAddToList(ids, list) — set union, appended at end The pre-existing moveAssistantUp/Down helpers are kept (other callers may still import them) but no longer used on this page. Verified: npx tsc --noEmit clean across web/ (0 errors). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Sister rework to the Manage Assistants page. With 50+ accessible assistants and growing, the old flat 2-column grid had no hierarchy and no status signal — every card looked identical regardless of whether it was yours, shared, public, or already in your picker. Same conceptual fix as Manage: give the page structure so scanning answers "what's mine?", "what's new?", "what does this one do?". Backend unchanged — every interaction PATCHes chosen_assistants via the existing /api/user/assistant-list endpoint (same path the Manage page uses). All mutations are optimistic + undoable. Changes (numbers map to the design proposal): 1. Per-card "In your picker" badge + muted card style when added. Eye now finds the un-added ones in a glance. 2. Three implicit sections: Yours / Shared with you / Featured & Built-in. Empty sections hide; section headers carry counts. 3. Filter chip rows: availability (All / Available to add / Already added) with live counts, plus auto-generated per-tool chips for tools that appear in ≥2 assistants (avoids chip-bloat as the dataset grows). Tool filters use OR semantics. 4. Owner display: best-effort name from the email local-part (split on '@', dots/underscores→spaces) with a "Built-in" badge for default_persona assistants. Kills the fork-specific "Author: Darwin" magic string. 6. Responsive grid: 1 / 2 / 3 / 4 cols by breakpoint. 7. Header matches the Manage rebuild — title + subtitle + Create button top-right, "Back to my assistants" as a text link. Cut the giant centered nav button and the explanatory paragraph. 8. Sort dropdown: Featured (API order, respects admin display_priority) / A → Z / Newly added (id desc proxy for recency). 9. Search broadened to name + description + tool names + document-set names. Empty-result state with a real "Clear all filters" button. 10. Compact "{n} tools" / "{n} sources" chips with hover-reveal of the full tool list. Flat Add/Remove buttons replace Tremor's color="green/red" which was visually shoutier than the action. 11. Design tokens fixed — border-border / focus-ring-accent in place of hardcoded gray-300 / blue-500. Consistent with the rest of the app. Skipped (per proposal): - #5 detail drawer / modal — revisit after observing how users use the new grid; bigger feature. - Bulk select — adding 5 assistants at once isn't a real use case here (bulk hide on Manage was). The pre-existing addAssistantToList / removeAssistantFromList helpers are kept and used at the call sites. The shared reorderAssistantList helper added in the prior commit is reused for the undo paths. Verified: npx tsc --noEmit clean across web/ (0 errors). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Dev-only seed script that creates N varied personas in the local DB for exercising the redesigned gallery / manage pages. Refuses to run when POSTGRES_HOST looks like a managed/prod database (azure.com, amazonaws.com, .cloud., "prod", etc.) — guard against pointing this at the wrong env by accident. Mix is designed to populate each section of the new gallery: ~30% "Yours" — owned by target user, private ~20% "Shared with you" — owned by another user, target user in users[] ~50% "Featured" — public, no specific owner Per row randomly attaches 0–3 tools and 0–2 document sets so the {n} tools / {n} sources chips render with variety. Half of "Yours" auto- land in chosen_assistants (and all "Shared with you" do), so the "Already added" vs "Available to add" filter chips have content on both sides without manual setup. 60 distinct names + 30 description templates so 50 rows feel populated and varied. Uses a fixed RNG seed by default (deterministic across runs). Name prefix "[seed] " makes rows easy to spot and to wipe via --clear. Usage: cd backend && source ../.venv/bin/activate python -m scripts.seed_assistants --email you@example.com python -m scripts.seed_assistants --clear # wipe and re-seed Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two follow-ups on the assistants UX rework, both from user feedback. Manage (/assistants/mine): * Toggle now accepts a `highlight` prop that draws a transient ring + slight scale-up on the switch. Used by hidden rows so a click anywhere on the (faded) row body flashes the toggle for ~1.2s, pointing the eye at the action that brings the assistant back. Doesn't auto-enable on body click — surprising a user mid-read into enabling would be a worse outcome than the discoverability gap. * Restructured opacity: only the content column (icon + name + description + chips) fades when a row is hidden. The action zone (checkbox, drag-slot, pin, toggle, share, edit) stays at full opacity so the toggle is the bright, clickable target on a dim row. Previously the parent opacity-50 cascaded to every child, making the toggle the dimmest thing on the dimmest row. * stopPropagation on the action zone so clicks on buttons inside it don't trigger the row-body flash handler. Gallery (/assistants/gallery): * Removed all tool-related UI per user request — the page is for browsing assistants, and tool filter chips + per-card "{n} tools" pulled focus from the assistant itself. Gone: the auto-generated tool filter chip row, the per-card tools chip, the toolDisplayName / toolIcon helpers, the FiTool / FiImage / FiCheck imports, and the toolFilters state + commonTools memo. Search hay is now name + description + document-set names (no tool names). * Dropped the absolute top-right "In your picker" badge. The muted card style (border + opacity-75) plus the "Remove" button in the footer already signal "added"; the badge ate horizontal space (pr-24 on the header reserved 96px) and crowded the title at narrower widths. Removed the pr-24 reservation now that nothing overlays the header. * Grid capped at `1 / 2 / 3` cols — 1 on mobile, 2 on most laptops and standard desktops, 3 only at `2xl` (≥1536px). Previously 1/2/3/4 with the 4-col breakpoint making cards cramped and hard to read once descriptions hit their 3-line clamp. * Bumped card padding p-4 → p-5 and description line-height to leading-relaxed for breathing room. * Updated clearAllFilters / hasAnyFilter to drop the toolFilters references (now dead). Verified: npx tsc --noEmit clean across web/ (0 errors), zero stray references to the removed helpers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirrors the gallery treatment from the previous commit. The user reported tool execution isn't reliable yet, and surfacing "{n} tools" on assistant rows misleads users into picking an assistant for a capability that may not work in practice. Dropped: the {n} tools Bubble in the row's chip block, the toolCount derivation, and the FiTool import. The {n} sources chip stays — it's about the assistant's knowledge scope, which works fine. Verified: npx tsc --noEmit clean across web/ (0 errors). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

AssistantsGallery now accepts an optional `columns` prop (default 3, supported 1-5). Responsive scaling below the widest breakpoint is fixed per row of GRID_CLASSES — each row is a complete static Tailwind class string so the purge step actually emits the classes (dynamic `md:grid-cols-${n}` would silently disappear at build). Unsupported values fall back to the default rather than rendering broken — a bad prop here shouldn't break the page. The single existing caller (page.tsx) doesn't pass columns, so it gets the default 3 — same layout as before. To switch to 4 columns on a wide-monitor deployment: `<AssistantsGallery columns={4} ... />`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A "{n} sources" chip told users how MUCH knowledge an assistant had access to but not WHICH knowledge — defeating the point of the chip for someone deciding "which assistant should I pick for this task?". Both the Manage page row and the Gallery card now render one Bubble per document-set name, capped at MAX_VISIBLE_DOC_SETS (3). When an assistant points at more than that, a "+N more" pill collects the overflow with the rest of the names exposed via the title tooltip, so we don't blow the card width or row layout at narrower column counts. Each name chip caps at a max-width with truncate + a hover title, so a single absurdly long document-set name can't push the actions off the row. Verified: npx tsc --noEmit clean across web/ (0 errors). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a small "Columns 2 | 3 | 4" segmented control at the right end of the filter row (next to Sort). The pick persists to localStorage under "danswer:assistants-gallery:columns" so it survives reloads on the same device. State precedence: user choice (localStorage) wins ↓ prop `columns` from caller (default for new users / new device) ↓ DEFAULT_COLUMNS = 3 (final fallback) The localStorage read happens in a useEffect so SSR + first paint use the prop value — avoids a hydration mismatch the time the stored value disagrees with the prop. localStorage writes are wrapped in try/catch because some sandboxed contexts (private modes, restrictive iframes) throw on access — the control still works for the session, just doesn't persist there. Picker is hidden below md (768px) because the layout falls back to 1 column at that width regardless of the chosen value. Exposed options are 2/3/4 — 1 is mobile-only via responsive, 5 is too cramped for typical screens (GRID_CLASSES still supports 5 if a deployment wants to set it via prop). Verified: npx tsc --noEmit clean across web/ (0 errors). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reverts the segmented "2 | 3 | 4" button group to a single <select> that mirrors the existing Sort dropdown for visual consistency on the "view controls" cluster at the right end of the filter row. Behavior unchanged: pure client-side state + localStorage persistence, no fetch and no router.refresh() in the column path — the user's column choice never triggers a backend call. Verified: npx tsc --noEmit clean across web/ (0 errors). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the monolithic supervisord background pod with separate deployments for celery-worker, celery-beat, indexer-scheduler, dask-scheduler, and dask-worker. The indexer-scheduler now reads DASK_SCHEDULER_ADDRESS to dispatch run_indexing_entrypoint to a remote Dask cluster instead of an in-process LocalCluster, so indexing throughput scales horizontally with dask-worker replicas instead of being capped by one pod's RAM. Local dev keeps the LocalCluster path (no env var); a new scripts/dev_run_dask_distributed.py and docker-compose overlay reproduce the prod-shape topology without K8s. scripts/test_dask_distributed_e2e.py exercises the topology (parallelism, worker death, scheduler death) end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Single migration doc covering all three slices on this branch: 1. Background indexing scaling (Dask topology) 2. Redis caching + rate limiting 3. Assistants UX rework Organized for an operator: TL;DR up top ("everything default OFF"), new deps/env/secrets summarized, deployment order, verification checklist BEFORE flipping any flags, per-feature enable steps, and the known footguns (k8s manifests missing REDIS_PASSWORD env wire-up in the bg-scaling path, seed script bypassing persona cache, CLAUDE.md update.py gate). Plus the recommended manual test list and the branch's commit map. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…in AGENTS.md The bg-scaling commit (03d1649) added 5 new k8s manifests under `deployment/kubernetes/` that split the combined background pod into beat / celery / indexer-scheduler / dask-scheduler / dask-worker. But Darwin doesn't apply from `deployment/kubernetes/` — its prod manifests live under `darwin-kubernetes/`, and the two trees aren't kept in sync. Porting all five into `darwin-kubernetes/` with Darwin conventions: - Image registry sfbrdevhelmweacr.azurecr.io/danswer/danswer-backend - configMapRef env-configmap, secretKeyRef danswer-secrets - POSTGRES_USER / POSTGRES_PASSWORD wired everywhere that talks to PG - REDIS_PASSWORD wired as optional secretKeyRef (the latent footgun flagged in MIGRATION.md §10a is now closed for the Darwin path) - indexcpu nodeAffinity + darwin/indexing toleration on every indexing-side pod (celery, indexer-scheduler, dask-scheduler, dask-worker); beat stays on the default pool (lightweight) - dynamic-pvc + file-connector-pvc volume mounts where any task may stage files The existing `darwin-kubernetes/background-deployment.yaml` (combined beat+celery+indexer via supervisord) is intentionally LEFT IN PLACE — the split is an opt-in rollout, not a forced cutover. To switch: apply the new five, verify the new pods are healthy, scale the combined deployment to 0. Also lock the convention in AGENTS.md so this doesn't recur: - New divergence-table row noting darwin-kubernetes/ is source of truth for prod. - New "Critical facts that bite" §9 documenting the two-tree split, when to touch which, and the per-pod adaptation checklist (image registry, configmap, secrets, REDIS_PASSWORD, affinity, PVCs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

§5b — Dask topology section now points at the actual ported darwin-kubernetes/*.yaml manifests with a concrete cutover script, not just "you'll need to port these later" boilerplate. §10a — Footgun is RESOLVED for the Darwin path (the 5 new Darwin manifests all wire REDIS_PASSWORD via optional secretKeyRef). Marks the entry as such rather than removing it, so the history of "why was this previously a concern" stays readable. §12 — Commit count, file count, and totals updated for the two new commits (MIGRATION.md itself + the darwin-kubernetes port). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rajiv/add-claude was merged to feature/darwin upstream, so the doc's "on top of rajiv/add-claude (PR #45)" reference is stale. The branch is now rebased onto origin/feature/darwin directly — same diff, just a fresher base. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Single source of truth for both production (Darwin AKS) and local dev, with image tags + env values + secrets externalised so a deploy is "edit one file, kubectl apply -k". No Helm. Replaces the flat darwin-kubernetes/ tree (which the operator will delete once they've verified the new structure against the live cluster). Layout: k8s/base/ Cleaned env-neutral manifests (one file per logical service; deployment+service merged where natural). Image refs use logical names (e.g. `danswer-backend`) which overlays rewrite to concrete registry+tag. k8s/overlays/prod/ Darwin AKS production: kustomization.yaml → images, replicas env.properties → non-secret config secrets.env.example → template (committed) secrets.env → real values (gitignored) k8s/overlays/local/ Same shape, local-dev defaults (host.docker.internal, latest tags, AUTH_TYPE=disabled, smaller replicas). k8s/optional/ Opt-in deployments not part of base: redis.yaml background-{beat,celery,indexer-scheduler}.yaml dask-{scheduler,worker}.yaml Apply with `kubectl apply -f <file>` when rolling out the corresponding feature. k8s/README.md Layout explanation + common workflows (image bump, env change, secret rotation, Redis rollout, migrating off darwin-kubernetes/). Built from the live-cluster dump in darwin-kubernetes/temp/ (gitignored, never committed). The cleaner script (intentionally not committed) strips status, uid, resourceVersion, generation, creationTimestamp, managedFields, last-applied-configuration annotation, restartedAt, progressDeadlineSeconds, revisionHistoryLimit, and the auto-assigned clusterIP/ipFamilies/sessionAffinity on Services. Image references in base/ are normalised to logical names so kustomize can rewrite them. SECURITY: the live env-configmap was discovered to contain real plaintext secrets — Slack tokens, GEN_AI client secret, Jira token, Opsgenie key. The new structure moves all of those to k8s/overlays/*/secrets.env (gitignored) which renders into a kustomize-generated Secret. api-server and background deployments gain `envFrom: secretRef: danswer-secrets` so the moved values continue to reach the app as env vars. Rotation of the leaked credentials is a separate operator task — every "REPLACE_ME" in secrets.env.example marked LEAKED is one of them. Validation: kubectl kustomize k8s/overlays/prod → 26 resources, clean render kubectl kustomize k8s/overlays/local → 26 resources, clean render Image substitution verified in both. .gitignore additions: darwin-kubernetes/temp/ Live cluster dumps k8s/overlays/*/secrets.env Real secret values per environment k8s/overlays/*/*.secrets.env Defensive (any *.secrets.env variant) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

In-cluster Redis is now an opt-in kustomize Component at k8s/optional/redis/, included by the local overlay via `components: [../../optional/redis]` and NOT by prod (which uses managed Redis instead). Why Component instead of `resources: ../../optional/redis.yaml`: kustomize's load restrictor rejects file-resource refs that escape the overlay's directory tree. Components are explicitly designed for opt-in cross-tree refs and pass the security check; they also let us add patches later that only apply when the component is opted in. Layout change: before: k8s/optional/redis.yaml after: k8s/optional/redis/ kustomization.yaml (kind: Component) redis.yaml The plain `kubectl apply -f k8s/optional/redis/redis.yaml` or `kubectl apply -k k8s/optional/redis` workflows still work — the file just moved one level deeper. env.properties updates: local: REDIS_HOST=redis (the in-cluster Service name, matching the component's deployment) prod: REDIS_HOST=<your-managed-redis>.redis.cache.windows.net (placeholder for Azure Cache for Redis; rename + drop the access key into secrets.env as `redis_password` when you adopt managed Redis) Validated: kubectl kustomize k8s/overlays/prod → 26 resources (no Redis) kubectl kustomize k8s/overlays/local → 28 resources (+Service +StatefulSet) README updated with the components pattern and how to add more opt-in features the same way (split-background, Dask, etc.). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… to 512MB Reversal of the earlier "prod will use managed Redis" decision. Prod overlay now opts into the same in-cluster Redis component as local: k8s/overlays/prod/kustomization.yaml — adds: components: - ../../optional/redis k8s/overlays/prod/env.properties — REDIS_HOST back to `redis` (the in-cluster Service name) Redis StatefulSet bumped from 256MB to 512MB: --maxmemory 256mb → 512mb resources.requests.memory 128Mi → 256Mi resources.limits.memory 384Mi → 1Gi Limit set to ~2x maxmemory rather than 1.5x because the single-replica StatefulSet has no failover — OOM = cache outage. Redis uses extra RSS beyond --maxmemory for client output buffers, COW pages during BGSAVE (if we ever turn on persistence), and fragmentation; safer to over- provision the cgroup limit and let `maxmemory-policy: allkeys-lru` do its job inside Redis's own accounting. Validated: kubectl kustomize k8s/overlays/prod → 28 resources (now includes Redis) kubectl kustomize k8s/overlays/local → 28 resources (unchanged) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Both prod and local overlays opted into the in-cluster Redis component, so it's no longer optional — promoted to base/redis.yaml and added to base/kustomization.yaml. Removed the now-redundant `components:` blocks from both overlays and the optional/redis/ component dir. Net effect is identical (prod + local still render 28 resources each, both including Redis) — just less indirection now that Redis is universal rather than opt-in. README updated: optional/ table drops the redis row with a note that it moved to base; the components: "flag" explanation now points at the split-background deployments as the example opt-in. Validated: kubectl kustomize k8s/overlays/prod → 28 resources (redis in base) kubectl kustomize k8s/overlays/local → 28 resources (redis in base) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…nces The README is the doc for the k8s/ layout as it stands, not a record of how it came to be. Removed: - "Replaces the older darwin-kubernetes/ tree" subtitle - the whole "Migration plan (deleting darwin-kubernetes/)" section - the "darwin-kubernetes/ is being retired" + temp/ convention bullets Also fixed two bits left stale by moving Redis into base: - structure diagram listed Redis under optional/ → now correctly omits it (it's in base) - "Roll out Redis" workflow told you to `kubectl apply -f k8s/optional/redis.yaml` → rewritten as "Redis ships in base; flip the env flags to enable the cache/rate-limiter features" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

optional/ holds plain manifests (no kustomization), so they need `kubectl apply -f` and aren't picked up by `apply -k overlays/...`. Added a workflow covering: - single-file and whole-folder apply - the dependency on the overlay being applied first (optional pods reference the overlay-generated env-configmap / danswer-secrets) - the full split-background + Dask cutover in dependency order (scheduler/workers → split bg pods → scale down combined), plus rollback and the dual-beat warning Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ical images) The optional/ manifests hardcoded the image tag (sfbrdevhelmweacr.azurecr.io/danswer/danswer-backend:vha-138) while base uses the logical name `danswer-backend` that the overlay's images: block rewrites. That inconsistency meant a tag bump had to be made in two places and the optional pods could drift from the rest of the cluster. Fix: grouped the five split-background + Dask manifests into a single kustomize Component at k8s/optional/background-scaling/, and changed their image refs to the logical `danswer-backend`. When an overlay opts in via `components: [../../optional/background-scaling]`, the overlay's existing `images:` entry for danswer-backend parameterizes them — same tag as api-server / background, set in one place. Verified: temporarily opting the component into the prod overlay renders all five bg-scaling pods with sfbrdevhelmweacr.azurecr.io/danswer/ danswer-backend:vha-138 (34 resources total), then reverted. Neither overlay opts in by default (prod/local still 28 resources each). Layout change: before: k8s/optional/{background-beat,background-celery, background-indexer-scheduler,dask-scheduler,dask-worker}.yaml (plain manifests, hardcoded image, applied via kubectl apply -f) after: k8s/optional/background-scaling/ kustomization.yaml (kind: Component) <same five manifests, logical image name> (opted into via the overlay's components: block) README updated: optional/ is now described as opt-in components with logical-image parameterization; the apply workflow switched from `kubectl apply -f` to the components: + replicas:0 overlay edits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…env-neutral) Follow-up to the component conversion — three remaining inconsistencies vs base that were pointed out: 1. Replicas were hardcoded per-manifest. Removed them from the manifests and moved the counts into the component's kustomization.yaml `replicas:` block (one place; mirrors how the overlay parameterizes base replicas). dask-worker=3 is the indexing-throughput knob. 2. Secret/config loading differed: the component had an extra explicit REDIS_PASSWORD secretKeyRef that base doesn't. Dropped it so every pod's env block is byte-identical to base/background.yaml — explicit POSTGRES_USER/POSTGRES_PASSWORD via secretKeyRef + envFrom [configMapRef env-configmap, secretRef danswer-secrets]. (redis_password still reaches the app via the envFrom secretRef like every other secrets.env key; the explicit entry was redundant and base-divergent.) 3. Manifests carried Darwin-specific node affinity + darwin/indexing tolerations, which base does NOT (base is env-neutral; the live cluster runs without pool affinity). Stripped them so the component is environment-neutral and won't fail to schedule on a local cluster that lacks the indexcpu pool. The prod overlay re-adds indexcpu affinity + toleration via a patch when it opts in — documented in the README opt-in steps with a ready-to-paste patch block. Verified end-to-end: opting the component into prod renders 34 resources, all five bg-scaling pods get sfbrdevhelmweacr.azurecr.io/danswer/ danswer-backend:vha-138, replicas come from the component kustomization (beat=1, celery=2, worker=3), background-deployment scaled to 0. Default (not opted in): prod/local both 28 resources. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…nd-scaling The apply command was present but buried under the overlay-edit YAML block and read as the generic overlay apply. Made the deploy commands explicit and labeled: preview the rendered bg/dask pods, kubectl diff vs live, apply, and rollout-status watches. Also stated plainly why there's no standalone `kubectl apply -f` for the component (logical image name only resolves through the overlay's images: block). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dask-worker 3 → 2, background-celery 2 → 1. The originals were inherited defaults from the cherry-picked feature/backgroundscaling commit, not sized to Darwin's load. - dask-worker=2: each pod runs one connector at a time (--nworkers=1 --nthreads=1), so this caps concurrent indexing at 2. Enough unless many connectors backlog in NOT_STARTED; raise then. Halves the worst-case indexcpu footprint (was 3×4Gi, now 2×4Gi). - background-celery=1: Celery here only runs maintenance tasks (prune, sync, deletion, analytics rollup) — NOT indexing. One pod already autoscales 3-10 threads (--autoscale=3,10), which easily covers the bursty maintenance queue at this scale. The 2nd replica was redundancy we don't need. Added inline comments noting which counts are singletons that must stay at 1 (beat, indexer-scheduler, dask-scheduler) vs the throughput knobs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Splitting the 10Gi monolithic background pod under-sized two containers, which OOMKilled (exit 137) in prod: - celery-beat: 256Mi limit → OOM on app import (celery beat still imports the full langchain/llama-index/tokenizer stack). Now 512Mi req / 2Gi limit. - indexer-scheduler: the update.py loop spikes from ~300Mi to ~5Gi per cycle (warms embedding model + Dask client + connector/index state). OOMKilled at 1Gi, 2Gi AND 4Gi. Now 4Gi req / 8Gi limit — verified stable in prod across multiple update cycles with ~3Gi headroom. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Remove vespa/ from base so `kubectl apply -k overlays/{prod,local}` no longer touches the Vespa StatefulSets (a drifted manifest could roll them — the kind of blast radius behind the prior outage). Vespa now has its own overlays carrying the pinned images + namespace; apply it deliberately. Also fix guarded-apply.sh: with Vespa gone from the app overlay, its `has_vespa=$(... | grep ...)` tripped `set -euo pipefail` on no-match and aborted before applying. Guarded `|| true`; the guard now auto-skips for overlays without Vespa and runs for the *-vespa overlays. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…istory Root-caused the indexer-scheduler's ~7.5Gi/cycle memory spikes (which we'd been masking by raising the pod limit to 8/12Gi). py-spy on the live pod showed every spike in get_last_attempt -> SQLAlchemy fetchall. get_last_attempt ran `select(IndexAttempt).where(cc-pair).order_by(time DESC)` then `.scalars().first()`. Result.first() does NOT add LIMIT to the SQL, so psycopg2 buffered the cc-pair's ENTIRE attempt history and the ORM materialized every row before discarding all but one. create_indexing_jobs calls this once per cc-pair every loop; with 518k index_attempt rows (12.7k for the busiest cc-pair) the per-cycle session climbed to 7.5Gi, then freed when it closed. Verified live: the worst cc-pair's query allocates +308MB without LIMIT vs +0MB with LIMIT 1. Fix is LIMIT 1 in the query (db helper only — NOT the Dask submission path). Steady-state scheduler memory drops to ~700Mi. After the fixed image is deployed, the indexer-scheduler limit can drop from 12Gi back to ~2Gi (temp 12Gi note left in the manifest until then). Also removes the SYS_PTRACE diagnostic capability added for profiling. Follow-up (separate): index_attempt has 518k rows — a retention/prune pass would speed these queries further. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The index_attempt table had grown to ~518k rows; the retention policy for it already exists in db/retention.py but was disabled by default (RETENTION_DAYS_INDEX_ATTEMPT=0). Enable it at 30d in both overlays — the daily 08:00 UTC sweep now prunes terminal attempts older than 30d while always keeping the last 20 per (connector, credential, embedding model). Dry-run on prod: ~499k of 518k rows eligible. Pairs with the get_last_attempt LIMIT-1 fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Workers that booted before the dask-scheduler accepted connections failed to register their Nanny and exited 1 (CrashLoopBackOff until ordering worked out; seen as 2-3 restarts per worker at every rollout). Wrap the worker command in a plain TCP retry loop against dask-scheduler-service:8786, then exec the worker. Portable by design: it's a bare socket connect with no mesh/platform dependency, so it works the same with or without istio. Kept in the main container rather than an initContainer precisely so it stays portable — under istio an initContainer runs before the sidecar and its mesh traffic is blackholed until envoy is up. Verified live: both workers now roll with 0 restarts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

With the get_last_attempt LIMIT-1 fix deployed (image vha-140), the indexer- scheduler no longer spikes — RSS sits flat at ~430Mi across update cycles (verified live). Drop from the temporary 4Gi req / 12Gi limit back to 512Mi req / 2Gi limit. This is the hardware reclaim the limit bump was masking: ~8x request, 6x limit reduction on this pod. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Roll out the images carrying the get_last_attempt LIMIT-1 indexing-memory fix (backend) and the latest web build. Deployed + verified live: indexer-scheduler RSS flat at ~430Mi, no spike. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ior change) Follow-up to the get_last_attempt LIMIT-1 fix — swept the codebase for the same class of "load more rows/columns than needed into memory" bugs. Each change here is behavior-preserving (same result set; only fewer columns fetched, or an additive optional bound): - prune task (celery_app): was materializing a connector's ENTIRE document corpus as full DbDocument ORM rows just to collect .id. New get_document_ids_for_connector_credential_pair selects only the id column (same WHERE + DISTINCT, identical id set). Biggest win — large connectors. - document-set sync (fetch_documents_for_document_set_paginated): selected full Document rows per batch but the only caller uses just .id and the keyset cursor is the last id. Now selects Document.id; caller updated. - get_index_attempts_for_cc_pair: added optional `limit` (default None = unchanged). The "any unfinished attempt?" existence check now passes limit=1 instead of materializing the cc-pair's full attempt history (rows carry large error_msg / full_exception_trace Text columns). - get_current/secondary_db_embedding_model: added .limit(1) — Result.first() doesn't add LIMIT (tiny table, but the same smell as the original bug). - delete_orphaned_search_docs: replaced fetch-all-ORM-rows-then-loop-delete with a single bulk DELETE over the same orphan set (matches retention.py). - get_tags_by_value_prefix_for_source_types + /valid-tags: added optional `limit` (default None = unchanged) so the unbounded all-tags path can be bounded by a caller. - db/connector.py: comment-only — flag that the per-connector legacy Query.first() over index_attempt is safe ONLY because Query.first() emits LIMIT 1; a migration to execute().scalars().first() must add .limit(1). NOT changed (deliberately): the cc-pair detail endpoint still returns the full attempt list — its frontend paginates client-side off that list, so a server cap would regress the UI. Proper fix = server-side pagination (FE+BE), tracked separately. Retention (RETENTION_DAYS_INDEX_ATTEMPT) already bounds that table in practice. connector_deletion's full-row load is left as-is (already batch-bounded, not a bloat bug). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… BE) The cc-pair detail endpoint embedded the cc-pair's ENTIRE index_attempt history (a busy connector had 3,255 rows, each carrying error_msg + full_exception_trace Text) in CCPairFullInfo on every page view, and the frontend paginated it client-side. That's a request-path memory/latency risk. BE: - CCPairFullInfo drops the full `index_attempts` list; carries only `latest_index_attempt` (LIMIT 1) + `num_index_attempts` (count) — all the detail page actually needs (latest status + "is this the only attempt"). - New endpoint GET /admin/cc-pair/{id}/index-attempts?page=&page_size= returns one page (LIMIT/OFFSET, newest first) + total_pages/total_count. page_size clamped to [1,100]. - db helpers: count_index_attempts_for_cc_pair + get_paginated_index_attempts_for_cc_pair (shared base stmt so count and page always agree on scope; only_current=PRESENT model, matching prior). FE: - IndexingAttemptsTable fetches the paginated endpoint via SWR keyed on the page, renders the server page, uses server total_pages for the selector, and finds the trace-popup attempt within the current page. Priority bump now revalidates the current page + the detail. - page.tsx uses latest_index_attempt / num_index_attempts instead of index_attempts[0] / .length. Verified: backend py_compile clean; full FE `tsc --noEmit` passes (0 errors). CCPairFullInfo has no other consumers. Ships with the next FE+BE image build. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Deploys the SQL over-materialization audit fixes (get_document_ids prune path, get_index_attempts_for_cc_pair limit, embedding_model LIMIT 1, bulk orphan delete, tags limit knob). Verified live on vha-141. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@Retry

…rallel chunk deletes Root cause of the slow SF-Account connector (diagnosed live): it isn't the source pull or embedding — it's the Vespa clear-and-rewrite. The connector filters by LastModifiedDate, but a Salesforce automation bumps that timestamp on ~all accounts, so a single poll pulled 209,759 of ~216k records, each forcing a full per-doc Vespa delete+rewrite (~2.7s/doc → ~6.5 days, never finishes, so last_successful never advances and it re-pulls forever). Fix 1 (the big one) — content-hash skip, in the pipeline so ALL connectors benefit, default-on, backward-compatible: - New nullable document.indexed_content_hash (sha256 of the INDEXED content: sections/title/semantic_id/metadata/owners — NOT doc_updated_at). Migration e5f6a7b8c9d0. - Document.get_content_hash() computes it (connectors/models.py). - get_doc_ids_to_update() skips a doc when the stored hash matches the current content hash, regardless of how doc_updated_at moved. Rows with no stored hash fall back to the original updated_at behavior (NULL → unchanged semantics), so existing data behaves exactly as before; hashes populate lazily on each doc's next successful index. The "re-index from beginning" path (ignore_time_skip) still bypasses both skips for a forced full reindex. - The hash is written only AFTER a confirmed Vespa write (alongside doc_updated_at in update_docs_updated_at), so it always reflects what's actually in the index. Fix 2 — cheaper re-index for docs that DO change: _delete_vespa_doc_chunks deletes a document's chunks concurrently (bounded local pool) instead of one sequential HTTP round-trip at a time. @Retry semantics preserved. Adds a unit test for get_doc_ids_to_update (hash skip + updated_at fallback). backend py_compile clean. Requires `alembic upgrade head` + bounce of api-server/background. First post-deploy poll still does one full pass to populate hashes; subsequent polls skip unchanged docs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Surfaces how many docs in each batch skip the embed + Vespa clear-and-rewrite because their content hash / timestamp is unchanged. Logged at INFO only when >0, in the pipeline so it covers all connectors. Lets us confirm in prod logs (post-deploy) that Salesforce stops re-indexing its LastModifiedDate-churned records. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… pool) background-lite's celery-worker ran `--pool=threads --autoscale=3,10`, which CrashLoopBackOff'd in prod: autoscale calls pool.grow()/shrink(), which the threads TaskPool doesn't implement, so the first task burst killed the worker with `AttributeError: 'TaskPool' object has no attribute 'grow'` (Unrecoverable error), the unacked task was redelivered, and it crashed again. Autoscale is prefork-only. Replaced with a fixed `--concurrency=10` (these maintenance tasks — prune / doc-set sync / deletion / analytics / retention sweep — are I/O-bound, so a fixed thread pool is the right fit). Verified live: worker now drains a burst of 15+ queued tasks with 0 restarts; pod 4/4 Running. Manifest-only, applied to prod (no image rebuild). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Prod data: WEB connectors were ~89% FAILED (78/88 over 30d), almost all "Stopped mid run, likely due to the background process being killed" — slow, unbounded crawls that ran long enough to get killed, then retried and failed again. Aligned our connector with upstream's approach: - Drop the PER-PAGE check_internet_connection (was a full extra GET for every URL, doubling network work, and 403'ing on bot-protected sites Playwright loads fine — with failures tearing down the whole browser). Now one check on the base URL. - page.goto(timeout=30s, wait_until="domcontentloaded") instead of the default unbounded "load" (which waits for every image/font). Big per-page speedup, bounded stalls. - Per-page retries (WEB_CONNECTOR_MAX_RETRIES=3, exp backoff + jitter) instead of skip-on-first-error. A single page error retries with a fresh page on the SAME browser; only an actual browser crash (_is_browser_dead) restarts Playwright — no more whole-browser teardown per flaky page. - Set a real User-Agent on the Playwright context (avoid WAF/403 blocks). - WEB_CONNECTOR_MAX_PAGES cap (default 5000) bounds recursive crawls so they finish before the freeze-timeout that was killing them. - Timeout on the PDF requests.get (was unbounded → could hang the attempt). All self-contained to the web connector; 4 modes / redirects / recursion / batching / error semantics preserved. New knobs in app_configs. Ships with the next backend image build (connector runs on dask-workers). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

poll_source appended `AND updated >= ... AND updated <= ...` to the end of the user's jira_filter. If the filter ended in `ORDER BY ...` (JQL requires WHERE conditions before ORDER BY), the result was invalid JQL and Jira returned HTTP 400 "Expecting ',' but got 'AND'" on every poll — connector 386 (SRE) failed daily for weeks. New _add_time_window_to_jql() splits on a trailing ORDER BY and injects the window in front of it; appends normally otherwise. Verified live: the previously-400'ing connector now returns SUCCESS. (Separate issue surfaced once the 400 cleared: that credential's API token is expired -> /myself 401 -> anonymous -> 0 issues; needs a token refresh, not a code change.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Imported the worthwhile, fork-compatible improvements from upstream's Jira connector (skipping checkpoint/hierarchy/EE-perm-sync, which don't apply here): - Per-issue error tolerance: each issue is processed in its own try/except; a single malformed ticket is logged and skipped instead of aborting the whole attempt (previously one bad issue failed the entire sync). - Implement IdConnector.retrieve_all_source_ids(): lists doc ids (<base>/browse/<KEY>) fetching ONLY the `key` field, so the prune task can detect deleted issues cheaply rather than loading every full document. - Richer metadata: issuetype, reporter, project (in addition to priority / status / resolution / labels); reporter also added to primary_owners. - Fix load_from_state(): it referenced self.quoted_jira_project, which is never set -> AttributeError on any call (the prune fallback hit this). Now uses the configured jira_filter (full, unbounded load). Pairs with the earlier ORDER-BY JQL fix. Connector 386 (SRE) verified live: SUCCESS, 50 issues indexed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Rolls out the indexing content-hash skip, web-connector resilience, and Jira connector fixes (ORDER-BY JQL, per-issue tolerance, ID-based pruning, richer metadata). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ed files Formatting-only; no behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…prettier) Formatting-only; ruff clean. Makes the quality-checks CI job pass — earlier branch commits had not been pre-commit-formatted. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The prune task instantiates connectors with InputType.PRUNE. Single-class sources return their class for any input_type, but SLACK is dict-mapped (LOAD_STATE/POLL only), so prune failed every run with "Connector not found for source=SLACK" -> deleted Slack messages were never pruned from the index. Map PRUNE to SlackPollConnector (the API connector, same config as the live POLL connectors). NOT SlackLoadConnector, which requires an export_path_str (reads a Slack export file) and would TypeError on an API connector's config. extract_ids_from_runnable_connector then enumerates message ids via poll_source(epoch, now). Verified SlackPollConnector constructs with a live Slack connector's config. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

vha-146 carries the Slack InputType.PRUNE fix (verified live). Manifest now matches the images actually deployed in the darwin cluster. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Built + pushed from the Mac (Apple Virtualization + Rosetta backend) via k8s/scripts/build-deploy.sh; deployed to darwin and verified live (api-server rolled out, pods healthy). No code/Dockerfile changes — image content matches the prior tags. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Cumulative stages (build < push < deploy) plus a standalone verify, against the prod overlay: - reads + auto-increments the vha-N image tags from kustomization.yaml - disk pre-req before build (graduated docker prune if >=80% full) - registry login from ACR_USERNAME/ACR_PASSWORD env (or ~/.zshrc); exits if unset - deploy edits the manifest, kubectl apply -k, waits on rollout - verify compares live cluster tags vs manifest + pod health - deploy refuses unless kubectl context is the prod cluster (FORCE=1 to override) - DRY_RUN=1 prints commands without mutating anything bash 3.2 compatible (stock macOS). No secrets in the script. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Sarath1018 approved these changes Jun 2, 2026

View reviewed changes

vnaren23 approved these changes Jun 3, 2026

View reviewed changes

rajivml and others added 28 commits June 3, 2026 14:12

rajivml and others added 22 commits June 3, 2026 14:12

style: apply pre-commit (black / reorder-imports / prettier) to chang…

f201995

…ed files Formatting-only; no behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

style: apply pre-commit across branch files (black/reorder/autoflake/…

b0f383d

…prettier) Formatting-only; ruff clean. Makes the quality-checks CI job pass — earlier branch commits had not been pre-commit-formatted. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

k8s(prod): bump backend vha-144->vha-146, web vha-75->vha-76

84caacf

vha-146 carries the Slack InputType.PRUNE fix (verified live). Manifest now matches the images actually deployed in the darwin cluster. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

rajivml force-pushed the feature/redis-assistants-ux-background-scaling branch from f5c3d5f to 4112145 Compare June 3, 2026 08:43

ninja-shreyash approved these changes Jun 3, 2026

View reviewed changes

rajivml closed this Jun 3, 2026

rajivml reopened this Jun 3, 2026

rajivml force-pushed the feature/redis-assistants-ux-background-scaling branch from 4112145 to eba29bd Compare June 3, 2026 08:53

rajivml merged commit fecda3d into feature/darwin Jun 3, 2026
6 checks passed

rajivml deleted the feature/redis-assistants-ux-background-scaling branch June 3, 2026 08:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k8s kustomize, Redis caching, background/indexing scaling, Azure Blob, retention + connector reliability#46

k8s kustomize, Redis caching, background/indexing scaling, Azure Blob, retention + connector reliability#46
rajivml merged 102 commits into
feature/darwinfrom
feature/redis-assistants-ux-background-scaling

rajivml commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

rajivml commented Jun 2, 2026

Summary

Functionalities implemented

Kubernetes / deployment

Redis caching & rate limiting

Background / indexing scaling

Azure Blob file store

DB retention

Indexing performance & reliability

Connectors

UX / analytics

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants