Skip to content

k8s kustomize, Redis caching, background/indexing scaling, Azure Blob, retention + connector reliability#46

Merged
rajivml merged 102 commits into
feature/darwinfrom
feature/redis-assistants-ux-background-scaling
Jun 3, 2026
Merged

k8s kustomize, Redis caching, background/indexing scaling, Azure Blob, retention + connector reliability#46
rajivml merged 102 commits into
feature/darwinfrom
feature/redis-assistants-ux-background-scaling

Conversation

@rajivml
Copy link
Copy Markdown
Collaborator

@rajivml rajivml commented Jun 2, 2026

Summary

This branch brings the Darwin deployment up to a scalable, observable, and reliable footing: a kustomize-based k8s layout, Redis caching + rate limiting, horizontally-scaled background/indexing, Azure Blob file storage, DB retention, and a series of indexing/connector performance & reliability fixes (several found and verified against prod).

Functionalities implemented

Kubernetes / deployment

  • Replaced the flat darwin-kubernetes/ manifests and the Helm chart with a kustomize base/ + overlays/{prod,local} layout (single source of truth for image tags, replicas, env, secrets).
  • In-cluster Redis (StatefulSet) in base; secrets via gitignored secrets.env.
  • Vespa pinned to an exact version (never :latest after the prior outage), decoupled from the app overlays into overlays/{prod,local}-vespa, with a guarded-apply.sh version-gate and an ordered, health-gated vespa-upgrade.sh.
  • Startup/readiness probes on the api-server; removed vestigial PVC mounts.

Redis caching & rate limiting

  • Per-user Redis caches (document-set list, connector indexing-status) and a per-user chat rate limiter.
  • Celery broker moved onto Redis; env-driven Postgres pool sizing.

Background / indexing scaling

  • Split the monolithic background pod into background-lite (celery worker + beat + slack listener), background-indexer-scheduler, and a remote Dask scheduler + horizontally-scalable workers; optional KEDA autoscale for dask-workers.
  • Right-sized memory; fixed the celery-worker --autoscale+threads-pool crashloop (→ --concurrency).

Azure Blob file store

  • Optional AzureBlobFileStore backend (bytes off Postgres), direct-to-Blob SAS upload from the browser with a progress bar, and chat-upload size/token limits.

DB retention

  • Pluggable retention sweep (kombu_message, task_queue, index_attempt, chat, usage_reports, permission_sync) under an advisory lock, run daily; opt-in index_attempt pruning.

Indexing performance & reliability

  • Content-hash skip: re-index a document only when its indexed content actually changed (not just its timestamp) — eliminates needless Vespa rewrites for churny sources (e.g. Salesforce LastModifiedDate). Backward-compatible, all connectors.
  • get_last_attempt LIMIT 1 + an audit of other over-materializing queries (prune id-only fetch, doc-set sync, cc-pair attempts) — fixed the indexer-scheduler OOM.
  • Parallelized per-document Vespa chunk deletes.

Connectors

  • Web: dropped the per-page connectivity GET, domcontentloaded navigation, per-page retries, no whole-browser teardown on a single page error, User-Agent, and a max-pages cap — fixed the ~89% failure rate (verified live).
  • Jira: fixed the ORDER BY JQL bug (HTTP 400 every poll), per-issue error tolerance, IdConnector ID-only retrieval (enables pruning of deleted issues), richer metadata, and a broken load_from_state.

UX / analytics

  • cc-pair index-attempt history moved to server-side pagination (FE + BE).
  • Durable per-user daily analytics (survives chat retention), adoption curves, most-used assistants; chat folder/assistant UX polish.

Verification

  • pre-commit (black / reorder-imports / autoflake / ruff / prettier) passes on the changed files; backend py_compile clean; web tsc --noEmit clean.
  • No secrets committed (the only key-like string is the public Azurite emulator default in a test docstring; real secrets live in gitignored secrets.env).
  • Prod-verified: web + Jira connectors now succeed; indexer-scheduler memory flat; celery-worker stable.

🤖 Generated with Claude Code

rajivml and others added 28 commits June 3, 2026 14:12
Plan for exposing chat to a few hundred users: P0 connection-pool/session
fix, P1 Redis foundation + DynamicConfigStore read-through cache, P2
per-user request rate limiting, P3 per-chat-turn config caches. Plan only,
no implementation yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Foundation for caching/rate-limiting work (see REDIS_CACHING_PLAN.md).
This commit only ships the cache piece — no behavioural change unless
REDIS_KV_CACHE_ENABLED=true is set.

* requirements: pin redis==5.0.8.
* configs/app_configs.py: REDIS_HOST/PORT/PASSWORD/DB_NUMBER/SSL,
  REDIS_POOL_MAX_CONNECTIONS, REDIS_HEALTH_CHECK_INTERVAL,
  REDIS_SOCKET_TIMEOUT_SECONDS; cache toggle REDIS_KV_CACHE_ENABLED
  (default OFF) and REDIS_KV_CACHE_TTL_SECONDS (1 day).
* danswer/redis/redis_pool.py: lazy ConnectionPool singleton +
  get_redis_client() helper. Single-tenant — DANSWER_REDIS_KEY_PREFIX
  is the only namespace; upstream's TenantRedisClient is intentionally
  not ported.
* dynamic_configs/store.py: RedisCachedDynamicConfigStore wraps any
  inner DynamicConfigStore with read-through / write-through caching.
  Inner store stays the source of truth (writes inner first), encrypted
  values are NEVER cached plaintext (just invalidated), every Redis
  call is fail-open so an outage degrades latency, not availability.
* dynamic_configs/factory.py: when REDIS_KV_CACHE_ENABLED, transparently
  wraps the existing PostgresBackedDynamicConfigStore — call sites
  unchanged.
* Deployment: redis service in docker-compose.dev.yml (cache-only:
  no AOF, no RDB snapshots, allkeys-lru @ 256mb so a runaway producer
  can't OOM the node). darwin-kubernetes/redis-statefulset.yaml mirrors
  that posture. REDIS_HOST etc. in env-configmap; REDIS_PASSWORD wired
  via optional secretKeyRef so the deployments still boot when Redis
  is unauth'd. NOT the Celery broker — that stays on Postgres by design.
* backend/.gitignore: ignore stray pywikibot apicache/throttle.ctrl
  artifacts dropped by the existing mediawiki test.

Tests (unittest, no real Redis required — mocks at the get_redis_client
boundary):
  - tests/.../redis_layer/test_redis_pool.py: pool singleton, prefix
    constant, reset_pool_for_tests.
  - tests/.../dynamic_configs/test_redis_cached_store.py: read-through,
    write-through invalidation, TTL on SET, cached-None vs miss,
    not-found NOT cached, encrypted values not mirrored, corrupt entry
    treated as miss, fail-open on Redis errors. 13 cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Layered on top of P1's Redis client. Complements the existing token-
budget limiter (token_limit.py) — that's a DB-backed COST cap, this is
a Redis-backed REQUEST-COUNT cap that's correct across api_server
replicas. Both run; this one runs first so a 429'd caller never even
touches the DB-backed usage query.

Default OFF. Enable per-environment via:
  REQUEST_RATE_LIMIT_ENABLED=true
  REQUEST_RATE_LIMIT_PER_MINUTE=<N>   # 0 = disable that window
  REQUEST_RATE_LIMIT_PER_HOUR=<N>

* server/middleware/request_rate_limit.py: fixed-window buckets keyed
  by floor(time/window). Atomic INCR + EXPIRE NX so the bucket
  boundary is fixed on first increment (without NX, every request
  would push expiry forward and the bucket would never reset — that
  bug is covered by an explicit test). Authenticated users keyed by
  uuid; anonymous keyed by the first X-Forwarded-For hop, falling back
  to the socket peer; if neither yields an IP we skip (better than
  bucketing every anonymous request under "").
* Fail-OPEN on any Redis error: a Redis blip lets requests through with
  a warning, never wedges the chat path.
* 429 response carries a Retry-After header with seconds-until-bucket-
  rollover so well-behaved clients back off precisely.
* Wired as a FastAPI Depends on:
    POST /chat/send-message
    POST /direct-qa/stream-answer-with-quote
  Both endpoints also keep the existing check_token_rate_limits.

Tests (unittest, mocked Redis pipeline — no real Redis required):
  - default-OFF short-circuits before any Redis call (covers both
    REQUEST_RATE_LIMIT_ENABLED=false AND both windows = 0).
  - within-limit: N requests under cap all allowed.
  - over-limit raises 429 with Retry-After header.
  - per-user isolation: distinct users have independent counters.
  - bucket rollover resets count (time-mocked).
  - EXPIRE NX semantics — locks down the no-sliding-TTL invariant.
  - anonymous keyed by XFF first hop; no-IP skips silently.
  - fail-open: Redis error doesn't propagate. 9 cases total.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GET /persona (Manage Assistants → "View available assistants") fires
get_personas(user_id, ...) — a multi-OR permission-filtered query joining
Persona, Persona__User, Persona__UserGroup, User__UserGroup. At hundreds
of concurrent users opening the chat UI around the same time, the burst
puts unnecessary pressure on the DB connection pool (which is the actual
scaling ceiling for streaming chat — see REDIS_CACHING_PLAN.md).

Design: global cache + per-user filter (not per-user response cache),
so the multi-user-burst pattern collapses 200 queries into ~1:

  danswer:personas:all:not_deleted   global, all PersonaSnapshot
                                     including is_public / users / groups
                                     (PersonaSnapshot already carries the
                                     permission inputs — no separate
                                     payload shape needed)
  danswer:personas:groups:{user_id}  per-user, list[int] of group ids

At request time the cached list is filtered in Python mirroring the SQL
OR-block exactly:
  is_public
  OR user.id IN persona.users.id
  OR (user_group_ids ∩ persona.groups)
The parity vs SQL is locked down by tests (one per branch + negative).

Invalidation is explicit + write-through:
  - 9 mutation paths in db/persona.py call invalidate_personas_all()
    AFTER db_session.commit() (after-commit ordering avoids stale-fill
    race during open transactions).
  - 3 paths in ee/danswer/db/user_group.py (insert/update/prepare-delete)
    call invalidate_user_groups(uid) for each affected user.
  - 24h TTL is ONLY a safety net for missed busts; primary mechanism is
    explicit so persona/group edits are visible immediately.
  - Default OFF (PERSONA_CACHE_ENABLED=false); enable per environment.
  - Fail-OPEN on every Redis op: a Redis outage degrades latency, not
    availability, and a failed bust doesn't roll back the DB write.
  - include_deleted=True falls through to direct DB (uncommon shape;
    we deliberately don't cache it).

Encrypted values: N/A — PersonaSnapshot has no encryption-at-rest
guarantee to bypass (unlike the KV store layer from P1).

Tests (17, mocked Redis + db boundary, no real services):
  - 6 filter-parity cases (one per SQL branch + mixed + zero-groups edge)
  - 2 user-group cache cases (miss/hit, TTL propagation)
  - 3 routing cases (disabled fallthrough, include_deleted bypass, admin
    user_id=None path skips group lookup)
  - 4 invalidation cases (right key for each side, disabled short-circuit,
    Redis-error-during-bust swallowed)
  - 2 fail-open-on-read cases (GET error → miss, corrupt entry → miss)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the prior "move up / move down inside a 3-dot popover" flow on
/assistants/mine with eight coordinated changes. Backend unchanged — the
existing PATCH /api/user/assistant-list endpoint already accepts the full
chosen_assistants array, so every interaction lands as one optimistic
local update + one PATCH + (on failure) a rollback.

  1. Drag-and-drop reorder via @dnd-kit (already in package.json) with a
     grab handle on each visible row. Pointer activation distance of 6px
     so clicks on the handle don't accidentally start a drag. Keyboard
     reordering comes for free via dnd-kit's default activator focus.

  2. Explicit "set as default" — pin/star icon on each visible row;
     filled when the row is the user's default (position 0 of
     chosen_assistants), with an accent border + "Default" chip on that
     row. Ordering and default are now orthogonal — reorder freely
     without accidentally changing your default.

  3. Visibility as a row-level switch instead of a buried "Hide / Remove"
     popover item. One unified list with a "Hidden (N)" divider; hidden
     rows render at reduced opacity and have no drag handle (no position
     to drag to). The prior separate "Active Assistants" / "Your Hidden
     Assistants" sections collapse into this single list. Refuses to
     hide the last visible row (can't ship the user a broken picker).

  4. Client-side search filter — matches name, description, or tool name.
     Applies to both visible and hidden sections so search-then-toggle
     for "where did I put X" is one motion.

  5. Information density rebalanced. Description is now the primary
     signal (was the smallest text). Tools/sources collapse into compact
     "{n} tools" / "{n} sources" chips so the row scans for "should I
     pick this?" not "what are its internals?". Full tool list reveals
     on hover via title attribute.

  6. Bulk select column + sticky action bar. Checkbox appears on hover
     or focus and stays visible when selected. Action bar shows
     Show / Hide / Remove + Clear when anything is selected. Refuses
     bulk-hide that would empty the visible list.

  7. Header cleanup: title + 1-line subtitle + Create button top-right,
     "Browse all available" as a text link. The prior two giant nav
     tiles + paragraph of explanatory copy are gone — recovers vertical
     space on a page whose real content is the list.

  8. Undo on every state-mutating toast (reorder / set-default / hide /
     show / bulk ops). PopupSpec gains an optional `undo: { onClick }`
     field; the popup stays on screen 6s instead of 4s when undoable so
     the user has time to react. Undo restores the prior chosen_assistants
     array via another PATCH — symmetric round-trip, no special endpoint.

New helpers in lib/assistants/updateAssistantPreferences.ts:
  reorderAssistantList(newOrder)      — full-array PATCH (drag, undo)
  setDefaultAssistant(id, list)       — move id to position 0
  bulkRemoveFromList(ids, list)       — set difference
  bulkAddToList(ids, list)            — set union, appended at end

The pre-existing moveAssistantUp/Down helpers are kept (other callers
may still import them) but no longer used on this page.

Verified: npx tsc --noEmit clean across web/ (0 errors).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sister rework to the Manage Assistants page. With 50+ accessible
assistants and growing, the old flat 2-column grid had no hierarchy and
no status signal — every card looked identical regardless of whether it
was yours, shared, public, or already in your picker. Same conceptual
fix as Manage: give the page structure so scanning answers "what's
mine?", "what's new?", "what does this one do?".

Backend unchanged — every interaction PATCHes chosen_assistants via the
existing /api/user/assistant-list endpoint (same path the Manage page
uses). All mutations are optimistic + undoable.

Changes (numbers map to the design proposal):
  1. Per-card "In your picker" badge + muted card style when added.
     Eye now finds the un-added ones in a glance.
  2. Three implicit sections: Yours / Shared with you / Featured & Built-in.
     Empty sections hide; section headers carry counts.
  3. Filter chip rows: availability (All / Available to add / Already
     added) with live counts, plus auto-generated per-tool chips for
     tools that appear in ≥2 assistants (avoids chip-bloat as the
     dataset grows). Tool filters use OR semantics.
  4. Owner display: best-effort name from the email local-part
     (split on '@', dots/underscores→spaces) with a "Built-in" badge
     for default_persona assistants. Kills the fork-specific
     "Author: Darwin" magic string.
  6. Responsive grid: 1 / 2 / 3 / 4 cols by breakpoint.
  7. Header matches the Manage rebuild — title + subtitle + Create
     button top-right, "Back to my assistants" as a text link. Cut the
     giant centered nav button and the explanatory paragraph.
  8. Sort dropdown: Featured (API order, respects admin display_priority)
     / A → Z / Newly added (id desc proxy for recency).
  9. Search broadened to name + description + tool names + document-set
     names. Empty-result state with a real "Clear all filters" button.
  10. Compact "{n} tools" / "{n} sources" chips with hover-reveal of
      the full tool list. Flat Add/Remove buttons replace Tremor's
      color="green/red" which was visually shoutier than the action.
  11. Design tokens fixed — border-border / focus-ring-accent in place
      of hardcoded gray-300 / blue-500. Consistent with the rest of the
      app.

Skipped (per proposal):
  - #5 detail drawer / modal — revisit after observing how users use
    the new grid; bigger feature.
  - Bulk select — adding 5 assistants at once isn't a real use case
    here (bulk hide on Manage was).

The pre-existing addAssistantToList / removeAssistantFromList helpers
are kept and used at the call sites. The shared reorderAssistantList
helper added in the prior commit is reused for the undo paths.

Verified: npx tsc --noEmit clean across web/ (0 errors).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Dev-only seed script that creates N varied personas in the local DB
for exercising the redesigned gallery / manage pages. Refuses to run
when POSTGRES_HOST looks like a managed/prod database (azure.com,
amazonaws.com, .cloud., "prod", etc.) — guard against pointing this
at the wrong env by accident.

Mix is designed to populate each section of the new gallery:
  ~30% "Yours"           — owned by target user, private
  ~20% "Shared with you" — owned by another user, target user in users[]
  ~50% "Featured"        — public, no specific owner

Per row randomly attaches 0–3 tools and 0–2 document sets so the {n}
tools / {n} sources chips render with variety. Half of "Yours" auto-
land in chosen_assistants (and all "Shared with you" do), so the
"Already added" vs "Available to add" filter chips have content on
both sides without manual setup.

60 distinct names + 30 description templates so 50 rows feel populated
and varied. Uses a fixed RNG seed by default (deterministic across runs).
Name prefix "[seed] " makes rows easy to spot and to wipe via --clear.

Usage:
  cd backend && source ../.venv/bin/activate
  python -m scripts.seed_assistants --email you@example.com
  python -m scripts.seed_assistants --clear     # wipe and re-seed

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two follow-ups on the assistants UX rework, both from user feedback.

Manage (/assistants/mine):
  * Toggle now accepts a `highlight` prop that draws a transient ring +
    slight scale-up on the switch. Used by hidden rows so a click
    anywhere on the (faded) row body flashes the toggle for ~1.2s,
    pointing the eye at the action that brings the assistant back.
    Doesn't auto-enable on body click — surprising a user mid-read into
    enabling would be a worse outcome than the discoverability gap.
  * Restructured opacity: only the content column (icon + name +
    description + chips) fades when a row is hidden. The action zone
    (checkbox, drag-slot, pin, toggle, share, edit) stays at full
    opacity so the toggle is the bright, clickable target on a dim row.
    Previously the parent opacity-50 cascaded to every child, making
    the toggle the dimmest thing on the dimmest row.
  * stopPropagation on the action zone so clicks on buttons inside it
    don't trigger the row-body flash handler.

Gallery (/assistants/gallery):
  * Removed all tool-related UI per user request — the page is for
    browsing assistants, and tool filter chips + per-card "{n} tools"
    pulled focus from the assistant itself. Gone: the auto-generated
    tool filter chip row, the per-card tools chip, the toolDisplayName
    / toolIcon helpers, the FiTool / FiImage / FiCheck imports, and
    the toolFilters state + commonTools memo. Search hay is now
    name + description + document-set names (no tool names).
  * Dropped the absolute top-right "In your picker" badge. The muted
    card style (border + opacity-75) plus the "Remove" button in the
    footer already signal "added"; the badge ate horizontal space
    (pr-24 on the header reserved 96px) and crowded the title at
    narrower widths. Removed the pr-24 reservation now that nothing
    overlays the header.
  * Grid capped at `1 / 2 / 3` cols — 1 on mobile, 2 on most laptops
    and standard desktops, 3 only at `2xl` (≥1536px). Previously
    1/2/3/4 with the 4-col breakpoint making cards cramped and hard
    to read once descriptions hit their 3-line clamp.
  * Bumped card padding p-4 → p-5 and description line-height to
    leading-relaxed for breathing room.
  * Updated clearAllFilters / hasAnyFilter to drop the toolFilters
    references (now dead).

Verified: npx tsc --noEmit clean across web/ (0 errors), zero stray
references to the removed helpers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the gallery treatment from the previous commit. The user
reported tool execution isn't reliable yet, and surfacing "{n} tools"
on assistant rows misleads users into picking an assistant for a
capability that may not work in practice.

Dropped: the {n} tools Bubble in the row's chip block, the toolCount
derivation, and the FiTool import. The {n} sources chip stays — it's
about the assistant's knowledge scope, which works fine.

Verified: npx tsc --noEmit clean across web/ (0 errors).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AssistantsGallery now accepts an optional `columns` prop (default 3,
supported 1-5). Responsive scaling below the widest breakpoint is
fixed per row of GRID_CLASSES — each row is a complete static
Tailwind class string so the purge step actually emits the classes
(dynamic `md:grid-cols-${n}` would silently disappear at build).

Unsupported values fall back to the default rather than rendering
broken — a bad prop here shouldn't break the page.

The single existing caller (page.tsx) doesn't pass columns, so it
gets the default 3 — same layout as before. To switch to 4 columns
on a wide-monitor deployment: `<AssistantsGallery columns={4} ... />`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A "{n} sources" chip told users how MUCH knowledge an assistant had
access to but not WHICH knowledge — defeating the point of the chip
for someone deciding "which assistant should I pick for this task?".

Both the Manage page row and the Gallery card now render one Bubble
per document-set name, capped at MAX_VISIBLE_DOC_SETS (3). When an
assistant points at more than that, a "+N more" pill collects the
overflow with the rest of the names exposed via the title tooltip,
so we don't blow the card width or row layout at narrower column
counts.

Each name chip caps at a max-width with truncate + a hover title,
so a single absurdly long document-set name can't push the actions
off the row.

Verified: npx tsc --noEmit clean across web/ (0 errors).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a small "Columns 2 | 3 | 4" segmented control at the right end of
the filter row (next to Sort). The pick persists to localStorage
under "danswer:assistants-gallery:columns" so it survives reloads on
the same device.

State precedence:
  user choice (localStorage)   wins
  ↓
  prop `columns` from caller   (default for new users / new device)
  ↓
  DEFAULT_COLUMNS = 3          (final fallback)

The localStorage read happens in a useEffect so SSR + first paint use
the prop value — avoids a hydration mismatch the time the stored value
disagrees with the prop. localStorage writes are wrapped in try/catch
because some sandboxed contexts (private modes, restrictive iframes)
throw on access — the control still works for the session, just
doesn't persist there.

Picker is hidden below md (768px) because the layout falls back to
1 column at that width regardless of the chosen value. Exposed
options are 2/3/4 — 1 is mobile-only via responsive, 5 is too cramped
for typical screens (GRID_CLASSES still supports 5 if a deployment
wants to set it via prop).

Verified: npx tsc --noEmit clean across web/ (0 errors).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverts the segmented "2 | 3 | 4" button group to a single <select>
that mirrors the existing Sort dropdown for visual consistency on the
"view controls" cluster at the right end of the filter row.

Behavior unchanged: pure client-side state + localStorage persistence,
no fetch and no router.refresh() in the column path — the user's
column choice never triggers a backend call.

Verified: npx tsc --noEmit clean across web/ (0 errors).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the monolithic supervisord background pod with separate
deployments for celery-worker, celery-beat, indexer-scheduler,
dask-scheduler, and dask-worker. The indexer-scheduler now reads
DASK_SCHEDULER_ADDRESS to dispatch run_indexing_entrypoint to a
remote Dask cluster instead of an in-process LocalCluster, so
indexing throughput scales horizontally with dask-worker replicas
instead of being capped by one pod's RAM.

Local dev keeps the LocalCluster path (no env var); a new
scripts/dev_run_dask_distributed.py and docker-compose overlay
reproduce the prod-shape topology without K8s.
scripts/test_dask_distributed_e2e.py exercises the topology
(parallelism, worker death, scheduler death) end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single migration doc covering all three slices on this branch:
  1. Background indexing scaling (Dask topology)
  2. Redis caching + rate limiting
  3. Assistants UX rework

Organized for an operator: TL;DR up top ("everything default OFF"),
new deps/env/secrets summarized, deployment order, verification
checklist BEFORE flipping any flags, per-feature enable steps, and
the known footguns (k8s manifests missing REDIS_PASSWORD env wire-up
in the bg-scaling path, seed script bypassing persona cache, CLAUDE.md
update.py gate). Plus the recommended manual test list and the
branch's commit map.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…in AGENTS.md

The bg-scaling commit (03d1649) added 5 new k8s manifests under
`deployment/kubernetes/` that split the combined background pod into
beat / celery / indexer-scheduler / dask-scheduler / dask-worker.
But Darwin doesn't apply from `deployment/kubernetes/` — its prod
manifests live under `darwin-kubernetes/`, and the two trees aren't
kept in sync.

Porting all five into `darwin-kubernetes/` with Darwin conventions:
  - Image registry sfbrdevhelmweacr.azurecr.io/danswer/danswer-backend
  - configMapRef env-configmap, secretKeyRef danswer-secrets
  - POSTGRES_USER / POSTGRES_PASSWORD wired everywhere that talks to PG
  - REDIS_PASSWORD wired as optional secretKeyRef (the latent footgun
    flagged in MIGRATION.md §10a is now closed for the Darwin path)
  - indexcpu nodeAffinity + darwin/indexing toleration on every
    indexing-side pod (celery, indexer-scheduler, dask-scheduler,
    dask-worker); beat stays on the default pool (lightweight)
  - dynamic-pvc + file-connector-pvc volume mounts where any task may
    stage files

The existing `darwin-kubernetes/background-deployment.yaml` (combined
beat+celery+indexer via supervisord) is intentionally LEFT IN PLACE —
the split is an opt-in rollout, not a forced cutover. To switch:
apply the new five, verify the new pods are healthy, scale the
combined deployment to 0.

Also lock the convention in AGENTS.md so this doesn't recur:
  - New divergence-table row noting darwin-kubernetes/ is source of
    truth for prod.
  - New "Critical facts that bite" §9 documenting the two-tree split,
    when to touch which, and the per-pod adaptation checklist (image
    registry, configmap, secrets, REDIS_PASSWORD, affinity, PVCs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
§5b — Dask topology section now points at the actual ported
darwin-kubernetes/*.yaml manifests with a concrete cutover script,
not just "you'll need to port these later" boilerplate.

§10a — Footgun is RESOLVED for the Darwin path (the 5 new Darwin
manifests all wire REDIS_PASSWORD via optional secretKeyRef).
Marks the entry as such rather than removing it, so the history of
"why was this previously a concern" stays readable.

§12 — Commit count, file count, and totals updated for the two new
commits (MIGRATION.md itself + the darwin-kubernetes port).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rajiv/add-claude was merged to feature/darwin upstream, so the doc's
"on top of rajiv/add-claude (PR #45)" reference is stale. The branch
is now rebased onto origin/feature/darwin directly — same diff, just
a fresher base.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single source of truth for both production (Darwin AKS) and local dev,
with image tags + env values + secrets externalised so a deploy is
"edit one file, kubectl apply -k". No Helm. Replaces the flat
darwin-kubernetes/ tree (which the operator will delete once they've
verified the new structure against the live cluster).

Layout:
  k8s/base/                  Cleaned env-neutral manifests (one file per
                             logical service; deployment+service merged
                             where natural). Image refs use logical names
                             (e.g. `danswer-backend`) which overlays
                             rewrite to concrete registry+tag.
  k8s/overlays/prod/         Darwin AKS production:
                               kustomization.yaml  → images, replicas
                               env.properties      → non-secret config
                               secrets.env.example → template (committed)
                               secrets.env         → real values (gitignored)
  k8s/overlays/local/        Same shape, local-dev defaults
                             (host.docker.internal, latest tags,
                              AUTH_TYPE=disabled, smaller replicas).
  k8s/optional/              Opt-in deployments not part of base:
                               redis.yaml
                               background-{beat,celery,indexer-scheduler}.yaml
                               dask-{scheduler,worker}.yaml
                             Apply with `kubectl apply -f <file>` when
                             rolling out the corresponding feature.
  k8s/README.md              Layout explanation + common workflows
                             (image bump, env change, secret rotation,
                             Redis rollout, migrating off darwin-kubernetes/).

Built from the live-cluster dump in darwin-kubernetes/temp/ (gitignored,
never committed). The cleaner script (intentionally not committed)
strips status, uid, resourceVersion, generation, creationTimestamp,
managedFields, last-applied-configuration annotation, restartedAt,
progressDeadlineSeconds, revisionHistoryLimit, and the auto-assigned
clusterIP/ipFamilies/sessionAffinity on Services. Image references in
base/ are normalised to logical names so kustomize can rewrite them.

SECURITY: the live env-configmap was discovered to contain real plaintext
secrets — Slack tokens, GEN_AI client secret, Jira token, Opsgenie key.
The new structure moves all of those to k8s/overlays/*/secrets.env
(gitignored) which renders into a kustomize-generated Secret. api-server
and background deployments gain `envFrom: secretRef: danswer-secrets` so
the moved values continue to reach the app as env vars. Rotation of the
leaked credentials is a separate operator task — every "REPLACE_ME" in
secrets.env.example marked LEAKED is one of them.

Validation:
  kubectl kustomize k8s/overlays/prod   → 26 resources, clean render
  kubectl kustomize k8s/overlays/local  → 26 resources, clean render
  Image substitution verified in both.

.gitignore additions:
  darwin-kubernetes/temp/          Live cluster dumps
  k8s/overlays/*/secrets.env       Real secret values per environment
  k8s/overlays/*/*.secrets.env     Defensive (any *.secrets.env variant)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In-cluster Redis is now an opt-in kustomize Component at
k8s/optional/redis/, included by the local overlay via
`components: [../../optional/redis]` and NOT by prod (which uses
managed Redis instead).

Why Component instead of `resources: ../../optional/redis.yaml`:
kustomize's load restrictor rejects file-resource refs that escape
the overlay's directory tree. Components are explicitly designed for
opt-in cross-tree refs and pass the security check; they also let us
add patches later that only apply when the component is opted in.

Layout change:
  before:  k8s/optional/redis.yaml
  after:   k8s/optional/redis/
             kustomization.yaml    (kind: Component)
             redis.yaml

The plain `kubectl apply -f k8s/optional/redis/redis.yaml` or
`kubectl apply -k k8s/optional/redis` workflows still work — the file
just moved one level deeper.

env.properties updates:
  local:  REDIS_HOST=redis  (the in-cluster Service name, matching the
                              component's deployment)
  prod:   REDIS_HOST=<your-managed-redis>.redis.cache.windows.net
          (placeholder for Azure Cache for Redis; rename + drop the
           access key into secrets.env as `redis_password` when you
           adopt managed Redis)

Validated:
  kubectl kustomize k8s/overlays/prod   → 26 resources (no Redis)
  kubectl kustomize k8s/overlays/local  → 28 resources (+Service +StatefulSet)

README updated with the components pattern and how to add more opt-in
features the same way (split-background, Dask, etc.).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… to 512MB

Reversal of the earlier "prod will use managed Redis" decision. Prod
overlay now opts into the same in-cluster Redis component as local:

  k8s/overlays/prod/kustomization.yaml — adds:
    components:
      - ../../optional/redis
  k8s/overlays/prod/env.properties     — REDIS_HOST back to `redis`
                                         (the in-cluster Service name)

Redis StatefulSet bumped from 256MB to 512MB:
  --maxmemory               256mb  →  512mb
  resources.requests.memory 128Mi  →  256Mi
  resources.limits.memory   384Mi  →  1Gi

Limit set to ~2x maxmemory rather than 1.5x because the single-replica
StatefulSet has no failover — OOM = cache outage. Redis uses extra RSS
beyond --maxmemory for client output buffers, COW pages during BGSAVE
(if we ever turn on persistence), and fragmentation; safer to over-
provision the cgroup limit and let `maxmemory-policy: allkeys-lru` do
its job inside Redis's own accounting.

Validated:
  kubectl kustomize k8s/overlays/prod   → 28 resources (now includes Redis)
  kubectl kustomize k8s/overlays/local  → 28 resources (unchanged)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both prod and local overlays opted into the in-cluster Redis component,
so it's no longer optional — promoted to base/redis.yaml and added to
base/kustomization.yaml. Removed the now-redundant `components:` blocks
from both overlays and the optional/redis/ component dir.

Net effect is identical (prod + local still render 28 resources each,
both including Redis) — just less indirection now that Redis is
universal rather than opt-in.

README updated: optional/ table drops the redis row with a note that it
moved to base; the components: "flag" explanation now points at the
split-background deployments as the example opt-in.

Validated:
  kubectl kustomize k8s/overlays/prod   → 28 resources (redis in base)
  kubectl kustomize k8s/overlays/local  → 28 resources (redis in base)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nces

The README is the doc for the k8s/ layout as it stands, not a record of
how it came to be. Removed:
  - "Replaces the older darwin-kubernetes/ tree" subtitle
  - the whole "Migration plan (deleting darwin-kubernetes/)" section
  - the "darwin-kubernetes/ is being retired" + temp/ convention bullets

Also fixed two bits left stale by moving Redis into base:
  - structure diagram listed Redis under optional/ → now correctly
    omits it (it's in base)
  - "Roll out Redis" workflow told you to `kubectl apply -f
    k8s/optional/redis.yaml` → rewritten as "Redis ships in base; flip
    the env flags to enable the cache/rate-limiter features"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
optional/ holds plain manifests (no kustomization), so they need
`kubectl apply -f` and aren't picked up by `apply -k overlays/...`.
Added a workflow covering:
  - single-file and whole-folder apply
  - the dependency on the overlay being applied first (optional pods
    reference the overlay-generated env-configmap / danswer-secrets)
  - the full split-background + Dask cutover in dependency order
    (scheduler/workers → split bg pods → scale down combined), plus
    rollback and the dual-beat warning

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ical images)

The optional/ manifests hardcoded the image tag
(sfbrdevhelmweacr.azurecr.io/danswer/danswer-backend:vha-138) while base
uses the logical name `danswer-backend` that the overlay's images: block
rewrites. That inconsistency meant a tag bump had to be made in two
places and the optional pods could drift from the rest of the cluster.

Fix: grouped the five split-background + Dask manifests into a single
kustomize Component at k8s/optional/background-scaling/, and changed
their image refs to the logical `danswer-backend`. When an overlay opts
in via `components: [../../optional/background-scaling]`, the overlay's
existing `images:` entry for danswer-backend parameterizes them — same
tag as api-server / background, set in one place.

Verified: temporarily opting the component into the prod overlay renders
all five bg-scaling pods with sfbrdevhelmweacr.azurecr.io/danswer/
danswer-backend:vha-138 (34 resources total), then reverted. Neither
overlay opts in by default (prod/local still 28 resources each).

Layout change:
  before:  k8s/optional/{background-beat,background-celery,
           background-indexer-scheduler,dask-scheduler,dask-worker}.yaml
           (plain manifests, hardcoded image, applied via kubectl apply -f)
  after:   k8s/optional/background-scaling/
             kustomization.yaml   (kind: Component)
             <same five manifests, logical image name>
           (opted into via the overlay's components: block)

README updated: optional/ is now described as opt-in components with
logical-image parameterization; the apply workflow switched from
`kubectl apply -f` to the components: + replicas:0 overlay edits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…env-neutral)

Follow-up to the component conversion — three remaining inconsistencies
vs base that were pointed out:

1. Replicas were hardcoded per-manifest. Removed them from the manifests
   and moved the counts into the component's kustomization.yaml
   `replicas:` block (one place; mirrors how the overlay parameterizes
   base replicas). dask-worker=3 is the indexing-throughput knob.

2. Secret/config loading differed: the component had an extra explicit
   REDIS_PASSWORD secretKeyRef that base doesn't. Dropped it so every
   pod's env block is byte-identical to base/background.yaml — explicit
   POSTGRES_USER/POSTGRES_PASSWORD via secretKeyRef + envFrom
   [configMapRef env-configmap, secretRef danswer-secrets]. (redis_password
   still reaches the app via the envFrom secretRef like every other
   secrets.env key; the explicit entry was redundant and base-divergent.)

3. Manifests carried Darwin-specific node affinity + darwin/indexing
   tolerations, which base does NOT (base is env-neutral; the live
   cluster runs without pool affinity). Stripped them so the component
   is environment-neutral and won't fail to schedule on a local cluster
   that lacks the indexcpu pool. The prod overlay re-adds indexcpu
   affinity + toleration via a patch when it opts in — documented in the
   README opt-in steps with a ready-to-paste patch block.

Verified end-to-end: opting the component into prod renders 34 resources,
all five bg-scaling pods get sfbrdevhelmweacr.azurecr.io/danswer/
danswer-backend:vha-138, replicas come from the component kustomization
(beat=1, celery=2, worker=3), background-deployment scaled to 0. Default
(not opted in): prod/local both 28 resources.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nd-scaling

The apply command was present but buried under the overlay-edit YAML
block and read as the generic overlay apply. Made the deploy commands
explicit and labeled: preview the rendered bg/dask pods, kubectl diff
vs live, apply, and rollout-status watches. Also stated plainly why
there's no standalone `kubectl apply -f` for the component (logical
image name only resolves through the overlay's images: block).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dask-worker 3 → 2, background-celery 2 → 1. The originals were inherited
defaults from the cherry-picked feature/backgroundscaling commit, not
sized to Darwin's load.

- dask-worker=2: each pod runs one connector at a time (--nworkers=1
  --nthreads=1), so this caps concurrent indexing at 2. Enough unless
  many connectors backlog in NOT_STARTED; raise then. Halves the
  worst-case indexcpu footprint (was 3×4Gi, now 2×4Gi).
- background-celery=1: Celery here only runs maintenance tasks (prune,
  sync, deletion, analytics rollup) — NOT indexing. One pod already
  autoscales 3-10 threads (--autoscale=3,10), which easily covers the
  bursty maintenance queue at this scale. The 2nd replica was redundancy
  we don't need.

Added inline comments noting which counts are singletons that must stay
at 1 (beat, indexer-scheduler, dask-scheduler) vs the throughput knobs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rajivml and others added 22 commits June 3, 2026 14:12
Splitting the 10Gi monolithic background pod under-sized two containers,
which OOMKilled (exit 137) in prod:

- celery-beat: 256Mi limit → OOM on app import (celery beat still imports
  the full langchain/llama-index/tokenizer stack). Now 512Mi req / 2Gi limit.
- indexer-scheduler: the update.py loop spikes from ~300Mi to ~5Gi per
  cycle (warms embedding model + Dask client + connector/index state).
  OOMKilled at 1Gi, 2Gi AND 4Gi. Now 4Gi req / 8Gi limit — verified stable
  in prod across multiple update cycles with ~3Gi headroom.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Remove vespa/ from base so `kubectl apply -k overlays/{prod,local}` no
longer touches the Vespa StatefulSets (a drifted manifest could roll them
— the kind of blast radius behind the prior outage). Vespa now has its own
overlays carrying the pinned images + namespace; apply it deliberately.

Also fix guarded-apply.sh: with Vespa gone from the app overlay, its
`has_vespa=$(... | grep ...)` tripped `set -euo pipefail` on no-match and
aborted before applying. Guarded `|| true`; the guard now auto-skips for
overlays without Vespa and runs for the *-vespa overlays.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…istory

Root-caused the indexer-scheduler's ~7.5Gi/cycle memory spikes (which we'd
been masking by raising the pod limit to 8/12Gi). py-spy on the live pod
showed every spike in get_last_attempt -> SQLAlchemy fetchall.

get_last_attempt ran `select(IndexAttempt).where(cc-pair).order_by(time DESC)`
then `.scalars().first()`. Result.first() does NOT add LIMIT to the SQL, so
psycopg2 buffered the cc-pair's ENTIRE attempt history and the ORM
materialized every row before discarding all but one. create_indexing_jobs
calls this once per cc-pair every loop; with 518k index_attempt rows (12.7k
for the busiest cc-pair) the per-cycle session climbed to 7.5Gi, then freed
when it closed.

Verified live: the worst cc-pair's query allocates +308MB without LIMIT vs
+0MB with LIMIT 1.

Fix is LIMIT 1 in the query (db helper only — NOT the Dask submission path).
Steady-state scheduler memory drops to ~700Mi. After the fixed image is
deployed, the indexer-scheduler limit can drop from 12Gi back to ~2Gi (temp
12Gi note left in the manifest until then). Also removes the SYS_PTRACE
diagnostic capability added for profiling.

Follow-up (separate): index_attempt has 518k rows — a retention/prune pass
would speed these queries further.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The index_attempt table had grown to ~518k rows; the retention policy for it
already exists in db/retention.py but was disabled by default
(RETENTION_DAYS_INDEX_ATTEMPT=0). Enable it at 30d in both overlays — the
daily 08:00 UTC sweep now prunes terminal attempts older than 30d while always
keeping the last 20 per (connector, credential, embedding model). Dry-run on
prod: ~499k of 518k rows eligible. Pairs with the get_last_attempt LIMIT-1 fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Workers that booted before the dask-scheduler accepted connections failed to
register their Nanny and exited 1 (CrashLoopBackOff until ordering worked out;
seen as 2-3 restarts per worker at every rollout). Wrap the worker command in
a plain TCP retry loop against dask-scheduler-service:8786, then exec the
worker.

Portable by design: it's a bare socket connect with no mesh/platform
dependency, so it works the same with or without istio. Kept in the main
container rather than an initContainer precisely so it stays portable — under
istio an initContainer runs before the sidecar and its mesh traffic is
blackholed until envoy is up. Verified live: both workers now roll with 0
restarts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
With the get_last_attempt LIMIT-1 fix deployed (image vha-140), the indexer-
scheduler no longer spikes — RSS sits flat at ~430Mi across update cycles
(verified live). Drop from the temporary 4Gi req / 12Gi limit back to
512Mi req / 2Gi limit. This is the hardware reclaim the limit bump was
masking: ~8x request, 6x limit reduction on this pod.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Roll out the images carrying the get_last_attempt LIMIT-1 indexing-memory fix
(backend) and the latest web build. Deployed + verified live: indexer-scheduler
RSS flat at ~430Mi, no spike.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ior change)

Follow-up to the get_last_attempt LIMIT-1 fix — swept the codebase for the
same class of "load more rows/columns than needed into memory" bugs. Each
change here is behavior-preserving (same result set; only fewer columns
fetched, or an additive optional bound):

- prune task (celery_app): was materializing a connector's ENTIRE document
  corpus as full DbDocument ORM rows just to collect .id. New
  get_document_ids_for_connector_credential_pair selects only the id column
  (same WHERE + DISTINCT, identical id set). Biggest win — large connectors.
- document-set sync (fetch_documents_for_document_set_paginated): selected
  full Document rows per batch but the only caller uses just .id and the
  keyset cursor is the last id. Now selects Document.id; caller updated.
- get_index_attempts_for_cc_pair: added optional `limit` (default None =
  unchanged). The "any unfinished attempt?" existence check now passes
  limit=1 instead of materializing the cc-pair's full attempt history
  (rows carry large error_msg / full_exception_trace Text columns).
- get_current/secondary_db_embedding_model: added .limit(1) — Result.first()
  doesn't add LIMIT (tiny table, but the same smell as the original bug).
- delete_orphaned_search_docs: replaced fetch-all-ORM-rows-then-loop-delete
  with a single bulk DELETE over the same orphan set (matches retention.py).
- get_tags_by_value_prefix_for_source_types + /valid-tags: added optional
  `limit` (default None = unchanged) so the unbounded all-tags path can be
  bounded by a caller.
- db/connector.py: comment-only — flag that the per-connector legacy
  Query.first() over index_attempt is safe ONLY because Query.first() emits
  LIMIT 1; a migration to execute().scalars().first() must add .limit(1).

NOT changed (deliberately): the cc-pair detail endpoint still returns the
full attempt list — its frontend paginates client-side off that list, so a
server cap would regress the UI. Proper fix = server-side pagination (FE+BE),
tracked separately. Retention (RETENTION_DAYS_INDEX_ATTEMPT) already bounds
that table in practice. connector_deletion's full-row load is left as-is
(already batch-bounded, not a bloat bug).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… BE)

The cc-pair detail endpoint embedded the cc-pair's ENTIRE index_attempt
history (a busy connector had 3,255 rows, each carrying error_msg +
full_exception_trace Text) in CCPairFullInfo on every page view, and the
frontend paginated it client-side. That's a request-path memory/latency risk.

BE:
- CCPairFullInfo drops the full `index_attempts` list; carries only
  `latest_index_attempt` (LIMIT 1) + `num_index_attempts` (count) — all the
  detail page actually needs (latest status + "is this the only attempt").
- New endpoint GET /admin/cc-pair/{id}/index-attempts?page=&page_size=
  returns one page (LIMIT/OFFSET, newest first) + total_pages/total_count.
  page_size clamped to [1,100].
- db helpers: count_index_attempts_for_cc_pair +
  get_paginated_index_attempts_for_cc_pair (shared base stmt so count and
  page always agree on scope; only_current=PRESENT model, matching prior).

FE:
- IndexingAttemptsTable fetches the paginated endpoint via SWR keyed on the
  page, renders the server page, uses server total_pages for the selector,
  and finds the trace-popup attempt within the current page. Priority bump
  now revalidates the current page + the detail.
- page.tsx uses latest_index_attempt / num_index_attempts instead of
  index_attempts[0] / .length.

Verified: backend py_compile clean; full FE `tsc --noEmit` passes (0 errors).
CCPairFullInfo has no other consumers. Ships with the next FE+BE image build.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Deploys the SQL over-materialization audit fixes (get_document_ids prune
path, get_index_attempts_for_cc_pair limit, embedding_model LIMIT 1, bulk
orphan delete, tags limit knob). Verified live on vha-141.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rallel chunk deletes

Root cause of the slow SF-Account connector (diagnosed live): it isn't the
source pull or embedding — it's the Vespa clear-and-rewrite. The connector
filters by LastModifiedDate, but a Salesforce automation bumps that timestamp
on ~all accounts, so a single poll pulled 209,759 of ~216k records, each
forcing a full per-doc Vespa delete+rewrite (~2.7s/doc → ~6.5 days, never
finishes, so last_successful never advances and it re-pulls forever).

Fix 1 (the big one) — content-hash skip, in the pipeline so ALL connectors
benefit, default-on, backward-compatible:
- New nullable document.indexed_content_hash (sha256 of the INDEXED content:
  sections/title/semantic_id/metadata/owners — NOT doc_updated_at). Migration
  e5f6a7b8c9d0.
- Document.get_content_hash() computes it (connectors/models.py).
- get_doc_ids_to_update() skips a doc when the stored hash matches the current
  content hash, regardless of how doc_updated_at moved. Rows with no stored
  hash fall back to the original updated_at behavior (NULL → unchanged
  semantics), so existing data behaves exactly as before; hashes populate
  lazily on each doc's next successful index. The "re-index from beginning"
  path (ignore_time_skip) still bypasses both skips for a forced full reindex.
- The hash is written only AFTER a confirmed Vespa write (alongside
  doc_updated_at in update_docs_updated_at), so it always reflects what's
  actually in the index.

Fix 2 — cheaper re-index for docs that DO change: _delete_vespa_doc_chunks
deletes a document's chunks concurrently (bounded local pool) instead of one
sequential HTTP round-trip at a time. @Retry semantics preserved.

Adds a unit test for get_doc_ids_to_update (hash skip + updated_at fallback).
backend py_compile clean. Requires `alembic upgrade head` + bounce of
api-server/background. First post-deploy poll still does one full pass to
populate hashes; subsequent polls skip unchanged docs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Surfaces how many docs in each batch skip the embed + Vespa clear-and-rewrite
because their content hash / timestamp is unchanged. Logged at INFO only when
>0, in the pipeline so it covers all connectors. Lets us confirm in prod logs
(post-deploy) that Salesforce stops re-indexing its LastModifiedDate-churned
records.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… pool)

background-lite's celery-worker ran `--pool=threads --autoscale=3,10`, which
CrashLoopBackOff'd in prod: autoscale calls pool.grow()/shrink(), which the
threads TaskPool doesn't implement, so the first task burst killed the worker
with `AttributeError: 'TaskPool' object has no attribute 'grow'` (Unrecoverable
error), the unacked task was redelivered, and it crashed again. Autoscale is
prefork-only.

Replaced with a fixed `--concurrency=10` (these maintenance tasks — prune /
doc-set sync / deletion / analytics / retention sweep — are I/O-bound, so a
fixed thread pool is the right fit). Verified live: worker now drains a burst
of 15+ queued tasks with 0 restarts; pod 4/4 Running. Manifest-only, applied to
prod (no image rebuild).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Prod data: WEB connectors were ~89% FAILED (78/88 over 30d), almost all
"Stopped mid run, likely due to the background process being killed" — slow,
unbounded crawls that ran long enough to get killed, then retried and failed
again. Aligned our connector with upstream's approach:

- Drop the PER-PAGE check_internet_connection (was a full extra GET for every
  URL, doubling network work, and 403'ing on bot-protected sites Playwright
  loads fine — with failures tearing down the whole browser). Now one check on
  the base URL.
- page.goto(timeout=30s, wait_until="domcontentloaded") instead of the default
  unbounded "load" (which waits for every image/font). Big per-page speedup,
  bounded stalls.
- Per-page retries (WEB_CONNECTOR_MAX_RETRIES=3, exp backoff + jitter) instead
  of skip-on-first-error. A single page error retries with a fresh page on the
  SAME browser; only an actual browser crash (_is_browser_dead) restarts
  Playwright — no more whole-browser teardown per flaky page.
- Set a real User-Agent on the Playwright context (avoid WAF/403 blocks).
- WEB_CONNECTOR_MAX_PAGES cap (default 5000) bounds recursive crawls so they
  finish before the freeze-timeout that was killing them.
- Timeout on the PDF requests.get (was unbounded → could hang the attempt).

All self-contained to the web connector; 4 modes / redirects / recursion /
batching / error semantics preserved. New knobs in app_configs. Ships with the
next backend image build (connector runs on dask-workers).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
poll_source appended `AND updated >= ... AND updated <= ...` to the end of the
user's jira_filter. If the filter ended in `ORDER BY ...` (JQL requires WHERE
conditions before ORDER BY), the result was invalid JQL and Jira returned
HTTP 400 "Expecting ',' but got 'AND'" on every poll — connector 386 (SRE)
failed daily for weeks. New _add_time_window_to_jql() splits on a trailing
ORDER BY and injects the window in front of it; appends normally otherwise.

Verified live: the previously-400'ing connector now returns SUCCESS. (Separate
issue surfaced once the 400 cleared: that credential's API token is
expired -> /myself 401 -> anonymous -> 0 issues; needs a token refresh, not a
code change.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Imported the worthwhile, fork-compatible improvements from upstream's Jira
connector (skipping checkpoint/hierarchy/EE-perm-sync, which don't apply here):

- Per-issue error tolerance: each issue is processed in its own try/except;
  a single malformed ticket is logged and skipped instead of aborting the
  whole attempt (previously one bad issue failed the entire sync).
- Implement IdConnector.retrieve_all_source_ids(): lists doc ids
  (<base>/browse/<KEY>) fetching ONLY the `key` field, so the prune task can
  detect deleted issues cheaply rather than loading every full document.
- Richer metadata: issuetype, reporter, project (in addition to priority /
  status / resolution / labels); reporter also added to primary_owners.
- Fix load_from_state(): it referenced self.quoted_jira_project, which is
  never set -> AttributeError on any call (the prune fallback hit this). Now
  uses the configured jira_filter (full, unbounded load).

Pairs with the earlier ORDER-BY JQL fix. Connector 386 (SRE) verified live:
SUCCESS, 50 issues indexed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rolls out the indexing content-hash skip, web-connector resilience, and Jira
connector fixes (ORDER-BY JQL, per-issue tolerance, ID-based pruning, richer
metadata).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ed files

Formatting-only; no behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…prettier)

Formatting-only; ruff clean. Makes the quality-checks CI job pass — earlier
branch commits had not been pre-commit-formatted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The prune task instantiates connectors with InputType.PRUNE. Single-class
sources return their class for any input_type, but SLACK is dict-mapped
(LOAD_STATE/POLL only), so prune failed every run with "Connector not found
for source=SLACK" -> deleted Slack messages were never pruned from the index.

Map PRUNE to SlackPollConnector (the API connector, same config as the live
POLL connectors). NOT SlackLoadConnector, which requires an export_path_str
(reads a Slack export file) and would TypeError on an API connector's config.
extract_ids_from_runnable_connector then enumerates message ids via
poll_source(epoch, now). Verified SlackPollConnector constructs with a live
Slack connector's config.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
vha-146 carries the Slack InputType.PRUNE fix (verified live). Manifest now
matches the images actually deployed in the darwin cluster.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Built + pushed from the Mac (Apple Virtualization + Rosetta backend) via
k8s/scripts/build-deploy.sh; deployed to darwin and verified live (api-server
rolled out, pods healthy). No code/Dockerfile changes — image content matches
the prior tags.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rajivml rajivml force-pushed the feature/redis-assistants-ux-background-scaling branch from f5c3d5f to 4112145 Compare June 3, 2026 08:43
@rajivml rajivml closed this Jun 3, 2026
@rajivml rajivml reopened this Jun 3, 2026
Cumulative stages (build < push < deploy) plus a standalone verify, against the
prod overlay:
  - reads + auto-increments the vha-N image tags from kustomization.yaml
  - disk pre-req before build (graduated docker prune if >=80% full)
  - registry login from ACR_USERNAME/ACR_PASSWORD env (or ~/.zshrc); exits if unset
  - deploy edits the manifest, kubectl apply -k, waits on rollout
  - verify compares live cluster tags vs manifest + pod health
  - deploy refuses unless kubectl context is the prod cluster (FORCE=1 to override)
  - DRY_RUN=1 prints commands without mutating anything

bash 3.2 compatible (stock macOS). No secrets in the script.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rajivml rajivml force-pushed the feature/redis-assistants-ux-background-scaling branch from 4112145 to eba29bd Compare June 3, 2026 08:53
@rajivml rajivml merged commit fecda3d into feature/darwin Jun 3, 2026
6 checks passed
@rajivml rajivml deleted the feature/redis-assistants-ux-background-scaling branch June 3, 2026 08:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants