Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
102 commits
Select commit Hold shift + click to select a range
bb1faab
docs: add Redis caching & scaling plan
rajivml May 27, 2026
da37d15
P1: Redis foundation + read-through KV cache
rajivml May 28, 2026
d124a07
P2: per-user request rate limiter on chat/query endpoints
rajivml May 28, 2026
9896c87
Persona list cache with explicit write-through invalidation
rajivml May 28, 2026
c8378eb
Manage Assistants page UX overhaul
rajivml May 28, 2026
d72dd43
Assistant Gallery page UX overhaul
rajivml May 28, 2026
6536a95
Add backend/scripts/seed_assistants.py for local UX testing
rajivml May 28, 2026
41c2df3
Assistants UX polish: toggle highlight + gallery declutter
rajivml May 28, 2026
c7c1a5b
Remove tools chip from Manage Assistants page
rajivml May 28, 2026
24f2e4d
Parameterize gallery grid column count (default 3)
rajivml May 28, 2026
749d85e
Show document-set names on assistant cards (was: count only)
rajivml May 28, 2026
82a8395
Gallery: user-controllable column count (segmented control, persists)
rajivml May 28, 2026
3af40e5
Gallery: column picker as dropdown to match Sort
rajivml May 28, 2026
85ba241
Scale indexing via remote Dask scheduler topology
rajivml May 6, 2026
5ce3f49
docs: add MIGRATION.md covering Redis / bg-scaling / UX
rajivml May 28, 2026
017f127
darwin-kubernetes: port split-background manifests + lock convention …
rajivml May 28, 2026
cd366bc
docs(MIGRATION.md): reflect the darwin-kubernetes port
rajivml May 28, 2026
65406ab
docs(MIGRATION.md): update base reference after add-claude merge
rajivml May 29, 2026
5791ac5
Introduce k8s/ — kustomize-based manifests replacing darwin-kubernetes/
rajivml May 29, 2026
32b6f93
k8s: opt-in Redis via kustomize component (local includes, prod doesn't)
rajivml May 29, 2026
f48ee7b
k8s: prod now uses in-cluster Redis (was: managed Redis); bump memory…
rajivml May 30, 2026
4d63b72
k8s: move redis from optional component to base
rajivml May 30, 2026
84c0e42
docs(k8s/README): drop migration narrative + darwin-kubernetes refere…
rajivml May 30, 2026
0a7e7dd
docs(k8s/README): add instructions for applying optional/ manifests
rajivml May 30, 2026
7035265
k8s: make optional manifests parameterized like base (component + log…
rajivml May 30, 2026
99d336e
k8s: fully parameterize background-scaling component (replicas, env, …
rajivml May 30, 2026
ed473e8
docs(k8s/README): explicit apply/preview/verify commands for backgrou…
rajivml May 30, 2026
20f33ab
k8s: right-size background-scaling for a few-hundred-user deployment
rajivml May 30, 2026
c39442e
k8s: add slack-listener to background-scaling component
rajivml May 30, 2026
7eaffcd
k8s: consolidate background-scaling 6 deployments → 4
rajivml May 30, 2026
9a3064b
k8s: disable MULTILINGUAL_QUERY_EXPANSION in prod overlay
rajivml May 30, 2026
aff60f1
k8s: pin Vespa to 8.600.35 (was :latest → caused prod outage)
rajivml May 30, 2026
c81bd47
k8s: add readiness probes to Service-backed Vespa nodes
rajivml May 30, 2026
f82eedf
docs(AGENTS.md): add Critical fact §10 — never :latest for Vespa, pin…
rajivml May 30, 2026
ace2787
k8s: add guarded-apply.sh — block unsafe Vespa version jumps at apply…
rajivml May 30, 2026
8c36451
k8s: add KEDA indexing-autoscale component (opt-in) for dask-worker
rajivml May 30, 2026
b0643a0
k8s: add KEDA operator install manifest (optional/keda, own namespace)
rajivml May 30, 2026
64afc0d
docs(k8s/README): document mandatory pod restarts after a config change
rajivml May 30, 2026
a0411b4
k8s: add startup + readiness probes to api-server (own /health, not d…
rajivml May 30, 2026
0514cab
docs(k8s/README): add "Verify Redis caching is working" runbook
rajivml May 30, 2026
df6a574
perf: fix N+1 in basic indexing-status (chat page load + folder create)
rajivml May 30, 2026
7cdc0c4
perf(chat): optimistic folder create — no full refetch
rajivml May 30, 2026
9d7cf15
perf(chat): Redis-cache the connector indexing-status read
rajivml May 30, 2026
8fbde6a
k8s(vespa): group into base/vespa/ + ordered, health-gated upgrade sc…
rajivml May 30, 2026
792cd86
fix(chat): createFolder returns the new folder id (was undefined)
rajivml May 30, 2026
96ab1c6
polish(chat): tidy chat-row drag (compact ghost + no browser split-view)
rajivml May 30, 2026
9fbc685
feat(chat): pending spinner on "Manage Assistants" navigation
rajivml May 30, 2026
ea835ff
perf(db): Celery broker on Redis + env-driven Postgres pool sizing
rajivml May 31, 2026
20f5ea5
style(config): black-format app_configs.py (line-length 130)
rajivml May 31, 2026
fba5572
style: black-format Redis persona cache + rate-limit middleware
rajivml May 31, 2026
7012b41
perf(chat): per-user Redis cache for the document-set list
rajivml May 31, 2026
5a8b16d
refactor(chat): make doc-set cache global + MIT-scoped (no EE depende…
rajivml May 31, 2026
a4ab70a
feat(analytics): chat adoption curve + per-user breakdown; chat reten…
rajivml May 31, 2026
3d8ff62
feat(analytics): durable per-user daily stats (leaderboard survives r…
rajivml May 31, 2026
5dcc35b
feat(analytics): most-used assistants + approximate datasets-in-use
rajivml May 31, 2026
9ce884f
refactor(analytics): split page into Overview / User Activity tabs
rajivml May 31, 2026
e04e3ef
k8s(background-scaling): drop vestigial PVCs from split deployments +…
rajivml Jun 1, 2026
de01992
fix(indexing): stream large connector files instead of loading into RAM
rajivml Jun 1, 2026
8bbcbfc
feat(file-store): Azure Blob backend (bytes off Postgres), opt-in via…
rajivml Jun 1, 2026
19c9eb0
feat(file-store): lobj→Blob migration script + k8s wiring + README
rajivml Jun 1, 2026
82a883e
test(file-store): big-file round-trip integration test for Azure Blob
rajivml Jun 1, 2026
a405c82
fix(file-store): clear error when azure-storage-blob is missing
rajivml Jun 1, 2026
aed860e
fix(chat): block send while a file is still uploading
rajivml Jun 1, 2026
5841386
feat(chat): direct-to-Blob upload backend (SAS mint + confirm)
rajivml Jun 1, 2026
e897ddd
feat(chat): direct-to-Blob upload from the browser + progress bar
rajivml Jun 1, 2026
f831281
docs(file-store): direct-to-Blob chat upload flow + Storage CORS
rajivml Jun 1, 2026
7f2d28e
feat(chat): enforce chat-upload size + token limits
rajivml Jun 1, 2026
45a8e3a
feat(chat): FE size pre-check reads the configured limit (no hardcode…
rajivml Jun 1, 2026
84c3e86
k8s: unmount vestigial PVCs from api-server + background (keep the PVCs)
rajivml Jun 1, 2026
04bb5f5
fix(file-store): robust account name/key resolution for upload SAS
rajivml Jun 1, 2026
e745c2b
fix(file-store): support SAS-based connection strings for direct upload
rajivml Jun 1, 2026
3e801b0
refactor(file-store): direct upload requires account-key (drop broad-…
rajivml Jun 1, 2026
96a4d4a
fix(file-store): clear diagnostic when the connection string is malfo…
rajivml Jun 1, 2026
ba33c58
k8s: roll out split-background scaling to prod
rajivml Jun 2, 2026
55e3571
docs(k8s): direct-to-Blob upload, CORS, chat limits, account-key conn…
rajivml Jun 2, 2026
f738c41
k8s: cut prod file store over to Azure Blob
rajivml Jun 2, 2026
0457955
k8s: bump dask-worker resources (req 1cpu/4Gi, limit 8Gi)
rajivml Jun 2, 2026
279bd6d
k8s: remove legacy deployment/helm + deployment/kubernetes manifests
rajivml Jun 2, 2026
75443c8
k8s: remove legacy darwin-kubernetes manifests
rajivml Jun 2, 2026
16dca52
fix(k8s): right-size background-scaling memory (OOMKilled after split)
rajivml Jun 2, 2026
cf858e8
k8s: decouple Vespa from app overlays (separate prod-vespa/local-vespa)
rajivml Jun 2, 2026
364b88b
fix(indexing): LIMIT 1 in get_last_attempt — was materializing full h…
rajivml Jun 2, 2026
19a9eb0
k8s: enable index_attempt retention (30d, keep last 20 per cc-pair)
rajivml Jun 2, 2026
fb271d7
k8s: dask-worker waits for scheduler before starting (env-agnostic)
rajivml Jun 2, 2026
3fdb15e
k8s: right-size indexer-scheduler to 512Mi/2Gi (LIMIT-1 fix verified)
rajivml Jun 2, 2026
5dae052
k8s(prod): bump backend vha-139->vha-140, web vha-71->vha-72
rajivml Jun 2, 2026
2d6823f
perf(db): eliminate SQL over-materialization found in audit (no behav…
rajivml Jun 2, 2026
2bd1a6a
feat(cc-pair): server-side pagination for index-attempt history (FE +…
rajivml Jun 2, 2026
90c6068
k8s(prod): bump backend vha-140->vha-141, web vha-72->vha-73
rajivml Jun 2, 2026
36cd220
perf(indexing): skip re-indexing unchanged docs via content hash + pa…
rajivml Jun 2, 2026
0f02acd
obs(indexing): log per-batch count of docs skipped as unchanged
rajivml Jun 2, 2026
1c369ca
fix(k8s): celery-worker --concurrency instead of --autoscale (threads…
rajivml Jun 2, 2026
e3f1e4e
perf(web-connector): faster + far more resilient crawling
rajivml Jun 2, 2026
320492c
fix(jira): place poll time-window before ORDER BY in JQL
rajivml Jun 2, 2026
ac7d479
feat(jira): per-issue error tolerance, ID-based pruning, richer metadata
rajivml Jun 2, 2026
aa291c7
k8s(prod): bump backend vha-141->vha-144, web vha-73->vha-75
rajivml Jun 2, 2026
f201995
style: apply pre-commit (black / reorder-imports / prettier) to chang…
rajivml Jun 2, 2026
b0f383d
style: apply pre-commit across branch files (black/reorder/autoflake/…
rajivml Jun 2, 2026
81140fb
fix(slack): register InputType.PRUNE so Slack connectors can be pruned
rajivml Jun 3, 2026
84caacf
k8s(prod): bump backend vha-144->vha-146, web vha-75->vha-76
rajivml Jun 3, 2026
6445a0f
k8s(prod): bump backend vha-146->vha-147, web vha-76->vha-77
rajivml Jun 3, 2026
eba29bd
k8s: add build-deploy.sh (build/push/deploy/verify for backend+web)
rajivml Jun 3, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -29,3 +29,13 @@ requestdata.json
# screenshots) written during local UI debugging. Not source.
.playwright-mcp/
model-picker-open.png

# Live cluster dumps from `kubectl get -o yaml > …`. NEVER commit:
# Darwin's ConfigMap currently contains real secrets in plaintext (Slack
# tokens, GEN_AI client secret, Jira token, Opsgenie key, etc.) — those
# values would be committed verbatim if temp/ ever got tracked. Real
# secret values for the new k8s/ layout live in gitignored *.env files
# under overlays/.
darwin-kubernetes/temp/
k8s/overlays/*/secrets.env
k8s/overlays/*/*.secrets.env
98 changes: 95 additions & 3 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ moved on substantially. This table is the explicit map.
| Indexing runtime | Celery `docfetching` + `docprocessing` workers | **Dask `LocalCluster`** in `update.py` (Celery only does maintenance) |
| Number of Celery workers | Eight specialized workers (primary, light, heavy, kg_processing, monitoring, beat, etc.) | One worker + beat, spawned by `dev_run_background_jobs.py` |
| Celery task definition | `@shared_task` under `background/celery/tasks/` | `@celery_app.task` in `background/celery/celery_app.py` |
| Celery broker | Redis | SQLAlchemy/Postgres (`sqla+postgresql+psycopg2://…`) |
| Celery broker | Redis | SQLAlchemy/Postgres by default; **optionally Redis** via `CELERY_BROKER_REDIS_ENABLED=true` (logical DB `CELERY_REDIS_DB_NUMBER`, default 1). Prod enables it to keep Celery's queue traffic off Postgres. Indexing is still Dask either way. |
| Error handling | `raise OnyxError(OnyxErrorCode.X, …)` everywhere; no `HTTPException` | Plain `HTTPException(status_code=…, detail=…)` is the norm here. `OnyxError` doesn't exist. |
| FastAPI return types | "Don't use `response_model=`, just type the function" | Both styles exist in this fork (the typed-return-annotation form is the majority — `response_model=` only appears once in `connector.py:560`). New endpoints should use the typed-return form. Don't strip the existing `response_model=` without checking serialization behavior. |
| LLM call instrumentation | Every call must open a `LLMFlow`-tagged span via `traced_llm_call(...)` | No tracing system. `LLMFlow` doesn't exist. |
Expand All @@ -75,6 +75,7 @@ moved on substantially. This table is the explicit map.
| Test buckets | `backend/tests/{unit,external_dependency_unit,integration}` + Playwright e2e | No comparable structure here. Most code lacks tests; add tests with the change if practical, otherwise note in PR. |
| Plan template | The "Creating a Plan" section in their `CLAUDE.md` (Issues / Notes / Strategy / Tests) | Useful template; can be borrowed for non-trivial changes here too. |
| Frontend stack | Next.js 15+, React 18+ | Next.js 14.2.x (App Router), React 18 |
| K8s manifest path | `deployment/kubernetes/*` is what upstream documents | **`darwin-kubernetes/*` is the source of truth for the Darwin prod cluster.** `deployment/kubernetes/*` is upstream legacy / scratch — Darwin doesn't apply from there. New manifests for Darwin go in `darwin-kubernetes/`. See critical fact §9. |

**Rule of thumb when reading upstream code or upstream guidance:** assume
it doesn't apply unless you can verify the same construct exists here.
Expand Down Expand Up @@ -164,8 +165,9 @@ web/src/
deployment/docker_compose/
docker-compose.dev.yml ← local stack (relational_db + index/Vespa +
api_server + web_server + model_server +
background + nginx). Note: no Redis
here — Celery uses Postgres as its broker.
background + nginx). Celery brokers on
Postgres by default, or Redis when
CELERY_BROKER_REDIS_ENABLED=true.
```

---
Expand Down Expand Up @@ -340,6 +342,96 @@ auto-parse entirely with a raw `requests.get` against the
`/drives/{drive_id}/items/{item_id}/content` endpoint using the bearer
token. Don't reintroduce the lossy re-serialization.

### 9. `darwin-kubernetes/` is the source of truth for the Darwin cluster

The repo has two parallel k8s manifest trees and they are **not** kept
in sync:

| Path | What it is | When to touch |
|---|---|---|
| `darwin-kubernetes/*.yaml` | **The actual manifests applied to Darwin's AKS cluster (the `darwin` kube context).** Image registry is `sfbrdevhelmweacr.azurecr.io/...`, configmap is `env-configmap`, secrets is `danswer-secrets`, indexing pods have `indexcpu`-pool affinity + `darwin/indexing` toleration, env vars come from the Darwin configmap. | **Edit here for any prod-affecting change**, including new deployments. |
| `deployment/kubernetes/*.yaml` | Upstream-style manifests inherited from Onyx / authored to match the OSS docker-compose. Generic image (`danswer/danswer-backend:latest`), no Azure-specific affinity / tolerations, no Darwin-specific configmap wiring. | Reference only — not deployed to Darwin. Useful for seeing the "upstream shape" of a new component before adapting it to `darwin-kubernetes/`. |

When upstream (or a branch like `feature/backgroundscaling`) adds a
new manifest in `deployment/kubernetes/`, the corresponding
`darwin-kubernetes/` version must be hand-ported with:

- Image: `sfbrdevhelmweacr.azurecr.io/danswer/danswer-backend:<tag>`
- `envFrom: configMapRef name: env-configmap`
- POSTGRES_USER / POSTGRES_PASSWORD via `secretKeyRef name: danswer-secrets`
- REDIS_PASSWORD via `secretKeyRef name: danswer-secrets, optional: true`
(so unauth'd in-cluster Redis still works)
- For indexing-related pods: `nodeAffinity` on `agentpool=indexcpu` +
`tolerations` for `darwin/indexing/NoSchedule` + `dynamic-pvc` /
`file-connector-pvc` volume mounts.

A drop-in port that misses any of these will boot in Darwin but
mis-route, miss secrets, or end up on the wrong node pool. The
existing `darwin-kubernetes/background-deployment.yaml` and
`api_server-service-deployment.yaml` are the canonical templates for
the conventions.

### 10. NEVER use `:latest` (or a floating tag) for Vespa — pin the exact version

**This caused a full prod outage.** Vespa's config server refuses an
auto-upgrade spanning more than ~30 releases (`VersionState
.verifyVersionIntervalForUpgrade` → `Cannot upgrade from X to Y ...
interval too large`). If a manifest change bumps the Vespa image to a
much newer version, **every Vespa StatefulSet rolls and the config
server crash-loops on bootstrap**, taking the whole cluster down
(config tier → no quorum → cluster-wide `upstream connect error /
connection refused` 503s on search AND the api-server's
`ensure_indices_exist`).

What triggered it: an image spec of bare `vespaengine/vespa` (which
pulls `:latest` at pull time) was changed to an explicit
`vespaengine/vespa:latest`, and on the next `kubectl apply` `:latest`
had moved 30+ releases ahead of the running version.

Rules:
- **Pin Vespa to the exact version the cluster runs.** As of this
writing that is **`8.600.35`** — it's the on-disk format the content
nodes' index (1.6M+ docs, 100Gi PVCs) is written in. See the pinned
`images:` entry + comment in `k8s/overlays/{prod,local}/kustomization.yaml`.
- **Upgrades are STEPWISE and deliberate** — at most ~30 releases per
hop, applied as an ordered operation, never a bare tag bump. Do NOT
set `VESPA_SKIP_UPGRADE_CHECK=true` to force a big jump on prod; it
risks the index format.
- **Upgrade with `k8s/scripts/vespa-upgrade.sh <target> [ns]`, NOT a
kustomize apply.** Ordering across the 5 StatefulSets (configserver →
admin → content one-ordinal-at-a-time → feed → query, health-gated
between each) is impossible to express declaratively — a `kubectl
apply` rolls them all at once. The manifests support the script via
**per-role logical image names** (`vespa-configserver`, `vespa-admin`,
`vespa-content`, `vespa-feed`, `vespa-query` in `k8s/base/vespa/`) so
versions move independently, plus readiness probes on content/admin
with `publishNotReadyAddresses: true` on `vespa-internal` (peer
discovery must not be readiness-gated). Run `DRY_RUN=1` first. After a
successful upgrade, sync the per-role `newTag`s in the overlays.
- **`k8s/scripts/guarded-apply.sh <overlay>` is the everyday-apply safety
net, not the upgrade tool.** The guard reads the live running Vespa
version, compares it to what the overlay would deploy, and refuses a
>30-minor upgrade / major change / floating tag (and warns on big
downgrades) before it can reach the cluster. It checks against *live*
(not the repo's previous pin) because config drifts out of git. But it
still rolls all roles at once — for an actual version change use
`vespa-upgrade.sh`.
- This applies to any version-stateful StatefulSet, but Vespa is the
one that bites.

**Recovery if it happens again** (data is safe — it lives on the
content PVCs, untouched): set all 5 Vespa StatefulSets' image back to
the running version (`kubectl set image statefulset/vespa-* ...`),
delete the config-server pods to recreate on the correct version, wait
for `:19071/state/v1/health` → 200, then restart the api-server so
`ensure_indices_exist` redeploys the schema. (Clearing the
config-server ZooKeeper state via `vespa-configserver-remove-state` is
only needed if the ZK state is genuinely corrupt — the version
mismatch alone does NOT require it.) Vespa nodes also have **no
liveness probes by design** (an aggressive one kills slow-but-healthy
nodes); readiness probes on the Service-backed nodes
(configserver/query/feed) gate traffic during the slow bootstrap.

---

## Common workflows
Expand Down
Loading
Loading