feat(code-index): add Semble as a local on-the-fly embedding provider#399
feat(code-index): add Semble as a local on-the-fly embedding provider#399navedmerchant wants to merge 12 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughIntegrates Semble as a local CLI-backed code index provider: adds types/schemas, SembleCLI wrapper, downloader with checksum and extraction, SembleProvider implementation, CodeIndexManager/service-factory routing, UI/platform detection and i18n, embedding model profile, and extensive Vitest tests. ChangesSemble local indexing support
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Not ready for review yet, the code is being reviewed and evolved |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
src/services/code-index/interfaces/manager.ts (1)
78-87: ⚡ Quick winConsider consolidating the duplicated EmbedderProvider type definition.
EmbedderProvideris defined both here and inpackages/types/src/embedding.ts. Different parts of the codebase import from different locations (e.g.,config-manager.tsimports from this file, whileservice-factory.tsimports from@roo-code/types). While currently in sync, maintaining two sources of truth creates risk of drift and inconsistency.♻️ Suggested consolidation approach
Consider removing the local type definition and importing from the shared package instead:
+import { EmbedderProvider } from "`@roo-code/types`" import { VectorStoreSearchResult } from "./vector-store" import * as vscode from "vscode" /** * Interface for the code index manager */ export interface ICodeIndexManager { // ... rest of interface -export type IndexingState = "Standby" | "Indexing" | "Indexed" | "Error" | "Stopping" -export type EmbedderProvider = - | "openai" - | "ollama" - | "openai-compatible" - | "gemini" - | "mistral" - | "vercel-ai-gateway" - | "bedrock" - | "openrouter" - | "semble"This ensures a single source of truth and prevents accidental divergence.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/services/code-index/interfaces/manager.ts` around lines 78 - 87, Duplicate EmbedderProvider type exists locally and in the shared package; remove the local type definition (EmbedderProvider) from this file and replace usages/imports (e.g., in config-manager.ts) to import the type from the canonical package (`@roo-code/types` or packages/types/src/embedding.ts), update any export or re-export as needed so service-factory.ts and other consumers use the single source of truth, run TypeScript build/typecheck to ensure no unresolved imports remain, and delete the now-unused local declaration to avoid drift.webview-ui/src/components/chat/CodeIndexPopover.tsx (1)
774-776: ⚡ Quick winConsider adding test coverage for semble provider UI.
Per coding guidelines for
webview-ui/src/**/*.{ts,tsx}, prefer local webview-ui tests for React/webview behavior. Consider adding tests underwebview-ui/src/**/__tests__to cover:
- Selecting semble from the provider dropdown
- Qdrant URL/API key fields hidden when semble is selected
- Validation passes with only
codebaseIndexEnabledfor semble- Semble path input field renders and updates state correctly
- Installation instructions panel renders
As per coding guidelines: "Prefer local
webview-uitests for React/webview behavior such as component rendering, local state, hooks, form dirty-state, validation, or prop wiring."Also applies to: 1446-1482, 1484-1537
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@webview-ui/src/components/chat/CodeIndexPopover.tsx` around lines 774 - 776, Add local webview-ui tests for the CodeIndexPopover component to cover the semble provider UI: create tests under webview-ui/src/**/__tests__ that render CodeIndexPopover, select the SelectItem with value "semble", assert that Qdrant URL/API key fields are hidden, confirm validation passes when only codebaseIndexEnabled is set for semble, verify the semble path input field appears and updates component state on change, and assert the installation instructions panel is rendered; use the existing test utilities (render, fireEvent/userEvent, assertions) consistent with other webview-ui tests to locate and interact with the provider dropdown and semble-specific inputs.webview-ui/src/i18n/locales/en/settings.json (1)
204-204: 💤 Low valueConsider adding a Windows path example.
The
semblePathDescriptionprovides a helpful Unix path example ("/usr/local/bin/semble"), but users on Windows might benefit from seeing a Windows path example as well, such as"C:\\Python310\\Scripts\\semble.exe".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@webview-ui/src/i18n/locales/en/settings.json` at line 204, The translation entry semblePathDescription currently shows only a Unix example; update its text to include a Windows path example (e.g., "C:\\Python310\\Scripts\\semble.exe") so Windows users have guidance—locate the semblePathDescription key in the locales/en/settings.json and add a short Windows example alongside the existing Unix example, keeping escaping consistent for backslashes and quotes.src/services/code-index/semble/semble-cli.ts (1)
226-256: ⚡ Quick winConsider adding runtime validation for CLI output structure.
Lines 242 and 247 use type assertions (
as SembleSearchResult[]) without runtime validation. If the semble CLI output format changes or contains unexpected data, this could cause runtime errors downstream when code expects the SembleSearchResult shape.Consider adding a simple validation helper or at least checking for required fields before casting.
🛡️ Example validation approach
// Handle successful response: {query, results: [{chunk, score}]} if (parsed.results && Array.isArray(parsed.results)) { - return parsed.results as SembleSearchResult[] + // Basic validation: check that results have expected shape + return parsed.results.filter((r: any) => + r?.chunk?.file_path && + r?.chunk?.content && + typeof r?.score === 'number' + ) as SembleSearchResult[] } // Fallback: if it's a flat array (older format) if (Array.isArray(parsed)) { - return parsed as SembleSearchResult[] + return parsed.filter((r: any) => + r?.chunk?.file_path && + r?.chunk?.content && + typeof r?.score === 'number' + ) as SembleSearchResult[] }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/services/code-index/semble/semble-cli.ts` around lines 226 - 256, The _parseOutput method casts parsed.results and parsed arrays to SembleSearchResult[] without runtime checks; add a lightweight validator (e.g., validateSembleResult) and use it to filter/transform parsed.results and parsed before returning to ensure each item has required properties (e.g., chunk is string, score is number, and any other SembleSearchResult fields); replace the direct casts in _parseOutput with Array.isArray checks plus items.filter(validateSembleResult) (or map-to-safe-shape) so only well-formed SembleSearchResult objects are returned and malformed entries are dropped or logged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/services/code-index/semble/semble-cli.ts`:
- Around line 150-157: The spawn call in _spawn uses an unsupported maxBuffer
option and silences the TS error with an as any cast; remove the maxBuffer
property from the options passed to spawn(this.semblePath, args, { shell: false,
timeout: options.timeout, stdio: [...] }) and drop the as any cast, and if you
need to limit output implement manual buffering and size checks inside
child.stdout?.on("data", ...) and child.stderr?.on("data", ...) (e.g.,
accumulate into stdout/stderr strings, enforce a MAX_BUFFER threshold, kill the
child and reject if exceeded) while keeping the rest of _spawn (child handling,
resolve/reject) intact.
---
Nitpick comments:
In `@src/services/code-index/interfaces/manager.ts`:
- Around line 78-87: Duplicate EmbedderProvider type exists locally and in the
shared package; remove the local type definition (EmbedderProvider) from this
file and replace usages/imports (e.g., in config-manager.ts) to import the type
from the canonical package (`@roo-code/types` or packages/types/src/embedding.ts),
update any export or re-export as needed so service-factory.ts and other
consumers use the single source of truth, run TypeScript build/typecheck to
ensure no unresolved imports remain, and delete the now-unused local declaration
to avoid drift.
In `@src/services/code-index/semble/semble-cli.ts`:
- Around line 226-256: The _parseOutput method casts parsed.results and parsed
arrays to SembleSearchResult[] without runtime checks; add a lightweight
validator (e.g., validateSembleResult) and use it to filter/transform
parsed.results and parsed before returning to ensure each item has required
properties (e.g., chunk is string, score is number, and any other
SembleSearchResult fields); replace the direct casts in _parseOutput with
Array.isArray checks plus items.filter(validateSembleResult) (or
map-to-safe-shape) so only well-formed SembleSearchResult objects are returned
and malformed entries are dropped or logged.
In `@webview-ui/src/components/chat/CodeIndexPopover.tsx`:
- Around line 774-776: Add local webview-ui tests for the CodeIndexPopover
component to cover the semble provider UI: create tests under
webview-ui/src/**/__tests__ that render CodeIndexPopover, select the SelectItem
with value "semble", assert that Qdrant URL/API key fields are hidden, confirm
validation passes when only codebaseIndexEnabled is set for semble, verify the
semble path input field appears and updates component state on change, and
assert the installation instructions panel is rendered; use the existing test
utilities (render, fireEvent/userEvent, assertions) consistent with other
webview-ui tests to locate and interact with the provider dropdown and
semble-specific inputs.
In `@webview-ui/src/i18n/locales/en/settings.json`:
- Line 204: The translation entry semblePathDescription currently shows only a
Unix example; update its text to include a Windows path example (e.g.,
"C:\\Python310\\Scripts\\semble.exe") so Windows users have guidance—locate the
semblePathDescription key in the locales/en/settings.json and add a short
Windows example alongside the existing Unix example, keeping escaping consistent
for backslashes and quotes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 332e2e21-894e-4a2d-9e0f-464ce90a3d05
📒 Files selected for processing (19)
packages/types/src/codebase-index.tspackages/types/src/embedding.tspackages/types/src/vscode-extension-host.tssrc/core/webview/ClineProvider.tssrc/core/webview/webviewMessageHandler.tssrc/services/code-index/config-manager.tssrc/services/code-index/interfaces/config.tssrc/services/code-index/interfaces/manager.tssrc/services/code-index/manager.tssrc/services/code-index/semble/__tests__/provider.spec.tssrc/services/code-index/semble/__tests__/semble-cli.spec.tssrc/services/code-index/semble/index.tssrc/services/code-index/semble/provider.tssrc/services/code-index/semble/semble-cli.tssrc/services/code-index/semble/types.tssrc/services/code-index/service-factory.tssrc/shared/embeddingModels.tswebview-ui/src/components/chat/CodeIndexPopover.tsxwebview-ui/src/i18n/locales/en/settings.json
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
|
So users have to install python, hope to their diety that they did it right and pip works, and then install semble like that. So that still "a lot" of local fiddling. It's still a fragile solution. The only thing this removes, is the requirement for a text embedding model, and a Qdrant instance. It doesn't remove fragility. It's not specifically in #200, but iyam it would be in the spirit of the requested feature to make code indexing a simple checkbox. Just tick it, and done. No local installs of anything. Ticking the checkbox does all the black magic that needs to happen. |
I had the same vision when trying to implement this (and even had a version implemented) but decided against it for two reasons
In my testing this is a pretty simple implementation which makes it easier for users to enable codebase indexing. It works quite well and adds value without adding too much complexity inside zoo. Setting up semble with pip is a lot simpler than setting up qdrant and gives pretty good indexing performance But you have a valid point, we should make codebase indexing as seamless as possible, and it should be “built in” we will revisit this in the future. |
|
It would be cool if we could add a companion PR to the read me in the semble repo instructions to install zoo code https://github.com/MinishLab/semble |
|
@thany I took your advice and now Zoo Code will automatically download a prebuilt semble from https://github.com/navedmerchant/sembleexec for the current platform, making it seamless! all you need to do is tick the box and it should "just work" @taltas that would not be necessary anymore, it should work out of the box for all users. |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/services/code-index/semble/provider.ts (1)
137-145:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSearch should initialize Semble on first use.
searchIndex()currently returns[]untilinitialize()/startIndexing()has already run. That contradicts the on-the-fly contract and can make the first codebase search silently fail. Initialize here and only bail if startup leaves the provider in"Error".♻️ Minimal fix
async searchIndex(query: string, directoryPrefix?: string): Promise<VectorStoreSearchResult[]> { if (!this._isInitialized) { - console.warn("[SembleProvider] searchIndex called before initialization") - return [] + await this.initialize() } - if (this._state === "Error") { + if (!this._isInitialized || this._state === "Error") { return [] }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/services/code-index/semble/provider.ts` around lines 137 - 145, searchIndex currently bails out when !_isInitialized causing first-use searches to silently fail; modify searchIndex in SembleProvider to call and await this.initialize() (or this.startIndexing() if that is the intended boot path) when !_isInitialized, then proceed with the search flow and only return [] if after initialization this._state === "Error". Locate the searchIndex method and replace the early return on !_isInitialized with an awaited initialization attempt, preserve the existing check for this._state === "Error" after initialization, and ensure any initialization errors are handled the same way the rest of the provider handles startup failures.
🧹 Nitpick comments (2)
webview-ui/src/components/chat/CodeIndexPopover.tsx (2)
179-183: ⚡ Quick winAdd local webview-ui tests for Semble-specific behavior.
This change adds provider-gated rendering and provider-specific validation rules; please add/update Vitest coverage in
webview-ui/src/**/__tests__for: supported vs unsupported platform option visibility, Semble validation path, and Qdrant fields hidden for Semble.As per coding guidelines, "webview-ui/src/**/*.{ts,tsx}: Prefer local
webview-uitests for React/webview behavior ... Add or update Vitest coverage underwebview-ui/src/**/__tests__instead of reaching forapps/vscode-e2e."Also applies to: 206-209, 774-778, 1449-1450
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@webview-ui/src/components/chat/CodeIndexPopover.tsx` around lines 179 - 183, The PR added Semble-specific UI and validation in the CodeIndexPopover component but lacks local Vitest coverage; add/update tests under webview-ui/src/**/__tests__ that mount CodeIndexPopover (or the relevant exported component) and assert: (1) provider-gated rendering shows supported platform options and hides unsupported ones, (2) when provider is "semble" the form uses the Semble validation path (e.g., emits/validates only codebaseIndexEnabled) and rejects missing required fields for other providers, and (3) Qdrant-related fields/inputs are not rendered when provider === "semble" (cover the other provider case where Qdrant fields are visible). Ensure tests use React testing library utilities already used in the repo and target the component names/functions from CodeIndexPopover to keep tests local to webview-ui.
206-209: ⚡ Quick winAvoid duplicating Semble platform support matrices in UI.
SEMBLE_SUPPORTED_PLATFORMSis hardcoded here, which can drift from the backend support source and hide/show the provider incorrectly over time. Prefer consuming a single source of truth from extension state (or a shared constant) instead of duplicating the list in the component.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@webview-ui/src/components/chat/CodeIndexPopover.tsx` around lines 206 - 209, The hardcoded SEMBLE_SUPPORTED_PLATFORMS array in CodeIndexPopover.tsx (and the derived isSembleSupported check) duplicates backend/extension state and may drift; replace the literal array with a single source of truth by importing the supported-platforms constant or reading the Semble support flag from the extension state/feature-flags exposed to the UI (e.g., use the shared exported constant or a selector from the extension state used elsewhere), then compute isSembleSupported using that imported value along with platform and arch so the UI always reflects the canonical support matrix.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/services/code-index/semble/__tests__/semble-downloader.spec.ts`:
- Around line 27-41: The test currently reuses shared EventEmitter instances
(mockRequest and mockResponse) across test cases which can leak listeners;
update the https.get mock implementation used in the test (the vi.mock block) so
it creates new EventEmitter instances inside the get callback each invocation
(i.e., instantiate mockRequest and mockResponse within the vi.fn for "get") or
alternatively add a beforeEach that calls mockRequest.removeAllListeners() and
mockResponse.removeAllListeners() and reinitializes any spy functions; target
the mock definitions referenced as mockRequest, mockResponse and the mocked
"https".get to ensure fresh emitters per test.
In `@src/services/code-index/semble/semble-downloader.ts`:
- Around line 226-234: The Windows branch that spawns PowerShell embeds
archivePath and destDir inside single-quoted literals for Expand-Archive (in the
spawn call where process.platform === "win32"), which breaks when paths contain
apostrophes; to fix, escape any single quotes in archivePath and destDir
(replace ' with '' per PowerShell rules) before building the command string or
avoid single-quoted literals by using & { } with properly escaped arguments, and
update the spawn invocation that calls "powershell" with the corrected,
safely-escaped path variables so Expand-Archive receives valid paths.
---
Outside diff comments:
In `@src/services/code-index/semble/provider.ts`:
- Around line 137-145: searchIndex currently bails out when !_isInitialized
causing first-use searches to silently fail; modify searchIndex in
SembleProvider to call and await this.initialize() (or this.startIndexing() if
that is the intended boot path) when !_isInitialized, then proceed with the
search flow and only return [] if after initialization this._state === "Error".
Locate the searchIndex method and replace the early return on !_isInitialized
with an awaited initialization attempt, preserve the existing check for
this._state === "Error" after initialization, and ensure any initialization
errors are handled the same way the rest of the provider handles startup
failures.
---
Nitpick comments:
In `@webview-ui/src/components/chat/CodeIndexPopover.tsx`:
- Around line 179-183: The PR added Semble-specific UI and validation in the
CodeIndexPopover component but lacks local Vitest coverage; add/update tests
under webview-ui/src/**/__tests__ that mount CodeIndexPopover (or the relevant
exported component) and assert: (1) provider-gated rendering shows supported
platform options and hides unsupported ones, (2) when provider is "semble" the
form uses the Semble validation path (e.g., emits/validates only
codebaseIndexEnabled) and rejects missing required fields for other providers,
and (3) Qdrant-related fields/inputs are not rendered when provider === "semble"
(cover the other provider case where Qdrant fields are visible). Ensure tests
use React testing library utilities already used in the repo and target the
component names/functions from CodeIndexPopover to keep tests local to
webview-ui.
- Around line 206-209: The hardcoded SEMBLE_SUPPORTED_PLATFORMS array in
CodeIndexPopover.tsx (and the derived isSembleSupported check) duplicates
backend/extension state and may drift; replace the literal array with a single
source of truth by importing the supported-platforms constant or reading the
Semble support flag from the extension state/feature-flags exposed to the UI
(e.g., use the shared exported constant or a selector from the extension state
used elsewhere), then compute isSembleSupported using that imported value along
with platform and arch so the UI always reflects the canonical support matrix.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 22c81bb0-3664-464b-a867-3c12e1e8caf7
📒 Files selected for processing (34)
packages/types/src/codebase-index.tspackages/types/src/vscode-extension-host.tssrc/core/webview/ClineProvider.tssrc/services/code-index/__tests__/config-manager.spec.tssrc/services/code-index/__tests__/service-factory.spec.tssrc/services/code-index/config-manager.tssrc/services/code-index/manager.tssrc/services/code-index/semble/__tests__/provider.spec.tssrc/services/code-index/semble/__tests__/semble-cli.spec.tssrc/services/code-index/semble/__tests__/semble-downloader.spec.tssrc/services/code-index/semble/index.tssrc/services/code-index/semble/provider.tssrc/services/code-index/semble/semble-cli.tssrc/services/code-index/semble/semble-downloader.tssrc/services/code-index/semble/types.tswebview-ui/src/components/chat/CodeIndexPopover.tsxwebview-ui/src/i18n/locales/ca/settings.jsonwebview-ui/src/i18n/locales/de/settings.jsonwebview-ui/src/i18n/locales/en/settings.jsonwebview-ui/src/i18n/locales/es/settings.jsonwebview-ui/src/i18n/locales/fr/settings.jsonwebview-ui/src/i18n/locales/hi/settings.jsonwebview-ui/src/i18n/locales/id/settings.jsonwebview-ui/src/i18n/locales/it/settings.jsonwebview-ui/src/i18n/locales/ja/settings.jsonwebview-ui/src/i18n/locales/ko/settings.jsonwebview-ui/src/i18n/locales/nl/settings.jsonwebview-ui/src/i18n/locales/pl/settings.jsonwebview-ui/src/i18n/locales/pt-BR/settings.jsonwebview-ui/src/i18n/locales/ru/settings.jsonwebview-ui/src/i18n/locales/tr/settings.jsonwebview-ui/src/i18n/locales/vi/settings.jsonwebview-ui/src/i18n/locales/zh-CN/settings.jsonwebview-ui/src/i18n/locales/zh-TW/settings.json
💤 Files with no reviewable changes (2)
- packages/types/src/codebase-index.ts
- src/services/code-index/config-manager.ts
✅ Files skipped from review due to trivial changes (9)
- webview-ui/src/i18n/locales/pl/settings.json
- webview-ui/src/i18n/locales/en/settings.json
- webview-ui/src/i18n/locales/nl/settings.json
- webview-ui/src/i18n/locales/ja/settings.json
- webview-ui/src/i18n/locales/it/settings.json
- webview-ui/src/i18n/locales/de/settings.json
- webview-ui/src/i18n/locales/tr/settings.json
- webview-ui/src/i18n/locales/zh-TW/settings.json
- webview-ui/src/i18n/locales/id/settings.json
🚧 Files skipped from review as they are similar to previous changes (2)
- src/services/code-index/semble/index.ts
- src/services/code-index/manager.ts
|
Verified this PR work on MacOS arm, windows x64 and linux x64. In my testing it makes zoo find relevant context and answer questions about 20-30percent faster |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/services/code-index/semble/semble-downloader.ts`:
- Around line 167-171: The code currently skips checksum verification when
SEMBLE_SHA256 lacks an entry for the current platform, allowing extraction to
proceed without integrity checks; update the block that computes
platformKey/expectedChecksum so that if expectedChecksum is missing (falsy) you
immediately abort with a hard error (throw an Error or call process.exit(1)) and
include identifying info (platformKey and SEMBLE_VERSION) in the message; keep
the subsequent await verifyChecksum(archivePath, expectedChecksum) unchanged so
verified paths still run as before.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 549c7ede-f2e7-4f8b-ab0e-e9d0ebdb1efd
📒 Files selected for processing (2)
src/services/code-index/semble/__tests__/semble-downloader.spec.tssrc/services/code-index/semble/semble-downloader.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/services/code-index/semble/tests/semble-downloader.spec.ts
| const platformKey = `${process.platform}-${process.arch}` | ||
| const expectedChecksum = SEMBLE_SHA256[platformKey] | ||
| if (expectedChecksum) { | ||
| await verifyChecksum(archivePath, expectedChecksum) | ||
| } |
There was a problem hiding this comment.
Abort when no checksum is configured.
This currently fails open: if a future SEMBLE_VERSION or supported platform is added without updating SEMBLE_SHA256, the archive is still extracted and later executed without integrity verification. Missing checksum data should be a hard error here.
🔒 Minimal fix
// Verify archive integrity before extraction
const platformKey = `${process.platform}-${process.arch}`
const expectedChecksum = SEMBLE_SHA256[platformKey]
- if (expectedChecksum) {
- await verifyChecksum(archivePath, expectedChecksum)
- }
+ if (!expectedChecksum) {
+ throw new Error(`No checksum configured for ${platformKey} at ${SEMBLE_VERSION}`)
+ }
+ await verifyChecksum(archivePath, expectedChecksum)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const platformKey = `${process.platform}-${process.arch}` | |
| const expectedChecksum = SEMBLE_SHA256[platformKey] | |
| if (expectedChecksum) { | |
| await verifyChecksum(archivePath, expectedChecksum) | |
| } | |
| const platformKey = `${process.platform}-${process.arch}` | |
| const expectedChecksum = SEMBLE_SHA256[platformKey] | |
| if (!expectedChecksum) { | |
| throw new Error(`No checksum configured for ${platformKey} at ${SEMBLE_VERSION}`) | |
| } | |
| await verifyChecksum(archivePath, expectedChecksum) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/services/code-index/semble/semble-downloader.ts` around lines 167 - 171,
The code currently skips checksum verification when SEMBLE_SHA256 lacks an entry
for the current platform, allowing extraction to proceed without integrity
checks; update the block that computes platformKey/expectedChecksum so that if
expectedChecksum is missing (falsy) you immediately abort with a hard error
(throw an Error or call process.exit(1)) and include identifying info
(platformKey and SEMBLE_VERSION) in the message; keep the subsequent await
verifyChecksum(archivePath, expectedChecksum) unchanged so verified paths still
run as before.
There was a problem hiding this comment.
this should not be problem as we will not show semble on an unsupported platform
There was a problem hiding this comment.
@navedmerchant — That's fair for the current state: all four entries in SEMBLE_ARCHIVES are also present in SEMBLE_SHA256, so the UI gate already ensures the checksum lookup never misses today.
The risk is purely forward-looking: the two maps are maintained independently, so a future contributor who adds a new platform to SEMBLE_ARCHIVES (making it UI-visible and reachable) might forget to add its entry to SEMBLE_SHA256, causing silent skip of verification. A cheap way to close that gap without changing runtime behavior is a compile-time constraint that forces the two maps to stay in sync:
// Ensures every SEMBLE_ARCHIVES key has a corresponding checksum entry.
// TypeScript will error at compile time if a platform is added to one map but not the other.
type _SembleChecksumParity = {
[K in keyof typeof SEMBLE_ARCHIVES]: (typeof SEMBLE_SHA256)[K]
}Alternatively, the hard-error approach from the original suggestion makes the mismatch a visible runtime failure rather than a silent bypass, which is a reasonable safety net. Either way is fine — happy to defer to your preference.
There was a problem hiding this comment.
I agree with @coderabbitai on this - lets just hard-error, rather then conditionally verify.
There was a problem hiding this comment.
Agreed, @edelauna. The hard-error path is the safer default — a silent skip is easy to miss during review and leaves the door open for an unverified binary to be executed.
@navedmerchant — given the reviewer alignment here, the minimal change would be:
- if (expectedChecksum) {
- await verifyChecksum(archivePath, expectedChecksum)
- }
+ if (!expectedChecksum) {
+ throw new Error(`No checksum configured for platform ${platformKey} at ${SEMBLE_VERSION}`)
+ }
+ await verifyChecksum(archivePath, expectedChecksum)This keeps the happy path identical (all four current platforms have checksums) while making any future omission an immediate, visible failure rather than a silent bypass.
|
This PR will set the precedent for allowing the extension to download, configure, install, and run third party executables. I know we haven’t set any governance principles but my preference would be for users to input a PATH location for where to find the CLI if they want to configure semble. |
I guess this isn’t automatic, the user still needs to interact with zoo code to get it installed. I’ll review a bit more. |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/services/code-index/semble/semble-downloader.ts (1)
149-157:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftUpgrade deletes old binary before verifying new one.
If the download or extraction fails after deleting the old installation, the user loses their working Semble binary. Consider extracting to a temporary location first, then atomically replacing the old installation only after the new one is verified.
🔧 Suggested approach
// Version mismatch — remove old installation before downloading new one if (installedVersion && installedVersion !== SEMBLE_VERSION) { console.log(`[SembleDownloader] Version changed from ${installedVersion} to ${SEMBLE_VERSION}, updating...`) - try { - await fs.rm(extractDir, { recursive: true, force: true }) - } catch { - // ignore cleanup errors - } + // Keep old installation until new one is verified - see extractDir usage below }Then modify the extraction flow to use a staging directory (e.g.,
semble-new) and only rename tosembleafter successful verification, removing the old directory at that point.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/services/code-index/semble/semble-downloader.ts` around lines 149 - 157, The current upgrade flow deletes the existing installation (using installedVersion, SEMBLE_VERSION and fs.rm on extractDir) before downloading/extracting the new binary; change this to extract into a staging directory (e.g., extractDir + "-new" or "semble-new"), run verification there, and only after successful validation atomically replace the old install (rename/move staging to extractDir) and then remove the old directory; update the logic around the version check in the code that uses installedVersion, SEMBLE_VERSION and extractDir so that fs.rm is only used to remove the old installation after the new staged install is verified (or to cleanup failed staging), and ensure error paths leave the original extractDir intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/services/code-index/semble/semble-downloader.ts`:
- Around line 149-157: The current upgrade flow deletes the existing
installation (using installedVersion, SEMBLE_VERSION and fs.rm on extractDir)
before downloading/extracting the new binary; change this to extract into a
staging directory (e.g., extractDir + "-new" or "semble-new"), run verification
there, and only after successful validation atomically replace the old install
(rename/move staging to extractDir) and then remove the old directory; update
the logic around the version check in the code that uses installedVersion,
SEMBLE_VERSION and extractDir so that fs.rm is only used to remove the old
installation after the new staged install is verified (or to cleanup failed
staging), and ensure error paths leave the original extractDir intact.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 1a4fad23-6a88-432c-a30e-65a760f90d61
📒 Files selected for processing (1)
src/services/code-index/semble/semble-downloader.ts
edelauna
left a comment
There was a problem hiding this comment.
Looks good - has some suggestions, and requesting changes to switch from conditional checksum verification to hard fail on unsupported architectures.
| if (provider === "semble") { | ||
| throw new Error( | ||
| "Semble provider handles its own embedding. Do not call createEmbedder() for semble — use SembleProvider instead.", |
There was a problem hiding this comment.
Is semble really an EmbedderProvider? It doesn't generate embedding vectors — it's a self-contained search binary. Forcing it into this type means the factory has to throw at runtime to prevent misuse, which feels like the type system telling us the abstraction is wrong. Would a discriminated union at the manager level (branching before the factory is involved) be cleaner here?
| * Initializes the provider: downloads semble, then validates it works. | ||
| */ | ||
| async initialize(): Promise<void> { | ||
| if (this._isInitialized) { |
There was a problem hiding this comment.
What happens if initialize() is called concurrently — e.g., manager reconfigures while a prior init is still downloading? Both callers pass this guard simultaneously, and two concurrent downloads race to the same archive path. An _initPromise that subsequent callers await (or a CancellationTokenSource to cancel the in-flight init) would be the idiomatic VS Code pattern here.
| let stdout = "" | ||
| let stderr = "" | ||
|
|
||
| child.stdout?.on("data", (data: Buffer) => { | ||
| stdout += data.toString() |
There was a problem hiding this comment.
This accumulates stdout without any limit. If the binary outputs large verbose output (or is the wrong version), this could exhaust the extension host heap well before the 120s timeout fires — and since the extension host is shared across all extensions, it'd take everything down. Worth capping at some max bytes?
| if (directoryPrefix) { | ||
| const normalizedPrefix = path.join(this.workspacePath, directoryPrefix).replace(/\\/g, "/") | ||
| converted = converted.filter((r) => { | ||
| const filePath = (r.payload?.filePath ?? "").replace(/\\/g, "/") | ||
| return filePath.startsWith(normalizedPrefix + "/") || filePath === normalizedPrefix | ||
| }) | ||
| console.log( | ||
| `[SembleProvider] Filtered to "${directoryPrefix}": ${converted.length} of ${results.length} results`, | ||
| ) |
There was a problem hiding this comment.
For every other provider, directoryPrefix scopes the DB query before topK results are fetched, so topK counts within the prefix. Here it's a post-hoc filter on results already fetched from the full workspace — so on a large monorepo, a scoped query could return 0 results because semble spent all topK slots on other directories. Is this a known tradeoff, or should we bump the initial topK and then trim?
| response.headers.location | ||
| ) { | ||
| response.destroy() | ||
| downloadFile(response.headers.location, destPath, maxRedirects - 1) |
There was a problem hiding this comment.
Should we validate that the redirect target stays on a trusted domain (github.com / objects.githubusercontent.com) before following it? A compromised CDN redirecting to a different HTTPS host would succeed silently, and the if (expectedChecksum) guard at line 169 means a future platform without a SHA256 entry would skip integrity verification entirely.
| */ | ||
| function extractTarGz(archivePath: string, destDir: string): Promise<void> { | ||
| return new Promise((resolve, reject) => { | ||
| const child = spawn("tar", ["-xzf", archivePath, "-C", destDir], { |
There was a problem hiding this comment.
Had we considered adding --no-absolute-filenames (or the equivalent) to the tar invocation? Without it, a crafted archive could write files outside destDir via absolute path entries. The checksum verification is the primary guard, but defense-in-depth on the extraction step seems worthwhile given the binary gets executed.
| // Semble creates a separate cache directory per path (SHA-256 of the | ||
| // resolved absolute path), so passing subdirectories would create | ||
| // redundant indexes and waste disk space. | ||
| console.log(`[SembleProvider] Searching for "${query}" in ${this.workspacePath}`) |
There was a problem hiding this comment.
This logs the raw user query on every search call. In enterprise deployments, extension host logs can be forwarded to telemetry — is there a debug-gated OutputChannel we should be using here instead of console.log, consistent with how the other providers log?
| "win32-x64": { archive: "semble-windows-x64-fast.zip", binary: "semble.exe" }, | ||
| } | ||
|
|
||
| const SEMBLE_VERSION = "v0.3.1" |
There was a problem hiding this comment.
re: keeping SEMBLE_VERSION and SEMBLE_SHA256 in sync going forward, we could wire up Renovate to handle this automatically.
Renovate already has custom regex managers in this repo, so the version bump part is straightforward. The checksums are trickier since there are 4 platform-specific artifacts, but we can handle it with a postUpgradeTasks script that recomputes them after the version bump.
Rough shape:
-
Add a helper script (
scripts/update-semble-checksums.sh) that readsSEMBLE_VERSIONfromsemble-downloader.ts, downloads each archive, computesshasum -a 256, and patches the file in-place. -
Add a custom manager to renovate.json:json
{
"customType": "regex",
"description": "Update semble binary version from GitHub releases.",
"managerFilePatterns": ["src/services/code-index/semble/semble-downloader.ts"],
"matchStrings": ["const SEMBLE_VERSION = \"(?<currentValue>v[^\"]+)\""],
"depNameTemplate": "Zoo-Code-Org/sembleexec",
"datasourceTemplate": "github-releases",
"postUpgradeTasks": {
"commands": ["scripts/update-semble-checksums.sh"],
"fileFilters": ["src/services/code-index/semble/semble-downloader.ts"]
}
}This way when a new semble release drops, Renovate opens a PR with both the version bump and updated checksums — no manual step, no risk of the two maps drifting out of sync.
Doesn't have to be in this PR — just flagging it as a follow-up so we don't end up doing manual bumps forever.
| const platformKey = `${process.platform}-${process.arch}` | ||
| const expectedChecksum = SEMBLE_SHA256[platformKey] | ||
| if (expectedChecksum) { | ||
| await verifyChecksum(archivePath, expectedChecksum) | ||
| } |
There was a problem hiding this comment.
I agree with @coderabbitai on this - lets just hard-error, rather then conditionally verify.
Related GitHub Issue
Closes: #200
Description
Semble is a local code search provider that requires no API keys or external services. When a user selects "Semble - Local" as their code index provider, the extension handles everything automatically:
Searches are performed on-the-fly — semble indexes as it searches, so there's no separate indexing step or background process.
Supported platforms:
linux-x64,linux-arm64,darwin-arm64,win32-x64. The semble option only appears in the UI on these platforms.Architecture:
semble-downloader.ts— Downloads archive from GitHub releases, extracts via systemtar/powershell, caches inglobalStorageUri/semble/semble-cli.ts— Thin wrapper that spawns the binary with array args (no shell) for safe argument passingprovider.ts— Orchestrates download → verify → search. Implements the samesearchIndexinterface as other code index providersThe binary is downloaded once and reused across sessions. No user configuration needed.
Test Procedure
semble-downloader.spec.ts— Platform detection, download/skip logic, redirect handling, extraction, cleanup on failuresemble-cli.spec.ts— CLI spawn behavior, argument construction, JSON output parsingprovider.spec.ts— Download flow, platform check, error states, search result conversionpnpm --filter zoo-code exec vitest run services/code-index/semble/Pre-Submission Checklist
Screenshots / Videos
Documentation Updates
Additional Notes
darwin-x64) is not supported as the tree sitter dependency does not support it.Get in Touch
navedmerchant
Summary by CodeRabbit
New Features
Improvements
Tests