Problem Statement
With CAS optimistic concurrency merged (PR #1292), the persistence layer prevents lost updates when multiple writers mutate the same object. However, the gateway's reconciler loop — which drives sandbox lifecycle state transitions — still runs on every replica. In an HA deployment, this produces duplicate work, N-way CAS contention on every reconcile sweep, and wasted compute driver RPCs. A single reconciler lease ensures only one replica runs background coordination at a time.
Supervisor session ownership and inter-replica session forwarding are out of scope for this issue and will be addressed separately.
Technical Context
The gateway reconciler operates through two concurrent loops spawned at startup: a watch loop that consumes real-time events from the compute driver's WatchSandboxes stream, and a reconcile loop that runs a full store-vs-backend sweep every 60 seconds. Both loops acquire a process-local sync_lock Mutex before mutating sandbox state — a guard that is explicitly documented as not HA-safe (references issue #1255).
All sandbox store mutations go through update_message_cas with expected_version=0 (server-driven CAS), which means the database resolves concurrent writes correctly. But without lease-based ownership, every replica does redundant work: re-fetching sandbox state from the driver, computing phase transitions, and attempting CAS writes that only one replica can win.
Why a single reconciler lease is sufficient
The reconciler is a background consistency-repair mechanism, not the hot path. It covers the 60-second periodic sweep and the watch event processing loop. All replicas still serve gRPC requests (create, delete, update sandboxes), and supervisor sessions still land on whichever replica the TCP connection reaches.
A single reconciler lease breaks down when:
- Sweep duration exceeds the interval. Each reconciled sandbox costs roughly one
GetSandbox driver RPC (~5-10ms). At 60s intervals, you'd need ~6,000+ concurrent sandboxes before the sweep can't finish in time — well beyond initial HA deployments.
- Reconciler needs session locality. If reconciliation ever requires talking to the supervisor (not just the compute driver and store), it would benefit from running on the session-owning replica. Today it doesn't.
- Failover gap. If the lease holder dies, reconciliation pauses for the TTL duration (~30s). gRPC-initiated mutations continue working on all replicas via CAS. The reconciler catches up stale state — a 30s gap is acceptable.
Per-sandbox or shard-based leases are a future optimization if sandbox counts grow into the thousands. The single-lease model avoids O(N) lease records, lease rebalancing, and unnecessary complexity.
This approach is consistent with RFC 0001's intent. The RFC rejects a "singleton controller" where one replica handles all control-plane responsibilities (reconciliation, session ownership, relay coordination, and client requests). A single reconciler lease is narrower: it only scopes background sweeps, while gRPC serving and session handling remain distributed across all replicas.
Affected Components
| Component |
Key Files |
Role |
| Compute runtime |
crates/openshell-server/src/compute/mod.rs |
Reconciler loops, sandbox state machine, driver interaction |
| Persistence layer |
crates/openshell-server/src/persistence/mod.rs, postgres.rs, sqlite.rs |
CAS primitives, object storage |
| Server startup |
crates/openshell-server/src/lib.rs |
Gateway initialization, replica identity |
| Proto definitions |
proto/datamodel.proto |
ObjectMeta (if lease needs proto representation) |
Technical Investigation
Architecture Overview
The compute subsystem (ComputeRuntime) is the gateway's sandbox lifecycle engine. It owns:
-
Watch loop (compute/mod.rs:706-742): Opens a streaming WatchSandboxes RPC to the compute driver. Events include sandbox status updates, deletions, and platform events. Each event triggers a CAS read-modify-write on the store record.
-
Reconcile loop (compute/mod.rs:744-751): Runs every 60 seconds (RECONCILE_INTERVAL). Lists all sandboxes from both the driver (ListSandboxes) and the store, then reconciles discrepancies. Records not updated since the sweep started are refreshed via GetSandbox. Orphaned store records (no backend resource) are pruned after a 300-second grace period (ORPHAN_GRACE_PERIOD).
The sync_lock Mutex (compute/mod.rs:231,281-285) serializes all sandbox mutations within a single gateway process. Its comment explicitly notes this is insufficient for HA and references issue #1255. The CAS branch (#1292) added database-level concurrency control as the foundation for removing this process-local guard.
Code References
| Location |
Description |
compute/mod.rs:220-233 |
ComputeRuntime struct — holds driver, store, session registry, sync_lock |
compute/mod.rs:231,281-285 |
sync_lock Mutex — documented as not HA-safe |
compute/mod.rs:549-558 |
spawn_watchers() — launches both background loops |
compute/mod.rs:706-742 |
watch_loop() — driver event stream consumer |
compute/mod.rs:744-751 |
reconcile_loop() — 60s periodic sweep |
compute/mod.rs:753-798 |
reconcile_store_with_backend() — core reconcile logic |
compute/mod.rs:839-980 |
apply_sandbox_update_locked() — read-modify-write with CAS |
compute/mod.rs:1115-1143 |
reconcile_snapshot_sandbox() — per-sandbox reconcile with staleness guard |
compute/mod.rs:1145-1187 |
prune_missing_sandbox() — orphan cleanup |
persistence/mod.rs:90-98 |
WriteCondition — MustCreate / MatchResourceVersion / Unconditional |
persistence/mod.rs:176-197 |
Store::put_if() — CAS write |
persistence/mod.rs:199-228 |
Store::delete_if() — CAS delete |
persistence/mod.rs:406-475 |
Store::update_message_cas() — read-modify-write helper |
Current Behavior
Reconcile flow:
reconcile_store_with_backend() calls ListSandboxes on the driver to get all backend sandbox IDs
- For each backend sandbox: acquire
sync_lock, read store record, skip if recently updated, re-fetch from driver via GetSandbox, apply state merge via apply_sandbox_update_locked
- For each store record with no backend match: wait for 300s grace period, double-check via
GetSandbox, prune if confirmed missing
- State merge (
apply_sandbox_update_locked) derives phase from driver conditions, checks supervisor session presence (in-memory registry), and writes via update_message_cas with expected_version=0
Phase transitions driven by the reconciler:
| Trigger |
From |
To |
Path |
| Driver reports Ready=True |
Provisioning |
Ready |
watch/reconcile loop |
| Driver reports terminal failure |
Provisioning |
Error |
watch/reconcile loop |
| Driver reports deleting=true |
Any |
Deleting |
watch/reconcile loop |
| Backend resource gone (after grace) |
Any |
Deleted |
reconcile loop |
What Would Need to Change
1. Lease record type and primitives. A single global lease record stored as object_type = "reconciler_lease" in the existing objects table. One record, not per-sandbox. CAS primitives already available:
- Acquire:
put_if("reconciler_lease", "singleton", ..., MustCreate) — atomic insert, fails if another replica holds it
- Renew:
put_if(..., MatchResourceVersion(v)) — CAS update with TTL bump in payload
- Release:
delete_if(..., expected_version) — CAS delete on graceful shutdown
- Steal expired: read lease, check TTL against
updated_at_ms, put_if with MatchResourceVersion — conditional takeover after TTL expiry
2. Gate reconciler loops on lease. spawn_watchers() starts a lease acquisition loop. Only the lease holder runs watch_loop() and reconcile_loop(). Non-holders run a standby loop that periodically attempts to acquire the lease. On lease loss (renewal failure), the holder stops its loops and re-enters standby.
3. Relaxing sync_lock. The process-local Mutex can be narrowed once the reconciler is gated by the lease. In single-replica mode (SQLite), it remains as-is. In HA mode (Postgres), CAS is the concurrency control and the Mutex is defense-in-depth within a single process.
Alternative Approaches Considered
Per-sandbox leases. One lease record per sandbox, O(N) records. Adds lease rebalancing, O(N) heartbeat writes, and complexity that isn't justified until sandbox counts reach the thousands. Can be introduced later as an evolution of the single-lease model without rework to the lease primitives.
Shard-based leases. Hash-partition sandboxes into K shards, one lease per shard. Reduces contention vs. per-sandbox but adds rebalancing complexity and partial-sweep logic. Overkill for initial HA.
DB-backed leases vs. Kubernetes leases. DB leases work everywhere (Postgres required for HA anyway). K8s leases would couple the gateway to in-cluster deployments. RFC 0001 and the architecture doc position SQLite as single-node and Postgres as multi-replica — DB-backed leases align with this.
Patterns to Follow
- CAS pattern:
update_message_cas with expected_version=0 for server-driven internal operations is the established mutation pattern. Lease operations should use the same approach.
- ObjectType trait: The lease record should implement
ObjectType with fn object_type() -> &'static str { "reconciler_lease" }.
- Background task pattern:
spawn_session_reaper and spawn_relay_reaper show the established pattern for periodic background tasks (tokio::spawn + tokio::time::sleep loop). Lease renewal and standby acquisition should follow this.
- CAS concurrency tests:
persistence/tests.rs spawns concurrent CAS updates and asserts exactly 1 succeeds and N-1 conflict. Lease acquisition tests should follow this pattern.
- Test harness:
TestDriver / NoopTestDriver in compute/mod.rs and the test_runtime() helper provide an in-memory test environment for reconciler tests.
Proposed Approach
Introduce a single global reconciler lease stored as one record in the existing objects table, using the CAS primitives from PR #1292. Only the lease holder runs the watch and reconcile loops; other replicas run standby and attempt acquisition on a heartbeat cadence. Single-replica deployments (SQLite) skip the lease entirely and run the reconciler unconditionally as they do today. On graceful shutdown, the holder releases the lease explicitly to minimize failover gaps during rolling deployments.
Scope Assessment
- Complexity: Low-Medium
- Confidence: High — uses existing CAS primitives, no new inter-replica communication needed
- Estimated files to change: 3-5
- Issue type:
feat
Risks & Open Questions
- Lease TTL and failover speed. Too short = lease churn on transient partitions. Too long = reconciliation paused during replica failure. Recommend 30s TTL with 10s renewal (3 missed heartbeats = expire). Needs validation under realistic failure scenarios.
- Watch loop placement. Should only the lease holder consume the
WatchSandboxes stream, or should all replicas consume it for in-memory index warming? If only the holder watches, non-holder replicas have stale indexes until the next gRPC read-through. If all replicas watch but only the holder writes mutations, non-holders get fresh indexes at no write cost.
- Replica identity. Each replica needs a stable identity for lease ownership. Options: hostname, pod name, random UUID generated at startup. UUID is simplest and avoids assumptions about the deployment environment.
- SQLite compatibility. HA requires Postgres (per architecture doc and RFC 0001). In single-replica mode (SQLite), the lease should be skipped entirely — the reconciler runs unconditionally as it does today.
- Graceful lease handoff. On replica shutdown, the lease should be released explicitly (
delete_if) rather than waiting for TTL expiry. This minimizes the reconciliation gap during rolling deployments.
- Future evolution. The lease primitives (acquire, renew, release, steal-expired) are the same regardless of granularity. If per-sandbox or shard leases are needed later, the evolution path is: change the lease ID from
"singleton" to "sandbox:{id}" or "shard:{n}" and scope the reconcile loop accordingly. No rework to the underlying CAS machinery.
Test Considerations
- Lease acquisition concurrency: Spawn N tasks attempting to acquire the singleton lease simultaneously. Assert exactly 1 succeeds (MustCreate) and N-1 get UniqueViolation. Follow the pattern in
persistence/tests.rs CAS concurrency tests.
- Lease renewal and expiry: Test that renewal extends TTL, that expired leases can be stolen, and that active leases cannot be stolen.
- Gated reconciler: Test that a replica with the lease runs reconcile/watch loops and a replica without the lease does not mutate sandbox state.
- Failover simulation: Test lease expiry -> standby acquisition -> reconciler resumes on new holder.
- Graceful shutdown: Test that lease release on shutdown allows immediate takeover by standby.
- Single-replica mode: Test that SQLite deployments skip lease acquisition and run the reconciler unconditionally.
- Test levels: Unit tests for lease primitives, integration tests for gated reconciler.
Deferred Work
The following concerns are out of scope and will be tracked separately:
- Supervisor session ownership persistence — recording which replica owns a supervisor's gRPC stream so other replicas can discover it.
- Inter-replica session forwarding — forwarding exec, relay, and log streaming requests to the session-owning replica.
- Per-sandbox or shard-based lease evolution — if the single reconciler lease becomes a bottleneck at scale.
Created by spike investigation. Builds on PR #1292 (CAS optimistic concurrency). Use build-from-issue to plan and implement.
Problem Statement
With CAS optimistic concurrency merged (PR #1292), the persistence layer prevents lost updates when multiple writers mutate the same object. However, the gateway's reconciler loop — which drives sandbox lifecycle state transitions — still runs on every replica. In an HA deployment, this produces duplicate work, N-way CAS contention on every reconcile sweep, and wasted compute driver RPCs. A single reconciler lease ensures only one replica runs background coordination at a time.
Supervisor session ownership and inter-replica session forwarding are out of scope for this issue and will be addressed separately.
Technical Context
The gateway reconciler operates through two concurrent loops spawned at startup: a watch loop that consumes real-time events from the compute driver's
WatchSandboxesstream, and a reconcile loop that runs a full store-vs-backend sweep every 60 seconds. Both loops acquire a process-localsync_lockMutex before mutating sandbox state — a guard that is explicitly documented as not HA-safe (references issue #1255).All sandbox store mutations go through
update_message_caswithexpected_version=0(server-driven CAS), which means the database resolves concurrent writes correctly. But without lease-based ownership, every replica does redundant work: re-fetching sandbox state from the driver, computing phase transitions, and attempting CAS writes that only one replica can win.Why a single reconciler lease is sufficient
The reconciler is a background consistency-repair mechanism, not the hot path. It covers the 60-second periodic sweep and the watch event processing loop. All replicas still serve gRPC requests (create, delete, update sandboxes), and supervisor sessions still land on whichever replica the TCP connection reaches.
A single reconciler lease breaks down when:
GetSandboxdriver RPC (~5-10ms). At 60s intervals, you'd need ~6,000+ concurrent sandboxes before the sweep can't finish in time — well beyond initial HA deployments.Per-sandbox or shard-based leases are a future optimization if sandbox counts grow into the thousands. The single-lease model avoids O(N) lease records, lease rebalancing, and unnecessary complexity.
This approach is consistent with RFC 0001's intent. The RFC rejects a "singleton controller" where one replica handles all control-plane responsibilities (reconciliation, session ownership, relay coordination, and client requests). A single reconciler lease is narrower: it only scopes background sweeps, while gRPC serving and session handling remain distributed across all replicas.
Affected Components
crates/openshell-server/src/compute/mod.rscrates/openshell-server/src/persistence/mod.rs,postgres.rs,sqlite.rscrates/openshell-server/src/lib.rsproto/datamodel.protoTechnical Investigation
Architecture Overview
The compute subsystem (
ComputeRuntime) is the gateway's sandbox lifecycle engine. It owns:Watch loop (
compute/mod.rs:706-742): Opens a streamingWatchSandboxesRPC to the compute driver. Events include sandbox status updates, deletions, and platform events. Each event triggers a CAS read-modify-write on the store record.Reconcile loop (
compute/mod.rs:744-751): Runs every 60 seconds (RECONCILE_INTERVAL). Lists all sandboxes from both the driver (ListSandboxes) and the store, then reconciles discrepancies. Records not updated since the sweep started are refreshed viaGetSandbox. Orphaned store records (no backend resource) are pruned after a 300-second grace period (ORPHAN_GRACE_PERIOD).The
sync_lockMutex (compute/mod.rs:231,281-285) serializes all sandbox mutations within a single gateway process. Its comment explicitly notes this is insufficient for HA and references issue #1255. The CAS branch (#1292) added database-level concurrency control as the foundation for removing this process-local guard.Code References
compute/mod.rs:220-233ComputeRuntimestruct — holds driver, store, session registry, sync_lockcompute/mod.rs:231,281-285sync_lockMutex — documented as not HA-safecompute/mod.rs:549-558spawn_watchers()— launches both background loopscompute/mod.rs:706-742watch_loop()— driver event stream consumercompute/mod.rs:744-751reconcile_loop()— 60s periodic sweepcompute/mod.rs:753-798reconcile_store_with_backend()— core reconcile logiccompute/mod.rs:839-980apply_sandbox_update_locked()— read-modify-write with CAScompute/mod.rs:1115-1143reconcile_snapshot_sandbox()— per-sandbox reconcile with staleness guardcompute/mod.rs:1145-1187prune_missing_sandbox()— orphan cleanuppersistence/mod.rs:90-98WriteCondition— MustCreate / MatchResourceVersion / Unconditionalpersistence/mod.rs:176-197Store::put_if()— CAS writepersistence/mod.rs:199-228Store::delete_if()— CAS deletepersistence/mod.rs:406-475Store::update_message_cas()— read-modify-write helperCurrent Behavior
Reconcile flow:
reconcile_store_with_backend()callsListSandboxeson the driver to get all backend sandbox IDssync_lock, read store record, skip if recently updated, re-fetch from driver viaGetSandbox, apply state merge viaapply_sandbox_update_lockedGetSandbox, prune if confirmed missingapply_sandbox_update_locked) derives phase from driver conditions, checks supervisor session presence (in-memory registry), and writes viaupdate_message_caswithexpected_version=0Phase transitions driven by the reconciler:
What Would Need to Change
1. Lease record type and primitives. A single global lease record stored as
object_type = "reconciler_lease"in the existingobjectstable. One record, not per-sandbox. CAS primitives already available:put_if("reconciler_lease", "singleton", ..., MustCreate)— atomic insert, fails if another replica holds itput_if(..., MatchResourceVersion(v))— CAS update with TTL bump in payloaddelete_if(..., expected_version)— CAS delete on graceful shutdownupdated_at_ms,put_ifwithMatchResourceVersion— conditional takeover after TTL expiry2. Gate reconciler loops on lease.
spawn_watchers()starts a lease acquisition loop. Only the lease holder runswatch_loop()andreconcile_loop(). Non-holders run a standby loop that periodically attempts to acquire the lease. On lease loss (renewal failure), the holder stops its loops and re-enters standby.3. Relaxing sync_lock. The process-local Mutex can be narrowed once the reconciler is gated by the lease. In single-replica mode (SQLite), it remains as-is. In HA mode (Postgres), CAS is the concurrency control and the Mutex is defense-in-depth within a single process.
Alternative Approaches Considered
Per-sandbox leases. One lease record per sandbox, O(N) records. Adds lease rebalancing, O(N) heartbeat writes, and complexity that isn't justified until sandbox counts reach the thousands. Can be introduced later as an evolution of the single-lease model without rework to the lease primitives.
Shard-based leases. Hash-partition sandboxes into K shards, one lease per shard. Reduces contention vs. per-sandbox but adds rebalancing complexity and partial-sweep logic. Overkill for initial HA.
DB-backed leases vs. Kubernetes leases. DB leases work everywhere (Postgres required for HA anyway). K8s leases would couple the gateway to in-cluster deployments. RFC 0001 and the architecture doc position SQLite as single-node and Postgres as multi-replica — DB-backed leases align with this.
Patterns to Follow
update_message_caswithexpected_version=0for server-driven internal operations is the established mutation pattern. Lease operations should use the same approach.ObjectTypewithfn object_type() -> &'static str { "reconciler_lease" }.spawn_session_reaperandspawn_relay_reapershow the established pattern for periodic background tasks (tokio::spawn+tokio::time::sleeploop). Lease renewal and standby acquisition should follow this.persistence/tests.rsspawns concurrent CAS updates and asserts exactly 1 succeeds and N-1 conflict. Lease acquisition tests should follow this pattern.TestDriver/NoopTestDriverincompute/mod.rsand thetest_runtime()helper provide an in-memory test environment for reconciler tests.Proposed Approach
Introduce a single global reconciler lease stored as one record in the existing
objectstable, using the CAS primitives from PR #1292. Only the lease holder runs the watch and reconcile loops; other replicas run standby and attempt acquisition on a heartbeat cadence. Single-replica deployments (SQLite) skip the lease entirely and run the reconciler unconditionally as they do today. On graceful shutdown, the holder releases the lease explicitly to minimize failover gaps during rolling deployments.Scope Assessment
featRisks & Open Questions
WatchSandboxesstream, or should all replicas consume it for in-memory index warming? If only the holder watches, non-holder replicas have stale indexes until the next gRPC read-through. If all replicas watch but only the holder writes mutations, non-holders get fresh indexes at no write cost.delete_if) rather than waiting for TTL expiry. This minimizes the reconciliation gap during rolling deployments."singleton"to"sandbox:{id}"or"shard:{n}"and scope the reconcile loop accordingly. No rework to the underlying CAS machinery.Test Considerations
persistence/tests.rsCAS concurrency tests.Deferred Work
The following concerns are out of scope and will be tracked separately:
Created by spike investigation. Builds on PR #1292 (CAS optimistic concurrency). Use
build-from-issueto plan and implement.