Skip to content

ClickBench Playground#904

Open
alexey-milovidov wants to merge 195 commits into
mainfrom
playground-wip
Open

ClickBench Playground#904
alexey-milovidov wants to merge 195 commits into
mainfrom
playground-wip

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

No description provided.

alexey-milovidov and others added 30 commits May 12, 2026 19:59
WIP checkpoint. Lets visitors run SQL against any of the 80+ ClickBench
systems via a single-page UI, each isolated in a per-system Firecracker
microVM.

  - server/  aiohttp API: /api/systems, /api/state, /api/query,
             /api/admin/provision. Owns the per-system VM lifecycle,
             a 1-Hz CPU/disk/host-pressure watchdog, and a batched
             ClickHouse-Cloud logging sink (JSONL fallback).
  - agent/   stdlib HTTP agent that runs inside each VM and wraps the
             system's install/start/load/query scripts.
  - images/  scripts to build the base Ubuntu 22.04 rootfs + per-system
             rootfs/system-disk pair (200 GB sparse + 16/88 GB sized
             for the system's data format).
  - web/     vanilla JS SPA — system picker, query box, X-Query-Time /
             X-Output-Truncated rendering.

Smoke-tested: base rootfs boots under Firecracker, agent comes up in
~2 s, /health and /stats respond. Agent self-test on the host (no VM)
covers all 4 endpoints including 10 KB output truncation. ClickHouse
provisioning is in flight; see playground/docs/build-progress.md for
the running checkpoint.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A later `umount -lR` on the chroot's /dev was propagating through the
shared mount group and tearing down the host's /dev/pts, breaking sshd's
PTY allocation. `--make-rslave` keeps mount events flowing *into* the
chroot but blocks unmounts from leaking back to the host.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A 16 GB guest snapshot.bin compresses to ~2 GB once we
  1) stop+start the system daemon (sheds INSERT-time heap arenas,
     buffers, fresh allocator pages),
  2) echo 3 > drop_caches (turns 3-5 GB of page cache into zero
     pages),
  3) zstd -T0 -3 --long=27 (parallel, big match window — most of
     the savings come from those zero pages).

Restart is skipped for in-process engines where stop/start is a
no-op AND the data lives in the process; wiping it would defeat
the whole point.

The host now keeps snapshot.bin.zst as the canonical artifact and
decompresses on demand right before /snapshot/load. snapshot.bin
itself is deleted after a successful restore + teardown.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous version threw away stdout/stderr from the pre-snapshot
stop/start cycle, so a silent failure (`sudo clickhouse start` failing
because the data dir was still locked by the dying daemon, etc.) left
us with a snapshot of a dead clickhouse-server — restored VMs then
returned "Connection refused (localhost:9000)" on every query and the
only way to recover was to manually delete the snapshot.

Capture stdout+stderr into the provision log so the failure mode is
visible via GET /provision-log, and refuse to mark PROVISION_DONE if
./check doesn't recover within the timeout. The host then sees /provision
return 500 and skips the snapshot step entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PROVISION_DONE lives on the rootfs disk (/var/lib/clickbench-agent/),
which persists across VM cold-boots. So on the second provision after
the host deleted the snapshot files, the agent saw PROVISION_DONE
already set and returned "already provisioned" — but the daemon
itself wasn't running (cold boot, no clickhouse-server in systemd),
so the host snapshotted an empty VM and every restored query came back
with "Connection refused (localhost:9000)".

Two fixes:
  1. Agent: on every startup, if PROVISION_DONE is set, kick ./start
     in a background thread. start is idempotent for the systems that
     have a daemon, so it costs nothing when the daemon is already up
     (post-restore) and brings it up when the rootfs is being re-used
     across a cold reboot.
  2. Host: when (re-)provisioning a system with no snapshot, drop the
     existing rootfs.ext4 so install/start/load run fresh. The
     system.ext4 (which holds ~14 GB of pre-staged dataset) is preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cloud image ships hostname=ubuntu but /etc/hosts only maps
'localhost' to 127.0.0.1. Every sudo invocation inside the VM then
tries to reverse-resolve 'ubuntu' against the network — which has no
DNS after the snapshot drops internet — and pays the ~2 s resolver
timeout. With several sudos per ./query, that's a multi-second floor
on every query, visible in the firecracker log as repeated
'sudo: unable to resolve host ubuntu: Name or service not known'.

Mapping ubuntu to 127.0.0.1 short-circuits the lookup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The mid-snapshot checksum-mismatch I attributed to "stopping the
daemon mid-merge" was actually FS corruption: KVM pauses the vcpus
the moment we call /vm Paused, and any ext4 writeback that was in
flight at that instant gets captured by the snapshot as half-flushed.
On restore the page cache references on-disk blocks that never landed,
and the next read sees a torn write.

Fix:
  1. Drop the pre-snapshot stop/start. Killing ClickHouse at any
     point never corrupts on-disk MergeTree data — only an unflushed
     FS can.
  2. Add a /sync endpoint to the agent and call it from the host
     right before /vm Paused, so all dirty pages have hit virtio-blk
     before KVM freezes the vcpus.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Now that the host /syncs the FS before pausing the vcpus, the snapshot
captures consistent on-disk state regardless of when the daemon exits
(MergeTree's on-disk format is durable under arbitrary process exit;
only an unflushed *filesystem* corrupts it). So we can shut the daemon
down here to evict its private heap (merge thread arenas, query cache,
mark cache, uncompressed cache, ingest buffers) and snapshot what's
left — mostly zero-fill RAM, which zstd compresses ~300:1.

Restore path is unchanged: _kick_daemon_if_provisioned at agent
startup brings the daemon back up on every cold restore. First query
in a restored VM pays a 1-2 s daemon-start cost instead of carrying
8-12 GB of memory in every snapshot.

In-process engines (chdb, polars, …) keep all state in RAM and have
no daemon to stop; for them, has_daemon is false and we skip the
stop step, falling back to drop_caches alone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes for the small-snapshot path:

  1. Pass init_on_free=1 in the guest kernel cmdline. Linux normally
     leaves freed page frames with whatever bytes were last written to
     them, so the post-`clickhouse stop` free pool was ~10 GB of stale
     daemon heap and Firecracker's snapshot dump compressed only ~3:1.
     init_on_free=1 zeros every page as it goes onto the free list, so
     the snapshot's RAM region is genuinely zero-filled and zstd hits
     ~300:1.

  2. Add `_ensure_daemon_started` at the top of the agent's /query
     handler. After a snapshot restore (taken with the daemon stopped),
     the restored memory has no daemon process and `localhost:9000`
     refuses connections. The cold-boot `_kick_daemon_if_provisioned`
     only fires on actual cold boots, not on snapshot resumes, so we
     need an explicit check at query time. Lock-protected so concurrent
     /query requests don't try to ./start the daemon twice; idempotent
     and free once the daemon is up.

Also dropped the userspace _zero_free_ram hack — init_on_free does
it natively at no userspace cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end working with a 35 MB snapshot (16 GiB raw, ~470x ratio):
SELECT COUNT(*) returns 99997497 cleanly, GROUP BY URL produces the
expected top-N without any checksum errors, output truncation caps a
244 KB result at 10 KB with the right header set.

Cold path (snapshot restore + daemon start): ~10 s.
Warm path (live VM): subsecond on COUNT / MIN-MAX.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two correctness/efficiency fixes:

1. Shared read-only datasets disk. Previously each per-system rootfs
   embedded its own copy of hits.parquet / hits.tsv / hits.csv (14-75 GB
   each), so the catalog needed ~1-2 TB of redundant dataset storage on
   the host. Build one shared datasets.ext4 instead, attach to every VM
   read-only at LABEL=cbdata, and have the agent copy the bytes the
   system actually needs from /opt/clickbench/datasets into the writable
   per-system disk at provision time only. The agent uses
   os.copy_file_range so the in-VM copy is kernel-side, not bounced
   through userspace.

2. Golden-disk snapshot/restore. Firecracker's snapshot.bin only saves
   memory; the disk image referenced by the in-memory state is the
   live file. If anything modifies it between snapshots (background
   merges, log writes, /tmp churn) the next /snapshot/load points at
   the new disk while replaying old memory references. We were getting
   away with this because clickhouse-server happens to be tolerant,
   but it's fragile. Now /snapshot also renames the working disks into
   `*.golden.ext4`, and /restore-snapshot clones the goldens back into
   fresh working copies via `cp --sparse=always`. Every restore starts
   from the exact disk state captured at snapshot time.

3. Bound per-system disk builds and provisions via asyncio.Semaphore
   (PLAYGROUND_BUILD_CONCURRENCY=6, PLAYGROUND_PROVISION_CONCURRENCY=32)
   so kicking off 98 systems at once doesn't thrash the host NVMe or
   rate-limit Ubuntu mirrors.

4. Re-enabled `ursa` in the playground catalog (was incorrectly in the
   _EXTERNAL exclude list; it runs locally).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous design copied dataset files from the read-only cbdata
mount into the per-VM writable cbsystem disk on every provision —
14 GB for parquet systems, 75 GB for tsv/csv. That worked but was
redundant: the data is already on a read-only mount, the only reason
we copied was that ClickBench's load scripts do `sudo mv` and
`sudo chown` on the dataset files.

Use overlayfs instead:
  lowerdir = /opt/clickbench/datasets_ro   (RO, the shared image)
  upperdir = /opt/clickbench/system_upper  (RW per-VM disk with scripts)
  merged at /opt/clickbench/system

The system's load runs at cwd=/opt/clickbench/system. It sees scripts
+ dataset files in one tree. When it `mv`s or `chown`s a file from
the lower, overlayfs does a lazy copy-up: only the file's bytes get
materialised into the upper, and only when the script actually
mutates it. Most ClickBench load scripts `rm` the dataset file after
INSERT, which becomes a whiteout in the upper — a few bytes of
metadata, not a 75 GB copy.

Saves ~1-2 TB across the catalog on host disk (no per-system copies)
*and* eliminates the per-provision in-VM stage. Only cost: small
metadata to maintain the overlay (kilobytes).

For partitioned parquet, the source files live in
datasets_ro/hits_partitioned/ but the load globs cwd/hits_*.parquet,
so the agent creates symlinks in the upper pointing at the lower —
~100 symlinks, a few hundred bytes total.

Also: make build-datasets-image.sh idempotent. The 173 GB rsync
into datasets.ext4 only needs to run when the source dir's mtime
has changed; otherwise the cached image is reused.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes for the parallel-provisioning-98-systems path:

1. The _build_sem and _provision_sem fields were defined but never
   acquired — `provision-all.sh` kicked all 98 provisions at once and
   they each independently spawned build-system-rootfs.sh, which
   tried to write ~8 GB of rootfs base content × 98 in parallel
   (~780 GB of writes against a single NVMe). Disk got saturated and
   nothing finished. Use `async with self._build_sem:` and `async
   with self._provision_sem:` around the heavy phases.

2. build-system-rootfs.sh now clones the base image at block level
   with `cp --sparse=always` and resizes the filesystem to 200 GB
   in place, instead of mkfs.ext4 + mount + rsync-of-base-contents.
   The block-level clone touches only the ~2 GB of non-zero blocks
   in the base, vs. the rsync approach traversing the mounted base
   and writing every file individually. Per-system rootfs build
   goes from ~30 s to ~3 s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the agent created symlinks in the overlay's upper for
partitioned parquet (hits_partitioned/* -> upper/hits_*.parquet)
because the source directory was nested. That fell apart on
clickhouse's load: `mv hits_*.parquet /var/lib/clickhouse/user_files/`
moved the symlinks, and the subsequent `chown` followed them through
to the read-only datasets disk and got `Read-only file system`.

Flatten the dataset image so all 100 partitioned parquet files sit
at the root next to hits.parquet / hits.tsv / hits.csv. The overlay
then exposes them directly at /opt/clickbench/system as real files,
no symlinks involved. clickhouse's `mv` becomes a real copy-up (and
the source becomes a whiteout in upper), and the subsequent `chown`
operates on a regular file on the rootfs — works.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 2 GB cap on the per-VM system disk was a holdover from the
in-VM-copy era, when system.ext4 only held scripts + staged data.
Once we switched to overlay-with-RO-datasets, system.ext4 also holds
the overlay's upperdir + workdir — i.e. every byte the load script
writes lands there, including the database's own files. ClickHouse
writes ~5 GB of MergeTree parts, DuckDB ~6 GB, Hyper ~10 GB; chown
on partitioned parquet copies up another 14 GB. 2 GB was always
going to overflow.

Match the rootfs at 200 GB (apparent). The file is sparse: truncate
reserves the size but allocates no physical blocks, mkfs.ext4 writes
~50 MB of metadata, and the snapshot/restore path uses
`cp --sparse=always` so only the bytes the VM actually wrote land
on the host disk. Light systems (chdb, sqlite, ...) cost the host
near nothing; heavy ones (tidb at ~137 GB, postgres-indexed ~80 GB)
fit without hitting ENOSPC mid-load.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each per-system rootfs build was running `e2fsck -fy` on its clone
before `resize2fs`. With 98 systems and ~5 s per fsck of a 200 GB
sparse file, that's ~8 minutes of pure disk thrash during catalog
build — and entirely redundant: the base ext4 is built fresh and
never mounted dirty, so the bit-for-bit clone is clean too.

Move the single fsck to the end of build-base-rootfs.sh (where it
has all the host's I/O to itself) and skip it in the per-system
loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The base ext4 used to be built at 8 GB and each per-system rootfs
clone ran resize2fs to grow to 200 GB. resize2fs on a 200 GB file is
disk-heavy (it has to write group descriptor and bitmap metadata for
every additional block group), and we did it 98 times in parallel.

Build the base directly at 200 GB sparse with
lazy_itable_init=1,lazy_journal_init=1. mkfs writes ~50 MB of
superblock + GDT material upfront and defers the rest to lazy
background init, so the image file's physical footprint is unchanged
from the previous 8 GB layout (~1.8 GB). Per-system clones then need
only `cp --sparse=always`: no resize2fs, no e2fsck, ~1 second each.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`umount` already syncs the filesystem being unmounted. The
host-wide `sync` we were calling first flushes every dirty page on
*every* mount — under 98-way parallel builds, each build's sync
blocked on every other build's writeback, multiplying the wall-clock
cost. Drop them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…olden

When clickhouse's load `mv hits.parquet /var/lib/clickhouse/user_files/`
(or any cross-FS move) copies the 14-75 GB dataset into the writable
per-VM disk and then `rm`'s it after INSERT, ext4 marks those blocks
free but the underlying virtio-blk file still carries the bytes.
`cp --sparse=always` on the golden then preserves them as random
data, so the per-system snapshot for a parquet engine carried a full
extra copy of the dataset that the load already discarded.

Adding `fstrim /opt/clickbench/sysdisk` and `fstrim /` before the
host's snapshot makes the guest issue DISCARD for free blocks; the
host loop driver responds by punching holes in the sparse backing
file (linux loop devices advertise discard with PUNCH_HOLE since 4.x,
which firecracker's virtio-blk passes through). The golden then holds
only the bytes the engine actually keeps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Several systems' load scripts do `sudo mv hits_*.parquet
/var/lib/<engine>/user_files/` or `sudo cp hits.csv .../extern/`
followed by `chown` to the daemon's user. The mv/cp copies 14-75 GB
of data the daemon reads once during INSERT and we delete right
after — a complete waste of bytes on disk and time on the wire.

Replace with `ln -s` + `chown -h` where the daemon's user-files dir
is on a different filesystem from the dataset. `chown -h` chowns
the symlink itself rather than following into the (often read-only)
original; the underlying dataset is mode 644 anyway, so daemon
processes can read through the symlink as their own user.

Systems updated: clickhouse, clickhouse-tencent, pg_clickhouse,
kinetica, oxla, ursa, arc, cockroachdb.

Motivated by the ClickBench playground (Firecracker microVM service)
where the dataset is mounted read-only and shared across all VMs;
the copy step was the dominant cost on parquet/csv-format systems
and pulled 14 GB into the per-VM snapshot golden disk unnecessarily.
The change is also benign for the regular benchmark — daemons still
read the same bytes, just through a symlink.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8080 is the default HTTP admin port for cockroach, the spark UI,
trino, presto, druid, and a long tail of other JVM-based databases
in the catalog. Our in-VM agent was binding it first, so when their
./start ran the daemon failed with "bind: address already in use"
and the whole provision came down with a port conflict.

Pick 50080 — uncommon enough that no ClickBench engine in the
current catalog wants it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Several systems' load scripts call ../lib/download-hits-* — e.g.
doris-parquet expects `download-hits-parquet-partitioned <doris_be_dir>`
to materialize the dataset in a specific subdirectory of the BE's
working tree. Previously we copied the lib tree into /opt/clickbench/
system/_lib, but ../lib from the system dir resolves to
/opt/clickbench/lib, not /opt/clickbench/system/_lib.

Put 4 stub scripts (one per format) at /opt/clickbench/lib in the
base rootfs. Each one symlinks from the shared RO dataset mount into
the target directory — same interface as upstream's wget-based
scripts, but instant and zero-byte-on-disk.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The firecracker-ci kernel is minimal: it boots fine, but Docker
fails to start because it lacks iptables/nat, br_netfilter, veth and
other modules that Docker needs to set up its bridge network. That
killed ~6 Docker-using systems (byconity, cedardb, citus, cloudberry,
greenplum) in the parallel provisioning run.

Swap in Ubuntu's `linux-image-generic` kernel (the same one Ubuntu
ships for cloud KVM guests). It has every Docker-required module
plus a much richer driver set, while still booting under Firecracker.
Trade-off: it lacks CONFIG_IP_PNP so the kernel's `ip=` boot arg is
ignored. Add a tiny clickbench-net.service that parses `ip=` from
/proc/cmdline and applies it to eth0 at boot; agent.service waits
for it. The same rootfs continues to work with the firecracker-ci
kernel (the systemd unit's `ip addr add` is idempotent — kernel-set
IPs are already there).

Verified: smoke-boot agent answered in 3 s on the new kernel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Ubuntu generic kernel builds overlay, veth, br_netfilter,
iptable_nat, nf_conntrack and friends as loadable modules, not
built-in. Without /lib/modules/<ver>/ in the rootfs the kernel can't
load them at runtime — the immediate symptom was `Failed to mount
/opt/clickbench/system` (overlayfs not available) and Docker still
failing to start (no br_netfilter/iptable_nat).

Drop the linux-modules-7.0.0-15-generic deb into the chroot,
`dpkg --unpack` it into the rootfs, run `depmod`, and pre-load the
critical modules via /etc/modules-load.d/clickbench.conf so they're
ready before any service starts. The image grew from 1.8 to 2.0 GB
physical (200 GB apparent) — modules add ~200 MB.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`dpkg --unpack` records the modules package in dpkg's status DB
without configuring it; subsequent `apt-get install` calls inside
every per-system VM see an unconfigured package with unmet
dependencies and bail with "Unmet dependencies. Try 'apt --fix-broken
install'". That broke ~10 systems in the previous parallel run.

Switch to `dpkg-deb -x` — extracts the data tarball into the rootfs
without touching dpkg's DB. apt sees a normal system with all modules
in /lib/modules/, and the kernel can load them at runtime.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Snapshot of the state after the 10th parallel run. Documents:
  - what works end-to-end (microVM lifecycle, shared RO datasets disk,
    per-restore disk hygiene, fstrim before snapshot, Ubuntu kernel
    with modules)
  - bug fixes pushed during the run (port 8080 conflict, mv→ln -s,
    download-hits stubs, build/provision semaphores, redundant fsck/
    resize2fs/sync removed, clickbench-net.service, kernel module
    preload, 200 GB system disk for heavy systems)
  - failure categories observed
  - what's left for the long tail

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three independent failures observed in the 10th parallel run:

1. The 7 pg_* systems (pg_clickhouse, pg_duckdb*, pg_ducklake,
   pg_mooncake) all failed to spawn firecracker with
   `Firecracker panicked at main.rs:296: Invalid instance ID:
   InvalidChar('_')`. Firecracker's --id rejects underscores. Map
   `_` to `-` for the fc id (the system name itself stays intact).

2. duckdb / chdb-dataframe / duckdb-dataframe OOM-killed at 16 GB
   ("Out of memory: Killed process 578 (duckdb) anon-rss:15926176kB").
   DuckDB and chdb hold the full dataset in memory during INSERT;
   16 GB just isn't enough for the 100 M row hits set. Bump default
   VM memory to 32 GB. KVM allocates lazily, so 98×32 GB on the host
   is fine.

3. monetdb's install fails with `$USER: unbound variable`. systemd's
   default service env has no USER/LOGNAME. Stamp them as root in
   clickbench-agent.service so subprocess.run inherits them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ClickBench: fix elasticsearch load.py bytes/str mix

VM tweaks for the long tail of failures:
  - chdb-dataframe / duckdb-dataframe materialize the full hits dataset
    in process memory and need >32 GB. Default to 48 GB.
  - Druid / Pinot / similar JVM stacks take 5-10 min to come up
    (Zookeeper → Coordinator → Broker → Historical, in sequence). The
    agent's 300 s check-loop wasn't enough; widen to 900 s.

elasticsearch/load.py: gzip.open in mode='rt' returns str docs, but
bulk_stream yields bytes for ACTION_META_BYTES and str for the doc.
requests.adapters.send() calls sock.sendall() on the mixed iterable
and crashes with `TypeError: a bytes-like object is required, not
'str'`. Open in 'rb' so docs are bytes — matches the rest of the
generator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
chdb-dataframe, duckdb-dataframe, polars-dataframe, daft-parquet,
daft-parquet-partitioned load the whole hits dataset into a single
in-process DataFrame. Observed peak RSS is 80-100 GB on the
partitioned parquet set — even though KVM allocates lazily,
sustaining that working set for shared use isn't feasible. Disable
them in the registry rather than bump RAM for everyone.

Revert the default per-VM RAM cap to 16 GB.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
duckdb-memory's load OOM'd at 16 GB anon-rss — it's the same RAM-
resident model as duckdb-dataframe/chdb-dataframe, just packaged as
its own ClickBench entry. Add to the disabled-systems list.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
alexey-milovidov and others added 30 commits May 14, 2026 15:32
mongosh routes console.error() through its own log formatter rather
than to process.stderr the way Node REPL does, so the elapsed time
the eval block was printing never reached the agent's
_extract_script_timing(stderr) parser. The UI's Time: column was
empty for every mongo query.

Wrap the mongosh invocation in shell-side date arithmetic and emit
the seconds to stderr ourselves.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previous attempt set --memory=256g --memory-swap=-1 --memory-swappiness=100,
but on cgroup v2 the swappiness flag is silently discarded and any
--memory cap creates a hard cgroup ceiling that the kernel will OOM
on regardless of swap. Let Umbra run with no docker memory cgroup
and rely on the host kernel + 256 GiB swap drive.

Also raise vm.max_map_count to 1048576 — Umbra issues many small
mmaps for its memory-mapped storage and a 100M-row COPY blows past
the 65530 default well before any OOM-killer fires.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… binary

The trino:455 image ships no /usr/bin/find, so the previous
'find /usr/lib/trino -name "*.jar"' classpath collector silently
returned empty and javac failed with 'package com.amazonaws.auth
does not exist'. Use a brace-glob over the two specific HDFS-plugin
jars (aws-java-sdk-core and hadoop-apache) and match either the
legacy 'com.amazonaws_' / 'io.trino.hadoop_' name prefix used by
older Trino builds or the bare modern name.

Tested: javac produces S3AnonymousProvider.class against
  /usr/lib/trino/plugin/hive/hdfs/aws-java-sdk-core-1.12.770.jar
  /usr/lib/trino/plugin/hive/hdfs/hadoop-apache-3.3.5-3.jar

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
omnisci/core-os-cpu:v5.10.2 ships with an empty
allowed-import-paths, so the load script's
  COPY hits FROM '/tmp/hits.csv'
fails with 'File or directory path "/tmp/hits.csv" is not
whitelisted.' Drop an omnisci.conf with [/tmp/] on the allowlist
into heavyai-storage before launching the container — the
startomnisci wrapper picks it up automatically.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
tursodb has been panicking partway through .import:
  thread 'main' panicked at core/storage/sqlite3_ondisk.rs:818:5:
  assertion failed: !*syncing.borrow()
  note: run with `RUST_BACKTRACE=1` environment variable ...
The note speaks for itself. Set RUST_BACKTRACE=1 so the panic line
in the provision log (and any UI-facing panic from /query) ships
with a call stack for the upstream bug report.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Engines like Elasticsearch, Quickwit, Parseable, Druid return raw
JSON for every query, which currently lands in the output pane as
a single 200-char unwrapped line. If the body is a parseable JSON
object or array, re-emit it with 2-space indentation.

Cheap pre-filter (first non-whitespace byte must be '{' or '[')
keeps us from feeding 14 GB count(*) results through JSON.parse.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
SHOW BACKENDS TSV columns are
  1 BackendId  2 IP  3 HeartbeatPort  4 BePort  5 HttpPort
  6 BrpcPort   7 LastStartTime  8 LastHeartbeat  9 Alive  ...
We were inspecting column 10 (SystemDecommissioned), which is always
"false" once the BE is registered — so the wait loop in ./start
timed out even when the backend was alive and serving.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The output cap was raised to 256 KB (CLICKBENCH_OUTPUT_LIMIT, enforced
inside the in-VM agent), but README.md and build-progress.md still
named '10 KB' and the host-side config still carried an unused
output_limit_bytes field with a 10 * 1024 default. Align the docs to
reality and remove the dead config field (plus the _env_bytes helper
that only fed it).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drop the per-VM _query_lock. Per-system ./query scripts are
already careful with scratch state (use \$\$ / mktemp; redirect to
sockets the daemon owns) and a quick audit shows no remaining fixed
/tmp/<name> paths. Engines whose runtime client takes an exclusive
file lock (embedded DuckDB on hits.db, ...) will fail one of two
concurrent requests with their normal lock error — that's visible
to the user, and the right answer at the engine level is server-mode
or per-connection databases. /provision keeps its own lock.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The file was a snapshot of what was wired up early in the playground
bring-up. The real source of truth is the code + README; everything
else here has drifted.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A point-in-time write-up of the first parallel-provision run; the
playground has moved on (snapshot/restore overhaul, per-VM swap,
btime-watcher agent, sysdisk overrides, ...) and the report is no
longer accurate.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ap.sql

The SQL file used to take freshly-rotated writer/reader passwords +
the writer's IP as substitution parameters, but those statements
were moved into clickhouse_bootstrap.py (which generates the
passwords from a state file). The header comment in the SQL still
listed the three parameters; only {db:Identifier} is left.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The reader user is created in ClickHouse with sha256_hash of the
empty string, so clients authenticate with just the username and no
password. The Credentials.reader_password field was a permanent
empty string fed straight into aiohttp.BasicAuth(_, "") which is
equivalent to BasicAuth(_). Remove the field; pass only the user.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
clickhouse-web ATTACHes the hits table to a remote web disk pointed
at https://clickhouse-public-datasets.s3.amazonaws.com/web/ —
nothing is downloaded during ./load, parts stream on demand at
query time, with /dev/shm/clickhouse/ as a local cache. Drop it
from the _EXTERNAL exclusion and grant DATALAKE_FILTERED so the
SNI-restricted proxy lets the S3 calls through post-snapshot.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…, kinetica

databend, kinetica already have install/start/check/load/query/stop;
just drop them from the _EXTERNAL exclusion. Both run as self-hosted
binaries / docker images.

firebolt + parquet variants only had run.sh + benchmark.sh (the
monolithic format), so add per-step scripts wrapping the
ghcr.io/firebolt-db/firebolt-core:preview-rc docker image:

  install/  docker pull
  start/    docker run with memlock 8 GiB + seccomp unconfined; loop
            on SELECT 'firebolt-ready' until the engine returns the
            sentinel (firebolt-core's HTTP port answers immediately
            but returns 'Cluster not yet healthy' at HTTP 200 until
            the engine threads have warmed)
  check/    SELECT 1
  load/     drop+create database clickbench, POST create.sql
            (variant-specific: firebolt INSERTs into a managed
            table, firebolt-parquet keeps the external table,
            firebolt-parquet-partitioned uses the parquet glob)
  query/    POST query to /?database=clickbench&output_format=JSON_Compact;
            parse .statistics.elapsed for X-Query-Time
  stop/     docker container stop

Each benchmark.sh now exports BENCH_DOWNLOAD_SCRIPT so
build-system-rootfs.sh stages hits.parquet (firebolt,
firebolt-parquet) or hits_*.parquet (firebolt-parquet-partitioned)
on the system disk.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…roxy + DNS

Fixes the security advisories from the review pass:

1. aiohttp static handler: drop follow_symlinks=True. GHSA-5h86-8mv2-jq9f
   was a path-traversal in the static handler reachable only when symlinks
   were followed. The repo's web/ tree has no symlinks anyway, so this
   is pure attack-surface reduction.

2. TRUSTED_INTERNET set removed. clickhouse{,-parquet,-parquet-partitioned}
   and chdb{,-parquet,-parquet-partitioned} no longer get unrestricted
   internet at query time — they all run through the SNI-allowlist
   proxy now. A user SQL that asked clickhouse-client to fetch
   http://169.254.169.254/... can no longer reach the EC2 metadata
   service or any RFC1918 destination; only the S3 hosts in
   sni_proxy.DEFAULT_ALLOW survive.

3. SNI proxy / local DNS resolver bound to internal traffic only.
   New net.setup_host_firewall() installs INPUT rules accepting
   8443/8080/53 only from the 10.200.0.0/16 TAP CIDR and loopback,
   then DROP for anything else. Called once at server startup.
   Without these rules the proxy was an open, unauthenticated S3
   allowlist relay reachable from the public internet.

4. DNS via local resolver, UDP only. enable_filtered_internet now
   REDIRECTs the VM's UDP/53 to the host's local resolver and
   DROPs TCP/53 outright (no big-payload exfiltration channel via
   port 53). The previous ACCEPT-and-forward path is gone; the
   POSTROUTING MASQUERADE that supported it is no longer needed
   either since the SNI proxy opens its own outbound socket.

5. /api/admin/provision/{name} restricted to loopback callers.
   It re-runs install/start/load — can take hours per system — so
   anonymous internet callers triggering it would be a trivial DoS
   and lateral-movement risk. peer-IP check; behind a reverse
   proxy the proxy itself is the peer, which is fine (the proxy is
   part of the admin trust boundary).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Config field was advertised as a "concurrent live VMs cap"
but nothing in vm_manager / monitor / main ever read it. Drop the
dataclass field, the _env_int default, and the README row.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
aiohttp:
  - Add startup assertion: aiohttp >= 3.10 (covers GHSA-5h86-8mv2-jq9f
    static-handler path traversal fixed in 3.9.2, and the request-
    smuggling fixes in 3.9.4 / 3.10.x). Already true on the running
    host (Ubuntu's python3-aiohttp ships 3.13.3), but the assertion
    catches a future install on a stale image.
  - Add playground/requirements.txt with the pin for pip-based setups.

systemd unit:
  - Drop in ProtectSystem=full, ProtectHome=read-only,
    ProtectKernelTunables/Modules/ControlGroups/Clock,
    PrivateTmp, RestrictAddressFamilies, LockPersonality,
    RestrictRealtime, RestrictNamespaces.
  - Explicit ReadWritePaths to /opt/clickbench-playground +
    ~/.cache (Python bytecode).
  - Comments explain what we DON'T set (NoNewPrivileges /
    RestrictSUIDSGID would break sudo, ProtectSystem=strict would
    break the privileged children, PrivateNetwork / PrivateDevices
    would break TAP + /dev/kvm).

Rate limiting:
  - In-memory per-source-IP sliding-window counters on /api/query
    and /api/warmup: 200 req/min and 3000 req/hour. Returns 429 with
    Retry-After when exceeded. Both endpoints are unauthenticated;
    bound the damage a single bad actor can do (snapshot-restore
    spam, heavy-query loops). X-Forwarded-For honored for the
    leftmost hop if a reverse proxy is in front.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Honoring XFF without an authenticated reverse proxy in front lets
any caller rotate the header value to forge a fresh IP for every
request and bypass the bucket entirely. Drop it.

If a reverse proxy is added later, that proxy is the trust boundary
and its operator should either terminate the rate-limit there or
extend this function to honor XFF only when the peer IP is the
proxy's address.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
DNS / dnsmasq:
  - install-firecracker.sh installs and configures dnsmasq on
    every non-loopback host address (port 53 UDP/TCP). The host's
    systemd-resolved stays put on 127.0.0.53. iptables PREROUTING
    REDIRECT for VM UDP/53 lands on a real listener now; before
    this commit the host had no resolver bound to 10.200.x.1:53
    and every VM DNS lookup just timed out (manifested as
    'Not found address of host' from ClickHouse url() calls).
  - net.setup_host_firewall hardens further: TCP/53 in INPUT is
    loopback-only now (was internal-CIDR + loopback). VMs are
    UDP-only for DNS at every layer.

Rate limiter:
  - Add a bulk eviction sweep: when _rate_hits grows past 4096
    entries, drop IPs whose newest hit is > 1h old (or whose
    deque is empty). The previous code only checked for empty
    deques, so one-shot IPs with a single in-window timestamp
    accumulated forever. Sweep is amortized O(1) per request.

clickhouse-web:
  - ClickHouse rejects filesystem-cache paths outside
    /var/lib/clickhouse/caches/ (BAD_ARGUMENTS at CREATE TABLE).
    Move the cache from /dev/shm/clickhouse to
    /var/lib/clickhouse/caches/web. install + create.sql updated
    together so the chown lands on the right path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ClickHouse rejects any filesystem-cache path outside
/var/lib/clickhouse/caches/ at CREATE TABLE time, but we still want
the actual bytes in tmpfs — cold queries pull ~1 GB on first run
and we'd rather not touch the SSD. Hand the engine a path that
satisfies its prefix check (.../caches/web) but is itself a
symlink into /dev/shm/clickhouse. ClickHouse only validates the
configured string lexically; it doesn't canonicalise the target.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
config.py:
  - New PLAYGROUND_TLS_CERT / PLAYGROUND_TLS_KEY / PLAYGROUND_TLS_PORT
    env vars (default port 443). Empty cert path disables TLS.

main.py:
  - When both cert+key are set, bind a second TCPSite on tls_port
    with an SSLContext loading the cert chain. Plain port stays up
    for loopback / behind-a-LB use.

clickbench-playground.service:
  - SupplementaryGroups=ssl-cert so the unprivileged ubuntu user
    can read /etc/letsencrypt/{live,archive}/.../privkey.pem.
  - AmbientCapabilities=CAP_NET_BIND_SERVICE so the python process
    can bind 443. Bounding set deliberately left at default — sudo
    children still need the full cap set for iptables / ip tuntap.

install-firecracker.sh:
  - When PLAYGROUND_TLS_DOMAIN is set, install certbot, acquire the
    cert via --standalone (binds 80 briefly for HTTP-01), and drop
    in a deploy hook that re-applies ssl-cert group perms on every
    renewal so the privkey stays readable.

End-to-end verified:
  curl https://clickbench-playground.clickhouse.com/api/state
  -> HTTP 200, ssl_verify_result=0, CN matches, Let's Encrypt E8,
     valid through 2026-08-12.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
scrollbar-gutter: stable keeps space for the vertical scrollbar even
when the rail's content fits without scrolling. Without it the rail
visibly shrinks as rows finish, briefly pushing the right pane wide
enough to trigger a horizontal scrollbar at the page level.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous start used `--rm` plus an anonymous volume mount,
which meant the agent's pre-snapshot `./stop` (docker container stop)
removed the container and discarded its volume. The snapshot then
captured a freshly-started, empty firebolt-core, and every query
post-restore returned
  Database 'clickbench' does not exist or not authorized.

Drop --rm, bind-mount the engine data directory to a per-system
fb-volume on the sysdisk, and make ./start re-use the existing
container if it's already present (`docker start` instead of
re-running `docker run`).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
polars/server.py stores the scan_parquet LazyFrame in a module-level
`hits` variable; /query returns 409 'DataFrame not loaded' when it
is None. The agent's pre-snapshot stop+start cycle was wiping that
variable: the snapshot captured a freshly-relaunched server, and
the first query post-restore failed with the 409.

Marking .preserve-state skips the stop+start so the snapshot ships
the running server with `hits` already set.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous --silent PUT discarded a non-200 response from
/api/v1/logstream/hits, then every subsequent /ingest POST 400'd
'stream not found' — the only visible evidence was 100k+ curl 400
lines that pushed everything else out of the agent's tail-only
provision log buffer. Print the response, capture HTTP_CODE, and
exit non-zero if it's not 200/201 so the actual cause surfaces.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The kinetica daemon runs inside a docker container with
./kinetica-persist bind-mounted, so a symlink pointing at
$PWD/hits.tsv.gz dangles inside the container and the LOAD
returns
  Not_Found: No such file(s) (File(s):hits.tsv.gz)
The persist dir and $PWD live on the same overlay filesystem, so
the mv is a rename — cheap.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
server.py was discarding the eval()'d value and returning only
{"elapsed": ...}; the playground UI then displayed just the timing.
Stringify the result (polars DataFrame/Series/LazyFrame via __str__,
everything else via repr) and pass it back in a "result" field.
query script extracts result -> stdout, elapsed -> stderr.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
vm_manager._snapshot_disks now adds a compression pass after the
reflink-clone:
  1. cp --reflink=always working/* -> golden/*   (cheap, as before)
  2. zstd -1 -T0 --sparse golden/* -> golden/*.zst
  3. unlink the uncompressed golden once .zst is written
Two-step compress (no `zstd --rm`) so an interrupted run can't lose
the only copy of the golden. Trades a 10-30 s restore-time
decompression for ~30-60% smaller goldens; on the heaviest VM
we have (duckdb-dataframe, 249 GB swap.golden.raw) zstd-1 sampled
~5.5x, so this is roughly the difference between fitting and not
fitting the catalog on a 7 TB host.

_restore_disks materializes the working disk from whichever form
of golden exists — .zst (decompress, no reflink) or .ext4 / .raw
(legacy reflink path, kept for backwards compatibility with old
snapshots).

_has_snapshot accepts either form.

Plus a one-shot scripts/compress-goldens.sh that walks the state
dir and converts existing uncompressed goldens, so operators don't
have to wait for every system to be re-provisioned before the disk
savings land.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant