Ray backend creates per-worker venvs that fill disk and crash with concurrent experiments

## Summary

When using `parallel_backend="ray"` (the default), Ray auto-packages the working directory and creates **a fresh virtual environment per worker** in a temporary directory. For projects with heavy dependencies (e.g., PyTorch ~12GB), this causes:

1. **Disk exhaustion**: Each Ray cluster creates a full venv copy (~12GB+). With 4-10 concurrent experiments, this can consume 50-120GB+ in `/tmp`, filling the root partition.
2. **Worker startup hangs**: Workers hang during `uv sync` / `pip install` in the temp venv, producing repeated `worker_pool.cc: Some workers of the worker process have not registered within the timeout` errors.
3. **GCS crashes**: When too many clusters compete for resources, Ray's GCS (Global Control Store) becomes unresponsive, causing `Failed to connect to GCS within 60 seconds` and terminating experiments.
4. **AF_UNIX socket path limit**: If the temp directory path is long (e.g., a scratch filesystem), the Unix socket path exceeds the 107-byte limit, causing `OSError: AF_UNIX path length cannot exceed 107 bytes`.

## Reproduction

1. Have a project with `pyproject.toml` that depends on PyTorch (or similar large packages)
2. Launch 3+ concurrent `Study.run()` calls with `parallel_backend="ray"` (default)
3. Observe `/tmp` filling up with `ray_*` directories, each containing a full venv

## Root Cause

In `agentlab/experiments/launch_exp.py:85`:
```python
ray.init(num_cpus=n_jobs)
```

This bare `ray.init()` causes Ray to auto-detect the working directory (which contains `pyproject.toml`) and package it for workers. Each worker then runs `uv sync` to create a fresh venv in the Ray temp directory, re-installing all dependencies from scratch.

Key behaviors:
- Ray creates a new temp directory per `ray.init()` call (each experiment gets its own cluster)
- Each cluster's workers build an independent venv copy
- Failed/completed experiments leave their temp directories behind (no cleanup)
- `ray.shutdown()` in `launch_exp.py:89` does not clean up the temp directory

## Impact

- Experiments silently fail with `ENOSPC` errors (Playwright can't create browser profiles when disk is full)
- Hundreds of tasks get recorded as errors that are actually disk-full failures, requiring full reruns
- The problem compounds: each relaunch creates additional temp directories

## Workaround

Using `parallel_backend="joblib"` avoids Ray entirely and doesn't have this issue. However, joblib doesn't support Ray's task graph execution (dependency tracking between tasks).

Another workaround is to set `RAY_TMPDIR` to a large filesystem and create isolated Ray temp dirs there, but this hits the AF_UNIX 107-byte socket path limit if the path is too long.

## Suggested Fix

1. **Disable Ray's auto-packaging** by setting `runtime_env={"worker_process_setup_hook": ...}` or `RAY_RUNTIME_ENV_HOOK` to prevent venv creation in workers
2. **Or** pass `runtime_env={"py_modules": [...]}` with only the necessary module instead of the full working directory
3. **Or** set `ray.init(runtime_env={"working_dir": None})` to prevent auto-packaging
4. **Add temp directory cleanup** in the `finally` block after `ray.shutdown()` -- clean up the Ray temp directory

## Environment

- AgentLab v0.4.0
- Ray 2.51.1
- Python 3.12
- Ubuntu 22.04
- Dependencies include PyTorch 2.8.0 (~12GB installed)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ray backend creates per-worker venvs that fill disk and crash with concurrent experiments #331

Summary

Reproduction

Root Cause

Impact

Workaround

Suggested Fix

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Ray backend creates per-worker venvs that fill disk and crash with concurrent experiments #331

Description

Summary

Reproduction

Root Cause

Impact

Workaround

Suggested Fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions