Skip to content

Ray backend creates per-worker venvs that fill disk and crash with concurrent experiments #331

@xhluca

Description

@xhluca

Summary

When using parallel_backend="ray" (the default), Ray auto-packages the working directory and creates a fresh virtual environment per worker in a temporary directory. For projects with heavy dependencies (e.g., PyTorch ~12GB), this causes:

  1. Disk exhaustion: Each Ray cluster creates a full venv copy (~12GB+). With 4-10 concurrent experiments, this can consume 50-120GB+ in /tmp, filling the root partition.
  2. Worker startup hangs: Workers hang during uv sync / pip install in the temp venv, producing repeated worker_pool.cc: Some workers of the worker process have not registered within the timeout errors.
  3. GCS crashes: When too many clusters compete for resources, Ray's GCS (Global Control Store) becomes unresponsive, causing Failed to connect to GCS within 60 seconds and terminating experiments.
  4. AF_UNIX socket path limit: If the temp directory path is long (e.g., a scratch filesystem), the Unix socket path exceeds the 107-byte limit, causing OSError: AF_UNIX path length cannot exceed 107 bytes.

Reproduction

  1. Have a project with pyproject.toml that depends on PyTorch (or similar large packages)
  2. Launch 3+ concurrent Study.run() calls with parallel_backend="ray" (default)
  3. Observe /tmp filling up with ray_* directories, each containing a full venv

Root Cause

In agentlab/experiments/launch_exp.py:85:

ray.init(num_cpus=n_jobs)

This bare ray.init() causes Ray to auto-detect the working directory (which contains pyproject.toml) and package it for workers. Each worker then runs uv sync to create a fresh venv in the Ray temp directory, re-installing all dependencies from scratch.

Key behaviors:

  • Ray creates a new temp directory per ray.init() call (each experiment gets its own cluster)
  • Each cluster's workers build an independent venv copy
  • Failed/completed experiments leave their temp directories behind (no cleanup)
  • ray.shutdown() in launch_exp.py:89 does not clean up the temp directory

Impact

  • Experiments silently fail with ENOSPC errors (Playwright can't create browser profiles when disk is full)
  • Hundreds of tasks get recorded as errors that are actually disk-full failures, requiring full reruns
  • The problem compounds: each relaunch creates additional temp directories

Workaround

Using parallel_backend="joblib" avoids Ray entirely and doesn't have this issue. However, joblib doesn't support Ray's task graph execution (dependency tracking between tasks).

Another workaround is to set RAY_TMPDIR to a large filesystem and create isolated Ray temp dirs there, but this hits the AF_UNIX 107-byte socket path limit if the path is too long.

Suggested Fix

  1. Disable Ray's auto-packaging by setting runtime_env={"worker_process_setup_hook": ...} or RAY_RUNTIME_ENV_HOOK to prevent venv creation in workers
  2. Or pass runtime_env={"py_modules": [...]} with only the necessary module instead of the full working directory
  3. Or set ray.init(runtime_env={"working_dir": None}) to prevent auto-packaging
  4. Add temp directory cleanup in the finally block after ray.shutdown() -- clean up the Ray temp directory

Environment

  • AgentLab v0.4.0
  • Ray 2.51.1
  • Python 3.12
  • Ubuntu 22.04
  • Dependencies include PyTorch 2.8.0 (~12GB installed)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions