Skip to content

Latest commit

 

History

History
264 lines (190 loc) · 19.4 KB

File metadata and controls

264 lines (190 loc) · 19.4 KB

Storage pipeline

The storage stack carries guest I/O requests from a guest-visible controller to a backing store and back. It's shared between OpenVMM and OpenHCL. Every disk backend implements the DiskIo trait, and frontends hold a Disk wrapper — a cheap, cloneable handle to any backend. For the DiskIo trait surface, method contracts, and error model, see the disk_backend rustdoc.

The pipeline

Every storage I/O flows through the same layered pipeline:

  ┌──────────────────────────────────────────────────────────┐
  │  Guest I/O                                               │
  └────────────────────────┬─────────────────────────────────┘
                           │
  ┌────────────────────────┼─────────────────────────────────┐
  │  Frontend              │                                 │
  │  (NVMe · StorVSP · IDE)│                                 │
  └────┬───────────────────┼────────────────────┬────────────┘
       │                   │                    │
       │ NVMe: direct      │ SCSI / IDE         │
       │                   ▼                    │
       │       ┌────────────────────────┐       │
       │       │ SCSI adapter           │       │
       │       │ (SimpleScsiDisk /      │       │
       │       │  SimpleScsiDvd)        │       │
       │       └───────────┬────────────┘       │
       │                   │                    │
       ▼                   ▼                    ▼
  ┌──────────────────────────────────────────────────────────┐
  │  Disk  (DiskIo trait boundary)                           │
  └────────────────────────┬─────────────────────────────────┘
                           │
  ┌────────────────────────┼─────────────────────────────────┐
  │  Decorator wrappers    │  (optional: crypt · delay · PR) │
  └────────────────────────┼─────────────────────────────────┘
                           │
            ┌──────────────┴──────────────┐
            ▼                             ▼
  ┌──────────────────┐      ┌──────────────────────────────┐
  │  Backend         │      │  Layered disk                │
  │  (file · block   │      │  (optional: RAM + backing)   │
  │   device · blob  │      │    ├── Layer 0 (RAM/sqlite)  │
  │   · VHD · ...)   │      │    └── Layer 1 (backend)     │
  └──────────────────┘      └──────────────────────────────┘

Key vocabulary:

  • Frontend. Speaks a guest-visible storage protocol and translates requests into DiskIo calls.
  • SCSI adapter. For the SCSI and IDE paths, an intermediate layer (SimpleScsiDisk or SimpleScsiDvd) that parses SCSI CDB opcodes before calling DiskIo.
  • Backend. A DiskIo implementation that reads and writes to a specific backing store.
  • Decorator. A DiskIo implementation that wraps another Disk and transforms I/O in transit (encryption, delay, persistent reservations).
  • Layered disk. A DiskIo implementation composed of ordered layers with per-sector presence tracking.

Frontends

Three frontends exist. Each speaks a different guest-visible protocol but they all produce DiskIo calls on the backend side.

Frontend Protocol Transport Crate
NVMe NVMe 2.0 PCI MMIO + MSI-X nvme
StorVSP SCSI CDB over VMBus VMBus ring buffers storvsp
IDE ATA / ATAPI PCI/ISA I/O ports + DMA ide

NVMe is the simplest path. The NVMe controller's namespace directly holds a Disk. NVM opcodes (READ, WRITE, FLUSH, DSM) map nearly 1:1 to DiskIo methods. The FUA bit from the NVMe write command is forwarded directly.

StorVSP / SCSI has a two-layer design. StorVSP handles the VMBus transport — negotiation, ring buffer management, sub-channel allocation. It dispatches each SCSI request to an AsyncScsiDisk implementation. For hard drives, that's SimpleScsiDisk, which parses the SCSI CDB and translates it to DiskIo calls. For optical drives, it's SimpleScsiDvd.

IDE is the legacy path. ATA commands for hard drives call DiskIo directly. ATAPI commands for optical drives delegate to SimpleScsiDvd through an ATAPI-to-SCSI translation layer — the same DVD implementation that StorVSP uses. IDE also supports enlightened INT13 commands, a Microsoft-specific optimization that collapses the multi-exit register-programming sequence into a single VM exit.

Backends

A backend is a DiskIo implementation that reads and writes to a specific backing store. Backends are interchangeable — swap one for another without changing the frontend. The frontend holds a Disk and doesn't know what's behind it. See the storage backends page for the full catalog and platform details.

Decorators

A decorator is a DiskIo implementation that wraps another Disk and transforms I/O in transit. Features compose by stacking decorators without modifying backends:

  CryptDisk
    └── BlockDeviceDisk

Three decorators exist: CryptDisk (XTS-AES-256 encryption), DelayDisk (injected latency), and DiskWithReservations (in-memory persistent reservation emulation). All three forward metadata (sector count, sector size, disk ID, wait_resize) to the inner disk unchanged. See the storage backends page for the decorator catalog.

The layered disk model

A LayeredDisk is a DiskIo implementation composed of multiple layers, ordered from top to bottom. Each layer is a block device with per-sector presence tracking. This model powers diff disks, RAM overlays, and caching.

Reads fall through

When a read arrives, the layered disk checks layers top-to-bottom. The first layer that has the requested sectors provides the data. Sectors not present in any layer are zeroed.

Writes go to the top

Writes always go to the topmost layer. If that layer is configured with write-through, the write also propagates to the next layer.

Read caching

A layer can be configured to cache read misses: when sectors are fetched from a lower layer, they're written back to the cache layer. This uses a write_no_overwrite operation to avoid overwriting sectors that were written between the read and the cache population.

Layer implementations

Two concrete layers exist today:

  • RamDiskLayer (disklayer_ram) — ephemeral, in-memory. Data is stored in a BTreeMap keyed by sector number. Fast, but lost when the VM stops.
  • SqliteDiskLayer (disklayer_sqlite) — persistent, backed by a SQLite database (.dbhd file). Designed for dev/test scenarios — no stability guarantees on the on-disk format.

A full Disk can appear at the bottom of the stack as a fully-present layer (DiskAsLayer). This is the typical case: a RAM or sqlite layer on top of a file or block device.

Worked example: memdiff:file:disk.vhdx

  Layer 0: RamDiskLayer (empty, writable)
  Layer 1: DiskAsLayer wrapping FileDisk (fully present, read-only
           from the layered disk's perspective)
  • Guest write → sector goes to the RAM layer.
  • Guest read → check RAM; if the sector is present, return it. If absent, fall through to the file.
  • Sectors absent from both layers → zero-filled.

Changes are ephemeral — they live in the RAM layer and are lost when the VM stops. The Running OpenVMM page shows concrete memdiff: examples.

How configuration becomes a concrete stack

The resource resolver connects configuration (CLI flags, VTL2 settings) to concrete backends. A resource handle describes what backend to use; a resolver creates it.

The storage resolver chain is recursive. An NVMe controller resolves each namespace's disk, which may be a layered disk, which resolves each layer in parallel, which may itself be a disk that needs resolving.

Example: --disk memdiff:file:path/to/disk.vhdx

  1. CLI parses this into a LayeredDiskHandle with two layers:
    • Layer 0: RamDiskLayerHandle { len: None, sector_size: None } (RAM diff, inherits size and sector size from backing disk)
    • Layer 1: DiskLayerHandle(FileDiskHandle(...)) (the file)
  2. The layered disk resolver resolves both layers in parallel.
  3. The RAM layer attaches on top of the file layer, inheriting its sector size and capacity.
  4. The resulting LayeredDisk is wrapped in a Disk and handed to the NVMe namespace or SCSI controller.

For the OpenHCL settings model (StorageController, Lun, PhysicalDevice), see Storage Translation and Storage Configuration Model.

Backend catalog

Backend Crate Wraps Platform Note
FileDisk disk_file Host file Cross-platform Simplest backend
Vhd1Disk disk_vhd1 VHD1 fixed file Cross-platform Parses VHD footer
VhdmpDisk disk_vhdmp Windows vhdmp driver Windows Dynamic/differencing VHD/VHDX
BlobDisk disk_blob HTTP / Azure Blob Cross-platform Read-only, HTTP range requests
BlockDeviceDisk disk_blockdevice Linux block device Linux io_uring, resize via uevent, PR passthrough
NvmeDisk disk_nvme Physical NVMe (VFIO) Linux/Windows User-mode NVMe driver, resize via AEN
StripedDisk disk_striped Multiple Disks Cross-platform Data striping

Online disk resize

Disk resize is a cross-cutting concern that spans backends and frontends.

Backend detection

Only two backends detect capacity changes at runtime:

  • BlockDeviceDisk — listens for Linux uevent notifications on the block device. When the host resizes the device, a uevent fires, the backend re-queries the size via ioctl, and wait_resize completes.
  • NvmeDisk — the user-mode NVMe driver monitors Async Event Notifications (AEN) from the physical controller and rescans namespace capacity.

All other backends default to never signaling (wait_resize returns pending()). Decorators and layered disks delegate wait_resize to the inner backend.

`FileDisk` never signals resize. If you attach a file backend and resize the file at runtime, nothing happens — the guest won't be notified. Use `BlockDeviceDisk` or `NvmeDisk` for runtime resize.

Frontend notification

Once a backend detects a resize, the frontend notifies the guest:

Frontend Mechanism How it works
NVMe Async Event Notification Background task per namespace calls wait_resize. On change, completes a queued AER command with a changed-namespace-list log page. Guest re-identifies the namespace.
StorVSP / SCSI UNIT_ATTENTION On the next SCSI command after a resize, SimpleScsiDisk detects the capacity change and returns CHECK_CONDITION with UNIT_ATTENTION / CAPACITY_DATA_CHANGED. Guest retries and re-reads capacity.
IDE Not supported IDE has no capacity-change notification mechanism.

The resize path is the same in OpenHCL and standalone — BlockDeviceDisk detects the uevent from the host, wait_resize completes, and the frontend notifies the guest through the standard mechanism. No special paravisor-level interception.

Virtual optical / DVD

DVD and CD-ROM drives use a different model from disk devices.

SimpleScsiDvd implements AsyncScsiDisk and manages media state: a disk can be Loaded or Unloaded. Optical media always uses a 2048-byte sector size. The implementation handles optical-specific SCSI commands: GET_EVENT_STATUS_NOTIFICATION, GET_CONFIGURATION, START_STOP_UNIT (eject), and media change events.

Eject

Two eject paths exist:

  • Guest-initiated (SCSI START_STOP_UNIT with the load/eject flag): the DVD handler checks the prevent flag, replaces media with Unloaded, and calls disk.eject(). Once ejected via SCSI, the media is permanently removed for the VM lifetime.
  • Host-initiated (change_media via the resolver's background task): can insert new media or remove existing media dynamically.

Frontend support

Frontend DVD support How
StorVSP / SCSI Yes SimpleScsiDvd implements AsyncScsiDisk directly.
IDE Yes ATAPI wraps SimpleScsiDvd through the ATAPI-to-SCSI layer.
NVMe No NVMe has no removable media concept. Explicitly rejected.

CLI

  • --disk file:my.iso,dvd → SCSI optical drive.
  • --ide file:my.iso,dvd → IDE optical drive (ATAPI).

The dvd flag implicitly sets read_only = true.

mem: and memdiff: CLI mapping

Both CLI options map to the layered disk model:

  • mem:1G creates a single-layer LayeredDisk with a RamDiskLayer sized to 1 GB. No backing disk — the RAM layer is the entire disk.
  • memdiff:file:disk.vhdx creates a two-layer LayeredDisk: a RamDiskLayer (inheriting size from the backing disk) on top of the file. Writes go to the RAM layer; reads fall through to the file for sectors not yet written.

Both use RamDiskLayerHandle under the hood. The difference is len: Some(size) for mem: (standalone RAM disk with explicit size) vs. len: None for memdiff: (inherits from backing disk). The optional sector_size field (default None) lets you override the sector size; when None, it inherits from the lower layer or defaults to 512 bytes. The Running OpenVMM page shows concrete examples.

Controller identity and Azure disk classification

In Azure, which controller a disk sits on is a de facto compatibility boundary. Azure VMs present four SCSI controllers (this may change), each with a distinct instance ID. One controller carries the OS disk, resource (temporary) disk, and related infrastructure disks; a separate controller carries remote data disks. For Gen1 VMs, the IDE controllers logically replace that first SCSI controller, while data disks remain on SCSI.

Guest agents use controller identity to classify disks. The azure-vm-utils udev rules match on SCSI controller instance IDs to create stable symlinks under /dev/disk/azure/. Moving a disk from one StorVSP controller instance to another changes its classification and can break guest-side automation. For SCSI disk mapping details, see the Azure disk mapping docs.

For NVMe, the mapping uses namespace IDs: NSID 1 is the OS disk, NSID 2+ are data disks (portal LUN = NSID − 2). On newer VM sizes (v7+), disks are split across multiple NVMe controllers by caching policy. NVMe is Gen2-only. See the NVMe overview and NVMe disk identification FAQ for the full Azure perspective.

Implementation map

Component Why read it Source Rustdoc
disk_backend DiskIo trait, Disk wrapper, error model source rustdoc
disk_layered Layered disk, LayerIo trait, bitmap tracking source rustdoc
nvme NVMe controller emulator source rustdoc
storvsp VMBus SCSI controller source rustdoc
scsidisk SCSI CDB parser (SimpleScsiDisk, SimpleScsiDvd) source rustdoc
ide IDE controller emulator source rustdoc
scsi_core AsyncScsiDisk trait, Request, ScsiResult source rustdoc