Skip to content

Feature request: low-latency raw audio stream for agent/action use cases #4

@maxpetrusenko

Description

@maxpetrusenko

Request

Please expose a supported low-latency audio stream from Bee, ideally as raw audio frames/chunks or an equivalent realtime websocket/webhook/API. This would make Bee usable as the front end for personal agents and action automation, not only retrospective notes/transcripts.

Current limitation

Today the public CLI path appears to be transcript-first:

  • bee stream --json --types new-utterance emits utterance events, but the realtime docs describe the stream as at-most-once delivery.
  • bee now --json can backfill, but it is polling-oriented and can lag behind speech.
  • In practice, action workflows that wait for processed utterances can miss commands or arrive too late for voice-assistant UX.

For my setup, Bee transcribes speech, a VPS ingests it, and a local agent routes explicit wake-word commands like "Hermes ..." to approved actions. This works for slow tasks, but the connection is brittle for Jarvis-style voice control because the system cannot get audio bytes or low-latency transcript deltas directly.

Desired API shape

Any one of these would help:

  • WebSocket or webhook delivering PCM/Opus/AAC audio chunks with timestamps.
  • Realtime transcript deltas with stable utterance IDs and delivery acknowledgements.
  • A local/device stream from the Bee CLI that can be consumed by an agent process.
  • Clear latency target and ordering/deduplication semantics.

Ideal target: sub-second to a few seconds end-to-end from speech to agent callback. Raw audio access would also allow users to run their own wake-word detection, ASR, latency measurement, and fallback routing without waiting for post-processing.

Safety/use case

This is for the owner’s own Bee device and opt-in personal automation. The agent side can keep a wake-word gate plus an allowlist of actions. I am not asking for other users’ data or bypassing consent controls, just an official way to stream my own captured audio or lower-latency speech events into my own automation stack.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions