A polished desktop GUI for Kokoro TTS (hexgrad/Kokoro-82M), built with Python + pywebview. Runs fully offline after the initial model download.
- Three-column interface: text input · voice & generation settings · audio post-processing
- All Kokoro languages and voices with gender filtering
- Voice blending (up to 3 voices with weighted mixing)
- Audio post-processing: normalize, trim silence, noise gate, fade in/out
- Output formats: WAV, FLAC, MP3
- Chunked generation with progress bar — handles long texts reliably
- Plays preview audio without saving; plays last output on demand
- Outputs are DAW-ready (DaVinci Resolve, Logic Pro, etc.)
Kokoro uses espeak-ng for phoneme generation. Install it before running the app:
macOS
brew install espeak-ngLinux (Debian/Ubuntu)
sudo apt-get install espeak-ngWindows Download the MSI installer from: https://github.com/espeak-ng/espeak-ng/releases
-
Clone / download this project folder.
-
Create a virtual environment (Python 3.10+):
python3 -m venv .venv source .venv/bin/activate # macOS/Linux .venv\Scripts\activate # Windows
-
Install Python dependencies:
pip install -r requirements.txt
-
Run the app:
python main.py
On first launch, Kokoro will download model weights (~330 MB) from Hugging Face
and cache them locally (~/.cache/huggingface/hub/). Subsequent launches are
fully offline.
Set the environment variable TTS_DEBUG=1 to open the pywebview developer tools:
TTS_DEBUG=1 python main.pyGenerated audio is saved to ./outputs/ by default (configurable in the UI).
Files are named output_YYYYMMDD_HHMMSS.wav unless you set a custom filename.
WAV and FLAC outputs are lossless and import cleanly into:
- DaVinci Resolve — File → Import → Media
- Logic Pro — drag into timeline or browser
- Any other DAW that accepts PCM or FLAC
MP3 export requires pydub and ffmpeg (install ffmpeg separately or via
brew install ffmpeg).
Enable blending in the Voice panel to mix up to 3 voices by weighted average
of their style tensors. The resulting blend spec (e.g. af_heart:60,am_echo:40)
is displayed in real time. This requires the voice .pt tensor files to be
present in the Kokoro installation.
| Symptom | Fix |
|---|---|
| "espeak-ng not found" dialog | Install espeak-ng per the instructions above |
| Voice shows "(unavailable)" | That voice's tensor file wasn't found; try reinstalling kokoro |
| MP3 export falls back to WAV | Install pydub (pip install pydub) and ffmpeg |
| Black window on launch | Ensure pywebview ≥ 5.0 is installed |
Produces a self-contained installer — no Python or espeak-ng required by end users. Model weights (~330 MB) are downloaded on first launch.
brew install espeak-ng create-dmg
pip install -r requirements.txt
bash build_mac.sh
# Output: dist/KokoroTTSStudio-mac.dmgIcons: Drop
assets/icon.icns(1024×1024) before building for a polished result. The build script generates a teal placeholder if absent.
choco install espeak innosetup
pip install -r requirements.txt
build_windows.bat
:: Output: dist\KokoroTTSStudio-Setup.exeIcons: Drop
assets/icon.ico(256×256) before building.
Push to main or trigger Build Installers manually from the Actions tab.
Both .dmg and .exe are uploaded as 30-day artifacts on each run.
- Torch is large (~2 GB unpacked in the bundle) — first build takes time.
numbais excluded; librosa works without it for the operations used here.- The
espeakng_loaderpip package bundles pre-built espeak-ng binaries, so end users don't need to install it separately. - Model weights are not bundled — they download to
~/Library/Application Support/KokoroTTSStudio/models/(macOS) or%APPDATA%\KokoroTTSStudio\models\(Windows) on first launch.
See LICENSE.