Deploy Speaches | Open-Source OpenAI Whisper Alternative on Railway
Self Host Speaches — OpenAI-compatible STT/TTS API server
Just deployed
/home/ubuntu/.cache/huggingface/hub
Deploy and Host Speaches on Railway
Deploy Speaches on Railway to run a fully self-hosted, OpenAI API-compatible speech-to-text and text-to-speech server. Speaches uses faster-whisper for transcription and Kokoro/Piper for speech synthesis — like Ollama, but for audio models. This template pre-configures Speaches with CPU-optimized int8 quantization, API key authentication, a persistent volume for model caching, and the Gradio web UI for interactive testing.
Self-host Speaches to process audio without sending data to third-party APIs. The deployment includes a single Speaches service with a HuggingFace model cache volume — no database required.

Getting Started with Speaches on Railway
After deployment completes, open your Railway-generated URL to access the Speaches Gradio web UI. The web UI lets you test speech-to-text transcription and text-to-speech generation directly in your browser. To use the API, send requests to /v1/audio/transcriptions (STT) or /v1/audio/speech (TTS) with your API key in the Authorization: Bearer header. Models download automatically on first use and are cached in the persistent volume. Check the /docs endpoint for the full OpenAPI specification. Point any OpenAI-compatible SDK at your Speaches URL to start integrating.
About Hosting Speaches
Speaches is an open-source (MIT) server that provides OpenAI-compatible speech-to-text and text-to-speech APIs. It dynamically loads and unloads models on demand — specify the model in your API request and Speaches downloads it from HuggingFace, runs inference, and optionally offloads it after a configurable TTL.
- STT via faster-whisper (4x faster than standard Whisper on CPU)
- TTS via Kokoro (#1 ranked on TTS Arena) and Piper (20+ languages)
- Realtime WebSocket API at
/v1/realtimefor two-way voice conversations - Streaming transcription via Server-Sent Events
- Dynamic model management — models auto-download and auto-unload based on TTL settings
Why Deploy Speaches on Railway
Railway handles infrastructure so you can focus on building voice-enabled applications.
- Full data privacy — audio never leaves your server
- Zero per-minute costs — pay only for Railway infrastructure
- OpenAI API drop-in replacement — existing code works unchanged
- Persistent model cache survives redeployments
- API key authentication included out of the box
Common Use Cases for Self-Hosted Speaches
- Voice-enabled AI agents: Use as the audio layer for LLM-powered assistants with real-time WebSocket support for two-way conversations
- Meeting and call transcription: Stream audio for real-time transcription in internal tools, call centers, or accessibility captioning
- Home automation voice control: Integrate with Home Assistant via the wyoming_openai proxy for private, cloud-free voice commands
- Multilingual content production: Generate voiceovers for videos, e-learning, audiobooks, or IVR systems using Kokoro and Piper TTS models
Dependencies for Speaches on Railway
- Speaches —
ghcr.io/speaches-ai/speaches:0.9.0-rc.3-cpu(CPU-only build, ~1.2 GB image) - Volume — HuggingFace model cache at
/home/ubuntu/.cache/huggingface/hub
Environment Variables Reference for Speaches
| Variable | Description | Default |
|---|---|---|
API_KEY | API key for endpoint authentication | (none — open) |
ENABLE_UI | Enable Gradio web interface | true |
WHISPER__COMPUTE_TYPE | Quantization type (int8 reduces memory ~40%) | default |
WHISPER__INFERENCE_DEVICE | Inference device (cpu/cuda/auto) | auto |
STT_MODEL_TTL | Seconds before STT model unloads (-1 = never) | 300 |
TTS_MODEL_TTL | Seconds before TTS model unloads (-1 = never) | 300 |
PRELOAD_MODELS | JSON array of model IDs to download at startup | [] |
LOG_LEVEL | Logging level (debug/info/warning/error) | debug |
Deployment Dependencies
- Runtime: Python 3.x with Uvicorn ASGI server
- Models: Downloaded from HuggingFace Hub at runtime
- GitHub: speaches-ai/speaches
- Docs: speaches.ai
Hardware Requirements for Self-Hosting Speaches
| Resource | Minimum (tiny/base models) | Recommended (small + Kokoro) |
|---|---|---|
| CPU | 2 vCPU | 4+ vCPU |
| RAM | 1 GB | 4 GB |
| Storage | 500 MB (model cache) | 5 GB (multiple models) |
| Runtime | Docker | Docker |
For the large-v3 Whisper model, allocate 8 GB RAM. Use WHISPER__COMPUTE_TYPE=int8 to reduce memory by approximately 40% on CPU deployments.
Self-Hosting Speaches
Pull and run the CPU image with Docker:
docker run -d \
-p 8000:8000 \
-v speaches-cache:/home/ubuntu/.cache/huggingface/hub \
-e API_KEY=your-secret-key \
-e WHISPER__COMPUTE_TYPE=int8 \
ghcr.io/speaches-ai/speaches:0.9.0-rc.3-cpu
Or use docker-compose for a persistent setup:
services:
speaches:
image: ghcr.io/speaches-ai/speaches:0.9.0-rc.3-cpu
ports:
- "8000:8000"
environment:
- API_KEY=your-secret-key
- WHISPER__COMPUTE_TYPE=int8
- WHISPER__INFERENCE_DEVICE=cpu
- STT_MODEL_TTL=-1
volumes:
- hf-cache:/home/ubuntu/.cache/huggingface/hub
volumes:
hf-cache:
How Much Does Speaches Cost to Self-Host?
Speaches is free and open-source under the MIT license — no per-minute or per-character fees. On Railway, you pay only for compute and storage. Cloud alternatives like OpenAI Whisper API charge $0.006/minute for transcription and $15/1M characters for TTS. Self-hosting Speaches on Railway eliminates these recurring costs entirely, making it cost-effective at any volume beyond a few hundred minutes per month.
Speaches vs LocalAI for Self-Hosted Audio
| Feature | Speaches | LocalAI |
|---|---|---|
| Focus | STT + TTS specialist | Multi-modal (LLM, images, audio, embeddings) |
| STT Engine | faster-whisper | whisper.cpp / faster-whisper |
| TTS Engines | Kokoro, Piper | Piper, Coqui, Kokoro |
| Realtime API | WebSocket at /v1/realtime | Not available |
| Model Management | Dynamic load/unload with TTL | Static configuration |
| Resource Usage | Lightweight (audio only) | Heavier (full LLM stack) |
Speaches is the better choice when you need a dedicated, lightweight audio server. LocalAI is better when you want a single server for LLMs, images, and audio combined.
FAQ for Speaches on Railway
What is Speaches and why self-host it? Speaches is an open-source server that provides OpenAI-compatible speech-to-text and text-to-speech APIs. Self-hosting gives you full data privacy, zero per-minute costs, and the ability to run audio processing without depending on cloud APIs.
What does this Railway template deploy? This template deploys a single Speaches container with CPU-optimized configuration, API key authentication, and a persistent volume for caching HuggingFace models. No database is required.
Why does the Speaches Railway template include a volume? The volume caches downloaded AI models (Whisper for STT, Kokoro/Piper for TTS) so they persist across redeployments. Without a volume, models would re-download on every container restart, adding minutes of cold-start delay.
How do I use Speaches as an OpenAI API drop-in replacement?
Point your OpenAI SDK's base_url to your Speaches Railway URL (e.g. https://your-app.up.railway.app/v1) and set any API key value matching your API_KEY environment variable. All /v1/audio/transcriptions, /v1/audio/speech, and /v1/models endpoints are compatible.
Can I run Speaches on Railway without a GPU? Yes. This template uses the CPU-only image with int8 quantization for reduced memory usage. The small Whisper model processes audio at approximately 4x real-time speed on CPU. For production workloads requiring faster processing, consider a GPU-enabled host.
How do I preload models in self-hosted Speaches to avoid cold starts?
Set the PRELOAD_MODELS environment variable to a JSON array of HuggingFace model IDs, for example ["Systran/faster-whisper-small","speaches-ai/Kokoro-82M-v1.0-ONNX"]. Models download during container startup and are ready for immediate use.
Template Content