Deploy Deploy and Host Speaches — Self-Hosted STT/TTS API on Railway
Self-host OpenAI-compatible STT & TTS. No per-minute fee. Audio stays local
Just deployed
/home/ubuntu/.cache/huggingface/hub
Deploy and Host Speaches on Railway

Speaches is an open-source, OpenAI API-compatible speech server — like Ollama, but for
audio models. Drop-in replacement for OpenAI's /v1/audio/transcriptions (STT) and
/v1/audio/speech (TTS) endpoints. Powered by faster-whisper (4× faster than standard
Whisper on CPU), Kokoro (ranked #1 on TTS Arena), and Piper (20+ languages) — all running
locally on your Railway instance with no audio ever leaving your server.
Self-host on Railway for ~$5–10/month versus OpenAI Whisper API at $0.36/hour or ElevenLabs at $5–22/month with character caps. Zero per-minute fees. Unlimited transcriptions and speech generation at flat compute cost.
What This Template Deploys
| Service | Purpose |
|---|---|
| Speaches v0.9.0 | OpenAI-compatible STT/TTS server — faster-whisper transcription, Kokoro/Piper speech synthesis, Realtime WebSocket API, and Gradio web UI on port 8000 |
HuggingFace Model Cache (/home/ubuntu/.cache/huggingface/hub) | Persistent volume — downloaded models survive redeploys; no re-downloading on container restart |
Single-service architecture. No database, no Redis, no external services. Models download from HuggingFace on first use and are cached permanently on the volume.
About Hosting Speaches
Running a production STT/TTS server requires managing model downloads, CPU/GPU inference configuration, API authentication, and a public HTTPS endpoint for application integration. Without a managed host, you're configuring Docker, model storage volumes, SSL, and compute resource allocation manually.
Railway pre-configures Speaches with CPU-optimized int8 quantization, API key auth, a
persistent HuggingFace cache volume, and Gradio web UI — all at deploy time.
Typical cost: ~$5–10/month on Railway's Hobby plan. OpenAI Whisper API costs $0.36/hour — 30 hours of audio costs $10.80 before any TTS usage. ElevenLabs caps at 30,000 characters on the starter plan. Speaches gives you unlimited STT and TTS at flat compute pricing.
Deploy in Under 5 Minutes
- Click Deploy on Railway — Speaches builds automatically (~2–3 minutes)
- Set
API_KEYto a strong random string in the Variables tab — this secures all endpoints - Open your Railway-assigned URL — the Gradio web UI loads immediately
- Test STT at
/v1/audio/transcriptionsand TTS at/v1/audio/speechwith your API key - Point any OpenAI-compatible SDK at your Railway URL — existing code works unchanged
No SSH. No model management. No infrastructure configuration.
Common Use Cases
- Self-hosted alternative to OpenAI Whisper API — run faster-whisper locally at flat compute cost instead of $0.36/hour; existing code using OpenAI's audio endpoints works unchanged — just swap the base URL to your Railway domain
- Self-hosted alternative to ElevenLabs — generate unlimited speech with Kokoro (#1 TTS Arena) and Piper (20+ languages) without $5–22/month subscription or character caps
- Voice-enabled AI agent audio layer — use Speaches as the STT/TTS backend for
LLM-powered voice assistants; the Realtime WebSocket API at
/v1/realtimeenables two-way voice conversations with sub-second latency - Meeting and call transcription — stream audio for real-time transcription in internal tools, call centres, or accessibility captioning pipelines without per-minute API billing
- Home Assistant private voice control — integrate via the
wyoming_openaiproxy for fully local, cloud-free voice commands; audio never leaves your network - Multilingual audio content production — generate voiceovers for video, e-learning, audiobooks, and IVR systems using Piper's 20+ language voices at no per-character cost
Configuration
| Variable | Required | Description |
|---|---|---|
API_KEY | Recommended | Secures all API endpoints — set before exposing your Railway URL publicly |
ENABLE_UI | Optional | Set to false to disable the Gradio web UI in production — reduces memory usage |
WHISPER__COMPUTE_TYPE | Optional | int8 (default, ~40% memory reduction), float16, or float32 |
WHISPER__INFERENCE_DEVICE | Optional | cpu (default), cuda (requires GPU), or auto |
STT_MODEL_TTL | Optional | Seconds before STT model unloads from memory — -1 to keep loaded permanently |
TTS_MODEL_TTL | Optional | Seconds before TTS model unloads from memory — -1 to keep loaded permanently |
PRELOAD_MODELS | Optional | JSON array of HuggingFace model IDs to download at startup — avoids cold-start delays |
LOG_LEVEL | Optional | info for production, debug for troubleshooting |
PORTis injected automatically by Railway. The default Speaches port is8000.
Speaches vs. Managed STT/TTS APIs
| Speaches (Railway) | OpenAI Whisper API | ElevenLabs | Deepgram | |
|---|---|---|---|---|
| Monthly cost | ~$5–10 flat | $0.36/hour | From $5/month | From $0.0043/min |
| Per-minute/character fees | ✅ None | ❌ $0.36/hr | ❌ Character caps | ❌ Per minute |
| Audio leaves your server | ✅ Never | ❌ OpenAI servers | ❌ ElevenLabs servers | ❌ Deepgram servers |
| OpenAI API drop-in | ✅ Yes — same endpoints | ✅ Yes | ❌ No | ❌ No |
| STT (transcription) | ✅ faster-whisper | ✅ Whisper | ❌ TTS only | ✅ Yes |
| TTS (speech synthesis) | ✅ Kokoro + Piper | ✅ TTS-1 | ✅ Yes | ❌ STT only |
| Realtime WebSocket | ✅ /v1/realtime | ✅ Yes | ❌ No | ✅ Yes |
| Self-hostable | ✅ Yes | ❌ No | ❌ No | ❌ No |
| Multilingual (20+ langs) | ✅ Piper | ✅ 99 languages | ✅ Yes | ✅ Yes |
Dependencies for Speaches Hosting
- Railway account — Hobby plan (~$5–10/month) covers the service and model cache volume
- No external API keys required — models download from HuggingFace at runtime for free
- Optional: NVIDIA GPU for accelerated inference (not available on Railway Hobby plan)
Deployment Dependencies
- Speaches GitHub Repository — source and releases
- Speaches Documentation — full API and model configuration reference
- HuggingFace — faster-whisper models — STT model options
- HuggingFace — Kokoro TTS — TTS model
- Railway Volumes Documentation — persistent storage setup
Implementation Details
This template deploys ghcr.io/speaches-ai/speaches:0.9.0-rc.3-cpu with a persistent
Railway volume at /home/ubuntu/.cache/huggingface/hub. Models are downloaded from
HuggingFace on first request and cached permanently — redeploys and version updates do not
require re-downloading models. The CPU build uses int8 quantization by default, reducing
memory usage by approximately 40% vs full precision.
The API server exposes OpenAI-compatible endpoints: /v1/audio/transcriptions for STT,
/v1/audio/speech for TTS, and /v1/realtime for WebSocket-based two-way voice. Full
OpenAPI documentation is available at /docs on your Railway domain after deploy.
Frequently Asked Questions
How much does Speaches cost on Railway vs OpenAI Whisper API? Speaches on Railway runs at ~$5–10/month flat with unlimited transcriptions and speech generation. OpenAI Whisper API charges $0.36/hour — at 30 hours of audio that's $10.80 in API costs before any TTS usage. ElevenLabs starts at $5/month but caps characters. Speaches gives you unlimited STT and TTS at flat compute pricing with no audio leaving your server.
Is Speaches a drop-in replacement for OpenAI's audio API?
Yes. Speaches implements the same /v1/audio/transcriptions and /v1/audio/speech endpoints
as OpenAI. Change the base URL in your OpenAI SDK to your Railway domain and set your API
key — existing code works unchanged without any other modifications.
Do my audio files leave my Railway instance? No. All transcription and speech synthesis runs inside your Railway container using local model inference. No audio data is sent to any external API. This makes Speaches suitable for HIPAA-regulated audio, internal voice tools, and privacy-sensitive applications.
How long does the first transcription take?
The first request triggers a model download from HuggingFace — faster-whisper tiny takes
~75 MB, small ~244 MB, and large-v3 ~1.5 GB. Subsequent requests use the cached model and
run immediately. Use PRELOAD_MODELS to pre-download models at startup and eliminate
cold-start delays on the first request.
What is the difference between Kokoro and Piper for TTS? Kokoro is ranked #1 on TTS Arena for voice quality — best for English speech generation where naturalness matters. Piper is optimised for speed and supports 20+ languages — best for multilingual applications or latency-sensitive deployments. Both are available in the same Speaches instance and selectable per API request.
Do I lose my downloaded models if Railway redeploys?
No. All downloaded models are stored on the Railway persistent volume at
/home/ubuntu/.cache/huggingface/hub, not inside the container. Redeploys, version updates,
and container restarts do not require re-downloading models.
Why Deploy and Host Speaches on Railway?
Railway is a singular platform to deploy your infrastructure stack. Railway will host your infrastructure so you don't have to deal with configuration, while allowing you to vertically and horizontally scale it.
By deploying Speaches on Railway, you get a fully OpenAI-compatible STT/TTS server — local faster-whisper transcription, Kokoro and Piper speech synthesis, Realtime WebSocket support, and persistent model caching — at ~$5–10/month flat with no per-minute fees and no audio ever leaving your infrastructure.
Template Content
