Deploy and Host Speaches on Railway

Deploy Speaches on Railway to run a fully self-hosted, OpenAI API-compatible speech-to-text and text-to-speech server. Speaches uses faster-whisper for transcription and Kokoro/Piper for speech synthesis — like Ollama, but for audio models. This template pre-configures Speaches with CPU-optimized int8 quantization, API key authentication, a persistent volume for model caching, and the Gradio web UI for interactive testing.

Self-host Speaches to process audio without sending data to third-party APIs. The deployment includes a single Speaches service with a HuggingFace model cache volume — no database required.

Speaches dashboard screenshot

Getting Started with Speaches on Railway

After deployment completes, open your Railway-generated URL to access the Speaches Gradio web UI. The web UI lets you test speech-to-text transcription and text-to-speech generation directly in your browser. To use the API, send requests to /v1/audio/transcriptions (STT) or /v1/audio/speech (TTS) with your API key in the Authorization: Bearer header. Models download automatically on first use and are cached in the persistent volume. Check the /docs endpoint for the full OpenAPI specification. Point any OpenAI-compatible SDK at your Speaches URL to start integrating.

About Hosting Speaches

Speaches is an open-source (MIT) server that provides OpenAI-compatible speech-to-text and text-to-speech APIs. It dynamically loads and unloads models on demand — specify the model in your API request and Speaches downloads it from HuggingFace, runs inference, and optionally offloads it after a configurable TTL.

STT via faster-whisper (4x faster than standard Whisper on CPU)
TTS via Kokoro (#1 ranked on TTS Arena) and Piper (20+ languages)
Realtime WebSocket API at /v1/realtime for two-way voice conversations
Streaming transcription via Server-Sent Events
Dynamic model management — models auto-download and auto-unload based on TTL settings

Why Deploy Speaches on Railway

Railway handles infrastructure so you can focus on building voice-enabled applications.

Full data privacy — audio never leaves your server
Zero per-minute costs — pay only for Railway infrastructure
OpenAI API drop-in replacement — existing code works unchanged
Persistent model cache survives redeployments
API key authentication included out of the box

Common Use Cases for Self-Hosted Speaches

Voice-enabled AI agents: Use as the audio layer for LLM-powered assistants with real-time WebSocket support for two-way conversations
Meeting and call transcription: Stream audio for real-time transcription in internal tools, call centers, or accessibility captioning
Home automation voice control: Integrate with Home Assistant via the wyoming_openai proxy for private, cloud-free voice commands
Multilingual content production: Generate voiceovers for videos, e-learning, audiobooks, or IVR systems using Kokoro and Piper TTS models

Dependencies for Speaches on Railway

Speaches — ghcr.io/speaches-ai/speaches:0.9.0-rc.3-cpu (CPU-only build, ~1.2 GB image)
Volume — HuggingFace model cache at /home/ubuntu/.cache/huggingface/hub

Environment Variables Reference for Speaches

Variable	Description	Default
`API_KEY`	API key for endpoint authentication	(none — open)
`ENABLE_UI`	Enable Gradio web interface	`true`
`WHISPER__COMPUTE_TYPE`	Quantization type (int8 reduces memory ~40%)	`default`
`WHISPER__INFERENCE_DEVICE`	Inference device (cpu/cuda/auto)	`auto`
`STT_MODEL_TTL`	Seconds before STT model unloads (-1 = never)	`300`
`TTS_MODEL_TTL`	Seconds before TTS model unloads (-1 = never)	`300`
`PRELOAD_MODELS`	JSON array of model IDs to download at startup	`[]`
`LOG_LEVEL`	Logging level (debug/info/warning/error)	`debug`

Deployment Dependencies

Runtime: Python 3.x with Uvicorn ASGI server
Models: Downloaded from HuggingFace Hub at runtime
GitHub: speaches-ai/speaches
Docs: speaches.ai

Hardware Requirements for Self-Hosting Speaches

Resource	Minimum (tiny/base models)	Recommended (small + Kokoro)
CPU	2 vCPU	4+ vCPU
RAM	1 GB	4 GB
Storage	500 MB (model cache)	5 GB (multiple models)
Runtime	Docker	Docker

For the large-v3 Whisper model, allocate 8 GB RAM. Use WHISPER__COMPUTE_TYPE=int8 to reduce memory by approximately 40% on CPU deployments.

Self-Hosting Speaches

Pull and run the CPU image with Docker:

docker run -d \
  -p 8000:8000 \
  -v speaches-cache:/home/ubuntu/.cache/huggingface/hub \
  -e API_KEY=your-secret-key \
  -e WHISPER__COMPUTE_TYPE=int8 \
  ghcr.io/speaches-ai/speaches:0.9.0-rc.3-cpu

Or use docker-compose for a persistent setup:

services:
  speaches:
    image: ghcr.io/speaches-ai/speaches:0.9.0-rc.3-cpu
    ports:
      - "8000:8000"
    environment:
      - API_KEY=your-secret-key
      - WHISPER__COMPUTE_TYPE=int8
      - WHISPER__INFERENCE_DEVICE=cpu
      - STT_MODEL_TTL=-1
    volumes:
      - hf-cache:/home/ubuntu/.cache/huggingface/hub
volumes:
  hf-cache:

How Much Does Speaches Cost to Self-Host?

Speaches is free and open-source under the MIT license — no per-minute or per-character fees. On Railway, you pay only for compute and storage. Cloud alternatives like OpenAI Whisper API charge $0.006/minute for transcription and $15/1M characters for TTS. Self-hosting Speaches on Railway eliminates these recurring costs entirely, making it cost-effective at any volume beyond a few hundred minutes per month.

Speaches vs LocalAI for Self-Hosted Audio

Feature	Speaches	LocalAI
Focus	STT + TTS specialist	Multi-modal (LLM, images, audio, embeddings)
STT Engine	faster-whisper	whisper.cpp / faster-whisper
TTS Engines	Kokoro, Piper	Piper, Coqui, Kokoro
Realtime API	WebSocket at /v1/realtime	Not available
Model Management	Dynamic load/unload with TTL	Static configuration
Resource Usage	Lightweight (audio only)	Heavier (full LLM stack)

Speaches is the better choice when you need a dedicated, lightweight audio server. LocalAI is better when you want a single server for LLMs, images, and audio combined.

FAQ for Speaches on Railway

What is Speaches and why self-host it? Speaches is an open-source server that provides OpenAI-compatible speech-to-text and text-to-speech APIs. Self-hosting gives you full data privacy, zero per-minute costs, and the ability to run audio processing without depending on cloud APIs.

What does this Railway template deploy? This template deploys a single Speaches container with CPU-optimized configuration, API key authentication, and a persistent volume for caching HuggingFace models. No database is required.

Why does the Speaches Railway template include a volume? The volume caches downloaded AI models (Whisper for STT, Kokoro/Piper for TTS) so they persist across redeployments. Without a volume, models would re-download on every container restart, adding minutes of cold-start delay.

How do I use Speaches as an OpenAI API drop-in replacement? Point your OpenAI SDK's base_url to your Speaches Railway URL (e.g. https://your-app.up.railway.app/v1) and set any API key value matching your API_KEY environment variable. All /v1/audio/transcriptions, /v1/audio/speech, and /v1/models endpoints are compatible.

Can I run Speaches on Railway without a GPU? Yes. This template uses the CPU-only image with int8 quantization for reduced memory usage. The small Whisper model processes audio at approximately 4x real-time speed on CPU. For production workloads requiring faster processing, consider a GPU-enabled host.

How do I preload models in self-hosted Speaches to avoid cold starts? Set the PRELOAD_MODELS environment variable to a JSON array of HuggingFace model IDs, for example ["Systran/faster-whisper-small","speaches-ai/Kokoro-82M-v1.0-ONNX"]. Models download during container startup and are ready for immediate use.