
Deploy Ollama — Private LLM & Embeddings API on Railway
Self-host Ollama: private LLM & embeddings API, OpenAI-compatible. No GPU.
Open WebUI
Just deployed
/app/backend/data
Ollama
Just deployed
/root/.ollama
Deploy and Host Ollama on Railway

Ollama is the simplest way to run open large language models through a clean REST API — 150k+ GitHub stars. Pull a model with one command and serve Llama 3.2, Phi-4, Gemma, Mistral, Qwen, or embedding models over an OpenAI-compatible endpoint. This template gives you a private, always-on Ollama API on Railway that your other services can call over the private network — no local machine kept running, no third-party inference bill.
Best for small models and embeddings. Railway runs CPU inference (no GPU), so this template shines as a private endpoint for embedding models and small chat models (1B–8B) — powering RAG, search, and lightweight agents at a flat ~$5–20/month instead of per-token API fees.
What This Template Deploys
| Service | Purpose |
|---|---|
| Ollama | The model server and REST API on port 11434 — OpenAI-compatible /v1 endpoint plus native /api routes for generation, chat, and embeddings |
| Persistent Volume | Mounted at /root/.ollama — pulled models are cached so they don't re-download on every redeploy |
Single-service, self-contained. Add the volume and models persist across redeploys. Connect from other Railway services over the private network — no public exposure of the model API required.
About Hosting Ollama
Running Ollama locally means keeping your own machine on 24/7 for anything that needs the API. Hosting it on Railway gives you an always-on inference endpoint your apps can hit any time — without a GPU cloud bill or per-token SaaS pricing for the many workloads that small models handle perfectly well.
Railway runs CPU-based inference (no GPU available), so this template is built for the workloads
that thrive there: embedding models (nomic-embed-text, mxbai-embed-large) for RAG and
semantic search, and small chat models (Llama 3.2 1B/3B, Phi, Gemma 2B, Qwen 0.5–7B) for
classification, extraction, and lightweight agents. Large models (30B+) will run but slowly —
match your model choice to CPU inference and this is a genuinely useful private endpoint.
Typical cost: ~$5–20/month on Railway depending on model size and RAM. Compared to OpenAI's embedding and small-model API fees at scale, a flat-rate private Ollama endpoint pays for itself fast on high-volume embedding and classification workloads.
Deploy in Under 3 Minutes
- Click Deploy on Railway — Ollama builds automatically (~1–2 minutes)
- Add a persistent volume at
/root/.ollamaso pulled models survive redeploys - Pull a model — POST to
/api/pullwith{"name": "llama3.2:3b"}, or execollama pull nomic-embed-text - Call the API:
POST /api/generate,/api/chat,/api/embeddings, or the OpenAI-compatible/v1 - From other Railway services, use
http://${{Ollama.RAILWAY_PRIVATE_DOMAIN}}:11434privately
No GPU setup. No CUDA. No local machine kept running.
Common Use Cases
- Private embeddings API for RAG — serve
nomic-embed-textormxbai-embed-largeto your vector pipeline over the private network; no per-embedding OpenAI fees - Small-model inference endpoint — run Llama 3.2, Phi, Gemma, or Qwen for classification, extraction, summarization, and routing where a small model is enough
- OpenAI-compatible drop-in for cost control — point any OpenAI SDK at the
/v1endpoint and swapgpt-4o-mini-class calls for a flat-rate self-hosted small model - Backend for Open WebUI, n8n, or LangChain — pair with a chat UI or workflow tool on Railway; they call Ollama over the private network for local inference
- Always-on API without keeping your laptop on — move local Ollama workflows to an always-available Railway endpoint your apps can reach 24/7
- Private, no-log inference — prompts and completions stay on your Railway instance; nothing sent to a third-party model provider
Configuration
| Variable | Required | Description |
|---|---|---|
OLLAMA_HOST | ✅ Pre-set | 0.0.0.0:11434 — required so the API is reachable, not just localhost |
OLLAMA_MODELS | Pre-set | /root/.ollama/models — model cache path on the persistent volume |
OLLAMA_KEEP_ALIVE | Optional | How long a model stays loaded in memory — e.g. 5m or -1 to keep loaded |
OLLAMA_MAX_LOADED_MODELS | Optional | Cap concurrently loaded models to control RAM use |
OLLAMA_NUM_PARALLEL | Optional | Parallel request slots per model |
PORT | Auto-set | Railway injects the port; Ollama serves on 11434 |
Match RAM to your model: a 3B model needs ~4 GB, a 7–8B model ~8 GB. Set
OLLAMA_HOSTto0.0.0.0(pre-set here) or the API only listens on localhost and external calls fail — the single most common Ollama hosting mistake.
Ollama on Railway vs. Alternatives
| Ollama (Railway) | OpenAI API | GPU cloud (Ollama) | Local Ollama | |
|---|---|---|---|---|
| Pricing | ~$5–20/mo flat | Per token | GPU $/hr | Free (your hardware) |
| Always-on API | ✅ Yes | ✅ Yes | ✅ Yes | ❌ Machine must run |
| Data privacy | ✅ Your instance | ❌ OpenAI | ✅ Your instance | ✅ Local |
| Large models (30B+) | ⚠️ Slow (CPU) | ✅ Yes | ✅ Fast (GPU) | ⚠️ Needs strong GPU |
| Small models / embeddings | ✅ Great fit | ✅ Yes | ✅ Overkill | ✅ Yes |
| Private network to your apps | ✅ Railway internal | ❌ Public API | ⚠️ Varies | ❌ No |
| Open source | ✅ MIT | ❌ No | ✅ MIT | ✅ MIT |
Dependencies for Ollama Hosting
- Railway account — size RAM to your model (4 GB for 3B, 8 GB for 7–8B); ~$5–20/month
- A persistent volume at
/root/.ollamaso pulled models survive redeploys - Any Ollama or OpenAI-compatible client — no special SDK required
Deployment Dependencies
- Ollama GitHub Repository — source and releases
- Ollama Model Library — browse pullable models and sizes
- Ollama API Documentation — API reference
- Railway Volumes Documentation — model persistence
Implementation Details
This template deploys the official ollama/ollama image with OLLAMA_HOST=0.0.0.0:11434 so the
API accepts external and private-network calls, and a persistent volume at /root/.ollama so
pulled models are cached across redeploys. The API exposes native /api/generate, /api/chat,
and /api/embeddings routes plus an OpenAI-compatible /v1 surface.
Railway provides CPU inference — no GPU — so match model choice to that: embedding models and
small chat models (1B–8B) run well; 30B+ models run but slowly. Connect from other Railway
services privately via http://${{Ollama.RAILWAY_PRIVATE_DOMAIN}}:11434, keeping the model API
off the public internet entirely.
Frequently Asked Questions
Can I run large models like Llama 70B on this? Technically yes, but Railway is CPU-only (no GPU), so large models run slowly. This template is built for embedding models and small chat models (1B–8B), which run well on CPU. For large-model GPU speed, use a GPU cloud; for embeddings, RAG, and lightweight inference, this is a great, cost-effective fit.
Which models should I use for best performance?
Embeddings: nomic-embed-text, mxbai-embed-large. Small chat/instruct: Llama 3.2 1B/3B, Phi,
Gemma 2B, Qwen 0.5–7B. These give responsive CPU inference. Size your Railway RAM to the model —
about 4 GB for a 3B model, 8 GB for a 7–8B model.
Do pulled models survive a redeploy?
Only with the persistent volume at /root/.ollama. Without it, models re-download on every
deploy. With the volume mounted, your pulled models are cached and persist across redeploys.
Can other Railway services use this Ollama privately?
Yes. Call http://${{Ollama.RAILWAY_PRIVATE_DOMAIN}}:11434 from any service in the same project
over Railway's private network — your app, Open WebUI, n8n, or a LangChain pipeline can use
Ollama without exposing the model API publicly.
Is it OpenAI-compatible?
Yes. Ollama exposes an OpenAI-compatible /v1 endpoint. Point any OpenAI SDK at your Railway
Ollama URL and swap small-model calls to self-hosted inference with minimal code change.
Why can't external apps reach the API?
Almost always because OLLAMA_HOST isn't set to 0.0.0.0 — by default Ollama listens on
localhost only. This template pre-sets OLLAMA_HOST=0.0.0.0:11434 so the API is reachable from
the start.
Why Deploy and Host Ollama on Railway?
Railway is a singular platform to deploy your infrastructure stack. Railway will host your infrastructure so you don't have to deal with configuration, while allowing you to vertically and horizontally scale it.
By deploying Ollama on Railway, you get a private, always-on LLM and embeddings API — OpenAI- compatible, callable over the private network, with models cached on a persistent volume — at a flat ~$5–20/month, ideal for RAG, embeddings, and small-model inference without per-token fees.
Template Content
Open WebUI
open-webui/open-webuiOllama
ollama/ollama