Deploy and Host Ollama on Railway

Ollama local LLM API server

Ollama is the simplest way to run open large language models through a clean REST API — 150k+ GitHub stars. Pull a model with one command and serve Llama 3.2, Phi-4, Gemma, Mistral, Qwen, or embedding models over an OpenAI-compatible endpoint. This template gives you a private, always-on Ollama API on Railway that your other services can call over the private network — no local machine kept running, no third-party inference bill.

Best for small models and embeddings. Railway runs CPU inference (no GPU), so this template shines as a private endpoint for embedding models and small chat models (1B–8B) — powering RAG, search, and lightweight agents at a flat ~$5–20/month instead of per-token API fees.

What This Template Deploys

Service	Purpose
Ollama	The model server and REST API on port `11434` — OpenAI-compatible `/v1` endpoint plus native `/api` routes for generation, chat, and embeddings
Persistent Volume	Mounted at `/root/.ollama` — pulled models are cached so they don't re-download on every redeploy

Single-service, self-contained. Add the volume and models persist across redeploys. Connect from other Railway services over the private network — no public exposure of the model API required.

About Hosting Ollama

Running Ollama locally means keeping your own machine on 24/7 for anything that needs the API. Hosting it on Railway gives you an always-on inference endpoint your apps can hit any time — without a GPU cloud bill or per-token SaaS pricing for the many workloads that small models handle perfectly well.

Railway runs CPU-based inference (no GPU available), so this template is built for the workloads that thrive there: embedding models (nomic-embed-text, mxbai-embed-large) for RAG and semantic search, and small chat models (Llama 3.2 1B/3B, Phi, Gemma 2B, Qwen 0.5–7B) for classification, extraction, and lightweight agents. Large models (30B+) will run but slowly — match your model choice to CPU inference and this is a genuinely useful private endpoint.

Typical cost: ~$5–20/month on Railway depending on model size and RAM. Compared to OpenAI's embedding and small-model API fees at scale, a flat-rate private Ollama endpoint pays for itself fast on high-volume embedding and classification workloads.

Deploy in Under 3 Minutes

Click Deploy on Railway — Ollama builds automatically (~1–2 minutes)
Add a persistent volume at /root/.ollama so pulled models survive redeploys
Pull a model — POST to /api/pull with {"name": "llama3.2:3b"}, or exec ollama pull nomic-embed-text
Call the API: POST /api/generate, /api/chat, /api/embeddings, or the OpenAI-compatible /v1
From other Railway services, use http://${{Ollama.RAILWAY_PRIVATE_DOMAIN}}:11434 privately

No GPU setup. No CUDA. No local machine kept running.

Common Use Cases

Private embeddings API for RAG — serve nomic-embed-text or mxbai-embed-large to your vector pipeline over the private network; no per-embedding OpenAI fees
Small-model inference endpoint — run Llama 3.2, Phi, Gemma, or Qwen for classification, extraction, summarization, and routing where a small model is enough
OpenAI-compatible drop-in for cost control — point any OpenAI SDK at the /v1 endpoint and swap gpt-4o-mini-class calls for a flat-rate self-hosted small model
Backend for Open WebUI, n8n, or LangChain — pair with a chat UI or workflow tool on Railway; they call Ollama over the private network for local inference
Always-on API without keeping your laptop on — move local Ollama workflows to an always-available Railway endpoint your apps can reach 24/7
Private, no-log inference — prompts and completions stay on your Railway instance; nothing sent to a third-party model provider

Configuration

Variable	Required	Description
`OLLAMA_HOST`	✅ Pre-set	`0.0.0.0:11434` — required so the API is reachable, not just localhost
`OLLAMA_MODELS`	Pre-set	`/root/.ollama/models` — model cache path on the persistent volume
`OLLAMA_KEEP_ALIVE`	Optional	How long a model stays loaded in memory — e.g. `5m` or `-1` to keep loaded
`OLLAMA_MAX_LOADED_MODELS`	Optional	Cap concurrently loaded models to control RAM use
`OLLAMA_NUM_PARALLEL`	Optional	Parallel request slots per model
`PORT`	Auto-set	Railway injects the port; Ollama serves on `11434`

Match RAM to your model: a 3B model needs ~4 GB, a 7–8B model ~8 GB. Set OLLAMA_HOST to 0.0.0.0 (pre-set here) or the API only listens on localhost and external calls fail — the single most common Ollama hosting mistake.

Ollama on Railway vs. Alternatives

	Ollama (Railway)	OpenAI API	GPU cloud (Ollama)	Local Ollama
Pricing	~$5–20/mo flat	Per token	GPU $/hr	Free (your hardware)
Always-on API	✅ Yes	✅ Yes	✅ Yes	❌ Machine must run
Data privacy	✅ Your instance	❌ OpenAI	✅ Your instance	✅ Local
Large models (30B+)	⚠️ Slow (CPU)	✅ Yes	✅ Fast (GPU)	⚠️ Needs strong GPU
Small models / embeddings	✅ Great fit	✅ Yes	✅ Overkill	✅ Yes
Private network to your apps	✅ Railway internal	❌ Public API	⚠️ Varies	❌ No
Open source	✅ MIT	❌ No	✅ MIT	✅ MIT

Dependencies for Ollama Hosting

Railway account — size RAM to your model (4 GB for 3B, 8 GB for 7–8B); ~$5–20/month
A persistent volume at /root/.ollama so pulled models survive redeploys
Any Ollama or OpenAI-compatible client — no special SDK required

Deployment Dependencies

Ollama GitHub Repository — source and releases
Ollama Model Library — browse pullable models and sizes
Ollama API Documentation — API reference
Railway Volumes Documentation — model persistence

Implementation Details

This template deploys the official ollama/ollama image with OLLAMA_HOST=0.0.0.0:11434 so the API accepts external and private-network calls, and a persistent volume at /root/.ollama so pulled models are cached across redeploys. The API exposes native /api/generate, /api/chat, and /api/embeddings routes plus an OpenAI-compatible /v1 surface.

Railway provides CPU inference — no GPU — so match model choice to that: embedding models and small chat models (1B–8B) run well; 30B+ models run but slowly. Connect from other Railway services privately via http://${{Ollama.RAILWAY_PRIVATE_DOMAIN}}:11434, keeping the model API off the public internet entirely.

Frequently Asked Questions

Can I run large models like Llama 70B on this? Technically yes, but Railway is CPU-only (no GPU), so large models run slowly. This template is built for embedding models and small chat models (1B–8B), which run well on CPU. For large-model GPU speed, use a GPU cloud; for embeddings, RAG, and lightweight inference, this is a great, cost-effective fit.

Which models should I use for best performance? Embeddings: nomic-embed-text, mxbai-embed-large. Small chat/instruct: Llama 3.2 1B/3B, Phi, Gemma 2B, Qwen 0.5–7B. These give responsive CPU inference. Size your Railway RAM to the model — about 4 GB for a 3B model, 8 GB for a 7–8B model.

Do pulled models survive a redeploy? Only with the persistent volume at /root/.ollama. Without it, models re-download on every deploy. With the volume mounted, your pulled models are cached and persist across redeploys.

Can other Railway services use this Ollama privately? Yes. Call http://${{Ollama.RAILWAY_PRIVATE_DOMAIN}}:11434 from any service in the same project over Railway's private network — your app, Open WebUI, n8n, or a LangChain pipeline can use Ollama without exposing the model API publicly.

Is it OpenAI-compatible? Yes. Ollama exposes an OpenAI-compatible /v1 endpoint. Point any OpenAI SDK at your Railway Ollama URL and swap small-model calls to self-hosted inference with minimal code change.

Why can't external apps reach the API? Almost always because OLLAMA_HOST isn't set to 0.0.0.0 — by default Ollama listens on localhost only. This template pre-sets OLLAMA_HOST=0.0.0.0:11434 so the API is reachable from the start.

Why Deploy and Host Ollama on Railway?

Railway is a singular platform to deploy your infrastructure stack. Railway will host your infrastructure so you don't have to deal with configuration, while allowing you to vertically and horizontally scale it.

By deploying Ollama on Railway, you get a private, always-on LLM and embeddings API — OpenAI- compatible, callable over the private network, with models cached on a persistent volume — at a flat ~$5–20/month, ideal for RAG, embeddings, and small-model inference without per-token fees.