Railway

Deploy Ollama — Private LLM & Embeddings API on Railway

Self-host Ollama: private LLM & embeddings API, OpenAI-compatible. No GPU.

Deploy Ollama — Private LLM & Embeddings API on Railway

Just deployed

/app/backend/data

Just deployed

/root/.ollama

Deploy and Host Ollama on Railway

Ollama local LLM API server

Ollama is the simplest way to run open large language models through a clean REST API — 150k+ GitHub stars. Pull a model with one command and serve Llama 3.2, Phi-4, Gemma, Mistral, Qwen, or embedding models over an OpenAI-compatible endpoint. This template gives you a private, always-on Ollama API on Railway that your other services can call over the private network — no local machine kept running, no third-party inference bill.

Best for small models and embeddings. Railway runs CPU inference (no GPU), so this template shines as a private endpoint for embedding models and small chat models (1B–8B) — powering RAG, search, and lightweight agents at a flat ~$5–20/month instead of per-token API fees.


What This Template Deploys

ServicePurpose
OllamaThe model server and REST API on port 11434 — OpenAI-compatible /v1 endpoint plus native /api routes for generation, chat, and embeddings
Persistent VolumeMounted at /root/.ollama — pulled models are cached so they don't re-download on every redeploy

Single-service, self-contained. Add the volume and models persist across redeploys. Connect from other Railway services over the private network — no public exposure of the model API required.


About Hosting Ollama

Running Ollama locally means keeping your own machine on 24/7 for anything that needs the API. Hosting it on Railway gives you an always-on inference endpoint your apps can hit any time — without a GPU cloud bill or per-token SaaS pricing for the many workloads that small models handle perfectly well.

Railway runs CPU-based inference (no GPU available), so this template is built for the workloads that thrive there: embedding models (nomic-embed-text, mxbai-embed-large) for RAG and semantic search, and small chat models (Llama 3.2 1B/3B, Phi, Gemma 2B, Qwen 0.5–7B) for classification, extraction, and lightweight agents. Large models (30B+) will run but slowly — match your model choice to CPU inference and this is a genuinely useful private endpoint.

Typical cost: ~$5–20/month on Railway depending on model size and RAM. Compared to OpenAI's embedding and small-model API fees at scale, a flat-rate private Ollama endpoint pays for itself fast on high-volume embedding and classification workloads.


Deploy in Under 3 Minutes

  1. Click Deploy on Railway — Ollama builds automatically (~1–2 minutes)
  2. Add a persistent volume at /root/.ollama so pulled models survive redeploys
  3. Pull a model — POST to /api/pull with {"name": "llama3.2:3b"}, or exec ollama pull nomic-embed-text
  4. Call the API: POST /api/generate, /api/chat, /api/embeddings, or the OpenAI-compatible /v1
  5. From other Railway services, use http://${{Ollama.RAILWAY_PRIVATE_DOMAIN}}:11434 privately

No GPU setup. No CUDA. No local machine kept running.


Common Use Cases

  • Private embeddings API for RAG — serve nomic-embed-text or mxbai-embed-large to your vector pipeline over the private network; no per-embedding OpenAI fees
  • Small-model inference endpoint — run Llama 3.2, Phi, Gemma, or Qwen for classification, extraction, summarization, and routing where a small model is enough
  • OpenAI-compatible drop-in for cost control — point any OpenAI SDK at the /v1 endpoint and swap gpt-4o-mini-class calls for a flat-rate self-hosted small model
  • Backend for Open WebUI, n8n, or LangChain — pair with a chat UI or workflow tool on Railway; they call Ollama over the private network for local inference
  • Always-on API without keeping your laptop on — move local Ollama workflows to an always-available Railway endpoint your apps can reach 24/7
  • Private, no-log inference — prompts and completions stay on your Railway instance; nothing sent to a third-party model provider

Configuration

VariableRequiredDescription
OLLAMA_HOST✅ Pre-set0.0.0.0:11434 — required so the API is reachable, not just localhost
OLLAMA_MODELSPre-set/root/.ollama/models — model cache path on the persistent volume
OLLAMA_KEEP_ALIVEOptionalHow long a model stays loaded in memory — e.g. 5m or -1 to keep loaded
OLLAMA_MAX_LOADED_MODELSOptionalCap concurrently loaded models to control RAM use
OLLAMA_NUM_PARALLELOptionalParallel request slots per model
PORTAuto-setRailway injects the port; Ollama serves on 11434

Match RAM to your model: a 3B model needs ~4 GB, a 7–8B model ~8 GB. Set OLLAMA_HOST to 0.0.0.0 (pre-set here) or the API only listens on localhost and external calls fail — the single most common Ollama hosting mistake.


Ollama on Railway vs. Alternatives

Ollama (Railway)OpenAI APIGPU cloud (Ollama)Local Ollama
Pricing~$5–20/mo flatPer tokenGPU $/hrFree (your hardware)
Always-on API✅ Yes✅ Yes✅ Yes❌ Machine must run
Data privacy✅ Your instance❌ OpenAI✅ Your instance✅ Local
Large models (30B+)⚠️ Slow (CPU)✅ Yes✅ Fast (GPU)⚠️ Needs strong GPU
Small models / embeddings✅ Great fit✅ Yes✅ Overkill✅ Yes
Private network to your apps✅ Railway internal❌ Public API⚠️ Varies❌ No
Open source✅ MIT❌ No✅ MIT✅ MIT

Dependencies for Ollama Hosting

  • Railway account — size RAM to your model (4 GB for 3B, 8 GB for 7–8B); ~$5–20/month
  • A persistent volume at /root/.ollama so pulled models survive redeploys
  • Any Ollama or OpenAI-compatible client — no special SDK required

Deployment Dependencies

Implementation Details

This template deploys the official ollama/ollama image with OLLAMA_HOST=0.0.0.0:11434 so the API accepts external and private-network calls, and a persistent volume at /root/.ollama so pulled models are cached across redeploys. The API exposes native /api/generate, /api/chat, and /api/embeddings routes plus an OpenAI-compatible /v1 surface.

Railway provides CPU inference — no GPU — so match model choice to that: embedding models and small chat models (1B–8B) run well; 30B+ models run but slowly. Connect from other Railway services privately via http://${{Ollama.RAILWAY_PRIVATE_DOMAIN}}:11434, keeping the model API off the public internet entirely.


Frequently Asked Questions

Can I run large models like Llama 70B on this? Technically yes, but Railway is CPU-only (no GPU), so large models run slowly. This template is built for embedding models and small chat models (1B–8B), which run well on CPU. For large-model GPU speed, use a GPU cloud; for embeddings, RAG, and lightweight inference, this is a great, cost-effective fit.

Which models should I use for best performance? Embeddings: nomic-embed-text, mxbai-embed-large. Small chat/instruct: Llama 3.2 1B/3B, Phi, Gemma 2B, Qwen 0.5–7B. These give responsive CPU inference. Size your Railway RAM to the model — about 4 GB for a 3B model, 8 GB for a 7–8B model.

Do pulled models survive a redeploy? Only with the persistent volume at /root/.ollama. Without it, models re-download on every deploy. With the volume mounted, your pulled models are cached and persist across redeploys.

Can other Railway services use this Ollama privately? Yes. Call http://${{Ollama.RAILWAY_PRIVATE_DOMAIN}}:11434 from any service in the same project over Railway's private network — your app, Open WebUI, n8n, or a LangChain pipeline can use Ollama without exposing the model API publicly.

Is it OpenAI-compatible? Yes. Ollama exposes an OpenAI-compatible /v1 endpoint. Point any OpenAI SDK at your Railway Ollama URL and swap small-model calls to self-hosted inference with minimal code change.

Why can't external apps reach the API? Almost always because OLLAMA_HOST isn't set to 0.0.0.0 — by default Ollama listens on localhost only. This template pre-sets OLLAMA_HOST=0.0.0.0:11434 so the API is reachable from the start.


Why Deploy and Host Ollama on Railway?

Railway is a singular platform to deploy your infrastructure stack. Railway will host your infrastructure so you don't have to deal with configuration, while allowing you to vertically and horizontally scale it.

By deploying Ollama on Railway, you get a private, always-on LLM and embeddings API — OpenAI- compatible, callable over the private network, with models cached on a persistent volume — at a flat ~$5–20/month, ideal for RAG, embeddings, and small-model inference without per-token fees.


Template Content

More templates in this category

View Template
Chat Chat
Chat Chat, your own unified chat and search to AI platform.

okisdev
112
View Template
Hermes Agent | OpenClaw Alternative with Dashboard
[Jun'26] Self-improving AI agent with memory, skills, and web dashboard 🤖

codestorm
46
View Template
EchoDeck
Generate a mp4 from powerpoint with TTS

Fixed Scope
6