Deploy and Host vLLM on Railway

vLLM is a high-throughput inference and serving engine for large language models that exposes an OpenAI-compatible HTTP API. Self-host vLLM on Railway when you want a private, OpenAI-drop-in endpoint for any Hugging Face causal-LM — no rate limits, no per-token billing, no third-party telemetry on your prompts.

This Railway template deploys the official vllm/vllm-openai-cpu Docker image as a single service with a persistent volume for the Hugging Face model cache, Bearer-token API key authentication, and a public HTTPS URL. The default model (Qwen/Qwen2.5-0.5B-Instruct) is deliberately small and ungated so the deploy boots without configuration; swap VLLM_MODEL to any HF model id that fits in 8 GB RAM.

Getting Started with vLLM on Railway

After deploy, the public URL exposes an OpenAI-compatible API at /v1/*. Grab the auto-generated VLLM_API_KEY from the Railway service's Variables tab — that's your Bearer token. Test the endpoint with curl https:///v1/models -H "Authorization: Bearer $VLLM_API_KEY"; it should list the loaded model. Point any OpenAI SDK at https:///v1 with that key as the api_key and you're done. The first request after a cold boot can take 30–60 seconds while the model warms up; subsequent requests are fast because the model stays resident in RAM. To swap models, change VLLM_MODEL to a different Hugging Face id (e.g. microsoft/Phi-3-mini-4k-instruct) and redeploy — the volume keeps weights cached so you only pay the download once per model.

About Hosting vLLM

vLLM was built at UC Berkeley's Sky Computing Lab and pioneered PagedAttention — a memory-management technique that turns the KV cache into pageable virtual memory, so concurrent requests share GPU/CPU memory like processes share RAM. Hugging Face put TGI into maintenance mode in late 2025 and now recommends vLLM for new deployments, making vLLM the de-facto open-source standard for serving open-weight LLMs.

Key features:

OpenAI-compatible API — /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/models work with any OpenAI SDK
Bearer-token auth on /v1/* via VLLM_API_KEY (rotatable at any time)
Continuous batching + PagedAttention for higher throughput under concurrency
Huge model coverage — any Hugging Face causal-LM, including quantized (AWQ, GPTQ, INT4) and gated (Llama, Mistral) checkpoints with HF_TOKEN
Streaming responses via Server-Sent Events
Persistent model cache on the Railway volume so redeploys don't re-download weights

Why Deploy vLLM on Railway

A managed Railway deployment removes the rough edges of self-hosting:

Public HTTPS URL with auto-generated TLS certificates
Persistent volume keeps the Hugging Face cache warm across redeploys
One-click memory bump if you load a larger model
No GPU driver setup — the CPU image works on Railway out of the box
Variables, logs, metrics, and rollbacks in one dashboard
Pay only for the container hours you use

Common Use Cases

Drop-in OpenAI replacement for a side project, agent, or RAG pipeline you don't want to send to OpenAI
Private inference for sensitive prompts — internal tooling, regulated industries, or research where prompt logging is unacceptable
Model evaluation harness — quickly stand up the same API surface across a series of HF checkpoints to A/B them
Lightweight production endpoint for small models (≤1B params) where CPU latency is acceptable

Dependencies for Self-Hosted vLLM

vLLM (vllm/vllm-openai-cpu:v0.20.2) — inference engine with OpenAI-compatible HTTP server, x86_64 CPU build with avx512f

Environment Variables Reference

Variable	Purpose
`VLLM_MODEL`	Hugging Face model id to load (e.g. `Qwen/Qwen2.5-0.5B-Instruct`)
`VLLM_API_KEY`	Bearer token required on `/v1/*` endpoints
`VLLM_DTYPE`	Tensor dtype — `bfloat16` for AVX-512 CPUs, `float16` otherwise
`VLLM_MAX_MODEL_LEN`	Maximum context window in tokens
`VLLM_CPU_KVCACHE_SPACE`	GiB reserved for KV cache (CPU backend)
`HF_HOME`	Hugging Face cache root (set to `/data` to use the volume)
`HF_TOKEN`	Optional — required for gated models (Llama, Mistral private weights)

Deployment Dependencies

Runtime: Docker (Railway), Python 3.12 inside the image
Image: vllm/vllm-openai-cpu on Docker Hub
Source: github.com/vllm-project/vllm
Docs: docs.vllm.ai

Hardware Requirements for Self-Hosting vLLM

CPU inference is feasible only for small models. Use this as a starting point and scale memory to fit your chosen model.

Resource	Minimum	Recommended
CPU	2 vCPU (`avx512f`)	4–8 vCPU (`avx512f`)
RAM	4 GB (≤0.5B model)	8 GB (≤1.5B model)
Storage	5 GB volume	20 GB volume (multiple models)
Runtime	Docker	Docker

For 7B+ models or production-grade throughput, run vLLM on a GPU host elsewhere and keep this Railway service for prototyping, demos, and lightweight workloads.

Self-Hosting vLLM with Docker

The official CPU image is a one-liner once you have the model id:

docker run --rm -p 8000:8000 \
  -e VLLM_API_KEY=your-secret-key \
  -e VLLM_CPU_KVCACHE_SPACE=4 \
  --shm-size=4g \
  vllm/vllm-openai-cpu:v0.20.2 \
  --model Qwen/Qwen2.5-0.5B-Instruct \
  --host 0.0.0.0 --port 8000 \
  --dtype bfloat16 --max-model-len 4096

Once it's serving, point any OpenAI client at it. The following is a Python client example using the OpenAI SDK:

from openai import OpenAI

client = OpenAI(base_url="https:///v1", api_key="")
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    messages=[{"role": "user", "content": "Explain PagedAttention in one sentence."}],
)
print(response.choices[0].message.content)

Is vLLM Free to Self-Host?

vLLM itself is fully open source under the Apache 2.0 license — no per-token fees, no seat licensing, no usage caps. On Railway you only pay for the container's compute and storage. A small Qwen2.5-0.5B deployment with 8 GB RAM and a 5 GB volume costs roughly the same as any other small Railway service; you avoid OpenAI's per-token billing entirely.

FAQ

What is vLLM and why self-host it? vLLM is an open-source inference engine for large language models with PagedAttention, continuous batching, and an OpenAI-compatible HTTP API. Self-hosting gives you a private endpoint, no rate limits, no per-token billing, and full control over which model serves your traffic.

What does this Railway template deploy? A single service running the vllm/vllm-openai-cpu Docker image, with a persistent volume mounted at /data for the Hugging Face model cache, Bearer-token API key authentication on /v1/*, and a public HTTPS URL on port 8000.

Why does the deploy use the CPU image instead of a GPU build? Railway containers do not have GPU access, so the CPU build (vllm-openai-cpu) is the only viable option on Railway. CPU inference is fast enough for small models (≤1B params) and demo workloads; for larger models or production throughput, run vLLM on a GPU host and keep this Railway service for prototyping.

Can I switch the model after deploy? Yes. Change VLLM_MODEL in the Variables tab to any Hugging Face causal-LM id and redeploy. The first deploy with a new model will download weights into the /data volume; subsequent restarts reuse the cache.

How do I serve a gated model like Llama 3? Set HF_TOKEN to a Hugging Face access token that has accepted the model's license, then point VLLM_MODEL at the gated repo (e.g. meta-llama/Llama-3.2-1B-Instruct). vLLM picks up HF_TOKEN automatically when downloading.

Why is the API key a static value instead of ${{secret(32)}}? Railway re-evaluates ${{secret(N)}} on every read, so the value the container starts with is unknowable to the human operator who needs to call the API. Using a static value (generated once and stored) ensures the key in Railway's UI matches what the container actually accepts. Rotate it at any time by editing the variable.

Does vLLM expose unauthenticated endpoints? Yes — /health, /metrics, and the /docs page are unauthenticated by design. Only /v1/* (the OpenAI-compatible endpoints) require the Bearer token. If you need to lock down everything, put the service behind a reverse proxy or IP allowlist.

vLLM vs Ollama vs TGI

Feature	vLLM	Ollama	TGI
Primary use	Production serving	Local dev / personal	Enterprise (HF)
OpenAI API	Yes (native)	Yes (compatibility layer)	Yes
Concurrent throughput	Highest (PagedAttention)	Lowest (sequential)	High
CPU support	Yes (`vllm-openai-cpu`)	Yes (default)	Limited
Status	Active, recommended by HF	Active	Maintenance mode (Dec 2025)

vLLM wins on throughput under concurrency. Ollama wins on local-dev ergonomics. If you already run TGI, plan a migration — Hugging Face themselves now point new deployments at vLLM.