Deploy vLLM | High-Throughput LLM Serving
Self-host vLLM on Railway with an OpenAI-compatible API
Just deployed
/data

Deploy and Host vLLM on Railway
vLLM is a high-throughput inference and serving engine for large language models that exposes an OpenAI-compatible HTTP API. Self-host vLLM on Railway when you want a private, OpenAI-drop-in endpoint for any Hugging Face causal-LM — no rate limits, no per-token billing, no third-party telemetry on your prompts.
This Railway template deploys the official vllm/vllm-openai-cpu Docker image as a single service with a persistent volume for the Hugging Face model cache, Bearer-token API key authentication, and a public HTTPS URL. The default model (Qwen/Qwen2.5-0.5B-Instruct) is deliberately small and ungated so the deploy boots without configuration; swap VLLM_MODEL to any HF model id that fits in 8 GB RAM.
Getting Started with vLLM on Railway
After deploy, the public URL exposes an OpenAI-compatible API at /v1/*. Grab the auto-generated VLLM_API_KEY from the Railway service's Variables tab — that's your Bearer token. Test the endpoint with curl https:///v1/models -H "Authorization: Bearer $VLLM_API_KEY"; it should list the loaded model. Point any OpenAI SDK at https:///v1 with that key as the api_key and you're done. The first request after a cold boot can take 30–60 seconds while the model warms up; subsequent requests are fast because the model stays resident in RAM. To swap models, change VLLM_MODEL to a different Hugging Face id (e.g. microsoft/Phi-3-mini-4k-instruct) and redeploy — the volume keeps weights cached so you only pay the download once per model.
About Hosting vLLM
vLLM was built at UC Berkeley's Sky Computing Lab and pioneered PagedAttention — a memory-management technique that turns the KV cache into pageable virtual memory, so concurrent requests share GPU/CPU memory like processes share RAM. Hugging Face put TGI into maintenance mode in late 2025 and now recommends vLLM for new deployments, making vLLM the de-facto open-source standard for serving open-weight LLMs.
Key features:
- OpenAI-compatible API —
/v1/chat/completions,/v1/completions,/v1/embeddings,/v1/modelswork with any OpenAI SDK - Bearer-token auth on
/v1/*viaVLLM_API_KEY(rotatable at any time) - Continuous batching + PagedAttention for higher throughput under concurrency
- Huge model coverage — any Hugging Face causal-LM, including quantized (AWQ, GPTQ, INT4) and gated (Llama, Mistral) checkpoints with
HF_TOKEN - Streaming responses via Server-Sent Events
- Persistent model cache on the Railway volume so redeploys don't re-download weights
Why Deploy vLLM on Railway
A managed Railway deployment removes the rough edges of self-hosting:
- Public HTTPS URL with auto-generated TLS certificates
- Persistent volume keeps the Hugging Face cache warm across redeploys
- One-click memory bump if you load a larger model
- No GPU driver setup — the CPU image works on Railway out of the box
- Variables, logs, metrics, and rollbacks in one dashboard
- Pay only for the container hours you use
Common Use Cases
- Drop-in OpenAI replacement for a side project, agent, or RAG pipeline you don't want to send to OpenAI
- Private inference for sensitive prompts — internal tooling, regulated industries, or research where prompt logging is unacceptable
- Model evaluation harness — quickly stand up the same API surface across a series of HF checkpoints to A/B them
- Lightweight production endpoint for small models (≤1B params) where CPU latency is acceptable
Dependencies for Self-Hosted vLLM
- vLLM (
vllm/vllm-openai-cpu:v0.20.2) — inference engine with OpenAI-compatible HTTP server, x86_64 CPU build withavx512f
Environment Variables Reference
| Variable | Purpose |
|---|---|
VLLM_MODEL | Hugging Face model id to load (e.g. Qwen/Qwen2.5-0.5B-Instruct) |
VLLM_API_KEY | Bearer token required on /v1/* endpoints |
VLLM_DTYPE | Tensor dtype — bfloat16 for AVX-512 CPUs, float16 otherwise |
VLLM_MAX_MODEL_LEN | Maximum context window in tokens |
VLLM_CPU_KVCACHE_SPACE | GiB reserved for KV cache (CPU backend) |
HF_HOME | Hugging Face cache root (set to /data to use the volume) |
HF_TOKEN | Optional — required for gated models (Llama, Mistral private weights) |
Deployment Dependencies
- Runtime: Docker (Railway), Python 3.12 inside the image
- Image: vllm/vllm-openai-cpu on Docker Hub
- Source: github.com/vllm-project/vllm
- Docs: docs.vllm.ai
Hardware Requirements for Self-Hosting vLLM
CPU inference is feasible only for small models. Use this as a starting point and scale memory to fit your chosen model.
| Resource | Minimum | Recommended |
|---|---|---|
| CPU | 2 vCPU (avx512f) | 4–8 vCPU (avx512f) |
| RAM | 4 GB (≤0.5B model) | 8 GB (≤1.5B model) |
| Storage | 5 GB volume | 20 GB volume (multiple models) |
| Runtime | Docker | Docker |
For 7B+ models or production-grade throughput, run vLLM on a GPU host elsewhere and keep this Railway service for prototyping, demos, and lightweight workloads.
Self-Hosting vLLM with Docker
The official CPU image is a one-liner once you have the model id:
docker run --rm -p 8000:8000 \
-e VLLM_API_KEY=your-secret-key \
-e VLLM_CPU_KVCACHE_SPACE=4 \
--shm-size=4g \
vllm/vllm-openai-cpu:v0.20.2 \
--model Qwen/Qwen2.5-0.5B-Instruct \
--host 0.0.0.0 --port 8000 \
--dtype bfloat16 --max-model-len 4096
Once it's serving, point any OpenAI client at it. The following is a Python client example using the OpenAI SDK:
from openai import OpenAI
client = OpenAI(base_url="https:///v1", api_key="")
response = client.chat.completions.create(
model="Qwen/Qwen2.5-0.5B-Instruct",
messages=[{"role": "user", "content": "Explain PagedAttention in one sentence."}],
)
print(response.choices[0].message.content)
Is vLLM Free to Self-Host?
vLLM itself is fully open source under the Apache 2.0 license — no per-token fees, no seat licensing, no usage caps. On Railway you only pay for the container's compute and storage. A small Qwen2.5-0.5B deployment with 8 GB RAM and a 5 GB volume costs roughly the same as any other small Railway service; you avoid OpenAI's per-token billing entirely.
FAQ
What is vLLM and why self-host it? vLLM is an open-source inference engine for large language models with PagedAttention, continuous batching, and an OpenAI-compatible HTTP API. Self-hosting gives you a private endpoint, no rate limits, no per-token billing, and full control over which model serves your traffic.
What does this Railway template deploy?
A single service running the vllm/vllm-openai-cpu Docker image, with a persistent volume mounted at /data for the Hugging Face model cache, Bearer-token API key authentication on /v1/*, and a public HTTPS URL on port 8000.
Why does the deploy use the CPU image instead of a GPU build?
Railway containers do not have GPU access, so the CPU build (vllm-openai-cpu) is the only viable option on Railway. CPU inference is fast enough for small models (≤1B params) and demo workloads; for larger models or production throughput, run vLLM on a GPU host and keep this Railway service for prototyping.
Can I switch the model after deploy?
Yes. Change VLLM_MODEL in the Variables tab to any Hugging Face causal-LM id and redeploy. The first deploy with a new model will download weights into the /data volume; subsequent restarts reuse the cache.
How do I serve a gated model like Llama 3?
Set HF_TOKEN to a Hugging Face access token that has accepted the model's license, then point VLLM_MODEL at the gated repo (e.g. meta-llama/Llama-3.2-1B-Instruct). vLLM picks up HF_TOKEN automatically when downloading.
Why is the API key a static value instead of ${{secret(32)}}?
Railway re-evaluates ${{secret(N)}} on every read, so the value the container starts with is unknowable to the human operator who needs to call the API. Using a static value (generated once and stored) ensures the key in Railway's UI matches what the container actually accepts. Rotate it at any time by editing the variable.
Does vLLM expose unauthenticated endpoints?
Yes — /health, /metrics, and the /docs page are unauthenticated by design. Only /v1/* (the OpenAI-compatible endpoints) require the Bearer token. If you need to lock down everything, put the service behind a reverse proxy or IP allowlist.
vLLM vs Ollama vs TGI
| Feature | vLLM | Ollama | TGI |
|---|---|---|---|
| Primary use | Production serving | Local dev / personal | Enterprise (HF) |
| OpenAI API | Yes (native) | Yes (compatibility layer) | Yes |
| Concurrent throughput | Highest (PagedAttention) | Lowest (sequential) | High |
| CPU support | Yes (vllm-openai-cpu) | Yes (default) | Limited |
| Status | Active, recommended by HF | Active | Maintenance mode (Dec 2025) |
vLLM wins on throughput under concurrency. Ollama wins on local-dev ergonomics. If you already run TGI, plan a migration — Hugging Face themselves now point new deployments at vLLM.
Template Content
