Railway

Deploy Xinference (Xorbits Inference) | OpenAI-Compatible API

Self-host Xinference. Self host LLMs, embeddings, rerankers & more.

Deploy Xinference (Xorbits Inference) | OpenAI-Compatible API

Just deployed

/data

Xinference logo

Deploy and Host Xinference on Railway

Xinference (Xorbits Inference) is an open-source model serving platform that lets you run large language models, speech recognition, image generation, and embedding models behind an OpenAI-compatible API. Self-host Xinference on Railway when you want a single private endpoint that swaps OpenAI/Anthropic for local GGUF, Llama, Qwen, Mistral, and Whisper models — without rewriting any of your application code.

This Railway template deploys the official xprobe/xinference:latest-cpu image as a single service with a persistent volume for model weights, an auto-generated public domain on port 9997, and built-in JWT + API key authentication rendered from environment variables at boot. There is no external database — Xinference stores model registry, virtual environments, and HuggingFace cache directly on the attached volume.

Getting Started with Xinference on Railway

After the deploy turns green, open the public Railway URL with /ui/ appended (note the trailing slash) and log in with the XINFERENCE_ADMIN_USER and XINFERENCE_ADMIN_PASSWORD you set in the env vars. The web UI lists every supported model in the registry — pick a small Q4-quantized model that fits in 8 GB RAM (e.g. qwen2.5-instruct 7B Q4_K_M, bge-small-en-v1.5 for embeddings, or bge-reranker-base for reranking), click Launch, and watch it download to /data/huggingface. Once the model is running, point any OpenAI SDK at https:///v1 and pass your XINFERENCE_API_KEY as the bearer token. The full Swagger reference lives at /docs.

Xinference dashboard screenshot

About Hosting Xinference

Xinference is a model orchestration layer maintained by Xorbits. It wraps llama.cpp, transformers, sentence-transformers, and several other inference backends behind one HTTP API and one Python SDK. You launch any supported model with one click, scale replicas up and down, and call them through the same /v1/chat/completions, /v1/embeddings, /v1/audio/transcriptions, and /v1/images/generations endpoints OpenAI uses.

Key features:

  • OpenAI-compatible REST API — drop-in replacement for openai.OpenAI() clients
  • Web UI for model management — launch, stop, list, monitor without writing code
  • Model registry — built-in catalog of dozens of LLMs, embedding, rerank, audio, and image models
  • Multi-backend — llama.cpp (GGUF), transformers, sentence-transformers, FunASR, ChatTTS
  • Built-in auth — JWT sessions plus per-user API keys with admin/read/start/stop permissions
  • Custom model registration — point at any HuggingFace repo or local GGUF file

Why Deploy Xinference on Railway

Railway gives you a one-click path to a private OpenAI-compatible endpoint:

  • Public HTTPS URL provisioned automatically with TLS
  • Persistent volume keeps downloaded models across redeploys
  • Single env-var bundle configures auth, model source, and API key
  • Memory and replica scaling controlled from the Railway dashboard
  • Pay only for actual compute and storage usage

Common Use Cases

  • Private LLM gateway — swap api.openai.com for your Railway URL across an entire stack without touching client code
  • Self-hosted embeddings for RAG — run BGE, GTE, or Nomic embedding models for vector search pipelines
  • On-call rerank service — host BGE-Reranker or Cohere-compatible rerankers next to your existing search index
  • Whisper transcription endpoint — point voice apps at /v1/audio/transcriptions for fully self-hosted speech-to-text

Dependencies for Xinference on Railway

This template ships a single service:

  • Xinferencexprobe/xinference:latest-cpu (Docker Hub) — serves the API, web UI, and runs all model inference

Environment Variables Reference

VariablePurpose
XINFERENCE_HOMEPersistent model + state directory (set to /data for the volume)
HF_HOMEHuggingFace cache directory (set under the same volume)
XINFERENCE_MODEL_SRCModel registry source: huggingface or modelscope
XINFERENCE_SECRET_KEYJWT signing secret — must be a stable hex string
XINFERENCE_ADMIN_USERAdmin login username for the web UI
XINFERENCE_ADMIN_PASSWORDAdmin login password — bootstrap-only, do not rotate
XINFERENCE_API_KEYBearer token for /v1/* endpoints — must be sk- + 13 chars (16 total)
PORTContainer port (9997)
RAILWAY_RUN_UIDSet to 0 so the container can write to the volume

Deployment Dependencies

Hardware Requirements for Self-Hosting Xinference

Xinference on Railway is CPU-only. Right-size the plan to the largest model you intend to run.

ResourceMinimumRecommended
CPU2 vCPU4–8 vCPU
RAM4 GB (small embedding models only)8 GB (7B Q4 LLMs)
Storage10 GB volume50 GB volume (multiple models)
RuntimeLinux x86_64Linux x86_64

Anything beyond a 7B Q4_K_M model needs more than 8 GB RAM and will OOM at launch — pick smaller quants or smaller models.

Self-Hosting Xinference Locally

Reproduce the Railway deploy locally with the same official image:

docker run -d --name xinference \
  -p 9997:9997 \
  -v xinference-data:/data \
  -e XINFERENCE_HOME=/data \
  -e HF_HOME=/data/huggingface \
  xprobe/xinference:latest-cpu \
  xinference-local --host 0.0.0.0 --port 9997

Then launch a model from the CLI:

docker exec xinference xinference launch \
  --model-name qwen2.5-instruct \
  --size-in-billions 7 \
  --model-format ggufv2 \
  --quantization Q4_K_M

For production deploys with auth, add --auth-config /path/to/auth.json to the start command — see the auth docs for the JSON schema.

Xinference vs Ollama

FeatureXinferenceOllama
OpenAI-compatible APIYesYes
Web UI for model managementBuilt-inThird-party
Embedding & rerank modelsFirst-classLimited
Audio (Whisper) & image genYesNo
Built-in auth (JWT + API keys)YesNo
GGUF backendYes (xllamacpp)Yes (llama.cpp)

Both are great for self-hosting LLMs; Xinference is the better pick when you also need embeddings, reranking, and audio models behind one auth-protected endpoint.

How Much Does Xinference Cost to Self-Host?

Xinference itself is fully open-source under the Apache 2.0 license — there is no paid tier and no license fee. Hosting cost on Railway is purely infrastructure: the CPU-only image and a small 7B-quantized model fit comfortably in an 8 GB plan, so you pay for compute, RAM, and the storage your model files consume on the volume. Larger models or higher concurrency need a bigger plan.

FAQ — Self-Hosting Xinference on Railway

What is Xinference and why self-host it? Xinference is an open-source serving framework for LLMs, embeddings, rerankers, and audio/image models behind a single OpenAI-compatible API. Self-hosting on Railway gives you private inference, no per-token billing, and full control over which model runs.

What does this Railway template deploy? A single Xinference service backed by the official xprobe/xinference:latest-cpu image, with a persistent volume mounted at /data, JWT + API key authentication generated from environment variables, and a public HTTPS domain on port 9997.

Why does this template not include a database? Xinference is fully self-contained. Model weights, virtual environments, and the HuggingFace cache live on the attached volume, and there is no external metadata store — that's why the deploy is a single service.

How do I enable API key auth in self-hosted Xinference on Railway? The start command renders an auth JSON file from XINFERENCE_SECRET_KEY, XINFERENCE_ADMIN_USER, XINFERENCE_ADMIN_PASSWORD, and XINFERENCE_API_KEY at boot, then passes --auth-config to xinference-local. Just set those four env vars; auth is on by default.

Why must XINFERENCE_API_KEY be exactly 16 characters? Xinference's auth validator hard-fails any key that is not sk- + 13 alphanumeric characters (16 total). Longer keys crash the REST API at startup with ValueError: Api-Key should be a string started with 'sk-' with a total length of 16.

Can I run a 13B or 70B model on this template? Not on the default 8 GB Railway plan. Practical CPU-only inference tops out around 7B parameters in Q4 quantization. Bigger models need a higher-memory plan and accept much slower inference because no GPU is attached.

Can I use Xinference as a drop-in replacement for the OpenAI SDK? Yes. Set the SDK's base_url to https:///v1 and the API key to your XINFERENCE_API_KEY, then call chat.completions.create() as normal. Embeddings and audio transcription endpoints follow the same pattern.


Template Content

More templates in this category

View Template
Chat Chat
Chat Chat, your own unified chat and search to AI platform.

okisdev
View Template
EchoDeck
Generate a mp4 from powerpoint with TTS

Fixed Scope
View Template
Rift
Rift Its a OSS AI Chat for teams

Compound