Skip to main content

Overview

vLLM is a high-throughput, memory-efficient inference engine for large language models. It uses advanced techniques like PagedAttention, continuous batching, and speculative decoding to serve open-source models at significantly higher throughput than standard inference frameworks. vLLM exposes an OpenAI-compatible API, making it a drop-in replacement for OpenAI in Nadoo AI. Key benefits:
  • High throughput — Serves 2—4x more requests per second than naive inference
  • Memory efficient — PagedAttention reduces GPU memory waste by up to 60%
  • OpenAI-compatible API — Standard /v1/chat/completions and /v1/completions endpoints
  • Any HuggingFace model — Deploy any model from the HuggingFace Hub
  • Production ready — Continuous batching, tensor parallelism, and streaming support

When to Use vLLM

vLLM is the best choice when you need:
  • High-volume production serving of open-source models
  • Maximum GPU utilization for cost-effective inference
  • Custom or fine-tuned models from HuggingFace
  • Self-hosted inference with no data leaving your infrastructure
  • Consistent low latency under concurrent load
For development and small-scale use, Ollama is simpler to set up. Use vLLM when you need production-grade throughput and GPU efficiency.

Setup

1

Install vLLM

Install vLLM on a GPU-equipped server:
pip install vllm
Requirements:
  • NVIDIA GPU with CUDA 11.8+
  • Python 3.9+
  • Sufficient GPU VRAM for your chosen model
2

Start the vLLM Server

Launch vLLM with your chosen model:
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --port 8080 \
  --tensor-parallel-size 1
The server starts an OpenAI-compatible API at http://localhost:8080.
3

Configure in Nadoo

Go to Admin > Model Providers > vLLM and enter:
FieldRequiredDescription
Server URLYesThe vLLM server address (e.g., http://gpu-server:8080)
Model NameYesThe model identifier used when starting vLLM (e.g., meta-llama/Meta-Llama-3.1-8B-Instruct)
API KeyNoOptional API key if your vLLM server requires authentication
4

Test Connection

Click Test to verify the connection. The model will appear as an available LLM in your workspace.

Deployment Examples

Single GPU

Serve a 7B—8B parameter model on a single GPU:
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --port 8080 \
  --max-model-len 8192

Multi-GPU (Tensor Parallelism)

Serve a 70B parameter model across multiple GPUs:
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-70B-Instruct \
  --port 8080 \
  --tensor-parallel-size 4 \
  --max-model-len 4096

Docker

Run vLLM in a Docker container:
docker run --gpus all \
  -p 8080:8080 \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --port 8080

Custom or Fine-Tuned Models

Serve a fine-tuned model from a local path or HuggingFace:
# From HuggingFace
python -m vllm.entrypoints.openai.api_server \
  --model your-org/your-fine-tuned-model \
  --port 8080

# From local path
python -m vllm.entrypoints.openai.api_server \
  --model /path/to/model \
  --port 8080
ModelParametersGPU VRAM RequiredBest For
meta-llama/Meta-Llama-3.1-8B-Instruct8B16 GBGeneral-purpose chat
meta-llama/Meta-Llama-3.1-70B-Instruct70B4x 24 GBHigh-quality reasoning
mistralai/Mistral-7B-Instruct-v0.37B16 GBEfficient general tasks
mistralai/Mixtral-8x7B-Instruct-v0.146.7B2x 24 GBMixture-of-experts
codellama/CodeLlama-34b-Instruct-hf34B2x 24 GBCode generation
deepseek-ai/DeepSeek-R1-Distill-Llama-8B8B16 GBReasoning and math

Capabilities

Chat Completion

OpenAI-compatible chat completions with streaming support and function calling.

High Throughput

Continuous batching and PagedAttention enable 2—4x more requests per second than naive serving.

Any HuggingFace Model

Deploy any compatible model from the HuggingFace Hub, including your fine-tuned models.

GPU Efficient

PagedAttention reduces GPU memory waste, allowing you to serve larger models or more concurrent requests.

Performance Tuning

Key Server Parameters

ParameterDescriptionDefault
--max-model-lenMaximum sequence length (reduce to save memory)Model default
--tensor-parallel-sizeNumber of GPUs for tensor parallelism1
--gpu-memory-utilizationFraction of GPU memory to use (0.0 — 1.0)0.9
--max-num-seqsMaximum concurrent sequences256
--enable-prefix-cachingCache common prefixes to speed up similar requestsDisabled
--quantizationWeight quantization method (awq, gptq, squeezellm)None

Optimization Tips

If your use case does not need the model’s full context window, set --max-model-len to a smaller value (e.g., 4096 or 8192). This significantly reduces GPU memory usage.
AWQ and GPTQ quantization let you run larger models on fewer GPUs with minimal quality loss. Use --quantization awq with pre-quantized model variants.
If many requests share the same system prompt or prefix, enable --enable-prefix-caching to avoid recomputing those tokens for each request.
For models too large for a single GPU, increase --tensor-parallel-size to distribute the model across multiple GPUs. Use NVLink-connected GPUs for best inter-GPU bandwidth.

Environment Variables

When self-hosting Nadoo AI, configure vLLM via environment variables:
VLLM_SERVER_URL=http://gpu-server:8080
VLLM_MODEL_NAME=meta-llama/Meta-Llama-3.1-8B-Instruct
VLLM_API_KEY=optional-api-key
If both environment variables and the admin UI configuration are set, the admin UI values take precedence.

Troubleshooting

The model requires more GPU VRAM than available. Reduce --max-model-len, use a quantized model, or distribute across more GPUs with --tensor-parallel-size.
The vLLM server is not running or is on a different port. Verify the server is started and the URL in Nadoo matches the server’s address and port.
The first request after server startup may be slow due to model loading and CUDA kernel compilation. Subsequent requests will be fast.
Not all HuggingFace models are supported by vLLM. Check the vLLM supported models list for compatibility.