Overview
vLLM is a high-throughput, memory-efficient inference engine for large language models. It uses advanced techniques like PagedAttention, continuous batching, and speculative decoding to serve open-source models at significantly higher throughput than standard inference frameworks. vLLM exposes an OpenAI-compatible API, making it a drop-in replacement for OpenAI in Nadoo AI. Key benefits:- High throughput — Serves 2—4x more requests per second than naive inference
- Memory efficient — PagedAttention reduces GPU memory waste by up to 60%
- OpenAI-compatible API — Standard
/v1/chat/completionsand/v1/completionsendpoints - Any HuggingFace model — Deploy any model from the HuggingFace Hub
- Production ready — Continuous batching, tensor parallelism, and streaming support
When to Use vLLM
vLLM is the best choice when you need:- High-volume production serving of open-source models
- Maximum GPU utilization for cost-effective inference
- Custom or fine-tuned models from HuggingFace
- Self-hosted inference with no data leaving your infrastructure
- Consistent low latency under concurrent load
For development and small-scale use, Ollama is simpler to set up. Use vLLM when you need production-grade throughput and GPU efficiency.
Setup
Install vLLM
Install vLLM on a GPU-equipped server:Requirements:
- NVIDIA GPU with CUDA 11.8+
- Python 3.9+
- Sufficient GPU VRAM for your chosen model
Start the vLLM Server
Launch vLLM with your chosen model:The server starts an OpenAI-compatible API at
http://localhost:8080.Configure in Nadoo
Go to Admin > Model Providers > vLLM and enter:
| Field | Required | Description |
|---|---|---|
| Server URL | Yes | The vLLM server address (e.g., http://gpu-server:8080) |
| Model Name | Yes | The model identifier used when starting vLLM (e.g., meta-llama/Meta-Llama-3.1-8B-Instruct) |
| API Key | No | Optional API key if your vLLM server requires authentication |
Deployment Examples
Single GPU
Serve a 7B—8B parameter model on a single GPU:Multi-GPU (Tensor Parallelism)
Serve a 70B parameter model across multiple GPUs:Docker
Run vLLM in a Docker container:Custom or Fine-Tuned Models
Serve a fine-tuned model from a local path or HuggingFace:Popular Models
| Model | Parameters | GPU VRAM Required | Best For |
|---|---|---|---|
meta-llama/Meta-Llama-3.1-8B-Instruct | 8B | 16 GB | General-purpose chat |
meta-llama/Meta-Llama-3.1-70B-Instruct | 70B | 4x 24 GB | High-quality reasoning |
mistralai/Mistral-7B-Instruct-v0.3 | 7B | 16 GB | Efficient general tasks |
mistralai/Mixtral-8x7B-Instruct-v0.1 | 46.7B | 2x 24 GB | Mixture-of-experts |
codellama/CodeLlama-34b-Instruct-hf | 34B | 2x 24 GB | Code generation |
deepseek-ai/DeepSeek-R1-Distill-Llama-8B | 8B | 16 GB | Reasoning and math |
Capabilities
Chat Completion
OpenAI-compatible chat completions with streaming support and function calling.
High Throughput
Continuous batching and PagedAttention enable 2—4x more requests per second than naive serving.
Any HuggingFace Model
Deploy any compatible model from the HuggingFace Hub, including your fine-tuned models.
GPU Efficient
PagedAttention reduces GPU memory waste, allowing you to serve larger models or more concurrent requests.
Performance Tuning
Key Server Parameters
| Parameter | Description | Default |
|---|---|---|
--max-model-len | Maximum sequence length (reduce to save memory) | Model default |
--tensor-parallel-size | Number of GPUs for tensor parallelism | 1 |
--gpu-memory-utilization | Fraction of GPU memory to use (0.0 — 1.0) | 0.9 |
--max-num-seqs | Maximum concurrent sequences | 256 |
--enable-prefix-caching | Cache common prefixes to speed up similar requests | Disabled |
--quantization | Weight quantization method (awq, gptq, squeezellm) | None |
Optimization Tips
Reduce max-model-len for memory savings
Reduce max-model-len for memory savings
If your use case does not need the model’s full context window, set
--max-model-len to a smaller value (e.g., 4096 or 8192). This significantly reduces GPU memory usage.Use quantization for larger models
Use quantization for larger models
AWQ and GPTQ quantization let you run larger models on fewer GPUs with minimal quality loss. Use
--quantization awq with pre-quantized model variants.Enable prefix caching for repeated prompts
Enable prefix caching for repeated prompts
If many requests share the same system prompt or prefix, enable
--enable-prefix-caching to avoid recomputing those tokens for each request.Scale with tensor parallelism
Scale with tensor parallelism
For models too large for a single GPU, increase
--tensor-parallel-size to distribute the model across multiple GPUs. Use NVLink-connected GPUs for best inter-GPU bandwidth.Environment Variables
When self-hosting Nadoo AI, configure vLLM via environment variables:Troubleshooting
CUDA out of memory
CUDA out of memory
The model requires more GPU VRAM than available. Reduce
--max-model-len, use a quantized model, or distribute across more GPUs with --tensor-parallel-size.Connection refused
Connection refused
The vLLM server is not running or is on a different port. Verify the server is started and the URL in Nadoo matches the server’s address and port.
Slow first response
Slow first response
The first request after server startup may be slow due to model loading and CUDA kernel compilation. Subsequent requests will be fast.
Model not supported
Model not supported
Not all HuggingFace models are supported by vLLM. Check the vLLM supported models list for compatibility.