What is vLLM?
vLLM is an open-source inference engine designed to serve large language models efficiently. It maximizes throughput and minimizes latency when running LLM inference workloads. vLLM workers include the vLLM engine with GPU optimizations and support for both OpenAI’s API and Runpod’s native API. You can deploy any supported model from Hugging Face with minimal configuration and start serving requests immediately. The workers run on Runpod Serverless, which automatically scales based on demand.How vLLM works
vLLM uses several advanced techniques to achieve high performance when serving LLMs. Understanding these can help you optimize your deployments and troubleshoot issues.PagedAttention for memory efficiency
PagedAttention is the key innovation in vLLM. It dramatically improves how GPU memory is used during inference. Traditional LLM serving wastes memory by pre-allocating large contiguous blocks for key-value (KV) caches. PagedAttention breaks the KV cache into smaller pages, similar to how operating systems manage memory. This reduces memory waste and allows vLLM to serve more requests concurrently on the same GPU. You can handle higher throughput or serve larger models on smaller GPUs.Continuous batching
vLLM uses continuous batching (also called dynamic batching) to process multiple requests simultaneously. Unlike traditional batching, which waits for a batch to fill up before processing, continuous batching processes requests as they arrive and adds new requests to the batch as soon as previous ones complete. This keeps your GPU busy and reduces latency for individual requests, especially during periods of variable traffic.Request lifecycle
When you send a request to a vLLM worker endpoint:- The request arrives at Runpod Serverless infrastructure.
- If no worker is available, the request is queued and a worker starts automatically.
- The worker loads your model from Hugging Face (or from the pre-baked Docker image).
- vLLM processes the request using PagedAttention and continuous batching.
- The response is returned to your application.
- If there are no more requests, the worker scales down to zero after a configured timeout.
/run
and /runsync
operations as other Runpod Serverless endpoints. The only difference is the input format and the specialized LLM processing inside the worker.
Why use vLLM workers?
vLLM workers offer several advantages over other LLM deployment options.Performance and efficiency
vLLM’s PagedAttention and continuous batching deliver significantly better throughput than traditional serving methods. You can serve 2-3x more requests per GPU compared to naive implementations, which directly translates to lower costs and better user experiences.OpenAI API compatibility
vLLM workers provide a drop-in replacement for OpenAI’s API. If you’re already using the OpenAI Python client or any other OpenAI-compatible library, you can switch to your Runpod endpoint by changing just two lines of code: the API key and the base URL. Your existing prompts, parameters, and response handling code continue to work without modification.Model flexibility
You can deploy virtually any model available on Hugging Face, including popular options like Llama, Mistral, Qwen, Gemma, and thousands of others. vLLM supports a wide range of model architectures out of the box, and new architectures are added regularly.Auto-scaling and cost efficiency
Runpod Serverless automatically scales your vLLM workers from zero to many based on demand. You only pay for the seconds when workers are actively processing requests. This makes vLLM workers ideal for workloads with variable traffic patterns or when you’re getting started and don’t want to pay for idle capacity.Production-ready features
vLLM workers come with features that make them suitable for production deployments, including streaming responses, configurable context lengths, quantization support (AWQ, GPTQ), multi-GPU tensor parallelism, and comprehensive error handling.Deployment options
There are two ways to deploy vLLM workers on Runpod.Using pre-built Docker images
This is the fastest and most common approach. Runpod provides pre-built vLLM worker images that you can deploy directly from the console. You specify your model name as an environment variable, and the worker downloads it from Hugging Face during initialization. This method is ideal for getting started quickly, testing different models, or deploying models that change frequently. However, model download time adds to your cold start latency.Building custom Docker images with models baked in
For production deployments where cold start time matters, you can build a custom Docker image that includes your model weights. This eliminates download time and can reduce cold starts from minutes to seconds. This approach requires more upfront work but provides the best performance for production workloads with consistent traffic.Compatible models
vLLM supports most model architectures available on Hugging Face. You can deploy models from families including Llama (1, 2, 3, 3.1, 3.2), Mistral and Mixtral, Qwen2 and Qwen2.5, Gemma and Gemma 2, Phi (2, 3, 3.5, 4), DeepSeek (V2, V3, R1), GPT-2, GPT-J, OPT, BLOOM, Falcon, MPT, StableLM, Yi, and many others. For a complete and up-to-date list of supported model architectures, see the vLLM supported models documentation.Performance considerations
Several factors affect vLLM worker performance. GPU selection is the most important factor. Larger models require more VRAM, and inference speed scales with GPU memory bandwidth. For 7B parameter models, an A10G or better is recommended. For 70B+ models, you’ll need an A100 or H100. See GPU types for details on available GPUs. Model size directly impacts both loading time and inference speed. Smaller models (7B parameters) load quickly and generate tokens fast. Larger models (70B+ parameters) provide better quality but require more powerful GPUs and have higher latency. Quantization reduces model size and memory requirements by using lower-precision weights. Methods like AWQ and GPTQ can reduce memory usage by 2-4x with minimal quality loss. This lets you run larger models on smaller GPUs or increase throughput on a given GPU. Context length affects memory requirements and processing time. Longer contexts require more memory for the KV cache and take longer to process. SetMAX_MODEL_LEN
to the minimum value that meets your needs.
Concurrent requests benefit from vLLM’s continuous batching, but too many concurrent requests can exceed GPU memory and cause failures. The MAX_NUM_SEQS
environment variable controls the maximum number of concurrent sequences.