vLLM workers overview

vLLM workers let you deploy and serve large language models on Runpod Serverless. They use vLLM, a high-performance inference engine, to deliver fast and efficient LLM inference with automatic scaling.

What is vLLM?

vLLM is an open-source inference engine designed to serve large language models efficiently. It maximizes throughput and minimizes latency when running LLM inference workloads. vLLM workers include the vLLM engine with GPU optimizations and support for both OpenAI’s API and Runpod’s native API. You can deploy any supported model from Hugging Face with minimal configuration and start serving requests immediately. The workers run on Runpod Serverless, which automatically scales based on demand.

How vLLM works

vLLM uses several advanced techniques to achieve high performance when serving LLMs. Understanding these can help you optimize your deployments and troubleshoot issues.

PagedAttention for memory efficiency

PagedAttention is the key innovation in vLLM. It dramatically improves how GPU memory is used during inference. Traditional LLM serving wastes memory by pre-allocating large contiguous blocks for key-value (KV) caches. PagedAttention breaks the KV cache into smaller pages, similar to how operating systems manage memory. This reduces memory waste and allows vLLM to serve more requests concurrently on the same GPU. You can handle higher throughput or serve larger models on smaller GPUs.

Continuous batching

vLLM uses continuous batching (also called dynamic batching) to process multiple requests simultaneously. Unlike traditional batching, which waits for a batch to fill up before processing, continuous batching processes requests as they arrive and adds new requests to the batch as soon as previous ones complete. This keeps your GPU busy and reduces latency for individual requests, especially during periods of variable traffic.

Request lifecycle

When you send a request to a vLLM worker endpoint:

The request arrives at Runpod Serverless infrastructure.
If no worker is available, the request is queued and a worker starts automatically.
The worker loads your model from Hugging Face (or from the pre-baked Docker image).
vLLM processes the request using PagedAttention and continuous batching.
The response is returned to your application.
If there are no more requests, the worker scales down to zero after a configured timeout.

vLLM endpoints use the same /run and /runsync operations as other Runpod Serverless endpoints. The only difference is the input format and the specialized LLM processing inside the worker.

Why use vLLM workers?

vLLM workers offer several advantages over other LLM deployment options.

Performance and efficiency

vLLM’s PagedAttention and continuous batching deliver significantly better throughput than traditional serving methods. You can serve 2-3x more requests per GPU compared to naive implementations, which directly translates to lower costs and better user experiences.

OpenAI API compatibility

vLLM workers provide a drop-in replacement for OpenAI’s API. If you’re already using the OpenAI Python client or any other OpenAI-compatible library, you can switch to your Runpod endpoint by changing just two lines of code: the API key and the base URL. Your existing prompts, parameters, and response handling code continue to work without modification.

Model flexibility

You can deploy virtually any model available on Hugging Face, including popular options like Llama, Mistral, Qwen, Gemma, and thousands of others. vLLM supports a wide range of model architectures out of the box, and new architectures are added regularly.

Auto-scaling and cost efficiency

Runpod Serverless automatically scales your vLLM workers from zero to many based on demand. You only pay for the seconds when workers are actively processing requests. This makes vLLM workers ideal for workloads with variable traffic patterns or when you’re getting started and don’t want to pay for idle capacity.

Production-ready features

vLLM workers come with features that make them suitable for production deployments, including streaming responses, configurable context lengths, quantization support (AWQ, GPTQ), multi-GPU tensor parallelism, and comprehensive error handling.

Deployment options

There are two ways to deploy vLLM workers on Runpod.

Using pre-built Docker images

This is the fastest and most common approach. Runpod provides pre-built vLLM worker images that you can deploy directly from the console. You specify your model name as an environment variable, and the worker downloads it from Hugging Face during initialization. This method is ideal for getting started quickly, testing different models, or deploying models that change frequently. However, model download time adds to your cold start latency.

Building custom Docker images with models baked in

For production deployments where cold start time matters, you can build a custom Docker image that includes your model weights. This eliminates download time and can reduce cold starts from minutes to seconds. This approach requires more upfront work but provides the best performance for production workloads with consistent traffic.

Compatible models

vLLM supports most model architectures available on Hugging Face. You can deploy models from families including Llama (1, 2, 3, 3.1, 3.2), Mistral and Mixtral, Qwen2 and Qwen2.5, Gemma and Gemma 2, Phi (2, 3, 3.5, 4), DeepSeek (V2, V3, R1), GPT-2, GPT-J, OPT, BLOOM, Falcon, MPT, StableLM, Yi, and many others. For a complete and up-to-date list of supported model architectures, see the vLLM supported models documentation.

Performance considerations

Several factors affect vLLM worker performance. GPU selection is the most important factor. Larger models require more VRAM, and inference speed scales with GPU memory bandwidth. For 7B parameter models, an A10G or better is recommended. For 70B+ models, you’ll need an A100 or H100. See GPU types for details on available GPUs. Model size directly impacts both loading time and inference speed. Smaller models (7B parameters) load quickly and generate tokens fast. Larger models (70B+ parameters) provide better quality but require more powerful GPUs and have higher latency. Quantization reduces model size and memory requirements by using lower-precision weights. Methods like AWQ and GPTQ can reduce memory usage by 2-4x with minimal quality loss. This lets you run larger models on smaller GPUs or increase throughput on a given GPU. Context length affects memory requirements and processing time. Longer contexts require more memory for the KV cache and take longer to process. Set MAX_MODEL_LEN to the minimum value that meets your needs. Concurrent requests benefit from vLLM’s continuous batching, but too many concurrent requests can exceed GPU memory and cause failures. The MAX_NUM_SEQS environment variable controls the maximum number of concurrent sequences.

Use cases

vLLM workers are ideal for several types of applications. Production LLM APIs benefit from vLLM’s high throughput and OpenAI compatibility. You can build scalable APIs for chatbots, content generation, code completion, or any other LLM-powered feature. Cost-effective scaling is enabled by Serverless auto-scaling. If your traffic varies significantly throughout the day or week, vLLM workers automatically scale down to zero during quiet periods, saving costs compared to always-on servers. OpenAI migration is straightforward because vLLM provides API compatibility. You can migrate existing OpenAI-based applications to open-source models by changing only your endpoint URL and API key. Custom model hosting lets you deploy fine-tuned or specialized models. If you’ve trained a custom model or fine-tuned an existing one, vLLM workers make it easy to serve it at scale. Development and experimentation is cheaper with pay-per-second billing. You can test multiple models and configurations without worrying about idle costs.

Next steps

Ready to deploy your first vLLM worker? Start with the get started guide to deploy a model in minutes. Once your endpoint is running, learn how to send requests using Runpod’s native API or the OpenAI-compatible API. For advanced configuration options, see the environment variables documentation.

Get started

Serverless

Pods

Storage

Hub

Instant Clusters

Fine-tuning

Reference

What is vLLM?

How vLLM works

PagedAttention for memory efficiency

Continuous batching

Request lifecycle

Why use vLLM workers?

Performance and efficiency

OpenAI API compatibility

Model flexibility

Auto-scaling and cost efficiency

Production-ready features

Deployment options

Using pre-built Docker images

Building custom Docker images with models baked in

Compatible models

Performance considerations

Use cases

Next steps

Get started

Serverless

Pods

Storage

Hub

Instant Clusters

Fine-tuning

Reference

​What is vLLM?

​How vLLM works

​PagedAttention for memory efficiency

​Continuous batching

​Request lifecycle

​Why use vLLM workers?

​Performance and efficiency

​OpenAI API compatibility

​Model flexibility

​Auto-scaling and cost efficiency

​Production-ready features

​Deployment options

​Using pre-built Docker images

​Building custom Docker images with models baked in

​Compatible models

​Performance considerations

​Use cases

​Next steps

What is vLLM?

How vLLM works

PagedAttention for memory efficiency

Continuous batching

Request lifecycle

Why use vLLM workers?

Performance and efficiency

OpenAI API compatibility

Model flexibility

Auto-scaling and cost efficiency

Production-ready features

Deployment options

Using pre-built Docker images

Building custom Docker images with models baked in

Compatible models

Performance considerations

Use cases

Next steps