How vLLM requests work
vLLM workers are queue-based Serverless endpoints. They use the same/run
and /runsync
operations as other Runpod endpoints, following the standard Serverless request structure.
The key difference is the input format. vLLM workers expect specific parameters for language model inference, such as prompts, messages, and sampling parameters. The worker’s handler processes these inputs using the vLLM engine and returns generated text.
Request operations
vLLM endpoints support both synchronous and asynchronous requests.Asynchronous requests with /run
Use /run
to submit a job that processes in the background. You’ll receive a job ID immediately, then poll for results using the /status
endpoint.
Synchronous requests with /runsync
Use /runsync
to wait for the complete response in a single request. The client blocks until processing is complete.
Input formats
vLLM workers accept two input formats for text generation.Messages format (for chat models)
Use the messages format for instruction-tuned models that expect conversation history. The worker automatically applies the model’s chat template.Prompt format (for text completion)
Use the prompt format for base models or when you want to provide raw text without a chat template.Applying chat templates to prompts
If you use the prompt format but want the model’s chat template applied, setapply_chat_template
to true
.
Request input parameters
Here are all available parameters you can include in theinput
object of your request.
Parameter | Type | Default | Description |
---|---|---|---|
prompt | string | None | Prompt string to generate text based on. |
messages | list[dict[str, str]] | None | List of messages with role and content keys. The model’s chat template will be applied automatically. Overrides prompt . |
apply_chat_template | bool | false | Whether to apply the model’s chat template to the prompt . |
sampling_params | dict | {} | Sampling parameters to control generation (see Sampling parameters section below). |
stream | bool | false | Whether to enable streaming of output. If true , responses are streamed as they are generated. |
max_batch_size | int | env DEFAULT_BATCH_SIZE | The maximum number of tokens to stream per HTTP POST call. |
min_batch_size | int | env DEFAULT_MIN_BATCH_SIZE | The minimum number of tokens to stream per HTTP POST call. |
batch_size_growth_factor | int | env DEFAULT_BATCH_SIZE_GROWTH_FACTOR | The growth factor by which min_batch_size multiplies for each call until max_batch_size is reached. |
Sampling parameters
Sampling parameters control how the model generates text. Include them in thesampling_params
dictionary in your request.
Parameter | Type | Default | Description |
---|---|---|---|
n | int | 1 | Number of output sequences generated from the prompt. The top n sequences are returned. |
best_of | int | n | Number of output sequences generated from the prompt. The top n sequences are returned from these best_of sequences. Must be ≥ n . Treated as beam width in beam search. |
presence_penalty | float | 0.0 | Penalizes new tokens based on their presence in the generated text so far. Values > 0 encourage new tokens, values < 0 encourage repetition. |
frequency_penalty | float | 0.0 | Penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage new tokens, values < 0 encourage repetition. |
repetition_penalty | float | 1.0 | Penalizes new tokens based on their appearance in the prompt and generated text. Values > 1 encourage new tokens, values < 1 encourage repetition. |
temperature | float | 1.0 | Controls the randomness of sampling. Lower values make it more deterministic, higher values make it more random. Zero means greedy sampling. |
top_p | float | 1.0 | Controls the cumulative probability of top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
top_k | int | -1 | Controls the number of top tokens to consider. Set to -1 to consider all tokens. |
min_p | float | 0.0 | Represents the minimum probability for a token to be considered, relative to the most likely token. Must be in [0, 1]. Set to 0 to disable. |
use_beam_search | bool | false | Whether to use beam search instead of sampling. |
length_penalty | float | 1.0 | Penalizes sequences based on their length. Used in beam search. |
early_stopping | bool or string | false | Controls stopping condition in beam search. Can be true , false , or "never" . |
stop | string or list[str] | None | String(s) that stop generation when produced. The output will not contain these strings. |
stop_token_ids | list[int] | None | List of token IDs that stop generation when produced. Output contains these tokens unless they are special tokens. |
ignore_eos | bool | false | Whether to ignore the End-Of-Sequence token and continue generating tokens after its generation. |
max_tokens | int | 16 | Maximum number of tokens to generate per output sequence. |
min_tokens | int | 0 | Minimum number of tokens to generate per output sequence before EOS or stop sequences. |
skip_special_tokens | bool | true | Whether to skip special tokens in the output. |
spaces_between_special_tokens | bool | true | Whether to add spaces between special tokens in the output. |
truncate_prompt_tokens | int | None | If set, truncate the prompt to this many tokens. |
Streaming responses
Enable streaming to receive tokens as they’re generated instead of waiting for the complete response.ENDPOINT_ID
and RUNPOD_API_KEY
with your actual values.
For more information on streaming, see the stream operation documentation.