Lighteval documentation
Using VLLM as Backend
Using VLLM as Backend
Lighteval allows you to use VLLM as a backend, providing significant speedups for model evaluation.
To use VLLM, simply change the model_args
to reflect the arguments you want to pass to VLLM.
Documentation for VLLM engine arguments can be found here
Basic Usage
lighteval vllm \
"model_name=HuggingFaceH4/zephyr-7b-beta" \
"extended|ifeval|0"
Parallelism Options
VLLM can distribute the model across multiple GPUs using data parallelism, pipeline parallelism, or tensor parallelism.
You can choose the parallelism method by setting the appropriate parameters in the model_args
.
Tensor Parallelism
For example, if you have 4 GPUs, you can split the model across them using tensor parallelism:
export VLLM_WORKER_MULTIPROC_METHOD=spawn && lighteval vllm \
"model_name=HuggingFaceH4/zephyr-7b-beta,tensor_parallel_size=4" \
"extended|ifeval|0"
Data Parallelism
If your model fits on a single GPU, you can use data parallelism to speed up the evaluation:
export VLLM_WORKER_MULTIPROC_METHOD=spawn && lighteval vllm \
"model_name=HuggingFaceH4/zephyr-7b-beta,data_parallel_size=4" \
"extended|ifeval|0"
Using a Configuration File
For more advanced configurations, you can use a YAML configuration file for the model.
An example configuration file is shown below and can be found at examples/model_configs/vllm_model_config.yaml
.
lighteval vllm \
"examples/model_configs/vllm_model_config.yaml" \
"extended|ifeval|0"
model_parameters:
model_name: "HuggingFaceTB/SmolLM-1.7B-Instruct"
revision: "main"
dtype: "bfloat16"
tensor_parallel_size: 1
data_parallel_size: 1
pipeline_parallel_size: 1
gpu_memory_utilization: 0.9
max_model_length: 2048
swap_space: 4
seed: 1
trust_remote_code: True
add_special_tokens: True
multichoice_continuations_start_space: True
pairwise_tokenization: True
subfolder: null
generation_parameters:
presence_penalty: 0.0
repetition_penalty: 1.0
frequency_penalty: 0.0
temperature: 1.0
top_k: 50
min_p: 0.0
top_p: 1.0
seed: 42
stop_tokens: null
max_new_tokens: 1024
min_new_tokens: 0
In case of out-of-memory (OOM) issues, you might need to reduce the context size of the
model as well as reduce the gpu_memory_utilization
parameter.
Key VLLM Parameters
Memory Management
gpu_memory_utilization
: Controls how much GPU memory VLLM can use (default: 0.9)max_model_length
: Maximum sequence length for the modelswap_space
: Amount of CPU memory to use for swapping (in GB)
Parallelism Settings
tensor_parallel_size
: Number of GPUs for tensor parallelismdata_parallel_size
: Number of GPUs for data parallelismpipeline_parallel_size
: Number of GPUs for pipeline parallelism
Generation Parameters
temperature
: Controls randomness in generation (0.0 = deterministic, 1.0 = random)top_p
: Nucleus sampling parametertop_k
: Top-k sampling parametermax_new_tokens
: Maximum number of tokens to generaterepetition_penalty
: Penalty for repeating tokens
Troubleshooting
Common Issues
- Out of Memory Errors: Reduce
gpu_memory_utilization
ormax_model_length
- Worker Process Issues: Ensure
VLLM_WORKER_MULTIPROC_METHOD=spawn
is set for multi-GPU setups - Model Loading Errors: Check that the model name and revision are correct