Using VLLM as Backend

Lighteval allows you to use VLLM as a backend, providing significant speedups for model evaluation. To use VLLM, simply change the model_args to reflect the arguments you want to pass to VLLM.

Documentation for VLLM engine arguments can be found here

Basic Usage

lighteval vllm \
    "model_name=HuggingFaceH4/zephyr-7b-beta" \
    "extended|ifeval|0"

Parallelism Options

VLLM can distribute the model across multiple GPUs using data parallelism, pipeline parallelism, or tensor parallelism. You can choose the parallelism method by setting the appropriate parameters in the model_args.

Tensor Parallelism

For example, if you have 4 GPUs, you can split the model across them using tensor parallelism:

export VLLM_WORKER_MULTIPROC_METHOD=spawn && lighteval vllm \
    "model_name=HuggingFaceH4/zephyr-7b-beta,tensor_parallel_size=4" \
    "extended|ifeval|0"

Data Parallelism

If your model fits on a single GPU, you can use data parallelism to speed up the evaluation:

export VLLM_WORKER_MULTIPROC_METHOD=spawn && lighteval vllm \
    "model_name=HuggingFaceH4/zephyr-7b-beta,data_parallel_size=4" \
    "extended|ifeval|0"

Using a Configuration File

For more advanced configurations, you can use a YAML configuration file for the model. An example configuration file is shown below and can be found at examples/model_configs/vllm_model_config.yaml.

lighteval vllm \
    "examples/model_configs/vllm_model_config.yaml" \
    "extended|ifeval|0"

model_parameters:
    model_name: "HuggingFaceTB/SmolLM-1.7B-Instruct"
    revision: "main"
    dtype: "bfloat16"
    tensor_parallel_size: 1
    data_parallel_size: 1
    pipeline_parallel_size: 1
    gpu_memory_utilization: 0.9
    max_model_length: 2048
    swap_space: 4
    seed: 1
    trust_remote_code: True
    add_special_tokens: True
    multichoice_continuations_start_space: True
    pairwise_tokenization: True
    subfolder: null
    generation_parameters:
      presence_penalty: 0.0
      repetition_penalty: 1.0
      frequency_penalty: 0.0
      temperature: 1.0
      top_k: 50
      min_p: 0.0
      top_p: 1.0
      seed: 42
      stop_tokens: null
      max_new_tokens: 1024
      min_new_tokens: 0

In case of out-of-memory (OOM) issues, you might need to reduce the context size of the model as well as reduce the gpu_memory_utilization parameter.

Key VLLM Parameters

Memory Management

gpu_memory_utilization: Controls how much GPU memory VLLM can use (default: 0.9)
max_model_length: Maximum sequence length for the model
swap_space: Amount of CPU memory to use for swapping (in GB)

Parallelism Settings

tensor_parallel_size: Number of GPUs for tensor parallelism
data_parallel_size: Number of GPUs for data parallelism
pipeline_parallel_size: Number of GPUs for pipeline parallelism

Generation Parameters

temperature: Controls randomness in generation (0.0 = deterministic, 1.0 = random)
top_p: Nucleus sampling parameter
top_k: Top-k sampling parameter
max_new_tokens: Maximum number of tokens to generate
repetition_penalty: Penalty for repeating tokens

Troubleshooting

Common Issues

Out of Memory Errors: Reduce gpu_memory_utilization or max_model_length
Worker Process Issues: Ensure VLLM_WORKER_MULTIPROC_METHOD=spawn is set for multi-GPU setups
Model Loading Errors: Check that the model name and revision are correct

< > Update on GitHub