Lighteval documentation

Using VLLM as Backend

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.9.2).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Using VLLM as Backend

Lighteval allows you to use VLLM as a backend, providing significant speedups for model evaluation. To use VLLM, simply change the model_args to reflect the arguments you want to pass to VLLM.

Documentation for VLLM engine arguments can be found here

Basic Usage

lighteval vllm \
    "model_name=HuggingFaceH4/zephyr-7b-beta" \
    "extended|ifeval|0"

Parallelism Options

VLLM can distribute the model across multiple GPUs using data parallelism, pipeline parallelism, or tensor parallelism. You can choose the parallelism method by setting the appropriate parameters in the model_args.

Tensor Parallelism

For example, if you have 4 GPUs, you can split the model across them using tensor parallelism:

export VLLM_WORKER_MULTIPROC_METHOD=spawn && lighteval vllm \
    "model_name=HuggingFaceH4/zephyr-7b-beta,tensor_parallel_size=4" \
    "extended|ifeval|0"

Data Parallelism

If your model fits on a single GPU, you can use data parallelism to speed up the evaluation:

export VLLM_WORKER_MULTIPROC_METHOD=spawn && lighteval vllm \
    "model_name=HuggingFaceH4/zephyr-7b-beta,data_parallel_size=4" \
    "extended|ifeval|0"

Using a Configuration File

For more advanced configurations, you can use a YAML configuration file for the model. An example configuration file is shown below and can be found at examples/model_configs/vllm_model_config.yaml.

lighteval vllm \
    "examples/model_configs/vllm_model_config.yaml" \
    "extended|ifeval|0"
model_parameters:
    model_name: "HuggingFaceTB/SmolLM-1.7B-Instruct"
    revision: "main"
    dtype: "bfloat16"
    tensor_parallel_size: 1
    data_parallel_size: 1
    pipeline_parallel_size: 1
    gpu_memory_utilization: 0.9
    max_model_length: 2048
    swap_space: 4
    seed: 1
    trust_remote_code: True
    add_special_tokens: True
    multichoice_continuations_start_space: True
    pairwise_tokenization: True
    subfolder: null
    generation_parameters:
      presence_penalty: 0.0
      repetition_penalty: 1.0
      frequency_penalty: 0.0
      temperature: 1.0
      top_k: 50
      min_p: 0.0
      top_p: 1.0
      seed: 42
      stop_tokens: null
      max_new_tokens: 1024
      min_new_tokens: 0

In case of out-of-memory (OOM) issues, you might need to reduce the context size of the model as well as reduce the gpu_memory_utilization parameter.

Key VLLM Parameters

Memory Management

  • gpu_memory_utilization: Controls how much GPU memory VLLM can use (default: 0.9)
  • max_model_length: Maximum sequence length for the model
  • swap_space: Amount of CPU memory to use for swapping (in GB)

Parallelism Settings

  • tensor_parallel_size: Number of GPUs for tensor parallelism
  • data_parallel_size: Number of GPUs for data parallelism
  • pipeline_parallel_size: Number of GPUs for pipeline parallelism

Generation Parameters

  • temperature: Controls randomness in generation (0.0 = deterministic, 1.0 = random)
  • top_p: Nucleus sampling parameter
  • top_k: Top-k sampling parameter
  • max_new_tokens: Maximum number of tokens to generate
  • repetition_penalty: Penalty for repeating tokens

Troubleshooting

Common Issues

  1. Out of Memory Errors: Reduce gpu_memory_utilization or max_model_length
  2. Worker Process Issues: Ensure VLLM_WORKER_MULTIPROC_METHOD=spawn is set for multi-GPU setups
  3. Model Loading Errors: Check that the model name and revision are correct
< > Update on GitHub