Lighteval documentation

Using SGLang as Backend

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.9.2).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Using SGLang as Backend

Lighteval allows you to use SGLang as a backend, providing significant speedups for model evaluation. To use SGLang, simply change the model_args to reflect the arguments you want to pass to SGLang.

Basic Usage

lighteval sglang \
    "model_name=HuggingFaceH4/zephyr-7b-beta,dtype=float16" \
    "leaderboard|truthfulqa:mc|0"

Parallelism Options

SGLang can distribute the model across multiple GPUs using data parallelism and tensor parallelism. You can choose the parallelism method by setting the appropriate parameters in the model_args.

Tensor Parallelism

For example, if you have 4 GPUs, you can split the model across them using tensor parallelism with tp_size:

lighteval sglang \
    "model_name=HuggingFaceH4/zephyr-7b-beta,dtype=float16,tp_size=4" \
    "leaderboard|truthfulqa:mc|0"

Data Parallelism

If your model fits on a single GPU, you can use data parallelism with dp_size to speed up the evaluation:

lighteval sglang \
    "model_name=HuggingFaceH4/zephyr-7b-beta,dtype=float16,dp_size=4" \
    "leaderboard|truthfulqa:mc|0"

Using a Configuration File

For more advanced configurations, you can use a YAML configuration file for the model. An example configuration file is shown below and can be found at examples/model_configs/sglang_model_config.yaml.

lighteval sglang \
    "examples/model_configs/sglang_model_config.yaml" \
    "leaderboard|truthfulqa:mc|0"

Documentation for SGLang server arguments can be found here

model_parameters:
    model_name: "HuggingFaceTB/SmolLM-1.7B-Instruct"
    dtype: "auto"
    tp_size: 1
    dp_size: 1
    context_length: null
    random_seed: 1
    trust_remote_code: False
    device: "cuda"
    skip_tokenizer_init: False
    kv_cache_dtype: "auto"
    add_special_tokens: True
    pairwise_tokenization: False
    sampling_backend: null
    attention_backend: null
    mem_fraction_static: 0.8
    chunked_prefill_size: 4096
    generation_parameters:
      max_new_tokens: 1024
      min_new_tokens: 0
      temperature: 1.0
      top_k: 50
      min_p: 0.0
      top_p: 1.0
      presence_penalty: 0.0
      repetition_penalty: 1.0
      frequency_penalty: 0.0

In case of out-of-memory (OOM) issues, you might need to reduce the context size of the model as well as reduce the mem_fraction_static and chunked_prefill_size parameters.

Key SGLang Parameters

Memory Management

  • mem_fraction_static: Fraction of GPU memory to allocate for static tensors (default: 0.8)
  • chunked_prefill_size: Size of chunks for prefill operations (default: 4096)
  • context_length: Maximum context length for the model
  • kv_cache_dtype: Data type for key-value cache

Parallelism Settings

  • tp_size: Number of GPUs for tensor parallelism
  • dp_size: Number of GPUs for data parallelism

Model Configuration

  • dtype: Data type for model weights (“auto”, “float16”, “bfloat16”, etc.)
  • device: Device to run the model on (“cuda”, “cpu”)
  • trust_remote_code: Whether to trust remote code from the model
  • skip_tokenizer_init: Skip tokenizer initialization for faster startup

Generation Parameters

  • temperature: Controls randomness in generation (0.0 = deterministic, 1.0 = random)
  • top_p: Nucleus sampling parameter
  • top_k: Top-k sampling parameter
  • max_new_tokens: Maximum number of tokens to generate
  • repetition_penalty: Penalty for repeating tokens
  • presence_penalty: Penalty for token presence
  • frequency_penalty: Penalty for token frequency
< > Update on GitHub