Using SGLang as Backend

Lighteval allows you to use SGLang as a backend, providing significant speedups for model evaluation. To use SGLang, simply change the model_args to reflect the arguments you want to pass to SGLang.

Basic Usage

lighteval sglang \
    "model_name=HuggingFaceH4/zephyr-7b-beta,dtype=float16" \
    "leaderboard|truthfulqa:mc|0"

Parallelism Options

SGLang can distribute the model across multiple GPUs using data parallelism and tensor parallelism. You can choose the parallelism method by setting the appropriate parameters in the model_args.

Tensor Parallelism

For example, if you have 4 GPUs, you can split the model across them using tensor parallelism with tp_size:

lighteval sglang \
    "model_name=HuggingFaceH4/zephyr-7b-beta,dtype=float16,tp_size=4" \
    "leaderboard|truthfulqa:mc|0"

Data Parallelism

If your model fits on a single GPU, you can use data parallelism with dp_size to speed up the evaluation:

lighteval sglang \
    "model_name=HuggingFaceH4/zephyr-7b-beta,dtype=float16,dp_size=4" \
    "leaderboard|truthfulqa:mc|0"

Using a Configuration File

For more advanced configurations, you can use a YAML configuration file for the model. An example configuration file is shown below and can be found at examples/model_configs/sglang_model_config.yaml.

lighteval sglang \
    "examples/model_configs/sglang_model_config.yaml" \
    "leaderboard|truthfulqa:mc|0"

Documentation for SGLang server arguments can be found here

model_parameters:
    model_name: "HuggingFaceTB/SmolLM-1.7B-Instruct"
    dtype: "auto"
    tp_size: 1
    dp_size: 1
    context_length: null
    random_seed: 1
    trust_remote_code: False
    device: "cuda"
    skip_tokenizer_init: False
    kv_cache_dtype: "auto"
    add_special_tokens: True
    pairwise_tokenization: False
    sampling_backend: null
    attention_backend: null
    mem_fraction_static: 0.8
    chunked_prefill_size: 4096
    generation_parameters:
      max_new_tokens: 1024
      min_new_tokens: 0
      temperature: 1.0
      top_k: 50
      min_p: 0.0
      top_p: 1.0
      presence_penalty: 0.0
      repetition_penalty: 1.0
      frequency_penalty: 0.0

In case of out-of-memory (OOM) issues, you might need to reduce the context size of the model as well as reduce the mem_fraction_static and chunked_prefill_size parameters.

Key SGLang Parameters

Memory Management

mem_fraction_static: Fraction of GPU memory to allocate for static tensors (default: 0.8)
chunked_prefill_size: Size of chunks for prefill operations (default: 4096)
context_length: Maximum context length for the model
kv_cache_dtype: Data type for key-value cache

Parallelism Settings

tp_size: Number of GPUs for tensor parallelism
dp_size: Number of GPUs for data parallelism

Model Configuration

dtype: Data type for model weights (“auto”, “float16”, “bfloat16”, etc.)
device: Device to run the model on (“cuda”, “cpu”)
trust_remote_code: Whether to trust remote code from the model
skip_tokenizer_init: Skip tokenizer initialization for faster startup

Generation Parameters

temperature: Controls randomness in generation (0.0 = deterministic, 1.0 = random)
top_p: Nucleus sampling parameter
top_k: Top-k sampling parameter
max_new_tokens: Maximum number of tokens to generate
repetition_penalty: Penalty for repeating tokens
presence_penalty: Penalty for token presence
frequency_penalty: Penalty for token frequency

< > Update on GitHub