Lighteval documentation
Using SGLang as Backend
Using SGLang as Backend
Lighteval allows you to use SGLang as a backend, providing significant speedups for model evaluation.
To use SGLang, simply change the model_args
to reflect the arguments you want to pass to SGLang.
Basic Usage
lighteval sglang \
"model_name=HuggingFaceH4/zephyr-7b-beta,dtype=float16" \
"leaderboard|truthfulqa:mc|0"
Parallelism Options
SGLang can distribute the model across multiple GPUs using data parallelism and tensor parallelism.
You can choose the parallelism method by setting the appropriate parameters in the model_args
.
Tensor Parallelism
For example, if you have 4 GPUs, you can split the model across them using tensor parallelism with tp_size
:
lighteval sglang \
"model_name=HuggingFaceH4/zephyr-7b-beta,dtype=float16,tp_size=4" \
"leaderboard|truthfulqa:mc|0"
Data Parallelism
If your model fits on a single GPU, you can use data parallelism with dp_size
to speed up the evaluation:
lighteval sglang \
"model_name=HuggingFaceH4/zephyr-7b-beta,dtype=float16,dp_size=4" \
"leaderboard|truthfulqa:mc|0"
Using a Configuration File
For more advanced configurations, you can use a YAML configuration file for the model.
An example configuration file is shown below and can be found at examples/model_configs/sglang_model_config.yaml
.
lighteval sglang \
"examples/model_configs/sglang_model_config.yaml" \
"leaderboard|truthfulqa:mc|0"
Documentation for SGLang server arguments can be found here
model_parameters:
model_name: "HuggingFaceTB/SmolLM-1.7B-Instruct"
dtype: "auto"
tp_size: 1
dp_size: 1
context_length: null
random_seed: 1
trust_remote_code: False
device: "cuda"
skip_tokenizer_init: False
kv_cache_dtype: "auto"
add_special_tokens: True
pairwise_tokenization: False
sampling_backend: null
attention_backend: null
mem_fraction_static: 0.8
chunked_prefill_size: 4096
generation_parameters:
max_new_tokens: 1024
min_new_tokens: 0
temperature: 1.0
top_k: 50
min_p: 0.0
top_p: 1.0
presence_penalty: 0.0
repetition_penalty: 1.0
frequency_penalty: 0.0
In case of out-of-memory (OOM) issues, you might need to reduce the context size of the
model as well as reduce the mem_fraction_static
and chunked_prefill_size
parameters.
Key SGLang Parameters
Memory Management
mem_fraction_static
: Fraction of GPU memory to allocate for static tensors (default: 0.8)chunked_prefill_size
: Size of chunks for prefill operations (default: 4096)context_length
: Maximum context length for the modelkv_cache_dtype
: Data type for key-value cache
Parallelism Settings
tp_size
: Number of GPUs for tensor parallelismdp_size
: Number of GPUs for data parallelism
Model Configuration
dtype
: Data type for model weights (“auto”, “float16”, “bfloat16”, etc.)device
: Device to run the model on (“cuda”, “cpu”)trust_remote_code
: Whether to trust remote code from the modelskip_tokenizer_init
: Skip tokenizer initialization for faster startup
Generation Parameters
temperature
: Controls randomness in generation (0.0 = deterministic, 1.0 = random)top_p
: Nucleus sampling parametertop_k
: Top-k sampling parametermax_new_tokens
: Maximum number of tokens to generaterepetition_penalty
: Penalty for repeating tokenspresence_penalty
: Penalty for token presencefrequency_penalty
: Penalty for token frequency