Lighteval documentation
Quick Tour
Quick Tour
We recommend using the --help
flag to get more information about the
available options for each command.
lighteval --help
Lighteval can be used with several different commands, each optimized for different evaluation scenarios.
Available Commands
Evaluation Backends
lighteval accelerate
: Evaluate models on CPU or one or more GPUs using 🤗 Acceleratelighteval nanotron
: Evaluate models in distributed settings using ⚡️ Nanotronlighteval vllm
: Evaluate models on one or more GPUs using 🚀 VLLMlighteval custom
: Evaluate custom models (can be anything)lighteval sglang
: Evaluate models using SGLang as backendlighteval endpoint
: Evaluate models using various endpoints as backendlighteval endpoint inference-endpoint
: Evaluate models using Hugging Face’s Inference Endpoints APIlighteval endpoint tgi
: Evaluate models using 🔗 Text Generation Inference running locallylighteval endpoint litellm
: Evaluate models on any compatible API using LiteLLMlighteval endpoint inference-providers
: Evaluate models using HuggingFace’s inference providers as backend
Evaluation Utils
lighteval baseline
: Compute baselines for given tasks
Utils
lighteval tasks
: List or inspect taskslighteval tasks list
: List all available taskslighteval tasks inspect
: Inspect a specific task to see its configuration and sampleslighteval tasks create
: Create a new task from a template
Basic Usage
To evaluate GPT-2
on the Truthful QA benchmark with 🤗
Accelerate, run:
lighteval accelerate \
"model_name=openai-community/gpt2" \
"leaderboard|truthfulqa:mc|0"
Here, we first choose a backend (either accelerate
, nanotron
, endpoint
, or vllm
), and then specify the model and task(s) to run.
Task Specification
The syntax for the task specification might be a bit hard to grasp at first. The format is as follows:
{suite}|{task}|{num_few_shot}
Tasks have a function applied at the sample level and one at the corpus level. For example,
- an exact match can be applied per sample, then averaged over the corpus to give the final score
- samples can be left untouched before applying Corpus BLEU at the corpus level etc.
If the task you are looking at has a sample level function (sample_level_fn
) which can be parametrized, you can pass parameters in the CLI.
For example
{suite}|{task}@{parameter_name1}={value1}@{parameter_name2}={value2},...|0
All officially supported tasks can be found at the tasks_list and in the extended folder. Moreover, community-provided tasks can be found in the community folder.
For more details on the implementation of the tasks, such as how prompts are constructed or which metrics are used, you can examine the implementation file.
Running Multiple Tasks
Running multiple tasks is supported, either with a comma-separated list or by specifying a file path.
The file should be structured like examples/tasks/recommended_set.txt.
When specifying a path to a file, it should start with ./
.
lighteval accelerate \
"model_name=openai-community/gpt2" \
./path/to/lighteval/examples/tasks/recommended_set.txt
# or, e.g., "leaderboard|truthfulqa:mc|0,leaderboard|gsm8k|3"
Backend Configuration
General Information
The model-args
argument takes a string representing a list of model
arguments. The arguments allowed vary depending on the backend you use and
correspond to the fields of the model configurations.
The model configurations can be found here.
All models allow you to post-process your reasoning model predictions
to remove the thinking tokens from the trace used to compute the metrics,
using --remove-reasoning-tags
and --reasoning-tags
to specify which
reasoning tags to remove (defaults to <think>
and </think>
).
Here’s an example with mistralai/Magistral-Small-2507
which outputs custom
thinking tokens:
lighteval vllm \
"model_name=mistralai/Magistral-Small-2507,dtype=float16,data_parallel_size=4" \
"lighteval|aime24|0" \
--remove-reasoning-tags \
--reasoning-tags="[('[THINK]','[/THINK]')]"
Nanotron
To evaluate a model trained with Nanotron on a single GPU:
Nanotron models cannot be evaluated without torchrun.
torchrun --standalone --nnodes=1 --nproc-per-node=1 \ src/lighteval/__main__.py nanotron \ --checkpoint-config-path ../nanotron/checkpoints/10/config.yaml \ --lighteval-config-path examples/nanotron/lighteval_config_override_template.yaml
The nproc-per-node
argument should match the data, tensor, and pipeline
parallelism configured in the lighteval_config_template.yaml
file.
That is: nproc-per-node = data_parallelism * tensor_parallelism * pipeline_parallelism
.