Quick Tour

We recommend using the --help flag to get more information about the available options for each command. lighteval --help

Lighteval can be used with several different commands, each optimized for different evaluation scenarios.

Available Commands

Evaluation Backends

lighteval accelerate: Evaluate models on CPU or one or more GPUs using 🤗 Accelerate
lighteval nanotron: Evaluate models in distributed settings using ⚡️ Nanotron
lighteval vllm: Evaluate models on one or more GPUs using 🚀 VLLM
lighteval custom: Evaluate custom models (can be anything)
lighteval sglang: Evaluate models using SGLang as backend
lighteval endpoint: Evaluate models using various endpoints as backend
- lighteval endpoint inference-endpoint: Evaluate models using Hugging Face’s Inference Endpoints API
- lighteval endpoint tgi: Evaluate models using 🔗 Text Generation Inference running locally
- lighteval endpoint litellm: Evaluate models on any compatible API using LiteLLM
- lighteval endpoint inference-providers: Evaluate models using HuggingFace’s inference providers as backend

Evaluation Utils

lighteval baseline: Compute baselines for given tasks

Utils

lighteval tasks: List or inspect tasks
- lighteval tasks list: List all available tasks
- lighteval tasks inspect: Inspect a specific task to see its configuration and samples
- lighteval tasks create: Create a new task from a template

Basic Usage

To evaluate GPT-2 on the Truthful QA benchmark with 🤗 Accelerate, run:

lighteval accelerate \
     "model_name=openai-community/gpt2" \
     "leaderboard|truthfulqa:mc|0"

Here, we first choose a backend (either accelerate, nanotron, endpoint, or vllm), and then specify the model and task(s) to run.

Task Specification

The syntax for the task specification might be a bit hard to grasp at first. The format is as follows:

{suite}|{task}|{num_few_shot}

Tasks have a function applied at the sample level and one at the corpus level. For example,

an exact match can be applied per sample, then averaged over the corpus to give the final score
samples can be left untouched before applying Corpus BLEU at the corpus level etc.

If the task you are looking at has a sample level function (sample_level_fn) which can be parametrized, you can pass parameters in the CLI. For example

{suite}|{task}@{parameter_name1}={value1}@{parameter_name2}={value2},...|0

All officially supported tasks can be found at the tasks_list and in the extended folder. Moreover, community-provided tasks can be found in the community folder.

For more details on the implementation of the tasks, such as how prompts are constructed or which metrics are used, you can examine the implementation file.

Running Multiple Tasks

Running multiple tasks is supported, either with a comma-separated list or by specifying a file path. The file should be structured like examples/tasks/recommended_set.txt. When specifying a path to a file, it should start with ./.

lighteval accelerate \
     "model_name=openai-community/gpt2" \
     ./path/to/lighteval/examples/tasks/recommended_set.txt
# or, e.g., "leaderboard|truthfulqa:mc|0,leaderboard|gsm8k|3"

Backend Configuration

General Information

The model-args argument takes a string representing a list of model arguments. The arguments allowed vary depending on the backend you use and correspond to the fields of the model configurations.

The model configurations can be found here.

All models allow you to post-process your reasoning model predictions to remove the thinking tokens from the trace used to compute the metrics, using --remove-reasoning-tags and --reasoning-tags to specify which reasoning tags to remove (defaults to <think> and </think>).

Here’s an example with mistralai/Magistral-Small-2507 which outputs custom thinking tokens:

lighteval vllm \
    "model_name=mistralai/Magistral-Small-2507,dtype=float16,data_parallel_size=4" \
    "lighteval|aime24|0" \
    --remove-reasoning-tags \
    --reasoning-tags="[('[THINK]','[/THINK]')]"

Nanotron

To evaluate a model trained with Nanotron on a single GPU:

Nanotron models cannot be evaluated without torchrun.

torchrun --standalone --nnodes=1 --nproc-per-node=1 \
    src/lighteval/__main__.py nanotron \
    --checkpoint-config-path ../nanotron/checkpoints/10/config.yaml \
    --lighteval-config-path examples/nanotron/lighteval_config_override_template.yaml

The nproc-per-node argument should match the data, tensor, and pipeline parallelism configured in the lighteval_config_template.yaml file. That is: nproc-per-node = data_parallelism * tensor_parallelism * pipeline_parallelism.

< > Update on GitHub