Lighteval documentation
Saving and Reading Results
Saving and Reading Results
Lighteval provides comprehensive logging and result management through the EvaluationTracker
class. This system allows you to save results locally and optionally push them to various platforms for collaboration and analysis.
Saving Results Locally
Lighteval automatically saves results and evaluation details in the
directory specified with the --output-dir
option. The results are saved in
{output_dir}/results/{model_name}/results_{timestamp}.json
. Here is an
example of a result file. The output path can be
any fsspec
compliant path (local, S3, Hugging Face Hub, Google Drive, FTP, etc.).
To save detailed evaluation information, you can use the --save-details
option. The details are saved in Parquet files at
{output_dir}/details/{model_name}/{timestamp}/details_{task}_{timestamp}.parquet
.
If you want results to be saved in a custom path structure, you can set the results-path-template
option.
This allows you to specify a string template for the path. The template must contain the following
variables: output_dir
, model_name
, org
. For example:
{output_dir}/{org}_{model}
. The template will be used to create the path for the results file.
Pushing Results to the Hugging Face Hub
You can push results and evaluation details to the Hugging Face Hub. To do
so, you need to set the --push-to-hub
option as well as the --results-org
option. The results are saved in a dataset with the name
{results_org}/{model_org}/{model_name}
. To push the details, you need to set
the --save-details
option.
The dataset created will be private by default. You can make it public by
setting the --public-run
option.
Pushing Results to TensorBoard
You can push results to TensorBoard by setting --push-to-tensorboard
.
This creates a TensorBoard dashboard in a Hugging Face organization specified with the --results-org
option.
Pushing Results to Weights & Biases or Trackio
You can push results to Weights & Biases by setting --wandb
. This initializes a W&B
run and logs the results.
W&B arguments need to be set in your environment variables:
export WANDB_PROJECT="lighteval"
You can find a complete list of variables in the W&B documentation.
If Trackio is available in your environment (pip install lighteval[trackio]
), it will be used to log and push results to a
Hugging Face dataset. Choose the dataset name and organization with:
export WANDB_SPACE_ID="org/name"
How to Load and Investigate Details
Loading from Local Detail Files
from datasets import load_dataset
import os
import glob
output_dir = "evals_doc"
model_name = "HuggingFaceH4/zephyr-7b-beta"
timestamp = "latest"
task = "lighteval|gsm8k|0"
if timestamp == "latest":
path = f"{output_dir}/details/{model_name}/*/"
timestamps = glob.glob(path)
timestamp = sorted(timestamps)[-1].split("/")[-2]
print(f"Latest timestamp: {timestamp}")
details_path = f"{output_dir}/details/{model_name}/{timestamp}/details_{task}_{timestamp}.parquet"
# Load the details
details = load_dataset("parquet", data_files=details_path, split="train")
for detail in details:
print(detail)
Loading from the Hugging Face Hub
from datasets import load_dataset
results_org = "SaylorTwift"
model_name = "HuggingFaceH4/zephyr-7b-beta"
sanitized_model_name = model_name.replace("/", "__")
task = "lighteval|gsm8k|0"
public_run = False
dataset_path = f"{results_org}/details_{sanitized_model_name}{'_private' if not public_run else ''}"
details = load_dataset(dataset_path, task.replace("|", "_"), split="latest")
for detail in details:
print(detail)
Detail File Structure
The detail file contains the following columns:
__doc__
: The document used for evaluation, containing the gold reference, few-shot examples, and other hyperparameters used for the task.__model_response__
: Contains model generations, log probabilities, and the input that was sent to the model.__metric__
: The value of the metrics for this sample.
EvaluationTracker Configuration
The EvaluationTracker
class provides several configuration options for customizing how results are saved and pushed:
Basic Configuration
from lighteval.logging.evaluation_tracker import EvaluationTracker
tracker = EvaluationTracker(
output_dir="./results",
save_details=True,
push_to_hub=True,
hub_results_org="your_username",
public=False
)
Advanced Configuration
tracker = EvaluationTracker(
output_dir="./results",
results_path_template="{output_dir}/custom/{org}_{model}",
save_details=True,
push_to_hub=True,
push_to_tensorboard=True,
hub_results_org="my-org",
tensorboard_metric_prefix="eval",
public=True,
use_wandb=True
)
Key Parameters
output_dir
: Local directory to save evaluation results and logsresults_path_template
: Template for results directory structuresave_details
: Whether to save detailed evaluation records (default: True)push_to_hub
: Whether to push results to Hugging Face Hub (default: False)push_to_tensorboard
: Whether to push metrics to TensorBoard (default: False)hub_results_org
: Hugging Face Hub organization to push results totensorboard_metric_prefix
: Prefix for TensorBoard metrics (default: “eval”)public
: Whether to make Hub datasets public (default: False)use_wandb
: Whether to log to Weights & Biases or Trackio (default: False)
Result File Structure
The main results file contains several sections:
General Configuration
config_general
: Overall evaluation configuration including model information, timing, and system detailssummary_general
: General statistics about the evaluation run
Task-Specific Information
config_tasks
: Configuration details for each evaluated tasksummary_tasks
: Task-specific statistics and metadataversions
: Version information for tasks and datasets
Results
results
: Actual evaluation metrics and scores for each task
Example of a Result File
{
"config_general": {
"lighteval_sha": "203045a8431bc9b77245c9998e05fc54509ea07f",
"num_fewshot_seeds": 1,
"max_samples": 1,
"job_id": "",
"start_time": 620979.879320166,
"end_time": 621004.632108041,
"total_evaluation_time_secondes": "24.752787875011563",
"model_name": "gpt2",
"model_sha": "607a30d783dfa663caf39e06633721c8d4cfcd7e",
"model_dtype": null,
"model_size": "476.2 MB"
},
"results": {
"lighteval|gsm8k|0": {
"em": 0.0,
"em_stderr": 0.0,
"maj@8": 0.0,
"maj@8_stderr": 0.0
},
"all": {
"em": 0.0,
"em_stderr": 0.0,
"maj@8": 0.0,
"maj@8_stderr": 0.0
}
},
"versions": {
"lighteval|gsm8k|0": 0
},
"config_tasks": {
"lighteval|gsm8k": {
"name": "gsm8k",
"prompt_function": "gsm8k",
"hf_repo": "gsm8k",
"hf_subset": "main",
"metric": [
{
"metric_name": "em",
"higher_is_better": true,
"category": "3",
"use_case": "5",
"sample_level_fn": "compute",
"corpus_level_fn": "mean"
},
{
"metric_name": "maj@8",
"higher_is_better": true,
"category": "5",
"use_case": "5",
"sample_level_fn": "compute",
"corpus_level_fn": "mean"
}
],
"hf_avail_splits": [
"train",
"test"
],
"evaluation_splits": [
"test"
],
"few_shots_split": null,
"few_shots_select": "random_sampling_from_train",
"generation_size": 256,
"generation_grammar": null,
"stop_sequence": [
"Question="
],
"num_samples": null,
"suite": [
"lighteval"
],
"original_num_docs": 1319,
"effective_num_docs": 1,
"must_remove_duplicate_docs": null,
"version": 0
}
},
"summary_tasks": {
"lighteval|gsm8k|0": {
"hashes": {
"hash_examples": "8517d5bf7e880086",
"hash_full_prompts": "8517d5bf7e880086",
"hash_input_tokens": "29916e7afe5cb51d",
"hash_cont_tokens": "37f91ce23ef6d435"
},
"padded": 0,
"non_padded": 2,
"effective_few_shots": 0.0,
}
},
"summary_general": {
"hashes": {
"hash_examples": "5f383c395f01096e",
"hash_full_prompts": "5f383c395f01096e",
"hash_input_tokens": "ac933feb14f96d7b",
"hash_cont_tokens": "9d03fb26f8da7277"
},
"padded": 0,
"non_padded": 2,
}
}