Saving and Reading Results

Lighteval provides comprehensive logging and result management through the EvaluationTracker class. This system allows you to save results locally and optionally push them to various platforms for collaboration and analysis.

Saving Results Locally

Lighteval automatically saves results and evaluation details in the directory specified with the --output-dir option. The results are saved in {output_dir}/results/{model_name}/results_{timestamp}.json. Here is an example of a result file. The output path can be any fsspec compliant path (local, S3, Hugging Face Hub, Google Drive, FTP, etc.).

To save detailed evaluation information, you can use the --save-details option. The details are saved in Parquet files at {output_dir}/details/{model_name}/{timestamp}/details_{task}_{timestamp}.parquet.

If you want results to be saved in a custom path structure, you can set the results-path-template option. This allows you to specify a string template for the path. The template must contain the following variables: output_dir, model_name, org. For example: {output_dir}/{org}_{model}. The template will be used to create the path for the results file.

Pushing Results to the Hugging Face Hub

You can push results and evaluation details to the Hugging Face Hub. To do so, you need to set the --push-to-hub option as well as the --results-org option. The results are saved in a dataset with the name {results_org}/{model_org}/{model_name}. To push the details, you need to set the --save-details option.

The dataset created will be private by default. You can make it public by setting the --public-run option.

Pushing Results to TensorBoard

You can push results to TensorBoard by setting --push-to-tensorboard. This creates a TensorBoard dashboard in a Hugging Face organization specified with the --results-org option.

Pushing Results to Weights & Biases or Trackio

You can push results to Weights & Biases by setting --wandb. This initializes a W&B run and logs the results.

W&B arguments need to be set in your environment variables:

export WANDB_PROJECT="lighteval"

You can find a complete list of variables in the W&B documentation.

If Trackio is available in your environment (pip install lighteval[trackio]), it will be used to log and push results to a Hugging Face dataset. Choose the dataset name and organization with:

export WANDB_SPACE_ID="org/name"

How to Load and Investigate Details

Loading from Local Detail Files

from datasets import load_dataset
import os
import glob

output_dir = "evals_doc"
model_name = "HuggingFaceH4/zephyr-7b-beta"
timestamp = "latest"
task = "lighteval|gsm8k|0"

if timestamp == "latest":
    path = f"{output_dir}/details/{model_name}/*/"
    timestamps = glob.glob(path)
    timestamp = sorted(timestamps)[-1].split("/")[-2]
    print(f"Latest timestamp: {timestamp}")

details_path = f"{output_dir}/details/{model_name}/{timestamp}/details_{task}_{timestamp}.parquet"

# Load the details
details = load_dataset("parquet", data_files=details_path, split="train")

for detail in details:
    print(detail)

Loading from the Hugging Face Hub

from datasets import load_dataset

results_org = "SaylorTwift"
model_name = "HuggingFaceH4/zephyr-7b-beta"
sanitized_model_name = model_name.replace("/", "__")
task = "lighteval|gsm8k|0"
public_run = False

dataset_path = f"{results_org}/details_{sanitized_model_name}{'_private' if not public_run else ''}"
details = load_dataset(dataset_path, task.replace("|", "_"), split="latest")

for detail in details:
    print(detail)

Detail File Structure

The detail file contains the following columns:

__doc__: The document used for evaluation, containing the gold reference, few-shot examples, and other hyperparameters used for the task.
__model_response__: Contains model generations, log probabilities, and the input that was sent to the model.
__metric__: The value of the metrics for this sample.

EvaluationTracker Configuration

The EvaluationTracker class provides several configuration options for customizing how results are saved and pushed:

Basic Configuration

from lighteval.logging.evaluation_tracker import EvaluationTracker

tracker = EvaluationTracker(
    output_dir="./results",
    save_details=True,
    push_to_hub=True,
    hub_results_org="your_username",
    public=False
)

Advanced Configuration

tracker = EvaluationTracker(
    output_dir="./results",
    results_path_template="{output_dir}/custom/{org}_{model}",
    save_details=True,
    push_to_hub=True,
    push_to_tensorboard=True,
    hub_results_org="my-org",
    tensorboard_metric_prefix="eval",
    public=True,
    use_wandb=True
)

Key Parameters

output_dir: Local directory to save evaluation results and logs
results_path_template: Template for results directory structure
save_details: Whether to save detailed evaluation records (default: True)
push_to_hub: Whether to push results to Hugging Face Hub (default: False)
push_to_tensorboard: Whether to push metrics to TensorBoard (default: False)
hub_results_org: Hugging Face Hub organization to push results to
tensorboard_metric_prefix: Prefix for TensorBoard metrics (default: “eval”)
public: Whether to make Hub datasets public (default: False)
use_wandb: Whether to log to Weights & Biases or Trackio (default: False)

Result File Structure

The main results file contains several sections:

General Configuration

config_general: Overall evaluation configuration including model information, timing, and system details
summary_general: General statistics about the evaluation run

Task-Specific Information

config_tasks: Configuration details for each evaluated task
summary_tasks: Task-specific statistics and metadata
versions: Version information for tasks and datasets

Results

results: Actual evaluation metrics and scores for each task

Example of a Result File

{
  "config_general": {
    "lighteval_sha": "203045a8431bc9b77245c9998e05fc54509ea07f",
    "num_fewshot_seeds": 1,
    "max_samples": 1,
    "job_id": "",
    "start_time": 620979.879320166,
    "end_time": 621004.632108041,
    "total_evaluation_time_secondes": "24.752787875011563",
    "model_name": "gpt2",
    "model_sha": "607a30d783dfa663caf39e06633721c8d4cfcd7e",
    "model_dtype": null,
    "model_size": "476.2 MB"
  },
  "results": {
    "lighteval|gsm8k|0": {
      "em": 0.0,
      "em_stderr": 0.0,
      "maj@8": 0.0,
      "maj@8_stderr": 0.0
    },
    "all": {
      "em": 0.0,
      "em_stderr": 0.0,
      "maj@8": 0.0,
      "maj@8_stderr": 0.0
    }
  },
  "versions": {
    "lighteval|gsm8k|0": 0
  },
  "config_tasks": {
    "lighteval|gsm8k": {
      "name": "gsm8k",
      "prompt_function": "gsm8k",
      "hf_repo": "gsm8k",
      "hf_subset": "main",
      "metric": [
        {
          "metric_name": "em",
          "higher_is_better": true,
          "category": "3",
          "use_case": "5",
          "sample_level_fn": "compute",
          "corpus_level_fn": "mean"
        },
        {
          "metric_name": "maj@8",
          "higher_is_better": true,
          "category": "5",
          "use_case": "5",
          "sample_level_fn": "compute",
          "corpus_level_fn": "mean"
        }
      ],
      "hf_avail_splits": [
        "train",
        "test"
      ],
      "evaluation_splits": [
        "test"
      ],
      "few_shots_split": null,
      "few_shots_select": "random_sampling_from_train",
      "generation_size": 256,
      "generation_grammar": null,
      "stop_sequence": [
        "Question="
      ],
      "num_samples": null,
      "suite": [
        "lighteval"
      ],
      "original_num_docs": 1319,
      "effective_num_docs": 1,
      "must_remove_duplicate_docs": null,
      "version": 0
    }
  },
  "summary_tasks": {
    "lighteval|gsm8k|0": {
      "hashes": {
        "hash_examples": "8517d5bf7e880086",
        "hash_full_prompts": "8517d5bf7e880086",
        "hash_input_tokens": "29916e7afe5cb51d",
        "hash_cont_tokens": "37f91ce23ef6d435"
      },
      "padded": 0,
      "non_padded": 2,
      "effective_few_shots": 0.0,
    }
  },
  "summary_general": {
    "hashes": {
      "hash_examples": "5f383c395f01096e",
      "hash_full_prompts": "5f383c395f01096e",
      "hash_input_tokens": "ac933feb14f96d7b",
      "hash_cont_tokens": "9d03fb26f8da7277"
    },
    "padded": 0,
    "non_padded": 2,
  }
}

< > Update on GitHub