Lighteval documentation

Metric List

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.9.2).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Metric List

Automatic metrics for multiple-choice tasks

These metrics use log-likelihood of the different possible targets.

  • loglikelihood_acc: Fraction of instances where the choice with the best logprob was correct - we recommend using a normalization by length
  • loglikelihood_f1: Corpus level F1 score of the multichoice selection
  • mcc: Matthew’s correlation coefficient (a measure of agreement between statistical distributions).
  • recall_at_k: Fraction of instances where the choice with the k-st best logprob or better was correct
  • mrr: Mean reciprocal rank, a measure of the quality of a ranking of choices ordered by correctness/relevance
  • target_perplexity: Perplexity of the different choices available.
  • acc_golds_likelihood: A bit different, it actually checks if the average logprob of a single target is above or below 0.5.
  • multi_f1_numeric: Loglikelihood F1 score for multiple gold targets.

Automatic metrics for perplexity and language modeling

These metrics use log-likelihood of prompt.

  • word_perplexity: Perplexity (log probability of the input) weighted by the number of words of the sequence.
  • byte_perplexity: Perplexity (log probability of the input) weighted by the number of bytes of the sequence.
  • bits_per_byte: Average number of bits per byte according to model probabilities.
  • log_prob: Predicted output’s average log probability (input’s log prob for language modeling).

Automatic metrics for generative tasks

These metrics need the model to generate an output. They are therefore slower.

  • Base:
    • exact_match: Fraction of instances where the prediction matches the gold. Several variations can be made through parametrization:
      • normalization on string pre-comparision on whitespace, articles, capitalization, …
      • comparing the full string, or only subsets (prefix, suffix, …)
    • maj_at_k: Model majority vote. Samples k generations from the model and assumes the most frequent is the actual prediction.
    • f1_score: Average F1 score in terms of word overlap between the model output and gold (normalisation optional).
    • f1_score_macro: Corpus level macro F1 score.
    • f1_score_macro: Corpus level micro F1 score.
  • Summarization:
    • rouge: Average ROUGE score (Lin, 2004).
    • rouge1: Average ROUGE score (Lin, 2004) based on 1-gram overlap.
    • rouge2: Average ROUGE score (Lin, 2004) based on 2-gram overlap.
    • rougeL: Average ROUGE score (Lin, 2004) based on longest common subsequence overlap.
    • rougeLsum: Average ROUGE score (Lin, 2004) based on longest common subsequence overlap.
    • rouge_t5 (BigBench): Corpus level ROUGE score for all available ROUGE metrics.
    • faithfulness: Faithfulness scores based on the SummaC method of Laban et al. (2022).
    • extractiveness: Reports, based on (Grusky et al., 2018):
      • summarization_coverage: Extent to which the model-generated summaries are extractive fragments from the source document,
      • summarization_density: Extent to which the model-generated summaries are extractive summaries based on the source document,
      • summarization_compression: Extent to which the model-generated summaries are compressed relative to the source document.
    • bert_score: Reports the average BERTScore precision, recall, and f1 score (Zhang et al., 2020) between model generation and gold summary.
  • Translation:
    • bleu: Corpus level BLEU score (Papineni et al., 2002) - uses the sacrebleu implementation.
    • bleu_1: Average sample BLEU score (Papineni et al., 2002) based on 1-gram overlap - uses the nltk implementation.
    • bleu_4: Average sample BLEU score (Papineni et al., 2002) based on 4-gram overlap - uses the nltk implementation.
    • chrf: Character n-gram matches f-score.
    • ter: Translation edit/error rate.
  • Copyright:
    • copyright: Reports:
      • longest_common_prefix_length: Average length of longest common prefix between model generation and reference,
      • edit_distance: Average Levenshtein edit distance between model generation and reference,
      • edit_similarity: Average Levenshtein edit similarity (normalized by the length of longer sequence) between model generation and reference.
  • Math:
    • Both exact_match and maj_at_k can be used to evaluate mathematics tasks with math specific normalization to remove and filter latex.

LLM-as-Judge

  • llm_judge_gpt3p5: Can be used for any generative task, the model will be scored by a GPT3.5 model using the OpenAI API.
  • llm_judge_llama_3_405b: Can be used for any generative task, the model will be scored by a Llama 3.405B model using the HuggingFace API.
  • llm_judge_multi_turn_gpt3p5: Can be used for any generative task, the model will be scored by a GPT3.5 model using the OpenAI API. It is used for multiturn tasks like mt-bench.
  • llm_judge_multi_turn_llama_3_405b: Can be used for any generative task, the model will be scored by a Llama 3.405B model using the HuggingFace API. It is used for multiturn tasks like mt-bench.
< > Update on GitHub