🧑‍⚖️ "Replacing Judges with Juries" using distilabel

Community Article Published May 3, 2024

TL;DR

distilabel is a framework to build pipelines for synthetic data generation and AI Feedback (AIF) as a Direct Acyclic Graph (DAG) using LLMs that comes with a growing collection of pre-defined tasks. "Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models" is a recent publication from Cohere that explores the problematic around using a single large LLM as a judge for the generations, and proposes the usage of a Panel of LLm evaluators (PoLL), the so called juries, composed of more and smaller LLMs, leading to a more diverse, less intra-bias, and less expensive generation judgement.

Introduction

"Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models" is a paper published by Cohere (Path Verga et al.) that explores the problematic around using a single large model like GPT-4 from OpenAI to judge / score either a single LLM generation or a comparison between multiple LLM generations, since they claim it introduces intra-model bias and most of the times using models that large is often unnecessary. So on, they propose what they call a Panel of LLm evaluators (PoLL), the so called "juries", which is a pool of more and smaller LLMs to judge / score the LLM outputs and then use an aggregation or average pooling of those scores instead of the single score provided by the larger LLM, the so called "judge".

Using the proposed PoLL is not only cheaper, but also avoids somehow the intra-model bias due to its composition fo disjoint model families; in the paper being Claude Haiku from Anthropic, GTP-3.5 from OpenAI, and Command R Plus from Cohere; the latter being open-source, while the rest are commercial / proprietary models.

The idea of this post is to reproduce a similar pipeline where some LLMs (Gemma 1.1 7B Instruct, Llama 3 8B Instruct, Phi 3 Mini (4K) Instruct, and Mistral 7B v0.2 Instruct; all being open and available within the Hugging Face Hub) are used to generate completions for a given collection of instructions / prompts, and then have other LLMs (Claude Haiku, GTP-3.5, and Command R Plus) judge those using the UltraFeedback prompt, to finally aggregate the scores so as to calculate the avarage score for each generation and use that score to binarize the dataset into a preference dataset based on the PoLL scores instead of solely on the GPT-4 score as formerly done in UltraFeedback.

What's distilabel?

distilabel is a framework to build pipelines for synthetic data generation and AI Feedback (AIF), defining a series of steps and connecting them as a Direct Acyclic Graph (DAG), so as to easily combine data processing steps with steps running LLMs for diverse tasks such as text generation, preference rating, etc.

This post covers the implementation assuming distilabel v1.0.0 is used, since the previous versions were still experimental.

The basic concepts of distilabel are the following:

  • Step: a step is a process that receives data in batches as input and produces or alters the recieved data as output, and it is the most basic step.
  • GeneratorStep: a Step that only generates data i.e. that doesn't receive any input.
  • GlobalStep: a Step that receives inputs and produces outputs as the default Step, but it's global, meaning that it's blocking and it won't be executed until all the batches from the previous steps are processed.
  • Task: a task is a special type of Step that contains a mandatory arg which is the LLM and will handle the processing so that when called, the input data will be prepared and streamed to the LLM as inputs, and then the outputs generated by the LLM will be handled and formatted according to the task.
  • Pipeline: a pipeline is the main class that orchestrates the execution of all the steps defined as part of the Pipeline, and will handle the batching of the data as well as the validation, logging, and any other related logic.

For more details about distilabel, I'd recommend you to go check distilabel - Documentation, specifically the section dedicated to "Learn".

Installation

To install it you can use pip as follows, which will also install both the anthropic, hf-inference-endpoints, and openai extras, which are required for the Anthropic, Inference Endpoints and OpenAI integrations, respectively.

distilabel will be installed from the develop branch since it has some features used within this post, but feel free to pin it to v1.1.0 once it's released. See the GitHub Milestone at https://github.com/argilla-io/distilabel/milestone/8

pip install "distilabel[anthropic,hf-inference-endpoints,openai] @ git+https://github.com/argilla-io/distilabel.git@develop"

Additionally, you will need to set the following environment variables to run the Pipeline below:

  • ANTHROPIC_API_KEY: is the Anthropic API Key required to send requests to the Anthropic models via their API.
  • HF_TOKEN: is the Hugging Face authentication token required to use the Inference Endpoints and to finally push the generated distilabel.Distiset to the Hugging Face Hub.
  • OPENAI_API_KEY: is the OpenAI API Key required to send requests to the OpenAI models via their API.

Code

Building blocks

  • LoadHubDataset: is a GeneratorStep that will load a dataset from the Hugging Face Hub and stream that in batches provided to the follow up steps. In this case, since the dataset we're using is HuggingFaceH4/instruction-dataset, we will need to rename the column prompt to instruction as that's what the TextGeneration task expects as input.

  • TextGeneration: is a Task that will generate an assistant response for a given instruction provided as input, generating the generation column in the output. The TextGeneration task expects an LLM as an arg, and in this case we'll be using:

  • CombineColumns: since the TextGeneration tasks connected to this step run in parallel, the outputs are not combined, so this step will receive that as an input i.e. all the outputs from all the previous steps, and merge those into a list. So that for each generation we'll have a list named generations that contains each generation value for the received inputs. This is also useful in order to prepare the data for the next step, UltraFeedback, as it expects an instruction and a list of generations as input.

  • UltraFeedback: is a Task that implements the UltraFeedback prompts and post-processing, so as to judge a list of generations for a given instruction using GPT-4, but in this case we'll use it with smaller LLMs as mentioned in the introduction, since that's what the paper is about. In this case we'll use the following LLM implementations:

    • InferenceEndpointsLLM: already mentioned above, and in this case it will run CohereForAI/c4ai-command-r-plus as opposed to CohereForAI/c4ai-command-r as mentioned in the paper, but only because Command R+ has a serverless endpoint available in the Hugging Face Hub.
    • AnthropicLLM: is an LLM implementation for Anthropic's models, and in this case we'll be using Claude 3 Haiku, their fastest and most compact model, designed for near-instant responsiveness and seamless AI experiences that mimic human interactions. Even though for the current UltraFeedback prompts hasn't proven to be so strong, will be detailed further on.
    • OpenAILLM: is an LLM implementation for OpenAI's models via their client, which can be extended to more APIs that are comliant with OpenAI's client; even though in this case we'll use it for GPT-3.5 (gpt-3.5-turbo-0125).
  • CombineColumns: as mentioned below, is a step that expects inputs from more than one step and combines the provided columns into a list under another column name, and in this case we'll group the ratings, rationales, and model_name to calculate the average of the ratings, while leaving the rest for reference.

  • AvgPooling: is a custom defined Step via the step decorator, that will expect poll_ratings as input, and will calculate the average of those ratings and put the average for each generation under a list that will match the length of the generations, in this case four. It also showcases how easy it is to create custom Step implementations with distilabel via the step decorator.

Implementation

from distilabel.llms import InferenceEndpointsLLM, OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import (
    CombineColumns,
    KeepColumns,
    LoadHubDataset,
    StepInput,
    step,
)
from distilabel.steps.formatting import FormatTextGenerationDPO
from distilabel.steps.tasks import TextGeneration, UltraFeedback
from distilabel.steps.typing import StepOutput


@step(inputs=["poll_ratings"], outputs=["avg_poll_ratings"])
def AveragePooling(*inputs: StepInput) -> StepOutput:
    """Custom `Step` that calculates the average of the ratings for each generation."""
    for input in inputs:
        for item in input:
            item["avg_poll_ratings"] = [
                sum(col) / len(col) for col in zip(*item["poll_ratings"])
            ]
        yield input


if __name__ == "__main__":
    # We use `Pipeline` context manager to ensure all the steps defined inside
    # are included as part of the `pipeline`
    with Pipeline(name="replacing-judges-with-juries") as pipeline:
        # First we load the dataset from the Hugging Face Hub, but for local testing
        # one could just define a dataset a list of dicts and provide that to `LoadDataFromDicts`
        load_dataset = LoadHubDataset(
            name="load_dataset",
            repo_id="HuggingFaceH4/instruction-dataset",
            split="test",
            num_examples=10,
            output_mappings={"prompt": "instruction"},
        )

        # We create a `TextGeneration` task running Llama 3 on serverless endpoints
        text_generation_llama3 = TextGeneration(
            name="text_generation_llama3",
            llm=InferenceEndpointsLLM(
                model_id="meta-llama/Meta-Llama-3-8B-Instruct",
                tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
            ),
            input_batch_size=10,
            output_mappings={"model_name": "generation_model"},
        )
        # We create a `TextGeneration` task running Gemma 1.1 on serverless endpoints
        text_generation_gemma = TextGeneration(
            name="text_generation_gemma",
            llm=InferenceEndpointsLLM(
                model_id="google/gemma-1.1-7b-it",
            ),
            input_batch_size=10,
            output_mappings={"model_name": "generation_model"},
        )
        # We create a `TextGeneration` task running Phi 3 on serverless endpoints
        text_generation_phi3 = TextGeneration(
            name="text_generation_phi3",
            llm=InferenceEndpointsLLM(
                model_id="microsoft/Phi-3-mini-4k-instruct",
            ),
            input_batch_size=10,
            output_mappings={"model_name": "generation_model"},
        )
        # We create a `TextGeneration` task running Mistral v0.2 on serverless endpoints
        text_generation_mistral = TextGeneration(
            name="text_generation_mistral",
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mistral-7B-Instruct-v0.2",
            ),
            input_batch_size=10,
            output_mappings={"model_name": "generation_model"},
        )

        # Combine the `generation` and `generation_model` columns from the previous step
        # under a single column name as a list
        combine_generation_columns = CombineColumns(
            name="combine_generation_columns",
            columns=["generation", "generation_model"],
            output_columns=["generations", "generation_models"],
        )

        # We create the UltraFeedback task with the `instruction-following` aspect to evaluate
        # the LLM capabilities on following instructions, running Command R+ on serverless
        # endpoints and GPT-3.5 from OpenAI
        ultrafeedback_cmdr_plus = UltraFeedback(
            name="ultrafeedback_cmdr_plus",
            llm=InferenceEndpointsLLM(
                model_id="CohereForAI/c4ai-command-r-plus",
            ),
            input_batch_size=5,
            aspect="instruction-following",
        )
        ultrafeedback_gpt35 = UltraFeedback(
            name="ultrafeedback_gpt35",
            llm=OpenAILLM(
                model="gpt-3.5-turbo-0125",
            ),
            input_batch_size=5,
            aspect="instruction-following",
        )

        # Then we combine again the generated `ratings` and `rationales` into a single column
        combine_ultrafeedback_columns = CombineColumns(
            name="combine_ultrafeedback_columns",
            columns=["ratings", "rationales", "model_name"],
            output_columns=["poll_ratings", "poll_rationales", "poll_models"],
        )

        # Finally, we call our custom task to calculate the average of the ratings for each generation
        avg_pooling = AveragePooling(name="avg_pooling", input_batch_size=1)

        # Here we define the orchestration of the steps using the `rshift` operator showing how the
        # different steps are connected to each other in the `Pipeline`
        (
          load_dataset
          >> [text_generation_llama3, text_generation_gemma, text_generation_phi3, text_generation_mistral]
          >> combine_generation_columns
          >> [ultrafeedback_cmdr_plus, ultrafeedback_gpt35]
          >> combine_ultrafeedback_columns
          >> avg_pooling
        )

Finally, once the Pipeline has been defined, you can run it as it follows, defining some runtime parameters to mainly control the generation of the LLMs.

distiset = pipeline.run(
    parameters={
        "text_generation_llama3": {
            "llm": {
                "generation_kwargs": {
                    "temperature": 0.7,
                    "max_new_tokens": 1024,
                    "stop_sequences": ["<|eot_id|>", "<|end_of_text|>"],
                },
            },
        },
        "text_generation_gemma": {
            "llm": {
                "generation_kwargs": {
                    "temperature": 0.7,
                    "max_new_tokens": 1024,
                    "stop_sequences": ["<eos>", "<end_of_turn>"],
                },
            },
        },
        "text_generation_phi3": {
            "llm": {
                "generation_kwargs": {
                    "temperature": 0.7,
                    "max_new_tokens": 1024,
                    "stop_sequences": ["</s>", "<|endoftext|>"],
                },
            },
        },
        "text_generation_mistral": {
            "llm": {
                "generation_kwargs": {
                    "temperature": 0.7,
                    "max_new_tokens": 1024,
                    "stop_sequences": ["</s>"],
                },
            },
        },
        # "ultrafeedback_haiku": {
        #     "llm": {"generation_kwargs": {"temperature": 0.7, "max_tokens": 4096}},
        # },
        "ultrafeedback_cmdr_plus": {
            "llm": {
                "generation_kwargs": {
                    "temperature": 0.7,
                    "max_new_tokens": 4096,
                    "stop_sequences": ["<EOS_TOKEN>", "<|END_OF_TURN_TOKEN|>"],
                },
            },
        },
        "ultrafeedback_gpt35": {
            "llm": {
                "generation_kwargs": {"temperature": 0.7, "max_new_tokens": 4096}
            },
        },
    }
)

Finally, we can optionally push the generated dataset i.e. distilabel.Distiset, to the Hugging Face Hub via the push_to_hub method, so that each subset generated in the leaf steps is pushed to the Hub, in this case since there's only one leaf step, only that will be pushed; but if there were many, then each leaf step would be pushed under a different configuration in the Hub.

distiset.push_to_hub("replacing-judges-with-juries-distilabel")

🤗 Dataset available at alvarobartt/replacing-judges-with-juries-distilabel

image/png

Notes (as of May 3rd, 2024)

Note that you can replace the LLMs used below with the ones from your choice, the idea of using those was because the ones used for the TextGeneration task are provided as serverless endpoints within the Hugging Face Hub and the ones used for UltraFeedback are the same ones as used in the official paper.

In order to use extensively the serverless Inference Endpoints deployed in the Hugging Face Hub, subscribing to Pro is recommended (see pricing), since Inference for PROs will be enabled and you will have improved rate limits for the usage of the free Inference API.

I've encounter issues when using Claude Haiku with UltraFeedback prompts, since apparently it's not able to generate something that's compliant with the expected formatting, but I'll investigate that; in the meantime, the code for running Claude Haiku with UltraFeedback has been commented. That could be fixed by just ignoring the values with rating=None, but until further investigation is done, I feel like it's better to leave that aside for the moment.

What's next?

Currently we're actively working on distilabel v1.1.0 trying to make it as developer friendly as possible, encouraging everyone in the community to build with distilabel, as well as aiming to bridge the gap on sythetic data generation with open models and on consumer hardware.

References