Evaluating Custom Models

Lighteval allows you to evaluate custom model implementations by creating a custom model class that inherits from LightevalModel. This is useful when you want to evaluate models that aren’t directly supported by the standard backends and providers (Transformers, VLLM, etc.), or if you want to add your own pre/post-processing logic.

Creating a Custom Model

Step 1: Create Your Model Implementation

Create a Python file containing your custom model implementation. The model must inherit from LightevalModel and implement all required methods.

Here’s a basic example:

from lighteval.models.abstract_model import LightevalModel
from lighteval.models.model_output import ModelResponse
from lighteval.tasks.requests import Doc, SamplingMethod
from lighteval.utils.cache_management import SampleCache, cached

class MyCustomModel(LightevalModel):
    def __init__(self, config):
        super().__init__(config)
        # Initialize your model here...

        # Enable caching (recommended)
        self._cache = SampleCache(config)

    @cached(SamplingMethod.GENERATIVE)
    def greedy_until(self, docs: List[Doc]) -> List[ModelResponse]:
        # Implement generation logic
        pass

    @cached(SamplingMethod.LOGPROBS)
    def loglikelihood(self, docs: List[Doc]) -> List[ModelResponse]:
        # Implement loglikelihood computation
        pass

    @cached(SamplingMethod.PERPLEXITY)
    def loglikelihood_rolling(self, docs: List[Doc]) -> List[ModelResponse]:
        # Implement rolling loglikelihood computation
        pass

Step 2: Model File Requirements

The custom model file should contain exactly one class that inherits from LightevalModel. This class will be automatically detected and instantiated when loading the model.

You can find a complete example of a custom model implementation in examples/custom_models/google_translate_model.py.

Running the Evaluation

You can evaluate your custom model using either the command-line interface or the Python API.

Using the Command Line

lighteval custom \
    "google-translate" \
    "examples/custom_models/google_translate_model.py" \
    "lighteval|wmt20:fr-de|0" \
    --max-samples 10

The command takes three required arguments:

Model name: Used for tracking in results/logs
Model implementation file path: Path to your Python file containing the custom model
Tasks: Tasks to evaluate on (same format as other backends)

Using the Python API

from lighteval.logging.evaluation_tracker import EvaluationTracker
from lighteval.models.custom.custom_model import CustomModelConfig
from lighteval.pipeline import Pipeline, PipelineParameters, ParallelismManager

# Set up evaluation tracking
evaluation_tracker = EvaluationTracker(
    output_dir="results",
    save_details=True
)

# Configure the pipeline
pipeline_params = PipelineParameters(
    launcher_type=ParallelismManager.CUSTOM,
)

# Configure your custom model
model_config = CustomModelConfig(
    model_name="my-custom-model",
    model_definition_file_path="path/to/my_model.py"
)

# Create and run the pipeline
pipeline = Pipeline(
    tasks="leaderboard|truthfulqa:mc|0",
    pipeline_parameters=pipeline_params,
    evaluation_tracker=evaluation_tracker,
    model_config=model_config
)

pipeline.evaluate()
pipeline.save_and_push_results()

Required Methods

Your custom model must implement these core methods:

greedy_until

For generating text until a stop sequence or max tokens is reached. This is used for generative evaluations.

def greedy_until(self, docs: list[Doc]) -> list[ModelResponse]:
    """
    Generate text until stop sequence or max tokens.

    Args:
        docs: list of documents containing prompts and generation parameters

    Returns:
        list of model responses with generated text
    """
    pass

loglikelihood

For computing log probabilities of specific continuations. This is used for multiple choice logprob evaluations.

def loglikelihood(self, docs: list[Doc]) -> list[ModelResponse]:
    """
    Compute log probabilities of continuations.

    Args:
        docs: list of documents containing context and continuation pairs

    Returns:
        list of model responses with log probabilities
    """
    pass

loglikelihood_rolling

For computing rolling log probabilities of sequences. This is used for perplexity metrics.

def loglikelihood_rolling(self, docs: list[Doc]) -> list[ModelResponse]:
    """
    Compute rolling log probabilities of sequences.

    Args:
        docs: list of documents containing text sequences

    Returns:
        list of model responses with rolling log probabilities
    """
    pass

See the LightevalModel base class documentation for detailed method signatures and requirements.

Enabling Caching (Recommended)

Lighteval includes a caching system that can significantly speed up evaluations by storing and reusing model predictions. To enable caching in your custom model:

Step 1: Import Caching Components

from lighteval.utils.cache_management import SampleCache, cached

Step 2: Initialize Cache in Constructor

def __init__(self, config):
    super().__init__(config)
    # Your initialization code...
    self._cache = SampleCache(config)

Add cache decorators to your prediction methods:

@cached(SamplingMethod.GENERATIVE)
def greedy_until(self, docs: List[Doc]) -> List[ModelResponse]:
    # Your implementation...

For detailed information about the caching system, see the Caching Documentation.

Troubleshooting

Common Issues

Import Errors: Ensure all required dependencies are installed
Method Signature Errors: Verify your methods match the expected signatures
Caching Issues: Check that cache decorators are applied correctly
Performance Issues: Consider implementing batching and caching

Debugging Tips

Use the --max-samples flag to test with a small dataset
Enable detailed logging to see what’s happening
Test individual methods in isolation
Check the example implementations for reference

For more detailed information about custom model implementation, see the Model Reference.

< > Update on GitHub