Published on

Simplifying Evaluation of Language Models with DeepEval

  • avatar

🚀 DeepEval: Simplifying Evaluation of Language Models

As machine learning engineers, we often face the challenge of evaluating the performance of our language models (LLMs) before deploying them into production. DeepEval is a powerful Python library that aims to make this process easier by providing a testing framework specifically designed for LLMs. Inspired by PyTest, DeepEval allows you to write tests for LLM applications, such as RAG (Retrieval-Augmented Generation), in a Pythonic way.

How Does DeepEval Work?

DeepEval provides a simple and intuitive API for writing test cases for your LLMs. You can define individual test cases or run bulk test cases defined in a CSV file. Let's take a look at how you can use DeepEval to write test cases.

Individual Test Cases

To define an individual test case, you can use the assert_llm_output function provided by DeepEval. Here's an example:

from deepeval.test_utils import assert_llm_output

def generate_llm_output(input: str):
    expected_output = "Our customer success phone line is 1200-231-231."
    return expected_output

def test_llm_output(self):
    input = "What is the customer success phone line?"
    expected_output = "Our customer success phone line is 1200-231-231."
    output = generate_llm_output(input)
    assert_llm_output(output, expected_output, metric="entailment")
    assert_llm_output(output, expected_output, metric="exact")

You can then run the test case using PyTest:

python -m pytest

# Output
Running tests ... ✅

Bulk Test Cases

If you have a large number of test cases, you can define them in a CSV file and import them into DeepEval. Here's an example:

import pandas as pd
from deepeval import TestCases

df = pd.read_csv('sample.csv')

class BulkTester(BulkTestRunner):
    def bulk_test_cases(self):
        return TestCases.from_csv(

tester = BulkTester()

Custom Metrics

DeepEval allows you to define custom metrics to evaluate the performance of your LLMs. To define a custom metric, you need to implement the measure and is_successful methods of the Metric class. Here's an example:

from deepeval.metric import Metric

class CustomMetric(Metric):
    def measure(self, a, b):
        return a > b

    def is_successful(self):
        return True

metric = CustomMetric()

Benefits and Use Cases

DeepEval offers several benefits for machine learning engineers working with LLMs. It provides a Pythonic interface for writing test cases, making it easy to incorporate evaluation into your development workflow. By automating the evaluation process, DeepEval reduces the feedback loop and allows you to iterate on your prompts, agents, and LLMs more efficiently.

Some use cases for DeepEval include:

  • Evaluating the performance of LLMs in production-grade applications.
  • Testing the accuracy and reliability of LLM responses.
  • Validating the behavior of LLMs across different input scenarios.

Future Directions

The DeepEval team has an exciting roadmap for the future. Some upcoming features and improvements include:

  • Web UI: A user-friendly web interface for managing and visualizing test cases and evaluation results.
  • Support for more metrics: DeepEval will provide support for additional evaluation metrics to cater to a wider range of use cases.
  • Integrations with LangChain and LlamaIndex: DeepEval will integrate tightly with popular frameworks like LangChain and LlamaIndex, further enhancing its capabilities.


DeepEval is a powerful Python library that simplifies the evaluation of language models. By providing a testing framework specifically designed for LLMs, DeepEval enables machine learning engineers to iterate on their models more efficiently and with greater confidence. With its intuitive API, support for custom metrics, and future roadmap, DeepEval is a valuable tool for anyone working with LLMs.

To get started with DeepEval, check out the official documentation and join the Discord community. Happy testing!