How to Use Promptfoo for LLM Testing

How to Use Promptfoo for LLM Testing blog post fallback blur image How to Use Promptfoo for LLM Testing blog post main image
Stephen CollinsFeb 14, 2024
7 mins

What you will learn

  • What is the purpose of Promptfoo in LLM development?
  • Promptfoo is designed to facilitate the evaluation of LLM output quality in a systematic and efficient manner, allowing developers to test prompts, models, and Retrieval-Augmented Generation setups against predefined test cases to identify the best-performing combinations for specific applications.
  • How can Promptfoo enhance LLM application development?
  • Promptfoo can enhance LLM application development by performing side-by-side comparisons of LLM outputs, utilizing caching and concurrent testing to expedite evaluations, automatically scoring outputs based on predefined expectations, and providing integration into existing workflows as a CLI or a library.
  • What are some key features of Promptfoo?
  • Some key features of Promptfoo include testing JSON model responses, evaluating model costs, ensuring adherence to instructions, facilitating side-by-side comparisons of LLM outputs, expedited evaluations through caching and concurrent testing, automatic output scoring, and compatibility with a wide range of LLM APIs.
  • What types of assertions does Promptfoo support for evaluating LLM outputs?
  • Promptfoo supports various types of assertions for evaluating LLM outputs, including cost assertion, contains-JSON assertion for validating JSON structure, answer-relevance assertion for thematic accuracy, llm-rubric assertion for qualitative assessments, and model-graded-closedQA assertion for factual correctness and thematic relevance.
  • Why is Promptfoo compared to Jest in the context of LLM application testing?
  • Promptfoo is compared to Jest in the context of LLM application testing because it offers a robust set of testing utilities that can significantly enhance the efficiency, quality, and reliability of LLM applications, much like Jest does for JavaScript testing, by integrating easily into development workflows and CI/CD processes.

“Untested software is broken software.”

As developers writing code for production environments, we deeply embrace this principle, and it holds particularly true in the context of working with large language models (LLMs). In order to develop robust applications, the capability to systematically evaluate LLM outputs is indispensable. Relying on traditional trial-and-error approaches not only proves to be inefficient but frequently results in less-than-ideal outcomes.

Enter Promptfoo, a cutting-edge CLI and library designed to revolutionize how we approach LLM development through a test-driven framework. In this tutorial, I’ll explore Promptfoo, showcasing its capabilities such as testing JSON model responses, model costs, and adherence to instructions, by walking you through a sample project focused on inventive storytelling.

You can access all the code in the companion GitHub repository that accompanies this blog post.

What is Promptfoo?

Promptfoo is a comprehensive tool that facilitates the evaluation of LLM output quality in a systematic and efficient manner. It allows developers to test prompts, models, and Retrieval-Augmented Generation (RAG) setups against predefined test cases, thereby identifying the best-performing combinations for specific applications. With Promptfoo, developers can:

  • Perform side-by-side comparisons of LLM outputs to detect quality variances and regressions.
  • Utilize caching and concurrent testing to expedite evaluations.
  • Automatically score outputs based on predefined expectations.
  • Integrate Promptfoo into existing workflows either as a CLI or a library.
  • Work with a wide range of LLM APIs, including OpenAI, Anthropic, Azure, Google, HuggingFace, open-source models like Llama, and even custom API providers.

The philosophy behind Promptfoo is simple: embrace test-driven development for LLM applications to move beyond the inefficiencies of trial-and-error. This approach not only saves time but also ensures that your applications meet the desired quality standards before deployment.

Demo Project: Creative Storytelling with Promptfoo

To illustrate the capabilities of Promptfoo, let’s go over our demo project centered on creative storytelling. This project uses a configuration file (promptfooconfig.yaml) that defines the evaluation setup for generating diary entries set in various contexts, such as a mysterious island, a futuristic city, and an ancient Egyptian civilization.

Project Setup

Writing the Prompt

The core of our evaluation is the prompt defined in prompt1.txt, which instructs the LLM to generate a diary entry from someone living in a specified context (e.g., a mysterious island). The output must be a JSON object containing metadata (person’s name, location, date) and the diary entry itself. Here’s the entire prompt1.txt for our project:

Write a diary entry from someone living in {{topic}}.
Return a JSON object with metadata and the diary entry.
The metadata should include the person's name, location, and the date.
The date should be the current date.
The diary entry key should be named "diary_entry" and as a raw string.

An example of the expected output is:

{
  "metadata": {
    "name": "John Doe",
    "location": "New York",
    "date": "2020-01-01"
  },
  "diary_entry": "Today was a good day."
}

Pretty simple prompt asking the LLM for JSON output. Promptfoo uses Nunjucks templates (the {{topic}} in the prompt1.txt) to be able to include variables from the promptfooconfig.yaml.

More information can be found on Promptfoo’s Input and output files doc.

The promptfooconfig.yaml

The promptfooconfig.yaml file outlines the structure of our evaluation. It includes a description of the project, specifies the prompts, lists the LLM providers (with their configurations), and defines the tests with associated assertions to evaluate the output quality based on cost, content relevance, and specific JSON structure requirements. The example promptfooconfig.yaml isn’t too long, and here is the whole file:

description: "Creative Storytelling"
prompts: [prompt1.txt]
providers:
  - id: "mistral:mistral-medium"
    config:
      temperature: 0
      max_tokens: 1000
      safe_prompt: true
  - id: "openai:gpt-3.5-turbo-0613"
    config:
      temperature: 0
      max_tokens: 1000
  - id: "openai:gpt-4-0125-preview"
    config:
      temperature: 0
      max_tokens: 1000
tests:
  - vars:
      topic: "a mysterious island"
    assert:
      - type: cost
        threshold: 0.002
      - type: "contains-json"
        value:
          {
            "required": ["metadata", "diary_entry"],
            "type": "object",
            "properties":
              {
                "metadata":
                  {
                    "type": "object",
                    "required": ["name", "location", "date"],
                    "properties":
                      {
                        "name": { "type": "string" },
                        "location": { "type": "string" },
                        "date": { "type": "string", "format": "date" },
                      },
                  },
                "diary_entry": { "type": "string" },
              },
          }
  - vars:
      topic: "a futuristic city"
    assert:
      - type: answer-relevance
        value: "Ensure that the output contains content about a futuristic city"
      - type: "llm-rubric"
        value: "ensure that the output showcases innovation and detailed world-building"
  - vars:
      topic: "an ancient Egyptian civilization"
    assert:
      - type: "model-graded-closedqa"
        value: "References Egypt in some way"

The Assertions Explained

Promptfoo offers a versatile suite of assertions to evaluate LLM outputs against predefined conditions or expectations, ensuring the outputs meet specific quality standards. These assertions are categorized into deterministic eval metrics and model-assisted eval metrics. Here’s a deep dive into each assertion used in the preceding example promptfooconfig.yaml for our creative storytelling project.

Cost Assertion

The cost assertion verifies if the inference cost of generating an output is below a predefined threshold. It’s crucial for managing computational resources effectively, especially when scaling LLM applications. In our example, the assertion ensures that generating a diary entry for “a mysterious island” remains cost-effective, with a threshold set at 0.002.

Contains-JSON Assertion

This assertion (contains-json) checks whether the output contains valid JSON that matches a specific schema. It’s particularly useful for structured data outputs, ensuring they adhere to the expected format. In the creative storytelling example, this assertion validates the JSON structure of the diary entry, including required fields like metadata (with subfields name, location, and date) and diary_entry.

Answer-Relevance Assertion

The answer-relevance assertion evaluates whether the LLM output is relevant to the original query or topic. This ensures that the model’s responses are on-topic and meet the user’s intent. For the futuristic city prompt, this assertion confirms that the content indeed revolves around a futuristic city, aligning with the user’s request for thematic accuracy.

LLM-Rubric Assertion

An llm-rubric assertion uses a Language Model to grade the output against a specific rubric. This method is effective for qualitative assessments of outputs, such as creativity, detail, or adherence to a theme. For our futuristic city scenario, this assertion evaluates whether the output demonstrates innovation and detailed world-building, as expected for a narrative set in a futuristic environment.

Model-Graded-ClosedQA Assertion

This model-graded-closedqa assertion uses Closed QA methods (based on the OpenAI Evals) to ensure that the output adheres to specific criteria. It’s beneficial for factual correctness and thematic relevance. In the case of “an ancient Egyptian civilization,” this assertion verifies that the output references Egypt in some manner, ensuring historical or thematic accuracy.

Running the Evaluation

With Promptfoo, executing this evaluation is straightforward. Developers can run tests using the command line, allowing Promptfoo to compare outputs from different LLMs based on the specified criteria. This process helps in identifying which LLM performs best for creative storytelling within the defined parameters. I’ve provided a simple test script (leveraging npx) that can be found on the package.json of the project, and run like the following from the root of the repository:

npm run test

Analyzing the Results

Promptfoo produces matrix views that enable quick evaluation of outputs across multiple prompts and inputs in the terminal, as well as a web UI for more in-depth exploration of the test results. These features are invaluable for spotting trends, understanding model strengths and weaknesses, and making informed decisions about which LLM to use for your specific application.

For more information on viewing the Promptfoo’s test results, check out Promptfoo’s Usage docs.

Why Choose Promptfoo?

Promptfoo stands out for several reasons:

  • Battle-tested: Designed for LLM applications serving millions of users, Promptfoo is both robust and adaptable.
  • Simple and Declarative: Define evaluations without extensive coding or the use of cumbersome notebooks.
  • Language Agnostic: Work in Python, JavaScript, or your preferred language.
  • Collaboration-Friendly: Share evaluations and collaborate with teammates effortlessly.
  • Open-Source and Private: Promptfoo is fully open-source and runs locally, ensuring your evaluations remain private.

Conclusion

Promptfoo may very well become the Jest of LLM application testing.

By integrating Promptfoo into your development workflow (and CI/CD process), you can significantly enhance the efficiency, quality, and reliability of your LLM applications.

Whether you’re developing creative storytelling applications or any other LLM-powered project, Promptfoo offers the features and flexibility needed to add confidence to your LLM integrations through a robust set of testing utilities.