Files
Alex Notov 8d1c93365b Revert CLAUDE_API_KEY to ANTHROPIC_API_KEY throughout the repository
Reverted all instances of CLAUDE_API_KEY back to ANTHROPIC_API_KEY to maintain
compatibility with existing infrastructure and GitHub secrets. This affects:
- Environment variable examples (.env.example files)
- Python scripts and notebooks
- Documentation and README files
- Evaluation scripts and test files

Other naming changes (Claude API, Claude Console, Claude Docs, Claude Cookbook) remain intact.

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-09-16 17:02:29 -06:00
..
2024-05-19 19:39:35 -04:00
2024-05-19 19:39:35 -04:00
2024-05-19 19:39:35 -04:00
2024-05-19 19:39:35 -04:00

Evaluations with Promptfoo

Pre-requisities

To use Promptfoo you will need to have node.js & npm installed on your system. For more information follow this guide

You can install promptfoo using npm or run it directly using npx. In this guide we will use npx.

Note: For this example you will not need to run npx promptfoo@latest init there is already an initialized promptfooconfig.yaml file in this directory

See the official docs here

Getting Started

The evaluation is orchestrated by the promptfooconfig.yaml file. In this file we define the following sections:

  • Prompts
    • Promptfoo enables you to import prompts in many different formats. You can read more about this here.
    • In this example we will load 3 prompts - the same used in guide.ipynb from the prompts.py file:
      • The functions are identical to those used in guide.ipynb except that instead of calling the Claude API they just return the prompt. Promptfoo then handles the orchestration of calling the API and storing the results.
      • You can read more about prompt functions here. Using python allows us to reuse the VectorDB class which is necessary for RAG, this is defined in vectordb.py.
  • Providers
    • With Promptfoo you can connect to many different LLMs from different platforms, see here for more. In guide.ipynb we used Haiku with default temperature 0.0. We will use Promptfoo to experiment with an array of different temperature settings to identify the optimal choice for our use case.
  • Tests
    • We will use the same data that was used in guide.ipynb which can be found in this Google Sheet.
    • Promptfoo has a wide array of built in tests which can be found here.
    • In this example we will define a test in our dataset.csv as the conditions of our evaluation change with each row and a test in the promptfooconfig.yaml for conditions that are consistent across all test cases. Read more about this here
  • Transform
    • In the defaultTest section we define a transform function. This is a python function which extracts the specific output we want to test from the LLM response.
  • Output
    • We define the path for the output file. Promptfoo can output results in many formats, see here. Alternatively you can use Promptfoo's web UI, see here.

Run the eval

To get started with Promptfoo open your terminal and navigate to this directory (./evaluation).

Before running your evaluation you must define the following environment variables:

export ANTHROPIC_API_KEY=YOUR_API_KEY
export VOYAGE_API_KEY=YOUR_API_KEY

From the evaluation directory, run the following command.

npx promptfoo@latest eval

If you would like to increase the concurrency of the requests (default = 4), run the following command.

npx promptfoo@latest eval -j 25

When the evaluation is complete the terminal will print the results for each row in the dataset.

You can now go back to guide.ipynb to analyze the results!