Files
claude-cookbooks/skills/retrieval_augmented_generation/evaluation/csvs/evaluation_results_detailed.csv
Alex Notov f0bf214841 Update references from 'Claude Cookbook' to 'Claude Cookbooks'
- Changed all instances of singular 'Claude Cookbook' to plural 'Claude Cookbooks'
- Updated URLs from anthropic-cookbooks to claude-cookbooks
- Applied changes across documentation, code, and data files

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-09-17 12:09:41 -06:00

15 KiB

1questionretrieval_precisionretrieval_recallretrieval_mrre2e_correct
2How can you create multiple test cases for an evaluation in the Anthropic Evaluation tool?0.33333333333333330.50.5False
3What embeddings provider does Anthropic recommend for customized domain-specific models, and what capabilities does this provider offer?0.66666666666666661.01.0True
4What are some key success metrics to consider when evaluating Claude's performance on a classification task, and how do they relate to choosing the right model to reduce latency?0.33333333333333330.51.0True
5What are two ways that Claude for Sheets can improve prompt engineering workflows compared to using chained prompts?0.33333333333333330.50.5True
6What happens if a prompt for the Text Completions API is missing the "\n\nHuman:" and "\n\nAssistant:" turns?0.66666666666666661.01.0True
7How do the additional tokens required for tool use in Claude API requests impact pricing compared to regular API requests?0.66666666666666661.01.0True
8When will the new Anthropic Developer Console features that show API usage, billing details, and rate limits be available?0.33333333333333331.01.0True
9When deciding whether to use chain-of-thought (CoT) for a task, what are two key factors to consider in order to strike the right balance between performance and latency?0.66666666666666661.01.0False
10How can I use Claude to more easily digest the content of long PDF documents?0.33333333333333330.50.3333333333333333True
11According to the documentation, where can you view your organization's current API rate limits in the Claude Console?0.66666666666666661.01.0False
12How can we measure the performance of the ticket classification system implemented using Claude beyond just accuracy?0.00.00.0False
13How can you specify a system prompt using the Text Completions API versus the Messages API?0.33333333333333330.51.0True
14How can you combine XML tags with chain of thought reasoning to create high-performance prompts for Claude?0.00.00.0False
15When evaluating the Claude model's performance for ticket routing, what three key metrics are calculated and what are the results for the claude-3-haiku-20240307 model on the 91 test samples?0.00.00.0False
16Before starting to engineer and improve a prompt in Claude, what key things does Anthropic recommend you have in place first?0.00.00.0False
17How does the Messages API handle mid-response prompting compared to the Text Completions API?0.66666666666666661.01.0True
18How does Claude's response differ when given a role through a system prompt compared to not having a specific role in the financial analysis example?0.33333333333333331.00.5True
19What are some quantitative metrics that can be used to measure the success of a sentiment analysis model, and how might specific targets for those metrics be determined?0.33333333333333331.00.5True
20What is a power user tip mentioned in the documentation for creating high-performance prompts using XML tags?0.66666666666666661.01.0True
21How can you use an LLM like Claude to automatically grade the outputs of other LLMs based on a rubric?0.33333333333333330.50.3333333333333333True
22How can you access and deploy Voyage embeddings on AWS Marketplace?0.33333333333333331.01.0True
23When using tools just to get Claude to produce JSON output following a particular schema, what key things should you do in terms of tool setup and prompting?0.33333333333333330.50.3333333333333333False
24What are the key differences between the legacy Claude Instant 1.2 model and the Claude 3 Haiku model in terms of capabilities and performance?0.66666666666666660.66666666666666661.0False
25What is one key benefit of using examples when prompt engineering with Claude?0.33333333333333331.00.5True
26According to the Claude Documentation, what is one key advantage of using prompt engineering instead of fine-tuning when it comes to adapting an AI model to new domains or tasks?0.33333333333333330.51.0False
27How can I quickly get started using the Claude for Sheets extension with a pre-made template?0.66666666666666661.01.0True
28How does the "index" field in the "content_block_delta" event relate to the text being streamed in a response?0.33333333333333330.50.5True
29How can you include an image as part of a Claude API request, and what image formats are currently supported?0.00.00.0False
30What is the relationship between time to first token (TTFT) and latency when evaluating a language model's performance?1.01.01.0True
31How can providing Claude with examples of handling certain edge cases like implicit requests or emotional prioritization help improve its performance in routing support tickets?0.33333333333333330.51.0True
32How does the stop_reason of "tool_use" relate to the overall workflow of integrating external tools with Claude?0.33333333333333330.51.0True
33According to the documentation, what error event and corresponding HTTP error code may be sent during periods of high usage for the Claude API when using streaming responses?1.01.01.0True
34What are the two types of deltas that can be contained in a content_block_delta event when streaming responses from the Claude API?0.66666666666666661.01.0True
35On what date did Claude 3.5 Sonnet and tool use both become generally available across the Claude API, Amazon Bedrock, and Google Vertex AI?0.66666666666666661.01.0False
36In what order did Anthropic launch Claude.ai and the Claude iOS app in Canada and Europe?0.66666666666666661.01.0True
37When the API response from Claude has a stop_reason of "tool_use", what does this indicate and what should be done next to continue the conversation?0.33333333333333330.51.0True
38What Python libraries are used in the example code snippet for evaluating tone and style in a customer service chatbot?0.00.00.0True
39What are the two main ways to authenticate when using the Anthropic Python SDK to access Claude models on Amazon Bedrock?0.66666666666666661.01.0True
40When deciding whether to implement leak-resistant prompt engineering strategies, what two factors should be considered and balanced?1.01.01.0True
41How can selecting the appropriate Claude model based on your specific requirements help reduce latency in your application?0.66666666666666661.01.0True
42How can you stream responses from the Claude API using the Python SDK?0.33333333333333330.51.0True
43How can you guide Claude's response by pre-filling part of the response, and what API parameter is used to generate a short response in this case?0.00.00.0True
44What is more important when building an eval set for an AI system - having a larger number of test cases with automated grading, or having fewer high-quality test cases graded by humans?0.33333333333333330.51.0True
45What are the two required fields in a content_block_delta event for a text delta type?0.66666666666666661.01.0False
46What are two interactive ways to learn how to use Claude's capabilities, such as uploading PDFs and generating embeddings?0.00.00.0False
47Why does breaking a task into distinct subtasks for chained prompts help improve Claude's accuracy on the overall task?0.66666666666666661.01.0True
48How does the streaming format for Messages responses differ from Text Completions streaming responses?0.33333333333333331.01.0True
49What are two ways to start experimenting with Claude as a user, according to Anthropic's documentation?0.00.00.0False
50How can using chain prompts help reduce errors and inconsistency in complex tasks handled by Claude?0.66666666666666661.01.0True
51What HTTP status code does an overloaded_error event correspond to in a non-streaming context for the Claude API?0.66666666666666661.01.0True
52What are the two ways to specify the format in which Voyage AI returns embeddings through its HTTP API?0.33333333333333331.01.0True
53When streaming API requests that use tools, how are the input JSON deltas for tool_use content blocks sent, and how can they be accumulated and parsed by the client?0.33333333333333330.51.0True
54What are the two interactive prompt engineering tutorials that Anthropic offers, and how do they differ?0.66666666666666661.01.0True
55What are some of the key capabilities that make Claude suitable for enterprise use cases requiring integration with specialized applications and processing of large volumes of sensitive data?0.33333333333333331.01.0True
56As of June 2024, in which regions are Anthropic's Claude.ai API and iOS app available?0.66666666666666660.66666666666666661.0False
57What are the two main approaches for integrating Claude into a support ticket workflow, and how do they differ in terms of scalability and ease of implementation?0.66666666666666661.01.0True
58When did Anthropic release a prompt generator tool to help guide Claude in generating high-quality prompts, and through what interface is it available?0.00.00.0False
59Which Claude 3 model provides the best balance of intelligence and speed for high-throughput tasks like sales forecasting and targeted marketing?0.00.00.0False
60How can you calculate the similarity between two Voyage embedding vectors, and what is this equivalent to since Voyage embeddings are normalized to length 1?0.66666666666666661.01.0True
61How can using examples in prompts improve Claude's performance on complex tasks?0.33333333333333330.51.0True
62What are the two types of content block deltas that can be emitted when streaming responses with tool use, and what does each delta type contain?1.00.751.0True
63What are two key capabilities of Claude that enable it to build interactive systems and personalized user experiences?0.33333333333333331.00.5False
64What are the key event types included in a raw HTTP stream response when using message streaming, and what is the typical order they occur in?0.66666666666666661.01.0True
65What is the maximum number of images that can be included in a single request using the Claude API compared to the claude.ai interface?0.33333333333333330.50.3333333333333333True
66When Claude's response is cut off due to hitting the max_tokens limit and contains an incomplete tool use block, what should you do to get the full tool use?0.00.00.0False
67What two steps are needed before running a classification evaluation on Claude according to the documentation?0.00.00.0False
68How can you use the content parameter in the messages list to influence Claude's response?0.00.00.0False
69What are two key advantages of prompt engineering over fine-tuning when it comes to model comprehension and general knowledge preservation?0.50.51.0True
70What are the two main steps to get started with making requests to Claude models on Anthropic's Bedrock API?0.33333333333333330.50.5False
71How can you check which Claude models are available in a specific AWS region using the AWS CLI?0.33333333333333330.51.0True
72What argument can be passed to the voyageai.Client.embed() method or the Voyage HTTP API to specify whether the input text is a query or a document?0.66666666666666661.01.0True
73How do the streaming API delta formats differ between tool_use content blocks and text content blocks?0.66666666666666661.01.0False
74What are the image file size limits when uploading images to Claude using the API versus on claude.ai?0.33333333333333331.01.0True
75What is one key consideration when selecting a Claude model for an enterprise use case that needs low latency?0.66666666666666661.00.5True
76What embedding model does Anthropic recommend for code retrieval, and how does its performance compare to alternatives according to Voyage AI?0.66666666666666661.01.0True
77What are two ways the Claude Cookbooks can help developers learn to use Anthropic's APIs?0.66666666666666661.01.0True
78How does the size of the context window impact a language model's ability to utilize retrieval augmented generation (RAG)?0.66666666666666661.01.0True
79How can the Evaluation tool in Anthropic's Claude platform help improve prompts and build more robust AI applications?0.33333333333333330.51.0True
80Which Claude model has the fastest comparative latency according to the comparison tables?0.66666666666666661.01.0True
81How can you build up a conversation with multiple turns using the Anthropic Messages API in Python?0.66666666666666661.01.0True
82How can using XML tags to provide a specific role or context help improve Claude's analysis of a legal contract compared to not using a role prompt?0.33333333333333330.50.3333333333333333True
83What are the key differences between how Claude 3 Opus and Claude 3 Sonnet handle missing information when making tool calls?0.00.00.0True
84What steps should be taken to ensure a reliable deployment of an automated ticket routing system using Claude into a production environment?0.66666666666666661.01.0True
85How should you evaluate a model's performance on a ticket routing classifier?0.33333333333333330.51.0False
86What two methods does Anthropic recommend for learning how to prompt engineer with Claude before diving into the techniques?0.33333333333333330.51.0True
87What are the key differences between a pretrained large language model and Claude in terms of their training and capabilities?0.33333333333333330.50.5True
88What are some key advantages of using prompt engineering instead of fine-tuning to adapt a pretrained language model for a specific task or domain?0.66666666666666660.66666666666666661.0True
89How can you authenticate with GCP before running requests to access Claude models on Vertex AI?0.33333333333333330.51.0True
90What new capabilities and features were introduced by Anthropic on May 10th, 2024 and how do they enable users to create and tailor prompts for specific tasks?0.33333333333333331.00.3333333333333333True
91On what date did both the Claude 3.5 Sonnet model and the Artifacts feature in Claude.ai become available?0.66666666666666661.01.0True
92When putting words in Claude's mouth to shape the response, what header and value can you use in the request to limit Claude's response to a single token?0.33333333333333330.51.0True
93What does the temperature parameter do when working with large language models?0.33333333333333330.51.0True
94What are two ways to specify API parameters when calling the Claude API using Claude for Sheets?0.33333333333333330.33333333333333330.3333333333333333False
95How does prefilling the response with an opening curly brace ({ ) affect Claude's output when extracting structured data from text?0.33333333333333331.01.0True
96What are some helpful resources provided by Anthropic to dive deeper into building with images using Claude?0.33333333333333330.51.0True
97How do you specify the API key when creating a new Anthropic client in the Python and TypeScript SDK examples?0.00.00.0False
98What are two key benefits of using the Anthropic Evaluation tool when developing prompts for an AI classification application?0.33333333333333330.51.0True
99What are the key differences between a pretrained language model like Claude's underlying model, and the final version of Claude available through Anthropic's API?0.33333333333333330.33333333333333330.3333333333333333True
100What is the IPv6 address range used by Anthropic?1.01.01.0True
101When using the Python SDK to create a message with Claude, what are two ways you can specify your API key?0.33333333333333330.51.0True