claude-cookbooks/evaluation_results_detailed.csv at 4b36a1e1f64361397f4cbe4fbc7cdace338f16f8

mirror of https://github.com/anthropics/claude-cookbooks.git synced 2025-10-06 01:00:28 +03:00

Files

Alex Notov f0bf214841 Update references from 'Claude Cookbook' to 'Claude Cookbooks'

- Changed all instances of singular 'Claude Cookbook' to plural 'Claude Cookbooks'
- Updated URLs from anthropic-cookbooks to claude-cookbooks
- Applied changes across documentation, code, and data files

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-09-17 12:09:41 -06:00

15 KiB

Raw Blame History

1	question	retrieval_precision	retrieval_recall	retrieval_mrr	e2e_correct
2	How can you create multiple test cases for an evaluation in the Anthropic Evaluation tool?	0.3333333333333333	0.5	0.5	False
3	What embeddings provider does Anthropic recommend for customized domain-specific models, and what capabilities does this provider offer?	0.6666666666666666	1.0	1.0	True
4	What are some key success metrics to consider when evaluating Claude's performance on a classification task, and how do they relate to choosing the right model to reduce latency?	0.3333333333333333	0.5	1.0	True
5	What are two ways that Claude for Sheets can improve prompt engineering workflows compared to using chained prompts?	0.3333333333333333	0.5	0.5	True
6	What happens if a prompt for the Text Completions API is missing the "\n\nHuman:" and "\n\nAssistant:" turns?	0.6666666666666666	1.0	1.0	True
7	How do the additional tokens required for tool use in Claude API requests impact pricing compared to regular API requests?	0.6666666666666666	1.0	1.0	True
8	When will the new Anthropic Developer Console features that show API usage, billing details, and rate limits be available?	0.3333333333333333	1.0	1.0	True
9	When deciding whether to use chain-of-thought (CoT) for a task, what are two key factors to consider in order to strike the right balance between performance and latency?	0.6666666666666666	1.0	1.0	False
10	How can I use Claude to more easily digest the content of long PDF documents?	0.3333333333333333	0.5	0.3333333333333333	True
11	According to the documentation, where can you view your organization's current API rate limits in the Claude Console?	0.6666666666666666	1.0	1.0	False
12	How can we measure the performance of the ticket classification system implemented using Claude beyond just accuracy?	0.0	0.0	0.0	False
13	How can you specify a system prompt using the Text Completions API versus the Messages API?	0.3333333333333333	0.5	1.0	True
14	How can you combine XML tags with chain of thought reasoning to create high-performance prompts for Claude?	0.0	0.0	0.0	False
15	When evaluating the Claude model's performance for ticket routing, what three key metrics are calculated and what are the results for the claude-3-haiku-20240307 model on the 91 test samples?	0.0	0.0	0.0	False
16	Before starting to engineer and improve a prompt in Claude, what key things does Anthropic recommend you have in place first?	0.0	0.0	0.0	False
17	How does the Messages API handle mid-response prompting compared to the Text Completions API?	0.6666666666666666	1.0	1.0	True
18	How does Claude's response differ when given a role through a system prompt compared to not having a specific role in the financial analysis example?	0.3333333333333333	1.0	0.5	True
19	What are some quantitative metrics that can be used to measure the success of a sentiment analysis model, and how might specific targets for those metrics be determined?	0.3333333333333333	1.0	0.5	True
20	What is a power user tip mentioned in the documentation for creating high-performance prompts using XML tags?	0.6666666666666666	1.0	1.0	True
21	How can you use an LLM like Claude to automatically grade the outputs of other LLMs based on a rubric?	0.3333333333333333	0.5	0.3333333333333333	True
22	How can you access and deploy Voyage embeddings on AWS Marketplace?	0.3333333333333333	1.0	1.0	True
23	When using tools just to get Claude to produce JSON output following a particular schema, what key things should you do in terms of tool setup and prompting?	0.3333333333333333	0.5	0.3333333333333333	False
24	What are the key differences between the legacy Claude Instant 1.2 model and the Claude 3 Haiku model in terms of capabilities and performance?	0.6666666666666666	0.6666666666666666	1.0	False
25	What is one key benefit of using examples when prompt engineering with Claude?	0.3333333333333333	1.0	0.5	True
26	According to the Claude Documentation, what is one key advantage of using prompt engineering instead of fine-tuning when it comes to adapting an AI model to new domains or tasks?	0.3333333333333333	0.5	1.0	False
27	How can I quickly get started using the Claude for Sheets extension with a pre-made template?	0.6666666666666666	1.0	1.0	True
28	How does the "index" field in the "content_block_delta" event relate to the text being streamed in a response?	0.3333333333333333	0.5	0.5	True
29	How can you include an image as part of a Claude API request, and what image formats are currently supported?	0.0	0.0	0.0	False
30	What is the relationship between time to first token (TTFT) and latency when evaluating a language model's performance?	1.0	1.0	1.0	True
31	How can providing Claude with examples of handling certain edge cases like implicit requests or emotional prioritization help improve its performance in routing support tickets?	0.3333333333333333	0.5	1.0	True
32	How does the stop_reason of "tool_use" relate to the overall workflow of integrating external tools with Claude?	0.3333333333333333	0.5	1.0	True
33	According to the documentation, what error event and corresponding HTTP error code may be sent during periods of high usage for the Claude API when using streaming responses?	1.0	1.0	1.0	True
34	What are the two types of deltas that can be contained in a content_block_delta event when streaming responses from the Claude API?	0.6666666666666666	1.0	1.0	True
35	On what date did Claude 3.5 Sonnet and tool use both become generally available across the Claude API, Amazon Bedrock, and Google Vertex AI?	0.6666666666666666	1.0	1.0	False
36	In what order did Anthropic launch Claude.ai and the Claude iOS app in Canada and Europe?	0.6666666666666666	1.0	1.0	True
37	When the API response from Claude has a stop_reason of "tool_use", what does this indicate and what should be done next to continue the conversation?	0.3333333333333333	0.5	1.0	True
38	What Python libraries are used in the example code snippet for evaluating tone and style in a customer service chatbot?	0.0	0.0	0.0	True
39	What are the two main ways to authenticate when using the Anthropic Python SDK to access Claude models on Amazon Bedrock?	0.6666666666666666	1.0	1.0	True
40	When deciding whether to implement leak-resistant prompt engineering strategies, what two factors should be considered and balanced?	1.0	1.0	1.0	True
41	How can selecting the appropriate Claude model based on your specific requirements help reduce latency in your application?	0.6666666666666666	1.0	1.0	True
42	How can you stream responses from the Claude API using the Python SDK?	0.3333333333333333	0.5	1.0	True
43	How can you guide Claude's response by pre-filling part of the response, and what API parameter is used to generate a short response in this case?	0.0	0.0	0.0	True
44	What is more important when building an eval set for an AI system - having a larger number of test cases with automated grading, or having fewer high-quality test cases graded by humans?	0.3333333333333333	0.5	1.0	True
45	What are the two required fields in a content_block_delta event for a text delta type?	0.6666666666666666	1.0	1.0	False
46	What are two interactive ways to learn how to use Claude's capabilities, such as uploading PDFs and generating embeddings?	0.0	0.0	0.0	False
47	Why does breaking a task into distinct subtasks for chained prompts help improve Claude's accuracy on the overall task?	0.6666666666666666	1.0	1.0	True
48	How does the streaming format for Messages responses differ from Text Completions streaming responses?	0.3333333333333333	1.0	1.0	True
49	What are two ways to start experimenting with Claude as a user, according to Anthropic's documentation?	0.0	0.0	0.0	False
50	How can using chain prompts help reduce errors and inconsistency in complex tasks handled by Claude?	0.6666666666666666	1.0	1.0	True
51	What HTTP status code does an overloaded_error event correspond to in a non-streaming context for the Claude API?	0.6666666666666666	1.0	1.0	True
52	What are the two ways to specify the format in which Voyage AI returns embeddings through its HTTP API?	0.3333333333333333	1.0	1.0	True
53	When streaming API requests that use tools, how are the input JSON deltas for tool_use content blocks sent, and how can they be accumulated and parsed by the client?	0.3333333333333333	0.5	1.0	True
54	What are the two interactive prompt engineering tutorials that Anthropic offers, and how do they differ?	0.6666666666666666	1.0	1.0	True
55	What are some of the key capabilities that make Claude suitable for enterprise use cases requiring integration with specialized applications and processing of large volumes of sensitive data?	0.3333333333333333	1.0	1.0	True
56	As of June 2024, in which regions are Anthropic's Claude.ai API and iOS app available?	0.6666666666666666	0.6666666666666666	1.0	False
57	What are the two main approaches for integrating Claude into a support ticket workflow, and how do they differ in terms of scalability and ease of implementation?	0.6666666666666666	1.0	1.0	True
58	When did Anthropic release a prompt generator tool to help guide Claude in generating high-quality prompts, and through what interface is it available?	0.0	0.0	0.0	False
59	Which Claude 3 model provides the best balance of intelligence and speed for high-throughput tasks like sales forecasting and targeted marketing?	0.0	0.0	0.0	False
60	How can you calculate the similarity between two Voyage embedding vectors, and what is this equivalent to since Voyage embeddings are normalized to length 1?	0.6666666666666666	1.0	1.0	True
61	How can using examples in prompts improve Claude's performance on complex tasks?	0.3333333333333333	0.5	1.0	True
62	What are the two types of content block deltas that can be emitted when streaming responses with tool use, and what does each delta type contain?	1.0	0.75	1.0	True
63	What are two key capabilities of Claude that enable it to build interactive systems and personalized user experiences?	0.3333333333333333	1.0	0.5	False
64	What are the key event types included in a raw HTTP stream response when using message streaming, and what is the typical order they occur in?	0.6666666666666666	1.0	1.0	True
65	What is the maximum number of images that can be included in a single request using the Claude API compared to the claude.ai interface?	0.3333333333333333	0.5	0.3333333333333333	True
66	When Claude's response is cut off due to hitting the max_tokens limit and contains an incomplete tool use block, what should you do to get the full tool use?	0.0	0.0	0.0	False
67	What two steps are needed before running a classification evaluation on Claude according to the documentation?	0.0	0.0	0.0	False
68	How can you use the content parameter in the messages list to influence Claude's response?	0.0	0.0	0.0	False
69	What are two key advantages of prompt engineering over fine-tuning when it comes to model comprehension and general knowledge preservation?	0.5	0.5	1.0	True
70	What are the two main steps to get started with making requests to Claude models on Anthropic's Bedrock API?	0.3333333333333333	0.5	0.5	False
71	How can you check which Claude models are available in a specific AWS region using the AWS CLI?	0.3333333333333333	0.5	1.0	True
72	What argument can be passed to the voyageai.Client.embed() method or the Voyage HTTP API to specify whether the input text is a query or a document?	0.6666666666666666	1.0	1.0	True
73	How do the streaming API delta formats differ between tool_use content blocks and text content blocks?	0.6666666666666666	1.0	1.0	False
74	What are the image file size limits when uploading images to Claude using the API versus on claude.ai?	0.3333333333333333	1.0	1.0	True
75	What is one key consideration when selecting a Claude model for an enterprise use case that needs low latency?	0.6666666666666666	1.0	0.5	True
76	What embedding model does Anthropic recommend for code retrieval, and how does its performance compare to alternatives according to Voyage AI?	0.6666666666666666	1.0	1.0	True
77	What are two ways the Claude Cookbooks can help developers learn to use Anthropic's APIs?	0.6666666666666666	1.0	1.0	True
78	How does the size of the context window impact a language model's ability to utilize retrieval augmented generation (RAG)?	0.6666666666666666	1.0	1.0	True
79	How can the Evaluation tool in Anthropic's Claude platform help improve prompts and build more robust AI applications?	0.3333333333333333	0.5	1.0	True
80	Which Claude model has the fastest comparative latency according to the comparison tables?	0.6666666666666666	1.0	1.0	True
81	How can you build up a conversation with multiple turns using the Anthropic Messages API in Python?	0.6666666666666666	1.0	1.0	True
82	How can using XML tags to provide a specific role or context help improve Claude's analysis of a legal contract compared to not using a role prompt?	0.3333333333333333	0.5	0.3333333333333333	True
83	What are the key differences between how Claude 3 Opus and Claude 3 Sonnet handle missing information when making tool calls?	0.0	0.0	0.0	True
84	What steps should be taken to ensure a reliable deployment of an automated ticket routing system using Claude into a production environment?	0.6666666666666666	1.0	1.0	True
85	How should you evaluate a model's performance on a ticket routing classifier?	0.3333333333333333	0.5	1.0	False
86	What two methods does Anthropic recommend for learning how to prompt engineer with Claude before diving into the techniques?	0.3333333333333333	0.5	1.0	True
87	What are the key differences between a pretrained large language model and Claude in terms of their training and capabilities?	0.3333333333333333	0.5	0.5	True
88	What are some key advantages of using prompt engineering instead of fine-tuning to adapt a pretrained language model for a specific task or domain?	0.6666666666666666	0.6666666666666666	1.0	True
89	How can you authenticate with GCP before running requests to access Claude models on Vertex AI?	0.3333333333333333	0.5	1.0	True
90	What new capabilities and features were introduced by Anthropic on May 10th, 2024 and how do they enable users to create and tailor prompts for specific tasks?	0.3333333333333333	1.0	0.3333333333333333	True
91	On what date did both the Claude 3.5 Sonnet model and the Artifacts feature in Claude.ai become available?	0.6666666666666666	1.0	1.0	True
92	When putting words in Claude's mouth to shape the response, what header and value can you use in the request to limit Claude's response to a single token?	0.3333333333333333	0.5	1.0	True
93	What does the temperature parameter do when working with large language models?	0.3333333333333333	0.5	1.0	True
94	What are two ways to specify API parameters when calling the Claude API using Claude for Sheets?	0.3333333333333333	0.3333333333333333	0.3333333333333333	False
95	How does prefilling the response with an opening curly brace ({ ) affect Claude's output when extracting structured data from text?	0.3333333333333333	1.0	1.0	True
96	What are some helpful resources provided by Anthropic to dive deeper into building with images using Claude?	0.3333333333333333	0.5	1.0	True
97	How do you specify the API key when creating a new Anthropic client in the Python and TypeScript SDK examples?	0.0	0.0	0.0	False
98	What are two key benefits of using the Anthropic Evaluation tool when developing prompts for an AI classification application?	0.3333333333333333	0.5	1.0	True
99	What are the key differences between a pretrained language model like Claude's underlying model, and the final version of Claude available through Anthropic's API?	0.3333333333333333	0.3333333333333333	0.3333333333333333	True
100	What is the IPv6 address range used by Anthropic?	1.0	1.0	1.0	True
101	When using the Python SDK to create a message with Claude, what are two ways you can specify your API key?	0.3333333333333333	0.5	1.0	True

15 KiB Raw Blame History

15 KiB

Raw Blame History