mirror of https://github.com/zilliztech/claude-context.git synced 2025-10-06 01:10:02 +03:00

Files

Cheney Zhang 5df059a8c1 add case study (#184 )

Signed-off-by: ChengZi <chen.zhang@zilliz.com>

2025-08-26 17:51:32 +08:00

case_study

add case study (#184 )

2025-08-26 17:51:32 +08:00

retrieval

Add Evaluation (#164 )

2025-08-18 15:34:27 +08:00

servers

Add Evaluation (#164 )

2025-08-18 15:34:27 +08:00

utils

Add Evaluation (#164 )

2025-08-18 15:34:27 +08:00

.python-version

Add Evaluation (#164 )

2025-08-18 15:34:27 +08:00

analyze_and_plot_mcp_efficiency.py

Add Evaluation (#164 )

2025-08-18 15:34:27 +08:00

client.py

Add Evaluation (#164 )

2025-08-18 15:34:27 +08:00

generate_subset_json.py

Add Evaluation (#164 )

2025-08-18 15:34:27 +08:00

pyproject.toml

Add Evaluation (#164 )

2025-08-18 15:34:27 +08:00

README.md

add case study (#184 )

2025-08-26 17:51:32 +08:00

run_evaluation.py

Add Evaluation (#164 )

2025-08-18 15:34:27 +08:00

uv.lock

Add Evaluation (#164 )

2025-08-18 15:34:27 +08:00

README.md

Claude Context MCP Evaluation

This directory contains the evaluation framework and experimental results for comparing the efficiency of code retrieval using Claude Context MCP versus traditional grep-only approaches.

Overview

We conducted a controlled experiment to measure the impact of adding Claude Context MCP tool to a baseline coding agent. The evaluation demonstrates significant improvements in token efficiency while maintaining comparable retrieval quality.

Experimental Design

We designed a controlled experiment comparing two coding agents performing identical retrieval tasks. The baseline agent uses simple tools including read, grep, and edit functions. The enhanced agent adds Claude Context MCP tool to this same foundation. Both agents work on the same dataset using the same model to ensure fair comparison. We use LangGraph MCP and ReAct framework to implement it.

We selected 30 instances from Princeton NLP's SWE-bench_Verified dataset, filtering for 15-60 minute difficulty problems with exactly 2 file modifications. This subset represents typical coding tasks and enables quick validation. The dataset generation is implemented in generate_subset_json.py.

We chose GPT-4o-mini as the default model for cost-effective considerations.

We ran each method 3 times independently, giving us 6 total runs for statistical reliability. We measured token usage, tool calls, retrieval precision, recall, and F1-score across all runs. The main entry point for running evaluations is run_evaluation.py.

Key Results

Performance Summary

Metric	Baseline (Grep Only)	With Claude Context MCP	Improvement
Average F1-Score	0.40	0.40	Comparable
Average Token Usage	73,373	44,449	-39.4%
Average Tool Calls	8.3	5.3	-36.3%

Key Findings

Dramatic Efficiency Gains:

With Claude Context MCP, we achieved:

39.4% reduction in token consumption (28,924 tokens saved per instance)
36.3% reduction in tool calls (3.0 fewer calls per instance)

Conclusion

The results demonstrate that Claude Context MCP provides:

Immediate Benefits

Cost Efficiency: ~40% reduction in token usage directly reduces operational costs
Speed Improvement: Fewer tool calls and tokens mean faster code localization and task completion
Better Quality: This also means that, under the constraint of limited token context length, using Claude Context yields better retrieval and answer results.

Strategic Advantages

Better Resource Utilization: Under fixed token budgets, Claude Context MCP enables handling more tasks
Wider Usage Scenarios: Lower per-task costs enable broader usage scenarios
Improved User Experience: Faster responses with maintained accuracy

Running the Evaluation

To reproduce these results:

Install Dependencies:

For python environment, you can use uv to install the lockfile dependencies.
```
cd evaluation && uv sync
source .venv/bin/activate
```
For node environment, make sure your node version is Node.js >= 20.0.0 and < 24.0.0.

Our evaluation results are tested on claude-context-mcp@0.1.0, you can change the claude-context mcp server setting in the retrieval/custom.py file to get the latest version or use a development version.
Set Environment Variables:
```
export OPENAI_API_KEY=your_openai_api_key
export MILVUS_ADDRESS=your_milvus_address
```
For more configuration details, refer the claude-context mcp server settings in the retrieval/custom.py file.
```
export GITHUB_TOKEN=your_github_token
```
You need also prepare a GITHUB_TOKEN for automatically cloning the repositories, refer to SWE-bench documentation for more details.
Generate Dataset:
```
python generate_subset_json.py
```

Run Baseline Evaluation:

python run_evaluation.py --retrieval_types grep --output_dir retrieval_results_grep

Run Enhanced Evaluation:

python run_evaluation.py --retrieval_types cc,grep --output_dir retrieval_results_both

Analyze Results:

python analyze_and_plot_mcp_efficiency.py

The evaluation framework is designed to be reproducible and can be easily extended to test additional configurations or datasets. Due to the proprietary nature of LLMs, exact numerical results may vary between runs and cannot be guaranteed to be identical. However, the core conclusions drawn from the analysis remain consistent and robust across different runs.

Results Visualization

The chart above shows the dramatic efficiency improvements achieved by Claude Context MCP. The token usage and tool calls are significantly reduced.

Case Study

For detailed analysis of why grep-only approaches have limitations and how semantic search addresses these challenges, please refer to our Case Study which provides in-depth comparisons and analysis on the this experiment results.