diff --git a/.env.example b/.env.example index 67a5ccb..de47a5a 100644 --- a/.env.example +++ b/.env.example @@ -82,3 +82,6 @@ SPLITTER_TYPE=ast # Additional ignore patterns to exclude files/directories (comma-separated) # Example: temp/**,*.backup,private/**,uploads/** # CUSTOM_IGNORE_PATTERNS=temp/**,*.backup,private/** + +# Whether to use hybrid search mode. If true, it will use both dense vector and BM25; if false, it will use only dense vector search. +# HYBRID_MODE=true diff --git a/.gitignore b/.gitignore index 841356e..ed060fc 100644 --- a/.gitignore +++ b/.gitignore @@ -54,6 +54,9 @@ Thumbs.db *.crx *.pem +__pycache__/ +*.log + .claude/* CLAUDE.md diff --git a/README.md b/README.md index e74fa61..7f062bc 100644 --- a/README.md +++ b/README.md @@ -403,7 +403,7 @@ For more detailed MCP environment variable configuration, see our [Environment V ### 🔧 Implementation Details -- 🔍 **Semantic Code Search**: Ask questions like *"find functions that handle user authentication"* and get relevant, context-rich code instantly. +- 🔍 **Hybrid Code Search**: Ask questions like *"find functions that handle user authentication"* and get relevant, context-rich code instantly using advanced hybrid search (BM25 + dense vector). - 🧠 **Context-Aware**: Discover large codebase, understand how different parts of your codebase relate, even across millions of lines of code. - ⚡ **Incremental Indexing**: Efficiently re-index only changed files using Merkle trees. - 🧩 **Intelligent Code Chunking**: Analyze code in Abstract Syntax Trees (AST) for chunking. diff --git a/docs/getting-started/environment-variables.md b/docs/getting-started/environment-variables.md index c34f054..ed55fd8 100644 --- a/docs/getting-started/environment-variables.md +++ b/docs/getting-started/environment-variables.md @@ -40,6 +40,7 @@ Claude Context supports a global configuration file at `~/.context/.env` to simp ### Advanced Configuration | Variable | Description | Default | |----------|-------------|---------| +| `HYBRID_MODE` | Enable hybrid search (BM25 + dense vector). Set to `false` for dense-only search | `true` | | `EMBEDDING_BATCH_SIZE` | Batch size for processing. Larger batch size means less indexing time | `100` | | `SPLITTER_TYPE` | Code splitter type: `ast`, `langchain` | `ast` | | `CUSTOM_EXTENSIONS` | Additional file extensions to include (comma-separated, e.g., `.vue,.svelte,.astro`) | None | diff --git a/docs/getting-started/overview.md b/docs/getting-started/overview.md index 8006408..e1f3796 100644 --- a/docs/getting-started/overview.md +++ b/docs/getting-started/overview.md @@ -6,8 +6,8 @@ Claude Context is a powerful semantic code search tool that gives AI coding assi ## Key Features -### 🔍 Semantic Code Search -Ask natural language questions like "find functions that handle user authentication" and get relevant code snippets from across your entire codebase. +### 🔍 Hybrid Code Search +Ask natural language questions like "find functions that handle user authentication" and get relevant code snippets from across your entire codebase using advanced hybrid search (BM25 + dense vector). ### 🧠 Context-Aware Understanding Discover relationships between different parts of your code, even across millions of lines. The system understands code structure, patterns, and dependencies. @@ -38,8 +38,8 @@ Each code chunk is converted into high-dimensional vectors using state-of-the-ar ### 4. Vector Storage Embeddings are stored in a vector database (Milvus/Zilliz Cloud) for efficient similarity search. -### 5. Semantic Search -Natural language queries are converted to vectors and matched against stored code embeddings. +### 5. Hybrid Search +Natural language queries are processed using both dense vector embeddings and BM25 sparse retrieval, then combined with RRF (Reciprocal Rank Fusion) for optimal results. ## Architecture Components