mirror of
https://github.com/microsoft/graphrag.git
synced 2025-03-11 01:26:14 +03:00
Next release docs (#1627)
* Wordind updates * Update yam lconfig and add notes to "deprecated" env * Add basic search section * Update versioning docs * Minor edits for clarity * Update init command * Update init to add --force in docs * Add NLP extraction params * Move vector_store to root * Add workflows to config * Add FastGraphRAG docs * add metadata column changes * Added documentation for multi index search. * Minor fixes. * Add config and table renames * Update migration notebook and comments to specify v1 * Add frequency to entity table docs * add new chunking options for metadata * Update output docs * Minor edits and cleanup * Add model ids to search configs * Spruce up migration notebook * Lint/format multi-index notebook * SpaCy model note * Update SpaCy footnote * Updated multi_index_search.ipynb to remove ruff errors. * add spacy to dictionary --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> Co-authored-by: Dayenne Souza <ddesouza@microsoft.com> Co-authored-by: dorbaker <dorbaker@microsoft.com>
This commit is contained in:
@@ -47,6 +47,12 @@ This repository presents a methodology for using knowledge graph memory structur
|
|||||||
Using _GraphRAG_ with your data out of the box may not yield the best possible results.
|
Using _GraphRAG_ with your data out of the box may not yield the best possible results.
|
||||||
We strongly recommend to fine-tune your prompts following the [Prompt Tuning Guide](https://microsoft.github.io/graphrag/prompt_tuning/overview/) in our documentation.
|
We strongly recommend to fine-tune your prompts following the [Prompt Tuning Guide](https://microsoft.github.io/graphrag/prompt_tuning/overview/) in our documentation.
|
||||||
|
|
||||||
|
## Versioning
|
||||||
|
|
||||||
|
Please see the [breaking changes](./breaking-changes.md) document for notes on our approach to versioning the project.
|
||||||
|
|
||||||
|
*Always run `graphrag init --root [path] --force` between minor version bumps to ensure you have the latest config format. Run the provided migration notebook between major version bumps if you want to avoid re-indexing prior datasets. Note that this will overwrite your configuration and prompts, so backup if necessary.*
|
||||||
|
|
||||||
## Responsible AI FAQ
|
## Responsible AI FAQ
|
||||||
|
|
||||||
See [RAI_TRANSPARENCY.md](./RAI_TRANSPARENCY.md)
|
See [RAI_TRANSPARENCY.md](./RAI_TRANSPARENCY.md)
|
||||||
|
|||||||
92
breaking-changes.md
Normal file
92
breaking-changes.md
Normal file
@@ -0,0 +1,92 @@
|
|||||||
|
# GraphRAG Data Model and Config Breaking Changes
|
||||||
|
|
||||||
|
This document contains notes about our versioning approach and a log of changes over time that may result in breakage. As of version 1.0 we are aligning more closely with standard [semantic versioning](https://semver.org/) practices. However, this is an ongoing research project that needs to balance experimental progress with stakeholder communication about big feature releases, so there may be times when we don't adhere perfectly to the spec.
|
||||||
|
|
||||||
|
There are five surface areas that may be impacted on any given release. They are:
|
||||||
|
|
||||||
|
- [CLI](https://microsoft.github.io/graphrag/cli/) - The CLI is the interface most project consumers are using. **Changes to the CLI will conform to standard semver.**
|
||||||
|
- [API](https://github.com/microsoft/graphrag/tree/main/graphrag/api) - The API layer is the primary interface we expect developers to use if they are consuming the project as a library in their own codebases. **Changes to the API layer modules will conform to standard semver.**
|
||||||
|
- Internals - Any code modules behind the CLI and API layers are considered "internal" and may change at any time without conforming to strict semver. This is intended to give the research team high flexibility to change our underlying implementation rapidly. We are not enforcing access via tightly controlled `__init__.py` files, so please understand that if you utilize modules other than the index or query API, they may break between releases in a non-semver-compliant manner.
|
||||||
|
- [settings.yaml](https://microsoft.github.io/graphrag/config/yaml/) - The settings.yaml file may have changes made to it as we adjust configurability. **Changes that affect the settings.yml will result in a minor version bump**. `graphrag init` will always emit compatible starter config, so we recommend always running the command when updating GraphRAG between minor versions, and copying your endpoint information or other customizations over to the new file.
|
||||||
|
- [Data model](https://microsoft.github.io/graphrag/index/outputs/) - The output data model may change over time as we adjust our approach. **Changes to the data model will conform to standard semver.** Any changes to the output tables will be shimmed for backwards compatibility between major releases, and we'll provide a migration notebook for folks to upgrade without requiring a re-index.
|
||||||
|
|
||||||
|
> TL;DR: Always run `graphrag init --path [path] --force` between minor version bumps to ensure you have the latest config format. Run the provided migration notebook between major version bumps if you want to avoid re-indexing prior datasets. Note that this will overwrite your configuration and prompts, so backup if necessary.
|
||||||
|
|
||||||
|
# v1
|
||||||
|
|
||||||
|
Run the [migration notebook](./docs/examples_notebooks/index_migration_to_v1.ipynb) to convert older tables to the v1 format.
|
||||||
|
|
||||||
|
Note that one of the new requirements is that we write embeddings to a vector store during indexing. By default, this uses a local lancedb instance. When you re-generate the default config, a block will be added to reflect this. If you need to write to Azure AI Search instead, we recommend updating these settings before you index, so you don't need to do a separate vector ingest.
|
||||||
|
|
||||||
|
All of the breaking changes listed below are accounted for in the four steps above.
|
||||||
|
|
||||||
|
## Updated data model
|
||||||
|
|
||||||
|
- We have streamlined the data model of the index in a few small ways to align tables more consistently and remove redundant content. Notably:
|
||||||
|
- Consistent use of `id` and `human_readable_id` across all tables; this also insures all int IDs are actually saved as ints and never strings
|
||||||
|
- Alignment of fields from `create_final_entities` (such as name -> title) with `create_final_nodes`, and removal of redundant content across these tables
|
||||||
|
- Rename of `document.raw_content` to `document.text`
|
||||||
|
- Rename of `entity.name` to `entity.title`
|
||||||
|
- Rename `rank` to `combined_degree` in `create_final_relationships` and removal of `source_degree` and `target_degree`fields
|
||||||
|
- Fixed community tables to use a proper UUID for the `id` field, and retain `community` and `human_readable_id` for the short IDs
|
||||||
|
- Removal of all embeddings columns from parquet files in favor of direct vector store writes
|
||||||
|
|
||||||
|
### Migration
|
||||||
|
|
||||||
|
- Run the migration notebook (some recent changes may invalidate existing caches, so migrating the format it cheaper than re-indexing).
|
||||||
|
|
||||||
|
## New required Embeddings
|
||||||
|
|
||||||
|
### Change
|
||||||
|
|
||||||
|
- Added new required embeddings for `DRIFTSearch` and base RAG capabilities.
|
||||||
|
|
||||||
|
### Migration
|
||||||
|
|
||||||
|
- Run a new index, leveraging existing cache.
|
||||||
|
|
||||||
|
## Vector Store required by default
|
||||||
|
|
||||||
|
### Change
|
||||||
|
|
||||||
|
- Vector store is now required by default for all search methods.
|
||||||
|
|
||||||
|
### Migration
|
||||||
|
|
||||||
|
- Run `graphrag init` command to generate a new settings.yaml file with the vector store configuration.
|
||||||
|
- Run a new index, leveraging existing cache.
|
||||||
|
|
||||||
|
## Deprecate timestamp paths
|
||||||
|
|
||||||
|
### Change
|
||||||
|
|
||||||
|
- Remove support for timestamp paths, those using `${timestamp}` directory nesting.
|
||||||
|
- Use the same directory for storage output and reporting output.
|
||||||
|
|
||||||
|
### Migration
|
||||||
|
|
||||||
|
- Ensure output directories no longer use `${timestamp}` directory nesting.
|
||||||
|
|
||||||
|
**Using Environment Variables**
|
||||||
|
|
||||||
|
- Ensure `GRAPHRAG_STORAGE_BASE_DIR` is set to a static directory, e.g., `output` instead of `output/${timestamp}/artifacts`.
|
||||||
|
- Ensure `GRAPHRAG_REPORTING_BASE_DIR` is set to a static directory, e.g., `output` instead of `output/${timestamp}/reports`
|
||||||
|
|
||||||
|
[Full docs on using environment variables for configuration](https://microsoft.github.io/graphrag/config/env_vars/).
|
||||||
|
|
||||||
|
**Using Configuration File**
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# rest of settings.yaml file
|
||||||
|
# ...
|
||||||
|
|
||||||
|
storage:
|
||||||
|
type: file
|
||||||
|
base_dir: "output" # changed from "output/${timestamp}/artifacts"
|
||||||
|
|
||||||
|
reporting:
|
||||||
|
type: file
|
||||||
|
base_dir: "output" # changed from "output/${timestamp}/reports"
|
||||||
|
```
|
||||||
|
|
||||||
|
[Full docs on using YAML files for configuration](https://microsoft.github.io/graphrag/config/yaml/).
|
||||||
@@ -78,6 +78,7 @@ semversioner
|
|||||||
mkdocs
|
mkdocs
|
||||||
fnllm
|
fnllm
|
||||||
typer
|
typer
|
||||||
|
spacy
|
||||||
|
|
||||||
# Library Methods
|
# Library Methods
|
||||||
iterrows
|
iterrows
|
||||||
|
|||||||
@@ -1,12 +1,18 @@
|
|||||||
# Default Configuration Mode (using Env Vars)
|
# Default Configuration Mode (using Env Vars)
|
||||||
|
|
||||||
## Text-Embeddings Customization
|
As of version 1.3, GraphRAG no longer supports a full complement of pre-built environment variables. Instead, we support variable replacement within the [settings.yml file](yaml.md) so you can specify any environment variables you like.
|
||||||
|
|
||||||
|
The only standard environment variable we expect, and include in the default settings.yml, is `GRAPHRAG_API_KEY`. If you are already using a number of the previous GRAPHRAG_* environment variables, you can insert them with template syntax into settings.yml and they will be adopted.
|
||||||
|
|
||||||
|
> **The environment variables below are documented as an aid for migration, but they WILL NOT be read unless you use template syntax in your settings.yml.**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Text-Embeddings Customization
|
||||||
|
|
||||||
By default, the GraphRAG indexer will only export embeddings required for our query methods. However, the model has embeddings defined for all plaintext fields, and these can be generated by setting the `GRAPHRAG_EMBEDDING_TARGET` environment variable to `all`.
|
By default, the GraphRAG indexer will only export embeddings required for our query methods. However, the model has embeddings defined for all plaintext fields, and these can be generated by setting the `GRAPHRAG_EMBEDDING_TARGET` environment variable to `all`.
|
||||||
|
|
||||||
If the embedding target is `all`, and you want to only embed a subset of these fields, you may specify which embeddings to skip using the `GRAPHRAG_EMBEDDING_SKIP` argument described below.
|
#### Embedded Fields
|
||||||
|
|
||||||
### Embedded Fields
|
|
||||||
|
|
||||||
- `text_unit.text`
|
- `text_unit.text`
|
||||||
- `document.text`
|
- `document.text`
|
||||||
@@ -17,11 +23,11 @@ If the embedding target is `all`, and you want to only embed a subset of these f
|
|||||||
- `community.summary`
|
- `community.summary`
|
||||||
- `community.full_content`
|
- `community.full_content`
|
||||||
|
|
||||||
## Input Data
|
### Input Data
|
||||||
|
|
||||||
Our pipeline can ingest .csv or .txt data from an input folder. These files can be nested within subfolders. To configure how input data is handled, what fields are mapped over, and how timestamps are parsed, look for configuration values starting with `GRAPHRAG_INPUT_` below. In general, CSV-based data provides the most customizability. Each CSV should at least contain a `text` field (which can be mapped with environment variables), but it's helpful if they also have `title`, `timestamp`, and `source` fields. Additional fields can be included as well, which will land as extra fields on the `Document` table.
|
Our pipeline can ingest .csv or .txt data from an input folder. These files can be nested within subfolders. To configure how input data is handled, what fields are mapped over, and how timestamps are parsed, look for configuration values starting with `GRAPHRAG_INPUT_` below. In general, CSV-based data provides the most customizability. Each CSV should at least contain a `text` field (which can be mapped with environment variables), but it's helpful if they also have `title`, `timestamp`, and `source` fields. Additional fields can be included as well, which will land as extra fields on the `Document` table.
|
||||||
|
|
||||||
## Base LLM Settings
|
### Base LLM Settings
|
||||||
|
|
||||||
These are the primary settings for configuring LLM connectivity.
|
These are the primary settings for configuring LLM connectivity.
|
||||||
|
|
||||||
@@ -33,7 +39,7 @@ These are the primary settings for configuring LLM connectivity.
|
|||||||
| `GRAPHRAG_API_ORGANIZATION` | | The AOAI organization. | `str` | `None` |
|
| `GRAPHRAG_API_ORGANIZATION` | | The AOAI organization. | `str` | `None` |
|
||||||
| `GRAPHRAG_API_PROXY` | | The AOAI proxy. | `str` | `None` |
|
| `GRAPHRAG_API_PROXY` | | The AOAI proxy. | `str` | `None` |
|
||||||
|
|
||||||
## Text Generation Settings
|
### Text Generation Settings
|
||||||
|
|
||||||
These settings control the text generation model used by the pipeline. Any settings with a fallback will use the base LLM settings, if available.
|
These settings control the text generation model used by the pipeline. Any settings with a fallback will use the base LLM settings, if available.
|
||||||
|
|
||||||
@@ -62,7 +68,7 @@ These settings control the text generation model used by the pipeline. Any setti
|
|||||||
| `GRAPHRAG_LLM_TOP_P` | | The top_p to use for sampling. | `float` | 1 |
|
| `GRAPHRAG_LLM_TOP_P` | | The top_p to use for sampling. | `float` | 1 |
|
||||||
| `GRAPHRAG_LLM_N` | | The number of responses to generate. | `int` | 1 |
|
| `GRAPHRAG_LLM_N` | | The number of responses to generate. | `int` | 1 |
|
||||||
|
|
||||||
## Text Embedding Settings
|
### Text Embedding Settings
|
||||||
|
|
||||||
These settings control the text embedding model used by the pipeline. Any settings with a fallback will use the base LLM settings, if available.
|
These settings control the text embedding model used by the pipeline. Any settings with a fallback will use the base LLM settings, if available.
|
||||||
|
|
||||||
@@ -78,8 +84,7 @@ These settings control the text embedding model used by the pipeline. Any settin
|
|||||||
| `GRAPHRAG_EMBEDDING_MODEL` | | The model to use for the embedding client. | `str` | `text-embedding-3-small` |
|
| `GRAPHRAG_EMBEDDING_MODEL` | | The model to use for the embedding client. | `str` | `text-embedding-3-small` |
|
||||||
| `GRAPHRAG_EMBEDDING_BATCH_SIZE` | | The number of texts to embed at once. [(Azure limit is 16)](https://learn.microsoft.com/en-us/azure/ai-ce) | `int` | 16 |
|
| `GRAPHRAG_EMBEDDING_BATCH_SIZE` | | The number of texts to embed at once. [(Azure limit is 16)](https://learn.microsoft.com/en-us/azure/ai-ce) | `int` | 16 |
|
||||||
| `GRAPHRAG_EMBEDDING_BATCH_MAX_TOKENS` | | The maximum tokens per batch [(Azure limit is 8191)](https://learn.microsoft.com/en-us/azure/ai-services/openai/reference) | `int` | 8191 |
|
| `GRAPHRAG_EMBEDDING_BATCH_MAX_TOKENS` | | The maximum tokens per batch [(Azure limit is 8191)](https://learn.microsoft.com/en-us/azure/ai-services/openai/reference) | `int` | 8191 |
|
||||||
| `GRAPHRAG_EMBEDDING_TARGET` | | The target fields to embed. Either `required` or `all`. | `str` | `required` |
|
| `GRAPHRAG_EMBEDDING_TARGET` | | The target fields to embed. Either `required` or `all`. | `str` | `required` | |
|
||||||
| `GRAPHRAG_EMBEDDING_SKIP` | | A comma-separated list of fields to skip embeddings for . (e.g. 'relationship.description') | `str` | `None` |
|
|
||||||
| `GRAPHRAG_EMBEDDING_THREAD_COUNT` | | The number of threads to use for parallelization for embeddings. | `int` | |
|
| `GRAPHRAG_EMBEDDING_THREAD_COUNT` | | The number of threads to use for parallelization for embeddings. | `int` | |
|
||||||
| `GRAPHRAG_EMBEDDING_THREAD_STAGGER` | | The time to wait (in seconds) between starting each thread for embeddings. | `float` | 50 |
|
| `GRAPHRAG_EMBEDDING_THREAD_STAGGER` | | The time to wait (in seconds) between starting each thread for embeddings. | `float` | 50 |
|
||||||
| `GRAPHRAG_EMBEDDING_CONCURRENT_REQUESTS` | | The number of concurrent requests to allow for the embedding client. | `int` | 25 |
|
| `GRAPHRAG_EMBEDDING_CONCURRENT_REQUESTS` | | The number of concurrent requests to allow for the embedding client. | `int` | 25 |
|
||||||
@@ -89,41 +94,38 @@ These settings control the text embedding model used by the pipeline. Any settin
|
|||||||
| `GRAPHRAG_EMBEDDING_MAX_RETRY_WAIT` | | The maximum number of seconds to wait between retries. | `int` | 10 |
|
| `GRAPHRAG_EMBEDDING_MAX_RETRY_WAIT` | | The maximum number of seconds to wait between retries. | `int` | 10 |
|
||||||
| `GRAPHRAG_EMBEDDING_SLEEP_ON_RATE_LIMIT_RECOMMENDATION` | | Whether to sleep on rate limit recommendation. (Azure Only) | `bool` | `True` |
|
| `GRAPHRAG_EMBEDDING_SLEEP_ON_RATE_LIMIT_RECOMMENDATION` | | Whether to sleep on rate limit recommendation. (Azure Only) | `bool` | `True` |
|
||||||
|
|
||||||
## Input Settings
|
### Input Settings
|
||||||
|
|
||||||
These settings control the data input used by the pipeline. Any settings with a fallback will use the base LLM settings, if available.
|
These settings control the data input used by the pipeline. Any settings with a fallback will use the base LLM settings, if available.
|
||||||
|
|
||||||
### Plaintext Input Data (`GRAPHRAG_INPUT_FILE_TYPE`=text)
|
#### Plaintext Input Data (`GRAPHRAG_INPUT_FILE_TYPE`=text)
|
||||||
|
|
||||||
| Parameter | Description | Type | Required or Optional | Default |
|
| Parameter | Description | Type | Required or Optional | Default |
|
||||||
| ----------------------------- | --------------------------------------------------------------------------------- | ----- | -------------------- | ---------- |
|
| ----------------------------- | --------------------------------------------------------------------------------- | ----- | -------------------- | ---------- |
|
||||||
| `GRAPHRAG_INPUT_FILE_PATTERN` | The file pattern regexp to use when reading input files from the input directory. | `str` | optional | `.*\.txt$` |
|
| `GRAPHRAG_INPUT_FILE_PATTERN` | The file pattern regexp to use when reading input files from the input directory. | `str` | optional | `.*\.txt$` |
|
||||||
|
|
||||||
### CSV Input Data (`GRAPHRAG_INPUT_FILE_TYPE`=csv)
|
#### CSV Input Data (`GRAPHRAG_INPUT_FILE_TYPE`=csv)
|
||||||
|
|
||||||
| Parameter | Description | Type | Required or Optional | Default |
|
| Parameter | Description | Type | Required or Optional | Default |
|
||||||
| ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----- | -------------------- | ---------- |
|
| ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----- | -------------------- | ---------- |
|
||||||
| `GRAPHRAG_INPUT_TYPE` | The input storage type to use when reading files. (`file` or `blob`) | `str` | optional | `file` |
|
| `GRAPHRAG_INPUT_TYPE` | The input storage type to use when reading files. (`file` or `blob`) | `str` | optional | `file` |
|
||||||
| `GRAPHRAG_INPUT_FILE_PATTERN` | The file pattern regexp to use when reading input files from the input directory. | `str` | optional | `.*\.txt$` |
|
| `GRAPHRAG_INPUT_FILE_PATTERN` | The file pattern regexp to use when reading input files from the input directory. | `str` | optional | `.*\.txt$` |
|
||||||
| `GRAPHRAG_INPUT_SOURCE_COLUMN` | The 'source' column to use when reading CSV input files. | `str` | optional | `source` |
|
|
||||||
| `GRAPHRAG_INPUT_TIMESTAMP_COLUMN` | The 'timestamp' column to use when reading CSV input files. | `str` | optional | `None` |
|
|
||||||
| `GRAPHRAG_INPUT_TIMESTAMP_FORMAT` | The timestamp format to use when parsing timestamps in the timestamp column. | `str` | optional | `None` |
|
|
||||||
| `GRAPHRAG_INPUT_TEXT_COLUMN` | The 'text' column to use when reading CSV input files. | `str` | optional | `text` |
|
| `GRAPHRAG_INPUT_TEXT_COLUMN` | The 'text' column to use when reading CSV input files. | `str` | optional | `text` |
|
||||||
| `GRAPHRAG_INPUT_DOCUMENT_ATTRIBUTE_COLUMNS` | A list of CSV columns, comma-separated, to incorporate as document fields. | `str` | optional | `id` |
|
| `GRAPHRAG_INPUT_METADATA` | A list of CSV columns, comma-separated, to incorporate as JSON in a metadata column. | `str` | optional | `None` |
|
||||||
| `GRAPHRAG_INPUT_TITLE_COLUMN` | The 'title' column to use when reading CSV input files. | `str` | optional | `title` |
|
| `GRAPHRAG_INPUT_TITLE_COLUMN` | The 'title' column to use when reading CSV input files. | `str` | optional | `title` |
|
||||||
| `GRAPHRAG_INPUT_STORAGE_ACCOUNT_BLOB_URL` | The Azure Storage blob endpoint to use when in `blob` mode and using managed identity. Will have the format `https://<storage_account_name>.blob.core.windows.net` | `str` | optional | `None` |
|
| `GRAPHRAG_INPUT_STORAGE_ACCOUNT_BLOB_URL` | The Azure Storage blob endpoint to use when in `blob` mode and using managed identity. Will have the format `https://<storage_account_name>.blob.core.windows.net` | `str` | optional | `None` |
|
||||||
| `GRAPHRAG_INPUT_CONNECTION_STRING` | The connection string to use when reading CSV input files from Azure Blob Storage. | `str` | optional | `None` |
|
| `GRAPHRAG_INPUT_CONNECTION_STRING` | The connection string to use when reading CSV input files from Azure Blob Storage. | `str` | optional | `None` |
|
||||||
| `GRAPHRAG_INPUT_CONTAINER_NAME` | The container name to use when reading CSV input files from Azure Blob Storage. | `str` | optional | `None` |
|
| `GRAPHRAG_INPUT_CONTAINER_NAME` | The container name to use when reading CSV input files from Azure Blob Storage. | `str` | optional | `None` |
|
||||||
| `GRAPHRAG_INPUT_BASE_DIR` | The base directory to read input files from. | `str` | optional | `None` |
|
| `GRAPHRAG_INPUT_BASE_DIR` | The base directory to read input files from. | `str` | optional | `None` |
|
||||||
|
|
||||||
## Data Mapping Settings
|
### Data Mapping Settings
|
||||||
|
|
||||||
| Parameter | Description | Type | Required or Optional | Default |
|
| Parameter | Description | Type | Required or Optional | Default |
|
||||||
| -------------------------- | -------------------------------------------------------- | ----- | -------------------- | ------- |
|
| -------------------------- | -------------------------------------------------------- | ----- | -------------------- | ------- |
|
||||||
| `GRAPHRAG_INPUT_FILE_TYPE` | The type of input data, `csv` or `text` | `str` | optional | `text` |
|
| `GRAPHRAG_INPUT_FILE_TYPE` | The type of input data, `csv` or `text` | `str` | optional | `text` |
|
||||||
| `GRAPHRAG_INPUT_ENCODING` | The encoding to apply when reading CSV/text input files. | `str` | optional | `utf-8` |
|
| `GRAPHRAG_INPUT_ENCODING` | The encoding to apply when reading CSV/text input files. | `str` | optional | `utf-8` |
|
||||||
|
|
||||||
## Data Chunking
|
### Data Chunking
|
||||||
|
|
||||||
| Parameter | Description | Type | Required or Optional | Default |
|
| Parameter | Description | Type | Required or Optional | Default |
|
||||||
| ------------------------------- | ------------------------------------------------------------------------------------------- | ----- | -------------------- | ----------------------------- |
|
| ------------------------------- | ------------------------------------------------------------------------------------------- | ----- | -------------------- | ----------------------------- |
|
||||||
@@ -132,7 +134,7 @@ These settings control the data input used by the pipeline. Any settings with a
|
|||||||
| `GRAPHRAG_CHUNK_BY_COLUMNS` | A comma-separated list of document attributes to groupby when performing TextUnit chunking. | `str` | optional | `id` |
|
| `GRAPHRAG_CHUNK_BY_COLUMNS` | A comma-separated list of document attributes to groupby when performing TextUnit chunking. | `str` | optional | `id` |
|
||||||
| `GRAPHRAG_CHUNK_ENCODING_MODEL` | The encoding model to use for chunking. | `str` | optional | The top-level encoding model. |
|
| `GRAPHRAG_CHUNK_ENCODING_MODEL` | The encoding model to use for chunking. | `str` | optional | The top-level encoding model. |
|
||||||
|
|
||||||
## Prompting Overrides
|
### Prompting Overrides
|
||||||
|
|
||||||
| Parameter | Description | Type | Required or Optional | Default |
|
| Parameter | Description | Type | Required or Optional | Default |
|
||||||
| --------------------------------------------- | ------------------------------------------------------------------------------------------ | -------- | -------------------- | ---------------------------------------------------------------- |
|
| --------------------------------------------- | ------------------------------------------------------------------------------------------ | -------- | -------------------- | ---------------------------------------------------------------- |
|
||||||
@@ -150,7 +152,7 @@ These settings control the data input used by the pipeline. Any settings with a
|
|||||||
| `GRAPHRAG_COMMUNITY_REPORTS_PROMPT_FILE` | The community reports extraction prompt to utilize. | `string` | optional | `None` |
|
| `GRAPHRAG_COMMUNITY_REPORTS_PROMPT_FILE` | The community reports extraction prompt to utilize. | `string` | optional | `None` |
|
||||||
| `GRAPHRAG_COMMUNITY_REPORTS_MAX_LENGTH` | The maximum number of tokens to generate per community reports. | `int` | optional | 1500 |
|
| `GRAPHRAG_COMMUNITY_REPORTS_MAX_LENGTH` | The maximum number of tokens to generate per community reports. | `int` | optional | 1500 |
|
||||||
|
|
||||||
## Storage
|
### Storage
|
||||||
|
|
||||||
This section controls the storage mechanism used by the pipeline used for exporting output tables.
|
This section controls the storage mechanism used by the pipeline used for exporting output tables.
|
||||||
|
|
||||||
@@ -162,7 +164,7 @@ This section controls the storage mechanism used by the pipeline used for export
|
|||||||
| `GRAPHRAG_STORAGE_CONTAINER_NAME` | The Azure Storage container name to use when in `blob` mode. | `str` | optional | None |
|
| `GRAPHRAG_STORAGE_CONTAINER_NAME` | The Azure Storage container name to use when in `blob` mode. | `str` | optional | None |
|
||||||
| `GRAPHRAG_STORAGE_BASE_DIR` | The base path to data outputs outputs. | `str` | optional | None |
|
| `GRAPHRAG_STORAGE_BASE_DIR` | The base path to data outputs outputs. | `str` | optional | None |
|
||||||
|
|
||||||
## Cache
|
### Cache
|
||||||
|
|
||||||
This section controls the cache mechanism used by the pipeline. This is used to cache LLM invocation results.
|
This section controls the cache mechanism used by the pipeline. This is used to cache LLM invocation results.
|
||||||
|
|
||||||
@@ -174,7 +176,7 @@ This section controls the cache mechanism used by the pipeline. This is used to
|
|||||||
| `GRAPHRAG_CACHE_CONTAINER_NAME` | The Azure Storage container name to use when in `blob` mode. | `str` | optional | None |
|
| `GRAPHRAG_CACHE_CONTAINER_NAME` | The Azure Storage container name to use when in `blob` mode. | `str` | optional | None |
|
||||||
| `GRAPHRAG_CACHE_BASE_DIR` | The base path to the cache files. | `str` | optional | None |
|
| `GRAPHRAG_CACHE_BASE_DIR` | The base path to the cache files. | `str` | optional | None |
|
||||||
|
|
||||||
## Reporting
|
### Reporting
|
||||||
|
|
||||||
This section controls the reporting mechanism used by the pipeline, for common events and error messages. The default is to write reports to a file in the output directory. However, you can also choose to write reports to the console or to an Azure Blob Storage container.
|
This section controls the reporting mechanism used by the pipeline, for common events and error messages. The default is to write reports to a file in the output directory. However, you can also choose to write reports to the console or to an Azure Blob Storage container.
|
||||||
|
|
||||||
@@ -186,7 +188,7 @@ This section controls the reporting mechanism used by the pipeline, for common e
|
|||||||
| `GRAPHRAG_REPORTING_CONTAINER_NAME` | The Azure Storage container name to use when in `blob` mode. | `str` | optional | None |
|
| `GRAPHRAG_REPORTING_CONTAINER_NAME` | The Azure Storage container name to use when in `blob` mode. | `str` | optional | None |
|
||||||
| `GRAPHRAG_REPORTING_BASE_DIR` | The base path to the reporting outputs. | `str` | optional | None |
|
| `GRAPHRAG_REPORTING_BASE_DIR` | The base path to the reporting outputs. | `str` | optional | None |
|
||||||
|
|
||||||
## Node2Vec Parameters
|
### Node2Vec Parameters
|
||||||
|
|
||||||
| Parameter | Description | Type | Required or Optional | Default |
|
| Parameter | Description | Type | Required or Optional | Default |
|
||||||
| ------------------------------- | ---------------------------------------- | ------ | -------------------- | ------- |
|
| ------------------------------- | ---------------------------------------- | ------ | -------------------- | ------- |
|
||||||
@@ -197,7 +199,7 @@ This section controls the reporting mechanism used by the pipeline, for common e
|
|||||||
| `GRAPHRAG_NODE2VEC_ITERATIONS` | The number of iterations to run node2vec | `int` | optional | 3 |
|
| `GRAPHRAG_NODE2VEC_ITERATIONS` | The number of iterations to run node2vec | `int` | optional | 3 |
|
||||||
| `GRAPHRAG_NODE2VEC_RANDOM_SEED` | The random seed to use for node2vec | `int` | optional | 597832 |
|
| `GRAPHRAG_NODE2VEC_RANDOM_SEED` | The random seed to use for node2vec | `int` | optional | 597832 |
|
||||||
|
|
||||||
## Data Snapshotting
|
### Data Snapshotting
|
||||||
|
|
||||||
| Parameter | Description | Type | Required or Optional | Default |
|
| Parameter | Description | Type | Required or Optional | Default |
|
||||||
| -------------------------------------- | ----------------------------------------------- | ------ | -------------------- | ------- |
|
| -------------------------------------- | ----------------------------------------------- | ------ | -------------------- | ------- |
|
||||||
@@ -214,5 +216,4 @@ This section controls the reporting mechanism used by the pipeline, for common e
|
|||||||
| `GRAPHRAG_ASYNC_MODE` | Which async mode to use. Either `asyncio` or `threaded`. | `str` | optional | `asyncio` |
|
| `GRAPHRAG_ASYNC_MODE` | Which async mode to use. Either `asyncio` or `threaded`. | `str` | optional | `asyncio` |
|
||||||
| `GRAPHRAG_ENCODING_MODEL` | The text encoding model, used in tiktoken, to encode text. | `str` | optional | `cl100k_base` |
|
| `GRAPHRAG_ENCODING_MODEL` | The text encoding model, used in tiktoken, to encode text. | `str` | optional | `cl100k_base` |
|
||||||
| `GRAPHRAG_MAX_CLUSTER_SIZE` | The maximum number of entities to include in a single Leiden cluster. | `int` | optional | 10 |
|
| `GRAPHRAG_MAX_CLUSTER_SIZE` | The maximum number of entities to include in a single Leiden cluster. | `int` | optional | 10 |
|
||||||
| `GRAPHRAG_SKIP_WORKFLOWS` | A comma-separated list of workflow names to skip. | `str` | optional | `None` |
|
|
||||||
| `GRAPHRAG_UMAP_ENABLED` | Whether to enable UMAP layouts | `bool` | optional | False |
|
| `GRAPHRAG_UMAP_ENABLED` | Whether to enable UMAP layouts | `bool` | optional | False |
|
||||||
|
|||||||
@@ -5,12 +5,13 @@ To start using GraphRAG, you must generate a configuration file. The `init` comm
|
|||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
```sh
|
```sh
|
||||||
graphrag init [--root PATH]
|
graphrag init [--root PATH] [--force, --no-force]
|
||||||
```
|
```
|
||||||
|
|
||||||
## Options
|
## Options
|
||||||
|
|
||||||
- `--root PATH` - The project root directory to initialize graphrag at. Default is the current directory.
|
- `--root PATH` - The project root directory to initialize graphrag at. Default is the current directory.
|
||||||
|
- `--force`, `--no-force` - Optional, default is --no-force. Overwrite existing configuration and prompt files if they exist.
|
||||||
|
|
||||||
## Example
|
## Example
|
||||||
|
|
||||||
|
|||||||
@@ -8,4 +8,4 @@ The default configuration mode is the simplest way to get started with the Graph
|
|||||||
|
|
||||||
- [Init command](init.md) (recommended)
|
- [Init command](init.md) (recommended)
|
||||||
- [Using YAML for deeper control](yaml.md)
|
- [Using YAML for deeper control](yaml.md)
|
||||||
- [Purely using environment variables](env_vars.md)
|
- [Purely using environment variables](env_vars.md) (not recommended)
|
||||||
|
|||||||
@@ -19,21 +19,37 @@ llm:
|
|||||||
|
|
||||||
## Indexing
|
## Indexing
|
||||||
|
|
||||||
### llm
|
### models
|
||||||
|
|
||||||
This is the base LLM configuration section. Other steps may override this configuration with their own LLM configuration.
|
This is a dict of model configurations. The dict key is used to reference this configuration elsewhere when a model instance is desired. In this way, you can specify as many different models as you need, and reference them differentially in the workflow steps.
|
||||||
|
|
||||||
|
For example:
|
||||||
|
```yml
|
||||||
|
models:
|
||||||
|
default_chat_model:
|
||||||
|
api_key: ${GRAPHRAG_API_KEY}
|
||||||
|
type: openai_chat
|
||||||
|
model: gpt-4o
|
||||||
|
model_supports_json: true
|
||||||
|
default_embedding_model:
|
||||||
|
api_key: ${GRAPHRAG_API_KEY}
|
||||||
|
type: openai_embedding
|
||||||
|
model: text-embedding-ada-002
|
||||||
|
```
|
||||||
|
|
||||||
#### Fields
|
#### Fields
|
||||||
|
|
||||||
- `api_key` **str** - The OpenAI API key to use.
|
- `api_key` **str** - The OpenAI API key to use.
|
||||||
- `type` **openai_chat|azure_openai_chat|openai_embedding|azure_openai_embedding** - The type of LLM to use.
|
- `type` **openai_chat|azure_openai_chat|openai_embedding|azure_openai_embedding** - The type of LLM to use.
|
||||||
- `model` **str** - The model name.
|
- `model` **str** - The model name.
|
||||||
|
- `encoding_model` **str** - The text encoding model to use. Default is to use the encoding model aligned with the language model (i.e., it is retrieved from tiktoken if unset).
|
||||||
- `max_tokens` **int** - The maximum number of output tokens.
|
- `max_tokens` **int** - The maximum number of output tokens.
|
||||||
- `request_timeout` **float** - The per-request timeout.
|
- `request_timeout` **float** - The per-request timeout.
|
||||||
- `api_base` **str** - The API base url to use.
|
- `api_base` **str** - The API base url to use.
|
||||||
- `api_version` **str** - The API version
|
- `api_version` **str** - The API version.
|
||||||
- `organization` **str** - The client organization.
|
- `organization` **str** - The client organization.
|
||||||
- `proxy` **str** - The proxy URL to use.
|
- `proxy` **str** - The proxy URL to use.
|
||||||
|
- `azure_auth_type` **api_key|managed_identity** - if using Azure, please indicate how you want to authenticate requests.
|
||||||
- `audience` **str** - (Azure OpenAI only) The URI of the target Azure resource/service for which a managed identity token is requested. Used if `api_key` is not defined. Default=`https://cognitiveservices.azure.com/.default`
|
- `audience` **str** - (Azure OpenAI only) The URI of the target Azure resource/service for which a managed identity token is requested. Used if `api_key` is not defined. Default=`https://cognitiveservices.azure.com/.default`
|
||||||
- `deployment_name` **str** - The deployment name to use (Azure).
|
- `deployment_name` **str** - The deployment name to use (Azure).
|
||||||
- `model_supports_json` **bool** - Whether the model supports JSON-mode output.
|
- `model_supports_json` **bool** - Whether the model supports JSON-mode output.
|
||||||
@@ -46,41 +62,50 @@ This is the base LLM configuration section. Other steps may override this config
|
|||||||
- `temperature` **float** - The temperature to use.
|
- `temperature` **float** - The temperature to use.
|
||||||
- `top_p` **float** - The top-p value to use.
|
- `top_p` **float** - The top-p value to use.
|
||||||
- `n` **int** - The number of completions to generate.
|
- `n` **int** - The number of completions to generate.
|
||||||
|
- `parallelization_stagger` **float** - The threading stagger value.
|
||||||
|
- `parallelization_num_threads` **int** - The maximum number of work threads.
|
||||||
|
- `async_mode` **asyncio|threaded** The async mode to use. Either `asyncio` or `threaded.
|
||||||
|
|
||||||
### parallelization
|
### embed_text
|
||||||
|
|
||||||
|
By default, the GraphRAG indexer will only export embeddings required for our query methods. However, the model has embeddings defined for all plaintext fields, and these can be customized by setting the `target` and `names` fields.
|
||||||
|
|
||||||
|
Supported embeddings names are:
|
||||||
|
- `text_unit.text`
|
||||||
|
- `document.text`
|
||||||
|
- `entity.title`
|
||||||
|
- `entity.description`
|
||||||
|
- `relationship.description`
|
||||||
|
- `community.title`
|
||||||
|
- `community.summary`
|
||||||
|
- `community.full_content`
|
||||||
|
|
||||||
#### Fields
|
#### Fields
|
||||||
|
|
||||||
- `stagger` **float** - The threading stagger value.
|
- `model_id` **str** - Name of the model definition to use for text embedding.
|
||||||
- `num_threads` **int** - The maximum number of work threads.
|
|
||||||
|
|
||||||
### async_mode
|
|
||||||
|
|
||||||
**asyncio|threaded** The async mode to use. Either `asyncio` or `threaded.
|
|
||||||
|
|
||||||
### embeddings
|
|
||||||
|
|
||||||
#### Fields
|
|
||||||
|
|
||||||
- `llm` (see LLM top-level config)
|
|
||||||
- `parallelization` (see Parallelization top-level config)
|
|
||||||
- `async_mode` (see Async Mode top-level config)
|
|
||||||
- `batch_size` **int** - The maximum batch size to use.
|
- `batch_size` **int** - The maximum batch size to use.
|
||||||
- `batch_max_tokens` **int** - The maximum batch # of tokens.
|
- `batch_max_tokens` **int** - The maximum batch # of tokens.
|
||||||
- `target` **required|all|none** - Determines which set of embeddings to export.
|
- `target` **required|all|selected|none** - Determines which set of embeddings to export.
|
||||||
- `skip` **list[str]** - Which embeddings to skip. Only useful if target=all to customize the list.
|
- `names` **list[str]** - If target=selected, this should be an explicit list of the embeddings names we support.
|
||||||
- `vector_store` **dict** - The vector store to use. Configured for lancedb by default.
|
|
||||||
- `type` **str** - `lancedb` or `azure_ai_search`. Default=`lancedb`
|
### vector_store
|
||||||
- `db_uri` **str** (only for lancedb) - The database uri. Default=`storage.base_dir/lancedb`
|
|
||||||
- `url` **str** (only for AI Search) - AI Search endpoint
|
Where to put all vectors for the system. Configured for lancedb by default.
|
||||||
- `api_key` **str** (optional - only for AI Search) - The AI Search api key to use.
|
|
||||||
- `audience` **str** (only for AI Search) - Audience for managed identity token if managed identity authentication is used.
|
#### Fields
|
||||||
- `overwrite` **bool** (only used at index creation time) - Overwrite collection if it exist. Default=`True`
|
|
||||||
- `container_name` **str** - The name of a vector container. This stores all indexes (tables) for a given dataset ingest. Default=`default`
|
- `type` **str** - `lancedb` or `azure_ai_search`. Default=`lancedb`
|
||||||
- `strategy` **dict** - Fully override the text-embedding strategy.
|
- `db_uri` **str** (only for lancedb) - The database uri. Default=`storage.base_dir/lancedb`
|
||||||
|
- `url` **str** (only for AI Search) - AI Search endpoint
|
||||||
|
- `api_key` **str** (optional - only for AI Search) - The AI Search api key to use.
|
||||||
|
- `audience` **str** (only for AI Search) - Audience for managed identity token if managed identity authentication is used.
|
||||||
|
- `overwrite` **bool** (only used at index creation time) - Overwrite collection if it exist. Default=`True`
|
||||||
|
- `container_name` **str** - The name of a vector container. This stores all indexes (tables) for a given dataset ingest. Default=`default`
|
||||||
|
|
||||||
### input
|
### input
|
||||||
|
|
||||||
|
Our pipeline can ingest .csv or .txt data from an input folder. These files can be nested within subfolders. In general, CSV-based data provides the most customizability. Each CSV should at least contain a `text` field. You can use the `metadata` list to specify additional columns from the CSV to include as headers in each text chunk, allowing you to repeat document content within each chunk for better LLM inclusion.
|
||||||
|
|
||||||
#### Fields
|
#### Fields
|
||||||
|
|
||||||
- `type` **file|blob** - The input type to use. Default=`file`
|
- `type` **file|blob** - The input type to use. Default=`file`
|
||||||
@@ -92,25 +117,26 @@ This is the base LLM configuration section. Other steps may override this config
|
|||||||
- `file_encoding` **str** - The encoding of the input file. Default is `utf-8`
|
- `file_encoding` **str** - The encoding of the input file. Default is `utf-8`
|
||||||
- `file_pattern` **str** - A regex to match input files. Default is `.*\.csv$` if in csv mode and `.*\.txt$` if in text mode.
|
- `file_pattern` **str** - A regex to match input files. Default is `.*\.csv$` if in csv mode and `.*\.txt$` if in text mode.
|
||||||
- `file_filter` **dict** - Key/value pairs to filter. Default is None.
|
- `file_filter` **dict** - Key/value pairs to filter. Default is None.
|
||||||
- `source_column` **str** - (CSV Mode Only) The source column name.
|
|
||||||
- `timestamp_column` **str** - (CSV Mode Only) The timestamp column name.
|
|
||||||
- `timestamp_format` **str** - (CSV Mode Only) The source format.
|
|
||||||
- `text_column` **str** - (CSV Mode Only) The text column name.
|
- `text_column` **str** - (CSV Mode Only) The text column name.
|
||||||
- `title_column` **str** - (CSV Mode Only) The title column name.
|
- `metadata` **list[str]** - (CSV Mode Only) The additional document attributes to include.
|
||||||
- `document_attribute_columns` **list[str]** - (CSV Mode Only) The additional document attributes to include.
|
|
||||||
|
|
||||||
### chunks
|
### chunks
|
||||||
|
|
||||||
|
These settings configure how we parse documents into text chunks. This is necessary because very large documents may not fit into a single context window, and graph extraction accuracy can be modulated. Also note the `metadata` setting in the input document config, which will replicate document metadata into each chunk.
|
||||||
|
|
||||||
#### Fields
|
#### Fields
|
||||||
|
|
||||||
- `size` **int** - The max chunk size in tokens.
|
- `size` **int** - The max chunk size in tokens.
|
||||||
- `overlap` **int** - The chunk overlap in tokens.
|
- `overlap` **int** - The chunk overlap in tokens.
|
||||||
- `group_by_columns` **list[str]** - group documents by fields before chunking.
|
- `group_by_columns` **list[str]** - group documents by fields before chunking.
|
||||||
- `encoding_model` **str** - The text encoding model to use. Default is to use the top-level encoding model.
|
- `encoding_model` **str** - The text encoding model to use for splitting on token boundaries.
|
||||||
- `strategy` **dict** - Fully override the chunking strategy.
|
- `prepend_metadata` **bool** - Determines if metadata values should be added at the beginning of each chunk. Default=`False`.
|
||||||
|
- `chunk_size_includes_metadata` **bool** - Specifies whether the chunk size calculation should include metadata tokens. Default=`False`.
|
||||||
|
|
||||||
### cache
|
### cache
|
||||||
|
|
||||||
|
This section controls the cache mechanism used by the pipeline. This is used to cache LLM invocation results.
|
||||||
|
|
||||||
#### Fields
|
#### Fields
|
||||||
|
|
||||||
- `type` **file|memory|none|blob** - The cache type to use. Default=`file`
|
- `type` **file|memory|none|blob** - The cache type to use. Default=`file`
|
||||||
@@ -119,7 +145,9 @@ This is the base LLM configuration section. Other steps may override this config
|
|||||||
- `base_dir` **str** - The base directory to write cache to, relative to the root.
|
- `base_dir` **str** - The base directory to write cache to, relative to the root.
|
||||||
- `storage_account_blob_url` **str** - The storage account blob URL to use.
|
- `storage_account_blob_url` **str** - The storage account blob URL to use.
|
||||||
|
|
||||||
### storage
|
### output
|
||||||
|
|
||||||
|
This section controls the storage mechanism used by the pipeline used for exporting output tables.
|
||||||
|
|
||||||
#### Fields
|
#### Fields
|
||||||
|
|
||||||
@@ -131,6 +159,8 @@ This is the base LLM configuration section. Other steps may override this config
|
|||||||
|
|
||||||
### update_index_storage
|
### update_index_storage
|
||||||
|
|
||||||
|
The section defines a secondary storage location for running incremental indexing, to preserve your original outputs.
|
||||||
|
|
||||||
#### Fields
|
#### Fields
|
||||||
|
|
||||||
- `type` **file|memory|blob** - The storage type to use. Default=`file`
|
- `type` **file|memory|blob** - The storage type to use. Default=`file`
|
||||||
@@ -141,6 +171,8 @@ This is the base LLM configuration section. Other steps may override this config
|
|||||||
|
|
||||||
### reporting
|
### reporting
|
||||||
|
|
||||||
|
This section controls the reporting mechanism used by the pipeline, for common events and error messages. The default is to write reports to a file in the output directory. However, you can also choose to write reports to the console or to an Azure Blob Storage container.
|
||||||
|
|
||||||
#### Fields
|
#### Fields
|
||||||
|
|
||||||
- `type` **file|console|blob** - The reporting type to use. Default=`file`
|
- `type` **file|console|blob** - The reporting type to use. Default=`file`
|
||||||
@@ -149,65 +181,89 @@ This is the base LLM configuration section. Other steps may override this config
|
|||||||
- `base_dir` **str** - The base directory to write reports to, relative to the root.
|
- `base_dir` **str** - The base directory to write reports to, relative to the root.
|
||||||
- `storage_account_blob_url` **str** - The storage account blob URL to use.
|
- `storage_account_blob_url` **str** - The storage account blob URL to use.
|
||||||
|
|
||||||
### entity_extraction
|
### extract_graph
|
||||||
|
|
||||||
#### Fields
|
#### Fields
|
||||||
|
|
||||||
- `llm` (see LLM top-level config)
|
- `model_id` **str** - Name of the model definition to use for API calls.
|
||||||
- `parallelization` (see Parallelization top-level config)
|
|
||||||
- `async_mode` (see Async Mode top-level config)
|
|
||||||
- `prompt` **str** - The prompt file to use.
|
- `prompt` **str** - The prompt file to use.
|
||||||
- `entity_types` **list[str]** - The entity types to identify.
|
- `entity_types` **list[str]** - The entity types to identify.
|
||||||
- `max_gleanings` **int** - The maximum number of gleaning cycles to use.
|
- `max_gleanings` **int** - The maximum number of gleaning cycles to use.
|
||||||
- `encoding_model` **str** - The text encoding model to use. By default, this will use the top-level encoding model.
|
|
||||||
- `strategy` **dict** - Fully override the entity extraction strategy.
|
|
||||||
|
|
||||||
### summarize_descriptions
|
### summarize_descriptions
|
||||||
|
|
||||||
#### Fields
|
#### Fields
|
||||||
|
|
||||||
- `llm` (see LLM top-level config)
|
- `model_id` **str** - Name of the model definition to use for API calls.
|
||||||
- `parallelization` (see Parallelization top-level config)
|
|
||||||
- `async_mode` (see Async Mode top-level config)
|
|
||||||
- `prompt` **str** - The prompt file to use.
|
- `prompt` **str** - The prompt file to use.
|
||||||
- `max_length` **int** - The maximum number of output tokens per summarization.
|
- `max_length` **int** - The maximum number of output tokens per summarization.
|
||||||
- `strategy` **dict** - Fully override the summarize description strategy.
|
|
||||||
|
|
||||||
### claim_extraction
|
### extract_graph_nlp
|
||||||
|
|
||||||
|
Defines settings for NLP-based graph extraction methods.
|
||||||
|
|
||||||
|
#### Fields
|
||||||
|
|
||||||
|
- `normalize_edge_weights` **bool** - Whether to normalize the edge weights during graph construction. Default=`True`.
|
||||||
|
- `text_analyzer` **dict** - Parameters for the NLP model.
|
||||||
|
- extractor_type **regex_english|syntactic_parser|cfg** - Default=`regex_english`.
|
||||||
|
- model_name **str** - Name of NLP model (for SpaCy-based models)
|
||||||
|
- max_word_length **int** - Longest word to allow. Default=`15`.
|
||||||
|
- word_delimiter **str** - Delimiter to split words. Default ' '.
|
||||||
|
- include_named_entities **bool** - Whether to include named entities in noun phrases. Default=`True`.
|
||||||
|
- exclude_nouns **list[str] | None** - List of nouns to exclude. If `None`, we use an internal stopword list.
|
||||||
|
- exclude_entity_tags **list[str]** - List of entity tags to ignore.
|
||||||
|
- exclude_pos_tags **list[str]** - List of part-of-speech tags to ignore.
|
||||||
|
- noun_phrase_tags **list[str]** - List of noun phrase tags to ignore.
|
||||||
|
- noun_phrase_grammars **dict[str, str]** - Noun phrase grammars for the model (cfg-only).
|
||||||
|
|
||||||
|
### extract_claims
|
||||||
|
|
||||||
#### Fields
|
#### Fields
|
||||||
|
|
||||||
- `enabled` **bool** - Whether to enable claim extraction. Off by default, because claim prompts really need user tuning.
|
- `enabled` **bool** - Whether to enable claim extraction. Off by default, because claim prompts really need user tuning.
|
||||||
- `llm` (see LLM top-level config)
|
- `model_id` **str** - Name of the model definition to use for API calls.
|
||||||
- `parallelization` (see Parallelization top-level config)
|
|
||||||
- `async_mode` (see Async Mode top-level config)
|
|
||||||
- `prompt` **str** - The prompt file to use.
|
- `prompt` **str** - The prompt file to use.
|
||||||
- `description` **str** - Describes the types of claims we want to extract.
|
- `description` **str** - Describes the types of claims we want to extract.
|
||||||
- `max_gleanings` **int** - The maximum number of gleaning cycles to use.
|
- `max_gleanings` **int** - The maximum number of gleaning cycles to use.
|
||||||
- `encoding_model` **str** - The text encoding model to use. By default, this will use the top-level encoding model.
|
|
||||||
- `strategy` **dict** - Fully override the claim extraction strategy.
|
|
||||||
|
|
||||||
### community_reports
|
### community_reports
|
||||||
|
|
||||||
#### Fields
|
#### Fields
|
||||||
|
|
||||||
- `llm` (see LLM top-level config)
|
- `model_id` **str** - Name of the model definition to use for API calls.
|
||||||
- `parallelization` (see Parallelization top-level config)
|
|
||||||
- `async_mode` (see Async Mode top-level config)
|
|
||||||
- `prompt` **str** - The prompt file to use.
|
- `prompt` **str** - The prompt file to use.
|
||||||
- `max_length` **int** - The maximum number of output tokens per report.
|
- `max_length` **int** - The maximum number of output tokens per report.
|
||||||
- `max_input_length` **int** - The maximum number of input tokens to use when generating reports.
|
- `max_input_length` **int** - The maximum number of input tokens to use when generating reports.
|
||||||
- `strategy` **dict** - Fully override the community reports strategy.
|
|
||||||
|
### prune_graph
|
||||||
|
|
||||||
|
Parameters for manual graph pruning. This can be used to optimize the modularity of your graph clusters, by removing overly-connected or rare nodes.
|
||||||
|
|
||||||
|
#### Fields
|
||||||
|
|
||||||
|
- min_node_freq **int** - The minimum node frequency to allow.
|
||||||
|
- max_node_freq_std **float | None** - The maximum standard deviation of node frequency to allow.
|
||||||
|
- min_node_degree **int** - The minimum node degree to allow.
|
||||||
|
- max_node_degree_std **float | None** - The maximum standard deviation of node degree to allow.
|
||||||
|
- min_edge_weight_pct **int** - The minimum edge weight percentile to allow.
|
||||||
|
- remove_ego_nodes **bool** - Remove ego nodes.
|
||||||
|
- lcc_only **bool** - Only use largest connected component.
|
||||||
|
|
||||||
### cluster_graph
|
### cluster_graph
|
||||||
|
|
||||||
|
These are the settings used for Leiden hierarchical clustering of the graph to create communities.
|
||||||
|
|
||||||
#### Fields
|
#### Fields
|
||||||
|
|
||||||
- `max_cluster_size` **int** - The maximum cluster size to export.
|
- `max_cluster_size` **int** - The maximum cluster size to export.
|
||||||
- `strategy` **dict** - Fully override the cluster_graph strategy.
|
- `use_lcc` **bool** - Whether to only use the largest connected component.
|
||||||
|
- `seed` **int** - A randomization seed to provide if consistent run-to-run results are desired. We do provide a default in order to guarantee clustering stability.
|
||||||
|
|
||||||
### embed_graph
|
### embed_graph
|
||||||
|
|
||||||
|
We use node2vec to embed the graph. This is primarily used for visualization, so it is not turned on by default. However, if you do prefer to embed the graph for secondary analysis, you can turn this on and we will persist the embeddings to your configured vector store.
|
||||||
|
|
||||||
#### Fields
|
#### Fields
|
||||||
|
|
||||||
- `enabled` **bool** - Whether to enable graph embeddings.
|
- `enabled` **bool** - Whether to enable graph embeddings.
|
||||||
@@ -220,6 +276,8 @@ This is the base LLM configuration section. Other steps may override this config
|
|||||||
|
|
||||||
### umap
|
### umap
|
||||||
|
|
||||||
|
Indicates whether we should run UMAP dimensionality reduction. This is used to provide an x/y coordinate to each graph node, suitable for visualization. If this is not enabled, nodes will receive a 0/0 x/y coordinate. If this is enabled, you *must* enable graph embedding as well.
|
||||||
|
|
||||||
#### Fields
|
#### Fields
|
||||||
|
|
||||||
- `enabled` **bool** - Whether to enable UMAP layouts.
|
- `enabled` **bool** - Whether to enable UMAP layouts.
|
||||||
@@ -230,15 +288,6 @@ This is the base LLM configuration section. Other steps may override this config
|
|||||||
|
|
||||||
- `embeddings` **bool** - Export embeddings snapshots to parquet.
|
- `embeddings` **bool** - Export embeddings snapshots to parquet.
|
||||||
- `graphml` **bool** - Export graph snapshots to GraphML.
|
- `graphml` **bool** - Export graph snapshots to GraphML.
|
||||||
- `transient` **bool** - Export transient workflow tables snapshots to parquet.
|
|
||||||
|
|
||||||
### encoding_model
|
|
||||||
|
|
||||||
**str** - The text encoding model to use. Default=`cl100k_base`.
|
|
||||||
|
|
||||||
### skip_workflows
|
|
||||||
|
|
||||||
**list[str]** - Which workflow names to skip.
|
|
||||||
|
|
||||||
## Query
|
## Query
|
||||||
|
|
||||||
@@ -246,6 +295,8 @@ This is the base LLM configuration section. Other steps may override this config
|
|||||||
|
|
||||||
#### Fields
|
#### Fields
|
||||||
|
|
||||||
|
- `chat_model_id` **str** - Name of the model definition to use for Chat Completion calls.
|
||||||
|
- `embedding_model_id` **str** - Name of the model definition to use for Embedding calls.
|
||||||
- `prompt` **str** - The prompt file to use.
|
- `prompt` **str** - The prompt file to use.
|
||||||
- `text_unit_prop` **float** - The text unit proportion.
|
- `text_unit_prop` **float** - The text unit proportion.
|
||||||
- `community_prop` **float** - The community proportion.
|
- `community_prop` **float** - The community proportion.
|
||||||
@@ -262,6 +313,7 @@ This is the base LLM configuration section. Other steps may override this config
|
|||||||
|
|
||||||
#### Fields
|
#### Fields
|
||||||
|
|
||||||
|
- `chat_model_id` **str** - Name of the model definition to use for Chat Completion calls.
|
||||||
- `map_prompt` **str** - The mapper prompt file to use.
|
- `map_prompt` **str** - The mapper prompt file to use.
|
||||||
- `reduce_prompt` **str** - The reducer prompt file to use.
|
- `reduce_prompt` **str** - The reducer prompt file to use.
|
||||||
- `knowledge_prompt` **str** - The knowledge prompt file to use.
|
- `knowledge_prompt` **str** - The knowledge prompt file to use.
|
||||||
@@ -288,7 +340,10 @@ This is the base LLM configuration section. Other steps may override this config
|
|||||||
|
|
||||||
#### Fields
|
#### Fields
|
||||||
|
|
||||||
|
- `chat_model_id` **str** - Name of the model definition to use for Chat Completion calls.
|
||||||
|
- `embedding_model_id` **str** - Name of the model definition to use for Embedding calls.
|
||||||
- `prompt` **str** - The prompt file to use.
|
- `prompt` **str** - The prompt file to use.
|
||||||
|
- `reduce_prompt` **str** - The reducer prompt file to use.
|
||||||
- `temperature` **float** - The temperature to use for token generation.",
|
- `temperature` **float** - The temperature to use for token generation.",
|
||||||
- `top_p` **float** - The top-p value to use for token generation.
|
- `top_p` **float** - The top-p value to use for token generation.
|
||||||
- `n` **int** - The number of completions to generate.
|
- `n` **int** - The number of completions to generate.
|
||||||
@@ -308,3 +363,25 @@ This is the base LLM configuration section. Other steps may override this config
|
|||||||
- `local_search_top_p` **float** - The top-p value to use for token generation in local search.
|
- `local_search_top_p` **float** - The top-p value to use for token generation in local search.
|
||||||
- `local_search_n` **int** - The number of completions to generate in local search.
|
- `local_search_n` **int** - The number of completions to generate in local search.
|
||||||
- `local_search_llm_max_gen_tokens` **int** - The maximum number of generated tokens for the LLM in local search.
|
- `local_search_llm_max_gen_tokens` **int** - The maximum number of generated tokens for the LLM in local search.
|
||||||
|
|
||||||
|
### basic_search
|
||||||
|
|
||||||
|
#### Fields
|
||||||
|
|
||||||
|
- `chat_model_id` **str** - Name of the model definition to use for Chat Completion calls.
|
||||||
|
- `embedding_model_id` **str** - Name of the model definition to use for Embedding calls.
|
||||||
|
- `prompt` **str** - The prompt file to use.
|
||||||
|
- `text_unit_prop` **float** - The text unit proportion.
|
||||||
|
- `community_prop` **float** - The community proportion.
|
||||||
|
- `conversation_history_max_turns` **int** - The conversation history maximum turns.
|
||||||
|
- `top_k_entities` **int** - The top k mapped entities.
|
||||||
|
- `top_k_relationships` **int** - The top k mapped relations.
|
||||||
|
- `temperature` **float | None** - The temperature to use for token generation.
|
||||||
|
- `top_p` **float | None** - The top-p value to use for token generation.
|
||||||
|
- `n` **int | None** - The number of completions to generate.
|
||||||
|
- `max_tokens` **int** - The maximum tokens.
|
||||||
|
- `llm_max_tokens` **int** - The LLM maximum tokens.
|
||||||
|
|
||||||
|
### workflows
|
||||||
|
|
||||||
|
**list[str]** - This is a list of workflow names to run, in order. GraphRAG has built-in pipelines to configure this, but you can run exactly and only what you want by specifying the list here. Useful if you have done part of the processing yourself.
|
||||||
@@ -2,7 +2,7 @@
|
|||||||
"cells": [
|
"cells": [
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 5,
|
"execution_count": 18,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
@@ -35,7 +35,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 2,
|
"execution_count": 20,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
@@ -54,7 +54,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 3,
|
"execution_count": 21,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
@@ -65,10 +65,12 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": null,
|
"execution_count": 22,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
|
"import numpy as np\n",
|
||||||
|
"\n",
|
||||||
"from graphrag.utils.storage import (\n",
|
"from graphrag.utils.storage import (\n",
|
||||||
" delete_table_from_storage,\n",
|
" delete_table_from_storage,\n",
|
||||||
" load_table_from_storage,\n",
|
" load_table_from_storage,\n",
|
||||||
@@ -78,6 +80,7 @@
|
|||||||
"final_documents = await load_table_from_storage(\"create_final_documents\", storage)\n",
|
"final_documents = await load_table_from_storage(\"create_final_documents\", storage)\n",
|
||||||
"final_text_units = await load_table_from_storage(\"create_final_text_units\", storage)\n",
|
"final_text_units = await load_table_from_storage(\"create_final_text_units\", storage)\n",
|
||||||
"final_entities = await load_table_from_storage(\"create_final_entities\", storage)\n",
|
"final_entities = await load_table_from_storage(\"create_final_entities\", storage)\n",
|
||||||
|
"final_covariates = await load_table_from_storage(\"create_final_covariates\", storage)\n",
|
||||||
"final_nodes = await load_table_from_storage(\"create_final_nodes\", storage)\n",
|
"final_nodes = await load_table_from_storage(\"create_final_nodes\", storage)\n",
|
||||||
"final_relationships = await load_table_from_storage(\n",
|
"final_relationships = await load_table_from_storage(\n",
|
||||||
" \"create_final_relationships\", storage\n",
|
" \"create_final_relationships\", storage\n",
|
||||||
@@ -110,6 +113,10 @@
|
|||||||
" right_on=\"parent\",\n",
|
" right_on=\"parent\",\n",
|
||||||
" how=\"left\",\n",
|
" how=\"left\",\n",
|
||||||
")\n",
|
")\n",
|
||||||
|
"# replace NaN children with empty list\n",
|
||||||
|
"final_communities[\"children\"] = final_communities[\"children\"].apply(\n",
|
||||||
|
" lambda x: x if isinstance(x, np.ndarray) else [] # type: ignore\n",
|
||||||
|
")\n",
|
||||||
"\n",
|
"\n",
|
||||||
"# add children to the reports as well\n",
|
"# add children to the reports as well\n",
|
||||||
"final_community_reports = final_community_reports.merge(\n",
|
"final_community_reports = final_community_reports.merge(\n",
|
||||||
@@ -119,13 +126,12 @@
|
|||||||
" how=\"left\",\n",
|
" how=\"left\",\n",
|
||||||
")\n",
|
")\n",
|
||||||
"\n",
|
"\n",
|
||||||
"# copy children into the reports as well\n",
|
|
||||||
"\n",
|
|
||||||
"# we renamed all the output files for better clarity now that we don't have workflow naming constraints from DataShaper\n",
|
"# we renamed all the output files for better clarity now that we don't have workflow naming constraints from DataShaper\n",
|
||||||
"await write_table_to_storage(final_documents, \"documents\", storage)\n",
|
"await write_table_to_storage(final_documents, \"documents\", storage)\n",
|
||||||
"await write_table_to_storage(final_text_units, \"text_units\", storage)\n",
|
"await write_table_to_storage(final_text_units, \"text_units\", storage)\n",
|
||||||
"await write_table_to_storage(final_entities, \"entities\", storage)\n",
|
"await write_table_to_storage(final_entities, \"entities\", storage)\n",
|
||||||
"await write_table_to_storage(final_relationships, \"relationships\", storage)\n",
|
"await write_table_to_storage(final_relationships, \"relationships\", storage)\n",
|
||||||
|
"await write_table_to_storage(final_covariates, \"covariates\", storage)\n",
|
||||||
"await write_table_to_storage(final_communities, \"communities\", storage)\n",
|
"await write_table_to_storage(final_communities, \"communities\", storage)\n",
|
||||||
"await write_table_to_storage(final_community_reports, \"community_reports\", storage)\n",
|
"await write_table_to_storage(final_community_reports, \"community_reports\", storage)\n",
|
||||||
"\n",
|
"\n",
|
||||||
@@ -135,6 +141,7 @@
|
|||||||
"await delete_table_from_storage(\"create_final_entities\", storage)\n",
|
"await delete_table_from_storage(\"create_final_entities\", storage)\n",
|
||||||
"await delete_table_from_storage(\"create_final_nodes\", storage)\n",
|
"await delete_table_from_storage(\"create_final_nodes\", storage)\n",
|
||||||
"await delete_table_from_storage(\"create_final_relationships\", storage)\n",
|
"await delete_table_from_storage(\"create_final_relationships\", storage)\n",
|
||||||
|
"await delete_table_from_storage(\"create_final_covariates\", storage)\n",
|
||||||
"await delete_table_from_storage(\"create_final_communities\", storage)\n",
|
"await delete_table_from_storage(\"create_final_communities\", storage)\n",
|
||||||
"await delete_table_from_storage(\"create_final_community_reports\", storage)"
|
"await delete_table_from_storage(\"create_final_community_reports\", storage)"
|
||||||
]
|
]
|
||||||
|
|||||||
835
docs/examples_notebooks/multi_index_search.ipynb
Normal file
835
docs/examples_notebooks/multi_index_search.ipynb
Normal file
@@ -0,0 +1,835 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Copyright (c) 2024 Microsoft Corporation.\n",
|
||||||
|
"# Licensed under the MIT License."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Multi Index Search\n",
|
||||||
|
"This notebook demonstrates multi-index search using the GraphRAG API.\n",
|
||||||
|
"\n",
|
||||||
|
"Indexes created from Wikipedia state articles for Alaska, California, DC, Maryland, NY and Washington are used."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 2,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"['alaska', 'california', 'dc', 'maryland', 'ny', 'washington']\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"import asyncio\n",
|
||||||
|
"\n",
|
||||||
|
"import pandas as pd\n",
|
||||||
|
"\n",
|
||||||
|
"from graphrag.api.query import (\n",
|
||||||
|
" multi_index_basic_search,\n",
|
||||||
|
" multi_index_drift_search,\n",
|
||||||
|
" multi_index_global_search,\n",
|
||||||
|
" multi_index_local_search,\n",
|
||||||
|
")\n",
|
||||||
|
"from graphrag.config.create_graphrag_config import create_graphrag_config\n",
|
||||||
|
"\n",
|
||||||
|
"indexes = [\"alaska\", \"california\", \"dc\", \"maryland\", \"ny\", \"washington\"]\n",
|
||||||
|
"indexes = sorted(indexes)\n",
|
||||||
|
"\n",
|
||||||
|
"print(indexes)\n",
|
||||||
|
"\n",
|
||||||
|
"vector_store_configs = {\n",
|
||||||
|
" index: {\n",
|
||||||
|
" \"type\": \"lancedb\",\n",
|
||||||
|
" \"db_uri\": f\"inputs/{index}/lancedb\",\n",
|
||||||
|
" \"container_name\": \"default\",\n",
|
||||||
|
" \"overwrite\": True,\n",
|
||||||
|
" \"index_name\": f\"{index}\",\n",
|
||||||
|
" }\n",
|
||||||
|
" for index in indexes\n",
|
||||||
|
"}"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"config_data = {\n",
|
||||||
|
" \"models\": {\n",
|
||||||
|
" \"default_chat_model\": {\n",
|
||||||
|
" \"model_supports_json\": True,\n",
|
||||||
|
" \"parallelization_num_threads\": 50,\n",
|
||||||
|
" \"parallelization_stagger\": 0.3,\n",
|
||||||
|
" \"async_mode\": \"threaded\",\n",
|
||||||
|
" \"type\": \"azure_openai_chat\",\n",
|
||||||
|
" \"model\": \"gpt-4o\",\n",
|
||||||
|
" \"auth_type\": \"azure_managed_identity\",\n",
|
||||||
|
" \"api_base\": \"<API_BASE_URL>\",\n",
|
||||||
|
" \"api_version\": \"2024-02-15-preview\",\n",
|
||||||
|
" \"deployment_name\": \"gpt-4o\",\n",
|
||||||
|
" },\n",
|
||||||
|
" \"default_embedding_model\": {\n",
|
||||||
|
" \"parallelization_num_threads\": 50,\n",
|
||||||
|
" \"parallelization_stagger\": 0.3,\n",
|
||||||
|
" \"async_mode\": \"threaded\",\n",
|
||||||
|
" \"type\": \"azure_openai_embedding\",\n",
|
||||||
|
" \"model\": \"text-embedding-3-large\",\n",
|
||||||
|
" \"auth_type\": \"azure_managed_identity\",\n",
|
||||||
|
" \"api_base\": \"<API_BASE_URL>\",\n",
|
||||||
|
" \"api_version\": \"2024-02-15-preview\",\n",
|
||||||
|
" \"deployment_name\": \"text-embedding-3-large\",\n",
|
||||||
|
" },\n",
|
||||||
|
" },\n",
|
||||||
|
" \"vector_store\": vector_store_configs,\n",
|
||||||
|
" \"local_search\": {\n",
|
||||||
|
" \"prompt\": \"prompts/local_search_system_prompt.txt\",\n",
|
||||||
|
" \"llm_max_tokens\": 12000,\n",
|
||||||
|
" },\n",
|
||||||
|
" \"global_search\": {\n",
|
||||||
|
" \"map_prompt\": \"prompts/global_search_map_system_prompt.txt\",\n",
|
||||||
|
" \"reduce_prompt\": \"prompts/global_search_reduce_system_prompt.txt\",\n",
|
||||||
|
" \"knowledge_prompt\": \"prompts/global_search_knowledge_system_prompt.txt\",\n",
|
||||||
|
" },\n",
|
||||||
|
" \"drift_search\": {\n",
|
||||||
|
" \"prompt\": \"prompts/drift_search_system_prompt.txt\",\n",
|
||||||
|
" \"reduce_prompt\": \"prompts/drift_search_reduce_prompt.txt\",\n",
|
||||||
|
" },\n",
|
||||||
|
" \"basic_search\": {\"prompt\": \"prompts/basic_search_system_prompt.txt\"},\n",
|
||||||
|
"}\n",
|
||||||
|
"parameters = create_graphrag_config(config_data, \".\")\n",
|
||||||
|
"loop = asyncio.get_event_loop()"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### Multi-index Global Search"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"entities = [pd.read_parquet(f\"inputs/{index}/entities.parquet\") for index in indexes]\n",
|
||||||
|
"communities = [\n",
|
||||||
|
" pd.read_parquet(f\"inputs/{index}/communities.parquet\") for index in indexes\n",
|
||||||
|
"]\n",
|
||||||
|
"community_reports = [\n",
|
||||||
|
" pd.read_parquet(f\"inputs/{index}/community_reports.parquet\") for index in indexes\n",
|
||||||
|
"]\n",
|
||||||
|
"\n",
|
||||||
|
"task = loop.create_task(\n",
|
||||||
|
" multi_index_global_search(\n",
|
||||||
|
" parameters,\n",
|
||||||
|
" entities,\n",
|
||||||
|
" communities,\n",
|
||||||
|
" community_reports,\n",
|
||||||
|
" indexes,\n",
|
||||||
|
" 1,\n",
|
||||||
|
" False,\n",
|
||||||
|
" \"Multiple Paragraphs\",\n",
|
||||||
|
" False,\n",
|
||||||
|
" \"Describe this dataset.\",\n",
|
||||||
|
" )\n",
|
||||||
|
")\n",
|
||||||
|
"results = await task"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"#### Print report"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 19,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"## Overview of the Dataset\n",
|
||||||
|
"\n",
|
||||||
|
"The dataset is a comprehensive collection of reports that cover a wide array of topics, including historical events, cultural dynamics, economic influences, geographical regions, and environmental issues across various regions in the United States. Each report is uniquely identified by an ID and includes a title, occurrence weight, content, and rank. These elements help to organize the dataset and provide insights into the significance and relevance of each report.\n",
|
||||||
|
"\n",
|
||||||
|
"## Content and Structure\n",
|
||||||
|
"\n",
|
||||||
|
"The reports provide detailed information about specific entities and their relationships, highlighting their importance and impact in different contexts. Topics range from the historical significance of regions like Maryland and Washington D.C., to the cultural and economic landscapes of areas such as Washington State and Los Angeles. The dataset also delves into significant events and figures, such as the Good Friday Earthquake, the Trans-Alaska Pipeline, and the role of Jimi Hendrix in Seattle's cultural heritage [Data: Reports (120, 129, 40, 16, +more)].\n",
|
||||||
|
"\n",
|
||||||
|
"## Key Features\n",
|
||||||
|
"\n",
|
||||||
|
"Each report is structured into sections that provide insights into the main topics discussed, supported by data references to entities and relationships. The occurrence weight and rank of each report may indicate its relevance and significance within the dataset. This structure allows for a comprehensive understanding of the topics discussed, emphasizing the interconnectedness of various entities and their roles in broader socio-economic and cultural contexts.\n",
|
||||||
|
"\n",
|
||||||
|
"## Topics Covered\n",
|
||||||
|
"\n",
|
||||||
|
"The dataset includes a diverse range of topics, such as the strategic geopolitical position of Alaska, the cultural and economic significance of California, the historical and geographical significance of New York State, and the environmental health concerns in Washington. It also covers political transitions, such as the governorship change in Maryland and the 2022 special election in Alaska [Data: Reports (204, 143, 85, 122, 83, +more)].\n",
|
||||||
|
"\n",
|
||||||
|
"## Conclusion\n",
|
||||||
|
"\n",
|
||||||
|
"Overall, the dataset serves as a valuable resource for understanding the complexities and interdependencies of historical, cultural, economic, and geographical factors in shaping the identity and development of various regions in the United States. The detailed narratives and data references provide a multifaceted perspective on each topic, making it a rich source of information for research and analysis.\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"print(results[0])"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"#### Show context links back to original index"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"120 dc 26\n",
|
||||||
|
"Washington D.C. Founders and Influences\n",
|
||||||
|
"Washington D.C. Founders and Influences\n",
|
||||||
|
"129 dc 35\n",
|
||||||
|
"Smithsonian Institution and Its Museums\n",
|
||||||
|
"Smithsonian Institution and Its Museums\n",
|
||||||
|
"40 alaska 40\n",
|
||||||
|
"Good Friday Earthquake and Its Global Impact\n",
|
||||||
|
"Good Friday Earthquake and Its Global Impact\n",
|
||||||
|
"16 alaska 16\n",
|
||||||
|
"Trans-Alaska Pipeline and Prudhoe Bay\n",
|
||||||
|
"Trans-Alaska Pipeline and Prudhoe Bay\n",
|
||||||
|
"204 ny 36\n",
|
||||||
|
"Long Island and its Educational and Cultural Landscape\n",
|
||||||
|
"Long Island and its Educational and Cultural Landscape\n",
|
||||||
|
"143 maryland 5\n",
|
||||||
|
"Western Maryland and Appalachian Region\n",
|
||||||
|
"Western Maryland and Appalachian Region\n",
|
||||||
|
"85 california 38\n",
|
||||||
|
"California and Its Historical and Geopolitical Context\n",
|
||||||
|
"California and Its Historical and Geopolitical Context\n",
|
||||||
|
"122 dc 28\n",
|
||||||
|
"District of Columbia and Legal Framework\n",
|
||||||
|
"District of Columbia and Legal Framework\n",
|
||||||
|
"83 california 36\n",
|
||||||
|
"Southern California and Key Geographical Entities\n",
|
||||||
|
"Southern California and Key Geographical Entities\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"for report_id in [120, 129, 40, 16, 204, 143, 85, 122, 83]:\n",
|
||||||
|
" index_name = [i for i in results[1][\"reports\"] if i[\"id\"] == str(report_id)][0][ # noqa: RUF015\n",
|
||||||
|
" \"index_name\"\n",
|
||||||
|
" ]\n",
|
||||||
|
" index_id = [i for i in results[1][\"reports\"] if i[\"id\"] == str(report_id)][0][ # noqa: RUF015\n",
|
||||||
|
" \"index_id\"\n",
|
||||||
|
" ]\n",
|
||||||
|
" print(report_id, index_name, index_id)\n",
|
||||||
|
" index_reports = pd.read_parquet(\n",
|
||||||
|
" f\"inputs/{index_name}/create_final_community_reports.parquet\"\n",
|
||||||
|
" )\n",
|
||||||
|
" print([i for i in results[1][\"reports\"] if i[\"id\"] == str(report_id)][0][\"title\"]) # noqa: RUF015\n",
|
||||||
|
" print(\n",
|
||||||
|
" index_reports[index_reports[\"community\"] == int(index_id)][\"title\"].to_numpy()[\n",
|
||||||
|
" 0\n",
|
||||||
|
" ]\n",
|
||||||
|
" )"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"#### Multi-index Local Search"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"nodes = [\n",
|
||||||
|
" pd.read_parquet(f\"inputs/{index}/create_final_nodes.parquet\") for index in indexes\n",
|
||||||
|
"]\n",
|
||||||
|
"entities = [\n",
|
||||||
|
" pd.read_parquet(f\"inputs/{index}/create_final_entities.parquet\")\n",
|
||||||
|
" for index in indexes\n",
|
||||||
|
"]\n",
|
||||||
|
"community_reports = [\n",
|
||||||
|
" pd.read_parquet(f\"inputs/{index}/create_final_community_reports.parquet\")\n",
|
||||||
|
" for index in indexes\n",
|
||||||
|
"]\n",
|
||||||
|
"covariates = [\n",
|
||||||
|
" pd.read_parquet(f\"inputs/{index}/create_final_covariates.parquet\")\n",
|
||||||
|
" for index in indexes\n",
|
||||||
|
"]\n",
|
||||||
|
"text_units = [\n",
|
||||||
|
" pd.read_parquet(f\"inputs/{index}/create_final_text_units.parquet\")\n",
|
||||||
|
" for index in indexes\n",
|
||||||
|
"]\n",
|
||||||
|
"relationships = [\n",
|
||||||
|
" pd.read_parquet(f\"inputs/{index}/create_final_relationships.parquet\")\n",
|
||||||
|
" for index in indexes\n",
|
||||||
|
"]\n",
|
||||||
|
"\n",
|
||||||
|
"task = loop.create_task(\n",
|
||||||
|
" multi_index_local_search(\n",
|
||||||
|
" parameters,\n",
|
||||||
|
" nodes,\n",
|
||||||
|
" entities,\n",
|
||||||
|
" community_reports,\n",
|
||||||
|
" text_units,\n",
|
||||||
|
" relationships,\n",
|
||||||
|
" covariates,\n",
|
||||||
|
" indexes,\n",
|
||||||
|
" 1,\n",
|
||||||
|
" \"Multiple Paragraphs\",\n",
|
||||||
|
" False,\n",
|
||||||
|
" \"weather\",\n",
|
||||||
|
" )\n",
|
||||||
|
")\n",
|
||||||
|
"results = await task"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"#### Print report"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 15,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"### Weather Patterns in California and Washington\n",
|
||||||
|
"\n",
|
||||||
|
"#### California's Climate\n",
|
||||||
|
"\n",
|
||||||
|
"California exhibits a wide range of climates due to its diverse geography, which includes coastal areas, mountains, and deserts. The state experiences a Mediterranean climate in the Central Valley and coastal regions, characterized by wet winters and dry summers. The Sierra Nevada mountains have an alpine climate with snow in winter and mild summers, while the eastern side of the mountains creates rain shadows, leading to desert conditions in areas like Death Valley, which is one of the hottest places on Earth [Data: Reports (47); Entities (500, 502, 506)].\n",
|
||||||
|
"\n",
|
||||||
|
"The state's climate diversity results in varying weather patterns, with northern regions receiving more rainfall than the south. The demand for water is high due to these climatic variations, and droughts have become more frequent, exacerbated by climate change and overextraction of water resources [Data: Reports (47); Claims (100)].\n",
|
||||||
|
"\n",
|
||||||
|
"#### Washington's Climate\n",
|
||||||
|
"\n",
|
||||||
|
"Washington State's climate is influenced by its location in the Pacific Northwest and its varied topography, including the Cascade Range and the Olympic Mountains. Western Washington has a marine climate with mild temperatures and significant rainfall, especially on the windward side of the mountains. The region is known for its cloudy and rainy weather, particularly in the winter months [Data: Reports (213); Sources (89)].\n",
|
||||||
|
"\n",
|
||||||
|
"Eastern Washington, in contrast, experiences a semi-arid climate due to the rain shadow effect of the Cascades. This area has less precipitation and more extreme temperature variations, with hot summers and cold winters. The state is also affected by climate patterns such as the Southern Oscillation, which includes El Niño and La Niña phases, impacting precipitation and temperature [Data: Entities (1960, 1961, 1962); Relationships (1805, 1806)].\n",
|
||||||
|
"\n",
|
||||||
|
"#### Environmental Impacts\n",
|
||||||
|
"\n",
|
||||||
|
"Both states face environmental challenges related to their weather patterns. California's diverse ecosystems are threatened by urbanization, logging, and climate change, which have led to increased wildfire risks and water scarcity. Efforts to manage these issues include water conservation projects and initiatives to revive traditional land management practices, such as controlled burns [Data: Reports (47)].\n",
|
||||||
|
"\n",
|
||||||
|
"Washington's commitment to environmental sustainability is reflected in its conservation efforts and the protection of natural areas like national parks. The state's weather patterns, influenced by atmospheric phenomena like the Pineapple Express, bring heavy rainfall, which can lead to flooding and other environmental impacts [Data: Reports (213); Entities (1950)].\n",
|
||||||
|
"\n",
|
||||||
|
"In summary, both California and Washington have unique weather patterns shaped by their geography and climate influences. These patterns have significant implications for environmental management and resource conservation in each state.\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"print(results[0])"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"#### Show context links back to original index"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"47 california 0\n",
|
||||||
|
"California: A Hub of Cultural, Economic, and Environmental Significance\n",
|
||||||
|
"California: A Hub of Cultural, Economic, and Environmental Significance\n",
|
||||||
|
"213 washington 0\n",
|
||||||
|
"Washington State: Economic and Cultural Hub\n",
|
||||||
|
"Washington State: Economic and Cultural Hub\n",
|
||||||
|
"500 california 161\n",
|
||||||
|
"Boca is a location in California where the lowest temperature in the state, −45 °F, was recorded on \n",
|
||||||
|
"Boca is a location in California where the lowest temperature in the state, −45 °F, was recorded on \n",
|
||||||
|
"502 california 163\n",
|
||||||
|
"Mammoth is a location in the Sierra Nevada, California, known for its mountain climate\n",
|
||||||
|
"Mammoth is a location in the Sierra Nevada, California, known for its mountain climate\n",
|
||||||
|
"506 california 167\n",
|
||||||
|
"Eureka is a city in California known for its cool summers in the Humboldt Bay region\n",
|
||||||
|
"Eureka is a city in California known for its cool summers in the Humboldt Bay region\n",
|
||||||
|
"1960 washington 104\n",
|
||||||
|
"The Southern Oscillation is a climate pattern that influences weather during the cold season, affect\n",
|
||||||
|
"The Southern Oscillation is a climate pattern that influences weather during the cold season, affect\n",
|
||||||
|
"1961 washington 105\n",
|
||||||
|
"El Niño is a phase of the Southern Oscillation that causes drier and less snowy conditions in Washin\n",
|
||||||
|
"El Niño is a phase of the Southern Oscillation that causes drier and less snowy conditions in Washin\n",
|
||||||
|
"1962 washington 106\n",
|
||||||
|
"La Niña is a phase of the Southern Oscillation that causes more rain and snow in Washington\n",
|
||||||
|
"La Niña is a phase of the Southern Oscillation that causes more rain and snow in Washington\n",
|
||||||
|
"1805 washington 92\n",
|
||||||
|
"El Niño is a phase of the Southern Oscillation\n",
|
||||||
|
"El Niño is a phase of the Southern Oscillation\n",
|
||||||
|
"1806 washington 93\n",
|
||||||
|
"La Niña is a phase of the Southern Oscillation\n",
|
||||||
|
"La Niña is a phase of the Southern Oscillation\n",
|
||||||
|
"1806 california 35\n",
|
||||||
|
"The lowest temperature in California was −45 °F (−43 °C) recorded in Boca on January 20, 1937.\n",
|
||||||
|
"The lowest temperature in California was −45 °F (−43 °C) recorded in Boca on January 20, 1937.\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"for report_id in [47, 213]:\n",
|
||||||
|
" index_name = [i for i in results[1][\"reports\"] if i[\"id\"] == str(report_id)][0][ # noqa: RUF015\n",
|
||||||
|
" \"index_name\"\n",
|
||||||
|
" ]\n",
|
||||||
|
" index_id = [i for i in results[1][\"reports\"] if i[\"id\"] == str(report_id)][0][ # noqa: RUF015\n",
|
||||||
|
" \"index_id\"\n",
|
||||||
|
" ]\n",
|
||||||
|
" print(report_id, index_name, index_id)\n",
|
||||||
|
" index_reports = pd.read_parquet(\n",
|
||||||
|
" f\"inputs/{index_name}/create_final_community_reports.parquet\"\n",
|
||||||
|
" )\n",
|
||||||
|
" print([i for i in results[1][\"reports\"] if i[\"id\"] == str(report_id)][0][\"title\"]) # noqa: RUF015\n",
|
||||||
|
" print(\n",
|
||||||
|
" index_reports[index_reports[\"community\"] == int(index_id)][\"title\"].to_numpy()[\n",
|
||||||
|
" 0\n",
|
||||||
|
" ]\n",
|
||||||
|
" )\n",
|
||||||
|
"for entity_id in [500, 502, 506, 1960, 1961, 1962]:\n",
|
||||||
|
" index_name = [i for i in results[1][\"entities\"] if i[\"id\"] == str(entity_id)][0][ # noqa: RUF015\n",
|
||||||
|
" \"index_name\"\n",
|
||||||
|
" ]\n",
|
||||||
|
" index_id = [i for i in results[1][\"entities\"] if i[\"id\"] == str(entity_id)][0][ # noqa: RUF015\n",
|
||||||
|
" \"index_id\"\n",
|
||||||
|
" ]\n",
|
||||||
|
" print(entity_id, index_name, index_id)\n",
|
||||||
|
" index_entities = pd.read_parquet(\n",
|
||||||
|
" f\"inputs/{index_name}/create_final_entities.parquet\"\n",
|
||||||
|
" )\n",
|
||||||
|
" print(\n",
|
||||||
|
" [i for i in results[1][\"entities\"] if i[\"id\"] == str(entity_id)][0][ # noqa: RUF015\n",
|
||||||
|
" \"description\"\n",
|
||||||
|
" ][:100]\n",
|
||||||
|
" )\n",
|
||||||
|
" print(\n",
|
||||||
|
" index_entities[index_entities[\"human_readable_id\"] == int(index_id)][\n",
|
||||||
|
" \"description\"\n",
|
||||||
|
" ].to_numpy()[0][:100]\n",
|
||||||
|
" )\n",
|
||||||
|
"for relationship_id in [1805, 1806]:\n",
|
||||||
|
" index_name = [ # noqa: RUF015\n",
|
||||||
|
" i for i in results[1][\"relationships\"] if i[\"id\"] == str(relationship_id)\n",
|
||||||
|
" ][0][\"index_name\"]\n",
|
||||||
|
" index_id = [ # noqa: RUF015\n",
|
||||||
|
" i for i in results[1][\"relationships\"] if i[\"id\"] == str(relationship_id)\n",
|
||||||
|
" ][0][\"index_id\"]\n",
|
||||||
|
" print(relationship_id, index_name, index_id)\n",
|
||||||
|
" index_relationships = pd.read_parquet(\n",
|
||||||
|
" f\"inputs/{index_name}/create_final_relationships.parquet\"\n",
|
||||||
|
" )\n",
|
||||||
|
" print(\n",
|
||||||
|
" [i for i in results[1][\"relationships\"] if i[\"id\"] == str(relationship_id)][0][ # noqa: RUF015\n",
|
||||||
|
" \"description\"\n",
|
||||||
|
" ]\n",
|
||||||
|
" )\n",
|
||||||
|
" print(\n",
|
||||||
|
" index_relationships[index_relationships[\"human_readable_id\"] == int(index_id)][\n",
|
||||||
|
" \"description\"\n",
|
||||||
|
" ].to_numpy()[0]\n",
|
||||||
|
" )\n",
|
||||||
|
"for claim_id in [100]:\n",
|
||||||
|
" index_name = [i for i in results[1][\"claims\"] if i[\"id\"] == str(claim_id)][0][ # noqa: RUF015\n",
|
||||||
|
" \"index_name\"\n",
|
||||||
|
" ]\n",
|
||||||
|
" index_id = [i for i in results[1][\"claims\"] if i[\"id\"] == str(claim_id)][0][ # noqa: RUF015\n",
|
||||||
|
" \"index_id\"\n",
|
||||||
|
" ]\n",
|
||||||
|
" print(relationship_id, index_name, index_id)\n",
|
||||||
|
" index_claims = pd.read_parquet(\n",
|
||||||
|
" f\"inputs/{index_name}/create_final_covariates.parquet\"\n",
|
||||||
|
" )\n",
|
||||||
|
" print(\n",
|
||||||
|
" [i for i in results[1][\"claims\"] if i[\"id\"] == str(claim_id)][0][\"description\"] # noqa: RUF015\n",
|
||||||
|
" )\n",
|
||||||
|
" print(\n",
|
||||||
|
" index_claims[index_claims[\"human_readable_id\"] == int(index_id)][\n",
|
||||||
|
" \"description\"\n",
|
||||||
|
" ].to_numpy()[0]\n",
|
||||||
|
" )"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### Multi-index Drift Search"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"nodes = [\n",
|
||||||
|
" pd.read_parquet(f\"inputs/{index}/create_final_nodes.parquet\") for index in indexes\n",
|
||||||
|
"]\n",
|
||||||
|
"entities = [\n",
|
||||||
|
" pd.read_parquet(f\"inputs/{index}/create_final_entities.parquet\")\n",
|
||||||
|
" for index in indexes\n",
|
||||||
|
"]\n",
|
||||||
|
"community_reports = [\n",
|
||||||
|
" pd.read_parquet(f\"inputs/{index}/create_final_community_reports.parquet\")\n",
|
||||||
|
" for index in indexes\n",
|
||||||
|
"]\n",
|
||||||
|
"text_units = [\n",
|
||||||
|
" pd.read_parquet(f\"inputs/{index}/create_final_text_units.parquet\")\n",
|
||||||
|
" for index in indexes\n",
|
||||||
|
"]\n",
|
||||||
|
"relationships = [\n",
|
||||||
|
" pd.read_parquet(f\"inputs/{index}/create_final_relationships.parquet\")\n",
|
||||||
|
" for index in indexes\n",
|
||||||
|
"]\n",
|
||||||
|
"\n",
|
||||||
|
"task = loop.create_task(\n",
|
||||||
|
" multi_index_drift_search(\n",
|
||||||
|
" parameters,\n",
|
||||||
|
" nodes,\n",
|
||||||
|
" entities,\n",
|
||||||
|
" community_reports,\n",
|
||||||
|
" text_units,\n",
|
||||||
|
" relationships,\n",
|
||||||
|
" indexes,\n",
|
||||||
|
" 1,\n",
|
||||||
|
" \"Multiple Paragraphs\",\n",
|
||||||
|
" False,\n",
|
||||||
|
" \"agriculture\",\n",
|
||||||
|
" )\n",
|
||||||
|
")\n",
|
||||||
|
"results = await task"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"#### Print report"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 24,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"### Overview of Agriculture in Key U.S. Regions\n",
|
||||||
|
"\n",
|
||||||
|
"Agriculture in the United States is a diverse and regionally varied industry, with different areas specializing in specific crops and facing unique challenges. This overview highlights the agricultural dynamics in several key regions, including California, Washington, and Alaska, as well as the role of agriculture in the broader economic and environmental context.\n",
|
||||||
|
"\n",
|
||||||
|
"#### California's Agricultural Landscape\n",
|
||||||
|
"\n",
|
||||||
|
"California is a powerhouse in U.S. agriculture, with the Central Valley being a critical area for crop production. The region is known for producing a wide variety of crops, including almonds, grapes, and dairy products, supported by fertile soil and a favorable climate [Data: Sources (16, 29)]. However, water management is a significant challenge due to the state's dry climate and frequent droughts. The Sacramento and San Joaquin Rivers are vital for irrigation, but water scarcity remains a persistent issue, impacting crop yields and farming costs [Data: Sources (24, 21)].\n",
|
||||||
|
"\n",
|
||||||
|
"In Southern California, the agricultural sector is characterized by the production of citrus fruits, avocados, and strawberries. The region's Mediterranean climate is ideal for these crops, but water scarcity and urbanization pose challenges to agricultural expansion [Data: Reports (47); Sources (19, 20, 21, 22, 24)].\n",
|
||||||
|
"\n",
|
||||||
|
"#### Washington's Agricultural Contributions\n",
|
||||||
|
"\n",
|
||||||
|
"Washington State is a leading producer of apples, with the Yakima and Wenatchee–Okanogan regions being major contributors to the state's agricultural output. The state's climate, with dry, warm summers and cold winters, is ideal for apple cultivation, supported by extensive irrigation systems from the Columbia River [Data: Sources (93)]. Washington also produces significant quantities of hops, cherries, and potatoes, contributing to its diverse agricultural economy [Data: Sources (93)].\n",
|
||||||
|
"\n",
|
||||||
|
"The Columbia River plays a crucial role in supporting agriculture in Washington, providing essential irrigation for the Columbia Basin. However, environmental challenges such as water quality and climate change impact agricultural practices, necessitating sustainable water management strategies [Data: Reports (236); Sources (95)].\n",
|
||||||
|
"\n",
|
||||||
|
"#### Alaska's Agricultural Scene\n",
|
||||||
|
"\n",
|
||||||
|
"In Alaska, the Tanana Valley, particularly the Delta Junction area, is a notable agricultural region known for producing barley and hay. The region's short growing season is offset by long summer days, which provide ample sunlight for crop growth [Data: Sources (10)]. The development of local agriculture is supported by state programs and initiatives like the Alaska Grown program, which promotes local produce and supports farmers [Data: Sources (10)].\n",
|
||||||
|
"\n",
|
||||||
|
"#### Environmental and Economic Interplay\n",
|
||||||
|
"\n",
|
||||||
|
"Agriculture in these regions is deeply intertwined with environmental and economic factors. Water management is a common challenge across all areas, with efforts focused on improving irrigation efficiency and adopting sustainable practices to mitigate the impacts of climate change and water scarcity. Additionally, the economic contributions of agriculture are significant, providing employment and supporting local economies, but they also require balancing with environmental conservation efforts to ensure long-term sustainability.\n",
|
||||||
|
"\n",
|
||||||
|
"Overall, agriculture in the U.S. is a complex and dynamic sector, shaped by regional characteristics and broader environmental and economic trends. The ongoing challenges of water management, climate change, and economic diversification highlight the need for innovative solutions and adaptive strategies to sustain agricultural productivity and support rural communities.\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"print(results[0])"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"#### Show context links back to original index"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"What strategies is the USDA implementing in California to combat drought effects on agriculture? 47 california 0\n",
|
||||||
|
"California: A Hub of Cultural, Economic, and Environmental Significance\n",
|
||||||
|
"California: A Hub of Cultural, Economic, and Environmental Significance\n",
|
||||||
|
"What environmental challenges affect agriculture around the Columbia River? 236 washington 23\n",
|
||||||
|
"Columbia River and Its Regional Impact\n",
|
||||||
|
"Columbia River and Its Regional Impact\n",
|
||||||
|
"How does agriculture in the Tanana Valley impact the local economy? 10 alaska 10\n",
|
||||||
|
" Fort Greely. This area was largely set aside and developed under a state program spearheaded by Hammond during his second term as governor. Delta-area crops consist predominantly of barley and hay. West of Fairbanks lies another concentration of sma\n",
|
||||||
|
" Fort Greely. This area was largely set aside and developed under a state program spearheaded by Hammond during his second term as governor. Delta-area crops consist predominantly of barley and hay. West of Fairbanks lies another concentration of sma\n",
|
||||||
|
"What are the major crops produced in California's Central Valley, and how are they impacted by river water management? 16 california 0\n",
|
||||||
|
"California is a state in the Western United States, lying on the American Pacific Coast. It borders Oregon to the north, Nevada and Arizona to the east, and an international border with the Mexican state of Baja California to the south. With nearly 3\n",
|
||||||
|
"California is a state in the Western United States, lying on the American Pacific Coast. It borders Oregon to the north, Nevada and Arizona to the east, and an international border with the Mexican state of Baja California to the south. With nearly 3\n",
|
||||||
|
"What are the major crops produced in California's Central Valley, and how are they impacted by river water management? 19 california 3\n",
|
||||||
|
" population of San Francisco increased from 500 to 150,000. \n",
|
||||||
|
"\n",
|
||||||
|
"The seat of government for California under Spanish and later Mexican rule had been located in Monterey from 1777 until 1845. Pio Pico, the last Mexican governor of Alta California, had br\n",
|
||||||
|
" population of San Francisco increased from 500 to 150,000. \n",
|
||||||
|
"\n",
|
||||||
|
"The seat of government for California under Spanish and later Mexican rule had been located in Monterey from 1777 until 1845. Pio Pico, the last Mexican governor of Alta California, had br\n",
|
||||||
|
"What strategies is the USDA implementing in California to combat drought effects on agriculture? 20 california 4\n",
|
||||||
|
" Alien Land Act, excluding Asian immigrants from owning land. During World War II, Japanese Americans in California were interned in concentration camps; in 2020, California apologized.\n",
|
||||||
|
"Migration to California accelerated during the early 20th centur\n",
|
||||||
|
" Alien Land Act, excluding Asian immigrants from owning land. During World War II, Japanese Americans in California were interned in concentration camps; in 2020, California apologized.\n",
|
||||||
|
"Migration to California accelerated during the early 20th centur\n",
|
||||||
|
"What strategies is the USDA implementing in California to combat drought effects on agriculture? 21 california 5\n",
|
||||||
|
"ias region of North America, alongside Baja California Sur).\n",
|
||||||
|
"In the middle of the state lies the California Central Valley, bounded by the Sierra Nevada in the east, the coastal mountain ranges in the west, the Cascade Range to the north and by the T\n",
|
||||||
|
"ias region of North America, alongside Baja California Sur).\n",
|
||||||
|
"In the middle of the state lies the California Central Valley, bounded by the Sierra Nevada in the east, the coastal mountain ranges in the west, the Cascade Range to the north and by the T\n",
|
||||||
|
"What strategies is the USDA implementing in California to combat drought effects on agriculture? 22 california 6\n",
|
||||||
|
" seen in the climate of the Bay Area, where areas sheltered from the ocean experience significantly hotter summers and colder winters in contrast with nearby areas closer to the ocean.\n",
|
||||||
|
"\n",
|
||||||
|
"Northern parts of the state have more rain than the south. Calif\n",
|
||||||
|
" seen in the climate of the Bay Area, where areas sheltered from the ocean experience significantly hotter summers and colder winters in contrast with nearby areas closer to the ocean.\n",
|
||||||
|
"\n",
|
||||||
|
"Northern parts of the state have more rain than the south. Calif\n",
|
||||||
|
"What strategies is the USDA implementing in California to combat drought effects on agriculture? 24 california 8\n",
|
||||||
|
" and Trinity Rivers drain a large area in far northwestern California. The Eel River and Salinas River each drain portions of the California coast, north and south of San Francisco Bay, respectively. The Mojave River is the primary watercourse in the\n",
|
||||||
|
" and Trinity Rivers drain a large area in far northwestern California. The Eel River and Salinas River each drain portions of the California coast, north and south of San Francisco Bay, respectively. The Mojave River is the primary watercourse in the\n",
|
||||||
|
"What strategies is the USDA implementing in California to combat drought effects on agriculture? 29 california 13\n",
|
||||||
|
" Los Angeles and the Port of Long Beach in Southern California collectively play a pivotal role in the global supply chain, together hauling in about 40% of all imports to the United States by TEU volume. The Port of Oakland and Port of Hueneme are t\n",
|
||||||
|
" Los Angeles and the Port of Long Beach in Southern California collectively play a pivotal role in the global supply chain, together hauling in about 40% of all imports to the United States by TEU volume. The Port of Oakland and Port of Hueneme are t\n",
|
||||||
|
"How has Hanford's historical role affected current environmental policies in Eastern Washington? 93 washington 8\n",
|
||||||
|
"Washington is a leading agricultural state. For 2018, the total value of Washington's agricultural products was $10.6 billion. In 2014, Washington ranked first in the nation in production of red raspberries (90.5 percent of total U.S. production), ho\n",
|
||||||
|
"Washington is a leading agricultural state. For 2018, the total value of Washington's agricultural products was $10.6 billion. In 2014, Washington ranked first in the nation in production of red raspberries (90.5 percent of total U.S. production), ho\n",
|
||||||
|
"How has Hanford's historical role affected current environmental policies in Eastern Washington? 95 washington 10\n",
|
||||||
|
", dioxins, two chlorinated pesticides, DDE, dieldrin and PBDEs. As a result of the study, the department will investigate the sources of PCBs in the Wenatchee River, where unhealthy levels of PCBs were found in mountain whitefish. Based on the 2007 i\n",
|
||||||
|
", dioxins, two chlorinated pesticides, DDE, dieldrin and PBDEs. As a result of the study, the department will investigate the sources of PCBs in the Wenatchee River, where unhealthy levels of PCBs were found in mountain whitefish. Based on the 2007 i\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"for report_id in [47, 236]:\n",
|
||||||
|
" for question in results[1]:\n",
|
||||||
|
" resq = results[1][question]\n",
|
||||||
|
" if len(resq[\"reports\"]) == 0:\n",
|
||||||
|
" continue\n",
|
||||||
|
" if len([i for i in resq[\"reports\"] if i[\"id\"] == str(report_id)]) == 0:\n",
|
||||||
|
" continue\n",
|
||||||
|
" index_name = [i for i in resq[\"reports\"] if i[\"id\"] == str(report_id)][0][ # noqa: RUF015\n",
|
||||||
|
" \"index_name\"\n",
|
||||||
|
" ]\n",
|
||||||
|
" index_id = [i for i in resq[\"reports\"] if i[\"id\"] == str(report_id)][0][ # noqa: RUF015\n",
|
||||||
|
" \"index_id\"\n",
|
||||||
|
" ]\n",
|
||||||
|
" print(question, report_id, index_name, index_id)\n",
|
||||||
|
" index_reports = pd.read_parquet(\n",
|
||||||
|
" f\"inputs/{index_name}/create_final_community_reports.parquet\"\n",
|
||||||
|
" )\n",
|
||||||
|
" print([i for i in resq[\"reports\"] if i[\"id\"] == str(report_id)][0][\"title\"]) # noqa: RUF015\n",
|
||||||
|
" print(\n",
|
||||||
|
" index_reports[index_reports[\"community\"] == int(index_id)][\n",
|
||||||
|
" \"title\"\n",
|
||||||
|
" ].to_numpy()[0]\n",
|
||||||
|
" )\n",
|
||||||
|
" break\n",
|
||||||
|
"for source_id in [10, 16, 19, 20, 21, 22, 24, 29, 93, 95]:\n",
|
||||||
|
" for question in results[1]:\n",
|
||||||
|
" resq = results[1][question]\n",
|
||||||
|
" if len(resq[\"sources\"]) == 0:\n",
|
||||||
|
" continue\n",
|
||||||
|
" if len([i for i in resq[\"sources\"] if i[\"id\"] == str(source_id)]) == 0:\n",
|
||||||
|
" continue\n",
|
||||||
|
" index_name = [i for i in resq[\"sources\"] if i[\"id\"] == str(source_id)][0][ # noqa: RUF015\n",
|
||||||
|
" \"index_name\"\n",
|
||||||
|
" ]\n",
|
||||||
|
" index_id = [i for i in resq[\"sources\"] if i[\"id\"] == str(source_id)][0][ # noqa: RUF015\n",
|
||||||
|
" \"index_id\"\n",
|
||||||
|
" ]\n",
|
||||||
|
" print(question, source_id, index_name, index_id)\n",
|
||||||
|
" index_sources = pd.read_parquet(\n",
|
||||||
|
" f\"inputs/{index_name}/create_final_text_units.parquet\"\n",
|
||||||
|
" )\n",
|
||||||
|
" print(\n",
|
||||||
|
" [i for i in resq[\"sources\"] if i[\"id\"] == str(source_id)][0][\"text\"][:250] # noqa: RUF015\n",
|
||||||
|
" )\n",
|
||||||
|
" print(index_sources.loc[int(index_id)][\"text\"][:250])\n",
|
||||||
|
" break"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### Multi-index Basic Search"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"text_units = [\n",
|
||||||
|
" pd.read_parquet(f\"inputs/{index}/create_final_text_units.parquet\")\n",
|
||||||
|
" for index in indexes\n",
|
||||||
|
"]\n",
|
||||||
|
"\n",
|
||||||
|
"task = loop.create_task(\n",
|
||||||
|
" multi_index_basic_search(\n",
|
||||||
|
" parameters, text_units, indexes, False, \"industry in maryland\"\n",
|
||||||
|
" )\n",
|
||||||
|
")\n",
|
||||||
|
"results = await task"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"#### Print report"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 25,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"# Industry in Maryland\n",
|
||||||
|
"\n",
|
||||||
|
"Maryland's economy is diverse and robust, with significant contributions from various sectors, including manufacturing, biotechnology, transportation, and agriculture. The state's strategic location near Washington, D.C., and its access to major transportation hubs like the Port of Baltimore, play a crucial role in its industrial landscape.\n",
|
||||||
|
"\n",
|
||||||
|
"## Manufacturing\n",
|
||||||
|
"\n",
|
||||||
|
"Manufacturing in Maryland is highly diversified, with no single sub-sector contributing more than 20% of the total. Key manufacturing industries include electronics, computer equipment, and chemicals. Historically, the primary metals sub-sector was significant, with the Sparrows Point steel factory once being the largest in the world. However, this sector has faced challenges from foreign competition, bankruptcies, and mergers [Data: Sources (0, 1)].\n",
|
||||||
|
"\n",
|
||||||
|
"## Biotechnology\n",
|
||||||
|
"\n",
|
||||||
|
"Maryland is a major center for life sciences research and development, hosting more than 400 biotechnology companies, making it the fourth largest nexus in this field in the United States. The state is home to prominent institutions and government agencies involved in research and development, such as Johns Hopkins University, the National Institutes of Health (NIH), and the Food and Drug Administration (FDA) [Data: Sources (0, 1)].\n",
|
||||||
|
"\n",
|
||||||
|
"## Transportation and the Port of Baltimore\n",
|
||||||
|
"\n",
|
||||||
|
"Transportation is a significant service activity in Maryland, centered around the Port of Baltimore. The port is a major hub for imports, particularly raw materials and bulk commodities, and is the number one auto port in the U.S. The port's strategic location allows for efficient distribution to manufacturing centers in the inland Midwest [Data: Sources (0, 1)].\n",
|
||||||
|
"\n",
|
||||||
|
"## Agriculture and Food Production\n",
|
||||||
|
"\n",
|
||||||
|
"Agriculture remains an important part of Maryland's economy, with large areas of fertile land in the coastal and Piedmont zones. The state is known for dairy farming, specialty horticulture crops, and a significant chicken-farming sector. Maryland's food-processing plants are the most significant type of manufacturing by value in the state [Data: Sources (0, 1)].\n",
|
||||||
|
"\n",
|
||||||
|
"## Conclusion\n",
|
||||||
|
"\n",
|
||||||
|
"Maryland's industrial landscape is characterized by its diversity and strategic advantages, including proximity to federal government operations and major transportation routes. The state's economy benefits from a mix of traditional industries like manufacturing and agriculture, alongside cutting-edge sectors such as biotechnology and transportation. This combination positions Maryland as a dynamic player in the national economy.\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"print(results[0])"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"#### Show context links back to original text\n",
|
||||||
|
"\n",
|
||||||
|
"Note that original index name is not saved in context data for basic search"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 26,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
" highly diversified with no sub-sector contributing over 20 percent of the total. Typical forms of manufacturing include electronics, computer equipment, and chemicals. The once-mighty primary metals sub-sector, which once included what was then the \n",
|
||||||
|
"20%. Demographically, both Protestants and those identifying with no religion are more numerous than Catholics.\n",
|
||||||
|
"According to the Pew Research Center in 2014, 69 percent of Maryland's population identifies themselves as Christian. Nearly 52% of the ad\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"for source_id in [0, 1]:\n",
|
||||||
|
" print(results[1][\"sources\"][source_id][\"text\"][:250])"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "base",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.11.5"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 2
|
||||||
|
}
|
||||||
@@ -14,7 +14,7 @@ Figure 1: An LLM-generated knowledge graph built using GPT-4 Turbo.
|
|||||||
GraphRAG is a structured, hierarchical approach to Retrieval Augmented Generation (RAG), as opposed to naive semantic-search
|
GraphRAG is a structured, hierarchical approach to Retrieval Augmented Generation (RAG), as opposed to naive semantic-search
|
||||||
approaches using plain text snippets. The GraphRAG process involves extracting a knowledge graph out of raw text, building a community hierarchy, generating summaries for these communities, and then leveraging these structures when perform RAG-based tasks.
|
approaches using plain text snippets. The GraphRAG process involves extracting a knowledge graph out of raw text, building a community hierarchy, generating summaries for these communities, and then leveraging these structures when perform RAG-based tasks.
|
||||||
|
|
||||||
To learn more about GraphRAG and how it can be used to enhance your LLMs ability to reason about your private data, please visit the [Microsoft Research Blog Post](https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/).
|
To learn more about GraphRAG and how it can be used to enhance your language model's ability to reason about your private data, please visit the [Microsoft Research Blog Post](https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/).
|
||||||
|
|
||||||
## Solution Accelerator 🚀
|
## Solution Accelerator 🚀
|
||||||
|
|
||||||
@@ -32,7 +32,7 @@ Retrieval-Augmented Generation (RAG) is a technique to improve LLM outputs using
|
|||||||
- Baseline RAG struggles to connect the dots. This happens when answering a question requires traversing disparate pieces of information through their shared attributes in order to provide new synthesized insights.
|
- Baseline RAG struggles to connect the dots. This happens when answering a question requires traversing disparate pieces of information through their shared attributes in order to provide new synthesized insights.
|
||||||
- Baseline RAG performs poorly when being asked to holistically understand summarized semantic concepts over large data collections or even singular large documents.
|
- Baseline RAG performs poorly when being asked to holistically understand summarized semantic concepts over large data collections or even singular large documents.
|
||||||
|
|
||||||
To address this, the tech community is working to develop methods that extend and enhance RAG. Microsoft Research’s new approach, GraphRAG, uses LLMs to create a knowledge graph based on an input corpus. This graph, along with community summaries and graph machine learning outputs, are used to augment prompts at query time. GraphRAG shows substantial improvement in answering the two classes of questions described above, demonstrating intelligence or mastery that outperforms other approaches previously applied to private datasets.
|
To address this, the tech community is working to develop methods that extend and enhance RAG. Microsoft Research’s new approach, GraphRAG, creates a knowledge graph based on an input corpus. This graph, along with community summaries and graph machine learning outputs, are used to augment prompts at query time. GraphRAG shows substantial improvement in answering the two classes of questions described above, demonstrating intelligence or mastery that outperforms other approaches previously applied to private datasets.
|
||||||
|
|
||||||
## The GraphRAG Process 🤖
|
## The GraphRAG Process 🤖
|
||||||
|
|
||||||
@@ -41,7 +41,7 @@ GraphRAG builds upon our prior [research](https://www.microsoft.com/en-us/workla
|
|||||||
### Index
|
### Index
|
||||||
|
|
||||||
- Slice up an input corpus into a series of TextUnits, which act as analyzable units for the rest of the process, and provide fine-grained references in our outputs.
|
- Slice up an input corpus into a series of TextUnits, which act as analyzable units for the rest of the process, and provide fine-grained references in our outputs.
|
||||||
- Extract all entities, relationships, and key claims from the TextUnits using an LLM.
|
- Extract all entities, relationships, and key claims from the TextUnits.
|
||||||
- Perform a hierarchical clustering of the graph using the [Leiden technique](https://arxiv.org/pdf/1810.08473.pdf). To see this visually, check out Figure 1 above. Each circle is an entity (e.g., a person, place, or organization), with the size representing the degree of the entity, and the color representing its community.
|
- Perform a hierarchical clustering of the graph using the [Leiden technique](https://arxiv.org/pdf/1810.08473.pdf). To see this visually, check out Figure 1 above. Each circle is an entity (e.g., a person, place, or organization), with the size representing the degree of the entity, and the color representing its community.
|
||||||
- Generate summaries of each community and its constituents from the bottom-up. This aids in holistic understanding of the dataset.
|
- Generate summaries of each community and its constituents from the bottom-up. This aids in holistic understanding of the dataset.
|
||||||
|
|
||||||
@@ -57,3 +57,10 @@ At query time, these structures are used to provide materials for the LLM contex
|
|||||||
|
|
||||||
Using _GraphRAG_ with your data out of the box may not yield the best possible results.
|
Using _GraphRAG_ with your data out of the box may not yield the best possible results.
|
||||||
We strongly recommend to fine-tune your prompts following the [Prompt Tuning Guide](prompt_tuning/overview.md) in our documentation.
|
We strongly recommend to fine-tune your prompts following the [Prompt Tuning Guide](prompt_tuning/overview.md) in our documentation.
|
||||||
|
|
||||||
|
|
||||||
|
## Versioning
|
||||||
|
|
||||||
|
Please see the [breaking changes](https://github.com/microsoft/graphrag/blob/main/breaking-changes.md) document for notes on our approach to versioning the project.
|
||||||
|
|
||||||
|
*Always run `graphrag init --root [path] --force` between minor version bumps to ensure you have the latest config format. Run the provided migration notebook between major version bumps if you want to avoid re-indexing prior datasets. Note that this will overwrite your configuration and prompts, so backup if necessary.*
|
||||||
@@ -26,13 +26,6 @@ stateDiagram-v2
|
|||||||
ExtractGraph --> EmbedGraph
|
ExtractGraph --> EmbedGraph
|
||||||
```
|
```
|
||||||
|
|
||||||
### Dataframe Message Format
|
|
||||||
|
|
||||||
The primary unit of communication between workflows, and between workflow steps is an instance of `pandas.DataFrame`.
|
|
||||||
Although side-effects are possible, our goal is to be _data-centric_ and _table-centric_ in our approach to data processing.
|
|
||||||
This allows us to easily reason about our data, and to leverage the power of dataframe-based ecosystems.
|
|
||||||
Our underlying dataframe technology may change over time, but our primary goal is to support the workflow schema while retaining single-machine ease of use and developer ergonomics.
|
|
||||||
|
|
||||||
### LLM Caching
|
### LLM Caching
|
||||||
|
|
||||||
The GraphRAG library was designed with LLM interactions in mind, and a common setback when working with LLM APIs is various errors due to network latency, throttling, etc..
|
The GraphRAG library was designed with LLM interactions in mind, and a common setback when working with LLM APIs is various errors due to network latency, throttling, etc..
|
||||||
|
|||||||
@@ -11,7 +11,6 @@ The knowledge model is a specification for data outputs that conform to our data
|
|||||||
- `Covariate` - Extracted claim information, which contains statements about entities which may be time-bound.
|
- `Covariate` - Extracted claim information, which contains statements about entities which may be time-bound.
|
||||||
- `Community` - Once the graph of entities and relationships is built, we perform hierarchical community detection on them to create a clustering structure.
|
- `Community` - Once the graph of entities and relationships is built, we perform hierarchical community detection on them to create a clustering structure.
|
||||||
- `Community Report` - The contents of each community are summarized into a generated report, useful for human reading and downstream search.
|
- `Community Report` - The contents of each community are summarized into a generated report, useful for human reading and downstream search.
|
||||||
- `Node` - This table contains layout information for rendered graph-views of the Entities and Documents which have been embedded and clustered.
|
|
||||||
|
|
||||||
## The Default Configuration Workflow
|
## The Default Configuration Workflow
|
||||||
|
|
||||||
@@ -48,7 +47,7 @@ flowchart TB
|
|||||||
subgraph phase6[Phase 6: Network Visualization]
|
subgraph phase6[Phase 6: Network Visualization]
|
||||||
graph_outputs --> graph_embed[Graph Embedding]
|
graph_outputs --> graph_embed[Graph Embedding]
|
||||||
graph_embed --> umap_entities[Umap Entities]
|
graph_embed --> umap_entities[Umap Entities]
|
||||||
umap_entities --> combine_nodes[Final Nodes]
|
umap_entities --> combine_nodes[Final Entities]
|
||||||
end
|
end
|
||||||
subgraph phase7[Phase 7: Text Embeddings]
|
subgraph phase7[Phase 7: Text Embeddings]
|
||||||
textUnits --> text_embed[Text Embedding]
|
textUnits --> text_embed[Text Embedding]
|
||||||
@@ -186,7 +185,7 @@ In this phase of the workflow, we perform some steps to support network visualiz
|
|||||||
title: Network Visualization Workflows
|
title: Network Visualization Workflows
|
||||||
---
|
---
|
||||||
flowchart LR
|
flowchart LR
|
||||||
ag[Graph Table] --> ge[Node2Vec Graph Embedding] --> ne[Umap Entities] --> ng[Nodes Table]
|
ag[Graph Table] --> ge[Node2Vec Graph Embedding] --> ne[Umap Entities] --> ng[Entities Table]
|
||||||
```
|
```
|
||||||
|
|
||||||
### Graph Embedding
|
### Graph Embedding
|
||||||
@@ -195,7 +194,7 @@ In this step, we generate a vector representation of our graph using the Node2Ve
|
|||||||
|
|
||||||
### Dimensionality Reduction
|
### Dimensionality Reduction
|
||||||
|
|
||||||
For each of the logical graphs, we perform a UMAP dimensionality reduction to generate a 2D representation of the graph. This will allow us to visualize the graph in a 2D space and understand the relationships between the nodes in the graph. The UMAP embeddings are then exported as a table of _Nodes_. The rows of this table include the UMAP dimensions as x/y coordinates.
|
For each of the logical graphs, we perform a UMAP dimensionality reduction to generate a 2D representation of the graph. This will allow us to visualize the graph in a 2D space and understand the relationships between the nodes in the graph. The UMAP embeddings are reduced to two dimensions as x/y coordinates.
|
||||||
|
|
||||||
## Phase 7: Text Embedding
|
## Phase 7: Text Embedding
|
||||||
|
|
||||||
|
|||||||
44
docs/index/methods.md
Normal file
44
docs/index/methods.md
Normal file
@@ -0,0 +1,44 @@
|
|||||||
|
# Indexing Methods
|
||||||
|
|
||||||
|
GraphRAG is a platform for our research into RAG indexing methods that produce optimal context window content for language models. We have a standard indexing pipeline that uses a language model to extract the graph that our memory model is based upon. We may introduce additional indexing methods from time to time. This page documents those options.
|
||||||
|
|
||||||
|
## Standard GraphRAG
|
||||||
|
|
||||||
|
This is the method described in the original [blog post](https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/). Standard uses a language model for all reasoning tasks:
|
||||||
|
|
||||||
|
- entity extraction: LLM is prompted to extract named entities and provide a description from each text unit.
|
||||||
|
- relationship extraction: LLM is prompted to describe the relationship between each pair of entities in each text unit.
|
||||||
|
- entity summarization: LLM is prompted to combine the descriptions for every instance of an entity found across the text units into a single summary.
|
||||||
|
- relationship summarization: LLM is prompted to combine the descriptions for every instance of a relationship found across the text units into a single summary.
|
||||||
|
- claim extraction (optiona): LLM is prompted to extract and describe claims from each text unit.
|
||||||
|
- community report generation: entity and relationship descriptions (and optionally claims) for each community are collected and used to prompt the LLM to generate a summary report.
|
||||||
|
|
||||||
|
`graphrag index --method standard`. This is the default method, so the method param can actual be omitted.
|
||||||
|
|
||||||
|
## FastGraphRAG
|
||||||
|
|
||||||
|
FastGraphRAG is a method that substitutes some of the language model reasoning for traditional natural language processing (NLP) methods. This is a hybrid technique that we developed as a faster and cheaper indexing alternative:
|
||||||
|
|
||||||
|
- entity extraction: entities are noun phrases extracted using NLP libraries such as NLTK and spaCy. There is no description; the source text unit is used for this.
|
||||||
|
- relationship extraction: relationships are defined as text unit co-occurrence between entity pairs. There is no description.
|
||||||
|
- entity summarization: not necessary.
|
||||||
|
- relationship summarization: not necessary.
|
||||||
|
- claim extraction (optiona): unused.
|
||||||
|
- community report generation: The direct text unit content containing each entity noun phrase is collected and used to prompt the LLM to generate a summary report.
|
||||||
|
|
||||||
|
`graphrag index --method fast`
|
||||||
|
|
||||||
|
FastGraphRAG has a handful of NLP [options built in](https://microsoft.github.io/graphrag/config/yaml/#extract_graph_nlp). By default we use NLTK + regular expressions for the noun phrase extraction, which is very fast but primarily suitable for English. We have built in two additional methods using spaCy: semantic parsing and CFG. We use the `en_core_web_md` model by default for spaCy, but note that you can reference any [supported model](https://spacy.io/models/) that you have installed.
|
||||||
|
|
||||||
|
Note that we also generally configure the text chunking to produce much smaller chunks (50-100 tokens). This results in a better co-occurrence graph.
|
||||||
|
|
||||||
|
⚠️ Note on SpaCy models:
|
||||||
|
|
||||||
|
This package requires SpaCy models to function correctly. If the required model is not installed, the package will automatically download and install it the first time it is used.
|
||||||
|
|
||||||
|
You can install it manually by running `python -m spacy download <model_name>`, for example `python -m spacy download en_core_web_md`.
|
||||||
|
|
||||||
|
|
||||||
|
## Choosing a Method
|
||||||
|
|
||||||
|
Standard GraphRAG provides a rich description of real-world entities and relationships, but is more expensive that FastGraphRAG. We estimate graph extraction to constitute roughly 75% of indexing cost. FastGraphRAG is therefore much cheaper, but the tradeoff is that the extracted graph is less directly relevant for use outside of GraphRAG, and the graph tends to be quite a bit noisier. If high fidelity entities and graph exploration are important to your use case, we recommend staying with traditional GraphRAG. If your use case is primarily aimed at summary questions using global search, FastGraphRAG is a reasonable and cheaper alternative.
|
||||||
@@ -10,40 +10,42 @@ All tables have two identifier fields:
|
|||||||
| id | str | Generated UUID, assuring global uniqueness |
|
| id | str | Generated UUID, assuring global uniqueness |
|
||||||
| human_readable_id | int | This is an incremented short ID created per-run. For example, we use this short ID with generated summaries that print citations so they are easy to cross-reference visually. |
|
| human_readable_id | int | This is an incremented short ID created per-run. For example, we use this short ID with generated summaries that print citations so they are easy to cross-reference visually. |
|
||||||
|
|
||||||
## create_final_communities
|
## communities
|
||||||
This is a list of the final communities generated by Leiden. Communities are strictly hierarchical, subdividing into children as the cluster affinity is narrowed.
|
This is a list of the final communities generated by Leiden. Communities are strictly hierarchical, subdividing into children as the cluster affinity is narrowed.
|
||||||
|
|
||||||
| name | type | description |
|
| name | type | description |
|
||||||
| ---------------- | ----- | ----------- |
|
| ------------------ | ----- | ----------- |
|
||||||
| community | int | Leiden-generated cluster ID for the community. Note that these increment with depth, so they are unique through all levels of the community hierarchy. For this table, human_readable_id is a copy of the community ID rather than a plain increment. |
|
| community | int | Leiden-generated cluster ID for the community. Note that these increment with depth, so they are unique through all levels of the community hierarchy. For this table, human_readable_id is a copy of the community ID rather than a plain increment. |
|
||||||
| parent | int | Parent community ID.|
|
| parent | int | Parent community ID.|
|
||||||
| level | int | Depth of the community in the hierarchy. |
|
| children | int[] | List of child community IDs.|
|
||||||
| title | str | Friendly name of the community. |
|
| level | int | Depth of the community in the hierarchy. |
|
||||||
| entity_ids | str[] | List of entities that are members of the community. |
|
| title | str | Friendly name of the community. |
|
||||||
| relationship_ids | str[] | List of relationships that are wholly within the community (source and target are both in the community). |
|
| entity_ids | str[] | List of entities that are members of the community. |
|
||||||
| text_unit_ids | str[] | List of text units represented within the community. |
|
| relationship_ids | str[] | List of relationships that are wholly within the community (source and target are both in the community). |
|
||||||
| period | str | Date of ingest, used for incremental update merges. ISO8601 |
|
| text_unit_ids | str[] | List of text units represented within the community. |
|
||||||
| size | int | Size of the community (entity count), used for incremental update merges. |
|
| period | str | Date of ingest, used for incremental update merges. ISO8601 |
|
||||||
|
| size | int | Size of the community (entity count), used for incremental update merges. |
|
||||||
|
|
||||||
## create_final_community_reports
|
## community_reports
|
||||||
This is the list of summarized reports for each community.
|
This is the list of summarized reports for each community.
|
||||||
|
|
||||||
| name | type | description |
|
| name | type | description |
|
||||||
| ----------------- | ----- | ----------- |
|
| -------------------- | ----- | ----------- |
|
||||||
| community | int | Short ID of the community this report applies to. |
|
| community | int | Short ID of the community this report applies to. |
|
||||||
| parent | int | Parent community ID. |
|
| parent | int | Parent community ID. |
|
||||||
| level | int | Level of the community this report applies to. |
|
| children | int[] | List of child community IDs.|
|
||||||
| title | str | LM-generated title for the report. |
|
| level | int | Level of the community this report applies to. |
|
||||||
| summary | str | LM-generated summary of the report. |
|
| title | str | LM-generated title for the report. |
|
||||||
| full_content | str | LM-generated full report. |
|
| summary | str | LM-generated summary of the report. |
|
||||||
| rank | float | LM-derived relevance ranking of the report based on member entity salience
|
| full_content | str | LM-generated full report. |
|
||||||
| rank_explanation | str | LM-derived explanation of the rank. |
|
| rank | float | LM-derived relevance ranking of the report based on member entity salience
|
||||||
| findings | dict | LM-derived list of the top 5-10 insights from the community. Contains `summary` and `explanation` values. |
|
| rating_explanation | str | LM-derived explanation of the rank. |
|
||||||
| full_content_json | json | Full JSON output as returned by the LM. Most fields are extracted into columns, but this JSON is sent for query summarization so we leave it to allow for prompt tuning to add fields/content by end users. |
|
| findings | dict | LM-derived list of the top 5-10 insights from the community. Contains `summary` and `explanation` values. |
|
||||||
| period | str | Date of ingest, used for incremental update merges. ISO8601 |
|
| full_content_json | json | Full JSON output as returned by the LM. Most fields are extracted into columns, but this JSON is sent for query summarization so we leave it to allow for prompt tuning to add fields/content by end users. |
|
||||||
| size | int | Size of the community (entity count), used for incremental update merges. |
|
| period | str | Date of ingest, used for incremental update merges. ISO8601 |
|
||||||
|
| size | int | Size of the community (entity count), used for incremental update merges. |
|
||||||
|
|
||||||
## create_final_covariates
|
## covariates
|
||||||
(Optional) If claim extraction is turned on, this is a list of the extracted covariates. Note that claims are typically oriented around identifying malicious behavior such as fraud, so they are not useful for all datasets.
|
(Optional) If claim extraction is turned on, this is a list of the extracted covariates. Note that claims are typically oriented around identifying malicious behavior such as fraud, so they are not useful for all datasets.
|
||||||
|
|
||||||
| name | type | description |
|
| name | type | description |
|
||||||
@@ -59,7 +61,7 @@ This is the list of summarized reports for each community.
|
|||||||
| source_text | str | Short string of text containing the claimed behavior. |
|
| source_text | str | Short string of text containing the claimed behavior. |
|
||||||
| text_unit_id | str | ID of the text unit the claim text was extracted from. |
|
| text_unit_id | str | ID of the text unit the claim text was extracted from. |
|
||||||
|
|
||||||
## create_final_documents
|
## documents
|
||||||
List of document content after import.
|
List of document content after import.
|
||||||
|
|
||||||
| name | type | description |
|
| name | type | description |
|
||||||
@@ -67,9 +69,9 @@ List of document content after import.
|
|||||||
| title | str | Filename, unless otherwise configured during CSV import. |
|
| title | str | Filename, unless otherwise configured during CSV import. |
|
||||||
| text | str | Full text of the document. |
|
| text | str | Full text of the document. |
|
||||||
| text_unit_ids | str[] | List of text units (chunks) that were parsed from the document. |
|
| text_unit_ids | str[] | List of text units (chunks) that were parsed from the document. |
|
||||||
| attributes | dict | (optional) If specified during CSV import, this is a dict of attributes for the document. |
|
| metadata | dict | If specified during CSV import, this is a dict of metadata for the document. |
|
||||||
|
|
||||||
## create_final_entities
|
## entities
|
||||||
List of all entities found in the data by the LM.
|
List of all entities found in the data by the LM.
|
||||||
|
|
||||||
| name | type | description |
|
| name | type | description |
|
||||||
@@ -78,22 +80,12 @@ List of all entities found in the data by the LM.
|
|||||||
| type | str | Type of the entity. By default this will be "organization", "person", "geo", or "event" unless configured differently or auto-tuning is used. |
|
| type | str | Type of the entity. By default this will be "organization", "person", "geo", or "event" unless configured differently or auto-tuning is used. |
|
||||||
| description | str | Textual description of the entity. Entities may be found in many text units, so this is an LM-derived summary of all descriptions. |
|
| description | str | Textual description of the entity. Entities may be found in many text units, so this is an LM-derived summary of all descriptions. |
|
||||||
| text_unit_ids | str[] | List of the text units containing the entity. |
|
| text_unit_ids | str[] | List of the text units containing the entity. |
|
||||||
|
| frequency | int | Count of text units the entity was found within. |
|
||||||
|
| degree | int | Node degree (connectedness) in the graph. |
|
||||||
|
| x | float | X position of the node for visual layouts. If graph embeddings and UMAP are not turned on, this will be 0. |
|
||||||
|
| y | float | Y position of the node for visual layouts. If graph embeddings and UMAP are not turned on, this will be 0. |
|
||||||
|
|
||||||
## create_final_nodes
|
## relationships
|
||||||
This is graph-related information for the entities. It contains only information relevant to the graph such as community. There is an entry for each entity at every community level it is found within, so you may see "duplicate" entities.
|
|
||||||
|
|
||||||
Note that the ID fields match those in create_final_entities and can be used for joining if additional information about a node is required.
|
|
||||||
|
|
||||||
| name | type | description |
|
|
||||||
| --------- | ----- | ----------- |
|
|
||||||
| title | str | Name of the referenced entity. Duplicated from create_final_entities for convenient cross-referencing. |
|
|
||||||
| community | int | Leiden community the node is found within. Entities are not always assigned a community (they may not be close enough to any), so they may have a ID of -1. |
|
|
||||||
| level | int | Level of the community the entity is in. |
|
|
||||||
| degree | int | Node degree (connectedness) in the graph. |
|
|
||||||
| x | float | X position of the node for visual layouts. If graph embeddings and UMAP are not turned on, this will be 0. |
|
|
||||||
| y | float | Y position of the node for visual layouts. If graph embeddings and UMAP are not turned on, this will be 0. |
|
|
||||||
|
|
||||||
## create_final_relationships
|
|
||||||
List of all entity-to-entity relationships found in the data by the LM. This is also the _edge list_ for the graph.
|
List of all entity-to-entity relationships found in the data by the LM. This is also the _edge list_ for the graph.
|
||||||
|
|
||||||
| name | type | description |
|
| name | type | description |
|
||||||
@@ -105,7 +97,7 @@ List of all entity-to-entity relationships found in the data by the LM. This is
|
|||||||
| combined_degree | int | Sum of source and target node degrees. |
|
| combined_degree | int | Sum of source and target node degrees. |
|
||||||
| text_unit_ids | str[] | List of text units the relationship was found within. |
|
| text_unit_ids | str[] | List of text units the relationship was found within. |
|
||||||
|
|
||||||
## create_final_text_units
|
## text_units
|
||||||
List of all text chunks parsed from the input documents.
|
List of all text chunks parsed from the input documents.
|
||||||
|
|
||||||
| name | type | description |
|
| name | type | description |
|
||||||
|
|||||||
@@ -27,16 +27,15 @@ After you have a config file you can run the pipeline using the CLI or the Pytho
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Via Poetry
|
# Via Poetry
|
||||||
poetry run poe cli --root <data_root> # default config mode
|
poetry run poe index --root <data_root> # default config mode
|
||||||
```
|
```
|
||||||
|
|
||||||
### Python API
|
### Python API
|
||||||
|
|
||||||
Please see the [examples folder](https://github.com/microsoft/graphrag/blob/main/examples/README.md) for a handful of functional pipelines illustrating how to create and run via a custom settings.yml or through custom python scripts.
|
Please see the indexing API [python file](https://github.com/microsoft/graphrag/blob/main/graphrag/api/index.py) for the recommended method to call directly from Python code.
|
||||||
|
|
||||||
## Further Reading
|
## Further Reading
|
||||||
|
|
||||||
- To start developing within the _GraphRAG_ project, see [getting started](../developing.md)
|
- To start developing within the _GraphRAG_ project, see [getting started](../developing.md)
|
||||||
- To understand the underlying concepts and execution model of the indexing library, see [the architecture documentation](../index/architecture.md)
|
- To understand the underlying concepts and execution model of the indexing library, see [the architecture documentation](../index/architecture.md)
|
||||||
- To get running with a series of examples, see [the examples documentation](https://github.com/microsoft/graphrag/blob/main/examples/README.md)
|
|
||||||
- To read more about configuring the indexing engine, see [the configuration documentation](../config/overview.md)
|
- To read more about configuring the indexing engine, see [the configuration documentation](../config/overview.md)
|
||||||
|
|||||||
20
docs/query/multi_index_search.md
Normal file
20
docs/query/multi_index_search.md
Normal file
@@ -0,0 +1,20 @@
|
|||||||
|
# Multi Index Search 🔎
|
||||||
|
|
||||||
|
## Multi Dataset Reasoning
|
||||||
|
|
||||||
|
GraphRAG takes in unstructured data contained in text documents and uses large languages models to “read” the documents in a targeted fashion and create a knowledge graph. This knowledge graph, or index, contains information about specific entities in the data, how the entities relate to one another, and high-level reports about communities and topics found in the data. Indexes can be searched by users to get meaningful information about the underlying data, including reports with citations that point back to the original unstructured text.
|
||||||
|
|
||||||
|
Multi-index search is a new capability that has been added to the GraphRAG python library to query multiple knowledge stores at once. Multi-index search allows for many new search scenarios, including:
|
||||||
|
|
||||||
|
- Combining knowledge from different domains – Many documents contain similar types of entities: person, place, thing. But GraphRAG can be tuned for highly specialized domains, such as science and engineering. With the recent updates to search, GraphRAG can now simultaneously query multiple datasets with completely different schemas and entity definitions.
|
||||||
|
|
||||||
|
- Combining knowledge with different access levels – Not all datasets are accessible to all people, even within an organization. Some datasets are publicly available. Some datasets, such as internal financial information or intellectual property, may only be accessible by a small number of employees at a company. Multi-index search allows multiple sources with different access controls to be queried at the same time, creating more nuanced and informative reports. Internal R&D findings can be seamlessly combined with open-source scientific publications.
|
||||||
|
|
||||||
|
- Combining knowledge in different locations – With multi-index search, indexes do not need to be in the same location or type of storage to be queried. Indexes in the cloud in Azure Storage can be queried at the same time as indexes stored on a personal computer. Multi-index search makes these types of data joins easy and accessible.
|
||||||
|
|
||||||
|
To search across multiple datasets, the underlying contexts from each index, based on the user query, are combined in-memory at query time, saving on computation and allowing the joint querying of indexes that can’t be joined inherently, either do access controls or differing schemas. Multi-index search automatically keeps track of provenance information, so that any references can be traced back to the correct indexes and correct original documents.
|
||||||
|
|
||||||
|
|
||||||
|
## How to Use
|
||||||
|
|
||||||
|
An example of a global search scenario can be found in the following [notebook](../examples_notebooks/multi_index_search.ipynb).
|
||||||
@@ -1,83 +0,0 @@
|
|||||||
# GraphRAG Data Model and Config Breaking Changes
|
|
||||||
|
|
||||||
As we worked toward a cleaner codebase, data model, and configuration for the v1 release, we made a few changes that can break older indexes. During the development process we left shims in place to account for these changes, so that all old indexes will work up until v1.0. However, with the release of 1.0 we are removing these shims to allow the codebase to move forward without the legacy code elements. We are providing a migration notebook so this process should be fairly painless for most users:
|
|
||||||
|
|
||||||
1. Rename or move your settings.yml file to back it up.
|
|
||||||
2. Re-run `graphrag init` to generate a new default settings.yml.
|
|
||||||
3. Open your old settings.yml and copy any critical settings that you changed. For most people this is likely only the LLM and embedding config.
|
|
||||||
4. Run the notebook here: [./docs/examples_notebooks/index_migration.ipynb]()
|
|
||||||
|
|
||||||
Note that one of the new requirements is that we write embeddings to a vector store during indexing. By default, this uses a local lancedb instance. When you re-generate the default config, a block will be added to reflect this. If you need to write to Azure AI Search instead, we recommend updating these settings before you index, so you don't need to do a separate vector ingest.
|
|
||||||
|
|
||||||
All of the breaking changes listed below are accounted for in the four steps above.
|
|
||||||
|
|
||||||
## Updated data model
|
|
||||||
|
|
||||||
- We have streamlined the data model of the index in a few small ways to align tables more consistently and remove redundant content. Notably:
|
|
||||||
- Consistent use of `id` and `human_readable_id` across all tables; this also insures all int IDs are actually saved as ints and never strings
|
|
||||||
- Alignment of fields from `create_final_entities` (such as name -> title) with `create_final_nodes`, and removal of redundant content across these tables
|
|
||||||
- Rename of `document.raw_content` to `document.text`
|
|
||||||
- Rename of `entity.name` to `entity.title`
|
|
||||||
- Rename `rank` to `combined_degree` in `create_final_relationships` and removal of `source_degree` and `target_degree`fields
|
|
||||||
- Fixed community tables to use a proper UUID for the `id` field, and retain `community` and `human_readable_id` for the short IDs
|
|
||||||
- Removal of all embeddings columns from parquet files in favor of direct vector store writes
|
|
||||||
|
|
||||||
### Migration
|
|
||||||
|
|
||||||
- Run a new index, leveraging existing cache.
|
|
||||||
|
|
||||||
## New required Embeddings
|
|
||||||
|
|
||||||
### Change
|
|
||||||
|
|
||||||
- Added new required embeddings for `DRIFTSearch` and base RAG capabilities.
|
|
||||||
|
|
||||||
### Migration
|
|
||||||
|
|
||||||
- Run a new index, leveraging existing cache.
|
|
||||||
|
|
||||||
## Vector Store required by default
|
|
||||||
|
|
||||||
### Change
|
|
||||||
|
|
||||||
- Vector store is now required by default for all search methods.
|
|
||||||
|
|
||||||
### Migration
|
|
||||||
|
|
||||||
- Run graphrag init command to generate a new settings.yaml file with the vector store configuration.
|
|
||||||
- Run a new index, leveraging existing cache.
|
|
||||||
|
|
||||||
## Deprecate timestamp paths
|
|
||||||
|
|
||||||
### Change
|
|
||||||
|
|
||||||
- Remove support for timestamp paths, those using `${timestamp}` directory nesting.
|
|
||||||
- Use the same directory for storage output and reporting output.
|
|
||||||
|
|
||||||
### Migration
|
|
||||||
|
|
||||||
- Ensure output directories no longer use `${timestamp}` directory nesting.
|
|
||||||
|
|
||||||
**Using Environment Variables**
|
|
||||||
|
|
||||||
- Ensure `GRAPHRAG_STORAGE_BASE_DIR` is set to a static directory, e.g., `output` instead of `output/${timestamp}/artifacts`.
|
|
||||||
- Ensure `GRAPHRAG_REPORTING_BASE_DIR` is set to a static directory, e.g., `output` instead of `output/${timestamp}/reports`
|
|
||||||
|
|
||||||
[Full docs on using environment variables for configuration](https://microsoft.github.io/graphrag/config/env_vars/).
|
|
||||||
|
|
||||||
**Using Configuration File**
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
# rest of settings.yaml file
|
|
||||||
# ...
|
|
||||||
|
|
||||||
storage:
|
|
||||||
type: file
|
|
||||||
base_dir: "output" # changed from "output/${timestamp}/artifacts"
|
|
||||||
|
|
||||||
reporting:
|
|
||||||
type: file
|
|
||||||
base_dir: "output" # changed from "output/${timestamp}/reports"
|
|
||||||
```
|
|
||||||
|
|
||||||
[Full docs on using YAML files for configuration](https://microsoft.github.io/graphrag/config/yaml/).
|
|
||||||
Reference in New Issue
Block a user