* Wordind updates * Update yam lconfig and add notes to "deprecated" env * Add basic search section * Update versioning docs * Minor edits for clarity * Update init command * Update init to add --force in docs * Add NLP extraction params * Move vector_store to root * Add workflows to config * Add FastGraphRAG docs * add metadata column changes * Added documentation for multi index search. * Minor fixes. * Add config and table renames * Update migration notebook and comments to specify v1 * Add frequency to entity table docs * add new chunking options for metadata * Update output docs * Minor edits and cleanup * Add model ids to search configs * Spruce up migration notebook * Lint/format multi-index notebook * SpaCy model note * Update SpaCy footnote * Updated multi_index_search.ipynb to remove ruff errors. * add spacy to dictionary --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> Co-authored-by: Dayenne Souza <ddesouza@microsoft.com> Co-authored-by: dorbaker <dorbaker@microsoft.com>
20 KiB
Default Configuration Mode (using YAML/JSON)
The default configuration mode may be configured by using a settings.yml or settings.json file in the data project root. If a .env file is present along with this config file, then it will be loaded, and the environment variables defined therein will be available for token replacements in your configuration document using ${ENV_VAR} syntax. We initialize with YML by default in graphrag init but you may use the equivalent JSON form if preferred.
Many of these config values have defaults. Rather than replicate them here, please refer to the constants in the code directly.
For example:
# .env
GRAPHRAG_API_KEY=some_api_key
# settings.yml
llm:
api_key: ${GRAPHRAG_API_KEY}
Config Sections
Indexing
models
This is a dict of model configurations. The dict key is used to reference this configuration elsewhere when a model instance is desired. In this way, you can specify as many different models as you need, and reference them differentially in the workflow steps.
For example:
models:
default_chat_model:
api_key: ${GRAPHRAG_API_KEY}
type: openai_chat
model: gpt-4o
model_supports_json: true
default_embedding_model:
api_key: ${GRAPHRAG_API_KEY}
type: openai_embedding
model: text-embedding-ada-002
Fields
api_keystr - The OpenAI API key to use.typeopenai_chat|azure_openai_chat|openai_embedding|azure_openai_embedding - The type of LLM to use.modelstr - The model name.encoding_modelstr - The text encoding model to use. Default is to use the encoding model aligned with the language model (i.e., it is retrieved from tiktoken if unset).max_tokensint - The maximum number of output tokens.request_timeoutfloat - The per-request timeout.api_basestr - The API base url to use.api_versionstr - The API version.organizationstr - The client organization.proxystr - The proxy URL to use.azure_auth_typeapi_key|managed_identity - if using Azure, please indicate how you want to authenticate requests.audiencestr - (Azure OpenAI only) The URI of the target Azure resource/service for which a managed identity token is requested. Used ifapi_keyis not defined. Default=https://cognitiveservices.azure.com/.defaultdeployment_namestr - The deployment name to use (Azure).model_supports_jsonbool - Whether the model supports JSON-mode output.tokens_per_minuteint - Set a leaky-bucket throttle on tokens-per-minute.requests_per_minuteint - Set a leaky-bucket throttle on requests-per-minute.max_retriesint - The maximum number of retries to use.max_retry_waitfloat - The maximum backoff time.sleep_on_rate_limit_recommendationbool - Whether to adhere to sleep recommendations (Azure).concurrent_requestsint The number of open requests to allow at once.temperaturefloat - The temperature to use.top_pfloat - The top-p value to use.nint - The number of completions to generate.parallelization_staggerfloat - The threading stagger value.parallelization_num_threadsint - The maximum number of work threads.async_modeasyncio|threaded The async mode to use. Eitherasyncioor `threaded.
embed_text
By default, the GraphRAG indexer will only export embeddings required for our query methods. However, the model has embeddings defined for all plaintext fields, and these can be customized by setting the target and names fields.
Supported embeddings names are:
text_unit.textdocument.textentity.titleentity.descriptionrelationship.descriptioncommunity.titlecommunity.summarycommunity.full_content
Fields
model_idstr - Name of the model definition to use for text embedding.batch_sizeint - The maximum batch size to use.batch_max_tokensint - The maximum batch # of tokens.targetrequired|all|selected|none - Determines which set of embeddings to export.nameslist[str] - If target=selected, this should be an explicit list of the embeddings names we support.
vector_store
Where to put all vectors for the system. Configured for lancedb by default.
Fields
typestr -lancedborazure_ai_search. Default=lancedbdb_uristr (only for lancedb) - The database uri. Default=storage.base_dir/lancedburlstr (only for AI Search) - AI Search endpointapi_keystr (optional - only for AI Search) - The AI Search api key to use.audiencestr (only for AI Search) - Audience for managed identity token if managed identity authentication is used.overwritebool (only used at index creation time) - Overwrite collection if it exist. Default=Truecontainer_namestr - The name of a vector container. This stores all indexes (tables) for a given dataset ingest. Default=default
input
Our pipeline can ingest .csv or .txt data from an input folder. These files can be nested within subfolders. In general, CSV-based data provides the most customizability. Each CSV should at least contain a text field. You can use the metadata list to specify additional columns from the CSV to include as headers in each text chunk, allowing you to repeat document content within each chunk for better LLM inclusion.
Fields
typefile|blob - The input type to use. Default=filefile_typetext|csv - The type of input data to load. Eithertextorcsv. Default istextbase_dirstr - The base directory to read input from, relative to the root.connection_stringstr - (blob only) The Azure Storage connection string.storage_account_blob_urlstr - The storage account blob URL to use.container_namestr - (blob only) The Azure Storage container name.file_encodingstr - The encoding of the input file. Default isutf-8file_patternstr - A regex to match input files. Default is.*\.csv$if in csv mode and.*\.txt$if in text mode.file_filterdict - Key/value pairs to filter. Default is None.text_columnstr - (CSV Mode Only) The text column name.metadatalist[str] - (CSV Mode Only) The additional document attributes to include.
chunks
These settings configure how we parse documents into text chunks. This is necessary because very large documents may not fit into a single context window, and graph extraction accuracy can be modulated. Also note the metadata setting in the input document config, which will replicate document metadata into each chunk.
Fields
sizeint - The max chunk size in tokens.overlapint - The chunk overlap in tokens.group_by_columnslist[str] - group documents by fields before chunking.encoding_modelstr - The text encoding model to use for splitting on token boundaries.prepend_metadatabool - Determines if metadata values should be added at the beginning of each chunk. Default=False.chunk_size_includes_metadatabool - Specifies whether the chunk size calculation should include metadata tokens. Default=False.
cache
This section controls the cache mechanism used by the pipeline. This is used to cache LLM invocation results.
Fields
typefile|memory|none|blob - The cache type to use. Default=fileconnection_stringstr - (blob only) The Azure Storage connection string.container_namestr - (blob only) The Azure Storage container name.base_dirstr - The base directory to write cache to, relative to the root.storage_account_blob_urlstr - The storage account blob URL to use.
output
This section controls the storage mechanism used by the pipeline used for exporting output tables.
Fields
typefile|memory|blob - The storage type to use. Default=fileconnection_stringstr - (blob only) The Azure Storage connection string.container_namestr - (blob only) The Azure Storage container name.base_dirstr - The base directory to write output artifacts to, relative to the root.storage_account_blob_urlstr - The storage account blob URL to use.
update_index_storage
The section defines a secondary storage location for running incremental indexing, to preserve your original outputs.
Fields
typefile|memory|blob - The storage type to use. Default=fileconnection_stringstr - (blob only) The Azure Storage connection string.container_namestr - (blob only) The Azure Storage container name.base_dirstr - The base directory to write output artifacts to, relative to the root.storage_account_blob_urlstr - The storage account blob URL to use.
reporting
This section controls the reporting mechanism used by the pipeline, for common events and error messages. The default is to write reports to a file in the output directory. However, you can also choose to write reports to the console or to an Azure Blob Storage container.
Fields
typefile|console|blob - The reporting type to use. Default=fileconnection_stringstr - (blob only) The Azure Storage connection string.container_namestr - (blob only) The Azure Storage container name.base_dirstr - The base directory to write reports to, relative to the root.storage_account_blob_urlstr - The storage account blob URL to use.
extract_graph
Fields
model_idstr - Name of the model definition to use for API calls.promptstr - The prompt file to use.entity_typeslist[str] - The entity types to identify.max_gleaningsint - The maximum number of gleaning cycles to use.
summarize_descriptions
Fields
model_idstr - Name of the model definition to use for API calls.promptstr - The prompt file to use.max_lengthint - The maximum number of output tokens per summarization.
extract_graph_nlp
Defines settings for NLP-based graph extraction methods.
Fields
normalize_edge_weightsbool - Whether to normalize the edge weights during graph construction. Default=True.text_analyzerdict - Parameters for the NLP model.- extractor_type regex_english|syntactic_parser|cfg - Default=
regex_english. - model_name str - Name of NLP model (for SpaCy-based models)
- max_word_length int - Longest word to allow. Default=
15. - word_delimiter str - Delimiter to split words. Default ' '.
- include_named_entities bool - Whether to include named entities in noun phrases. Default=
True. - exclude_nouns list[str] | None - List of nouns to exclude. If
None, we use an internal stopword list. - exclude_entity_tags list[str] - List of entity tags to ignore.
- exclude_pos_tags list[str] - List of part-of-speech tags to ignore.
- noun_phrase_tags list[str] - List of noun phrase tags to ignore.
- noun_phrase_grammars dict[str, str] - Noun phrase grammars for the model (cfg-only).
- extractor_type regex_english|syntactic_parser|cfg - Default=
extract_claims
Fields
enabledbool - Whether to enable claim extraction. Off by default, because claim prompts really need user tuning.model_idstr - Name of the model definition to use for API calls.promptstr - The prompt file to use.descriptionstr - Describes the types of claims we want to extract.max_gleaningsint - The maximum number of gleaning cycles to use.
community_reports
Fields
model_idstr - Name of the model definition to use for API calls.promptstr - The prompt file to use.max_lengthint - The maximum number of output tokens per report.max_input_lengthint - The maximum number of input tokens to use when generating reports.
prune_graph
Parameters for manual graph pruning. This can be used to optimize the modularity of your graph clusters, by removing overly-connected or rare nodes.
Fields
- min_node_freq int - The minimum node frequency to allow.
- max_node_freq_std float | None - The maximum standard deviation of node frequency to allow.
- min_node_degree int - The minimum node degree to allow.
- max_node_degree_std float | None - The maximum standard deviation of node degree to allow.
- min_edge_weight_pct int - The minimum edge weight percentile to allow.
- remove_ego_nodes bool - Remove ego nodes.
- lcc_only bool - Only use largest connected component.
cluster_graph
These are the settings used for Leiden hierarchical clustering of the graph to create communities.
Fields
max_cluster_sizeint - The maximum cluster size to export.use_lccbool - Whether to only use the largest connected component.seedint - A randomization seed to provide if consistent run-to-run results are desired. We do provide a default in order to guarantee clustering stability.
embed_graph
We use node2vec to embed the graph. This is primarily used for visualization, so it is not turned on by default. However, if you do prefer to embed the graph for secondary analysis, you can turn this on and we will persist the embeddings to your configured vector store.
Fields
enabledbool - Whether to enable graph embeddings.num_walksint - The node2vec number of walks.walk_lengthint - The node2vec walk length.window_sizeint - The node2vec window size.iterationsint - The node2vec number of iterations.random_seedint - The node2vec random seed.strategydict - Fully override the embed graph strategy.
umap
Indicates whether we should run UMAP dimensionality reduction. This is used to provide an x/y coordinate to each graph node, suitable for visualization. If this is not enabled, nodes will receive a 0/0 x/y coordinate. If this is enabled, you must enable graph embedding as well.
Fields
enabledbool - Whether to enable UMAP layouts.
snapshots
Fields
embeddingsbool - Export embeddings snapshots to parquet.graphmlbool - Export graph snapshots to GraphML.
Query
local_search
Fields
chat_model_idstr - Name of the model definition to use for Chat Completion calls.embedding_model_idstr - Name of the model definition to use for Embedding calls.promptstr - The prompt file to use.text_unit_propfloat - The text unit proportion.community_propfloat - The community proportion.conversation_history_max_turnsint - The conversation history maximum turns.top_k_entitiesint - The top k mapped entities.top_k_relationshipsint - The top k mapped relations.temperaturefloat | None - The temperature to use for token generation.top_pfloat | None - The top-p value to use for token generation.nint | None - The number of completions to generate.max_tokensint - The maximum tokens.llm_max_tokensint - The LLM maximum tokens.
global_search
Fields
chat_model_idstr - Name of the model definition to use for Chat Completion calls.map_promptstr - The mapper prompt file to use.reduce_promptstr - The reducer prompt file to use.knowledge_promptstr - The knowledge prompt file to use.map_promptstr | None - The global search mapper prompt to use.reduce_promptstr | None - The global search reducer to use.knowledge_promptstr | None - The global search general prompt to use.temperaturefloat | None - The temperature to use for token generation.top_pfloat | None - The top-p value to use for token generation.nint | None - The number of completions to generate.max_tokensint - The maximum context size in tokens.data_max_tokensint - The data llm maximum tokens.map_max_tokensint - The map llm maximum tokens.reduce_max_tokensint - The reduce llm maximum tokens.concurrencyint - The number of concurrent requests.dynamic_search_llmstr - LLM model to use for dynamic community selection.dynamic_search_thresholdint - Rating threshold in include a community report.dynamic_search_keep_parentbool - Keep parent community if any of the child communities are relevant.dynamic_search_num_repeatsint - Number of times to rate the same community report.dynamic_search_use_summarybool - Use community summary instead of full_context.dynamic_search_concurrent_coroutinesint - Number of concurrent coroutines to rate community reports.dynamic_search_max_levelint - The maximum level of community hierarchy to consider if none of the processed communities are relevant.
drift_search
Fields
chat_model_idstr - Name of the model definition to use for Chat Completion calls.embedding_model_idstr - Name of the model definition to use for Embedding calls.promptstr - The prompt file to use.reduce_promptstr - The reducer prompt file to use.temperaturefloat - The temperature to use for token generation.",top_pfloat - The top-p value to use for token generation.nint - The number of completions to generate.max_tokensint - The maximum context size in tokens.data_max_tokensint - The data llm maximum tokens.concurrencyint - The number of concurrent requests.drift_k_followupsint - The number of top global results to retrieve.primer_foldsint - The number of folds for search priming.primer_llm_max_tokensint - The maximum number of tokens for the LLM in primer.n_depthint - The number of drift search steps to take.local_search_text_unit_propfloat - The proportion of search dedicated to text units.local_search_community_propfloat - The proportion of search dedicated to community properties.local_search_top_k_mapped_entitiesint - The number of top K entities to map during local search.local_search_top_k_relationshipsint - The number of top K relationships to map during local search.local_search_max_data_tokensint - The maximum context size in tokens for local search.local_search_temperaturefloat - The temperature to use for token generation in local search.local_search_top_pfloat - The top-p value to use for token generation in local search.local_search_nint - The number of completions to generate in local search.local_search_llm_max_gen_tokensint - The maximum number of generated tokens for the LLM in local search.
basic_search
Fields
chat_model_idstr - Name of the model definition to use for Chat Completion calls.embedding_model_idstr - Name of the model definition to use for Embedding calls.promptstr - The prompt file to use.text_unit_propfloat - The text unit proportion.community_propfloat - The community proportion.conversation_history_max_turnsint - The conversation history maximum turns.top_k_entitiesint - The top k mapped entities.top_k_relationshipsint - The top k mapped relations.temperaturefloat | None - The temperature to use for token generation.top_pfloat | None - The top-p value to use for token generation.nint | None - The number of completions to generate.max_tokensint - The maximum tokens.llm_max_tokensint - The LLM maximum tokens.
workflows
list[str] - This is a list of workflow names to run, in order. GraphRAG has built-in pipelines to configure this, but you can run exactly and only what you want by specifying the list here. Useful if you have done part of the processing yourself.