mirror of
https://github.com/microsoft/graphrag.git
synced 2025-03-11 01:26:14 +03:00
8.0 KiB
8.0 KiB
title, navtitle, tags, layout, date
| title | navtitle | tags | layout | date | |
|---|---|---|---|---|---|
| Default Configuration Mode (using JSON/YAML) | Using JSON or YAML |
|
page | 2023-01-03 |
The default configuration mode may be configured by using a config.json or config.yml file in the data project root. If a .env file is present along with this config file, then it will be loaded, and the environment variables defined therein will be available for token replacements in your configuration document using ${ENV_VAR} syntax.
For example:
# .env
API_KEY=some_api_key
# config.json
{
"llm": {
"api_key": "${API_KEY}"
}
}
Config Sections
input
Fields
typefile|blob - The input type to use. Default=filefile_typetext|csv - The type of input data to load. Eithertextorcsv. Default istextfile_encodingstr - The encoding of the input file. Default isutf-8file_patternstr - A regex to match input files. Default is.*\.csv$if in csv mode and.*\.txt$if in text mode.source_columnstr - (CSV Mode Only) The source column name.timestamp_columnstr - (CSV Mode Only) The timestamp column name.timestamp_formatstr - (CSV Mode Only) The source format.text_columnstr - (CSV Mode Only) The text column name.title_columnstr - (CSV Mode Only) The title column name.document_attribute_columnslist[str] - (CSV Mode Only) The additional document attributes to include.connection_stringstr - (blob only) The Azure Storage connection string.container_namestr - (blob only) The Azure Storage container name.base_dirstr - The base directory to read input from, relative to the root.storage_account_blob_urlstr - The storage account blob URL to use.
llm
This is the base LLM configuration section. Other steps may override this configuration with their own LLM configuration.
Fields
api_keystr - The OpenAI API key to use.typeopenai_chat|azure_openai_chat|openai_embedding|azure_openai_embedding - The type of LLM to use.modelstr - The model name.max_tokensint - The maximum number of output tokens.request_timeoutfloat - The per-request timeout.api_basestr - The API base url to use.api_versionstr - The API versionorganizationstr - The client organization.proxystr - The proxy URL to use.cognitive_services_endpointstr - The url endpoint for cognitive services.deployment_namestr - The deployment name to use (Azure).model_supports_jsonbool - Whether the model supports JSON-mode output.tokens_per_minuteint - Set a leaky-bucket throttle on tokens-per-minute.requests_per_minuteint - Set a leaky-bucket throttle on requests-per-minute.max_retriesint - The maximum number of retries to use.max_retry_waitfloat - The maximum backoff time.sleep_on_rate_limit_recommendationbool - Whether to adhere to sleep recommendations (Azure).concurrent_requestsint The number of open requests to allow at once.
parallelization
Fields
staggerfloat - The threading stagger value.num_threadsint - The maximum number of work threads.
async_mode
asyncio|threaded The async mode to use. Either asyncio or `threaded.
embeddings
Fields
llm(see LLM top-level config)parallelization(see Parallelization top-level config)async_mode(see Async Mode top-level config)batch_sizeint - The maximum batch size to use.batch_max_tokensint - The maximum batch #-tokens.targetrequired|all - Determines which set of embeddings to emit.skiplist[str] - Which embeddings to skip.strategydict - Fully override the text-embedding strategy.
chunks
Fields
sizeint - The max chunk size in tokens.overlapint - The chunk overlap in tokens.group_by_columnslist[str] - group documents by fields before chunking.strategydict - Fully override the chunking strategy.
cache
Fields
typefile|memory|none|blob - The cache type to use. Default=fileconnection_stringstr - (blob only) The Azure Storage connection string.container_namestr - (blob only) The Azure Storage container name.base_dirstr - The base directory to write cache to, relative to the root.storage_account_blob_urlstr - The storage account blob URL to use.
storage
Fields
typefile|memory|blob - The storage type to use. Default=fileconnection_stringstr - (blob only) The Azure Storage connection string.container_namestr - (blob only) The Azure Storage container name.base_dirstr - The base directory to write reports to, relative to the root.storage_account_blob_urlstr - The storage account blob URL to use.
reporting
Fields
typefile|console|blob - The reporting type to use. Default=fileconnection_stringstr - (blob only) The Azure Storage connection string.container_namestr - (blob only) The Azure Storage container name.base_dirstr - The base directory to write reports to, relative to the root.storage_account_blob_urlstr - The storage account blob URL to use.
entity_extraction
Fields
llm(see LLM top-level config)parallelization(see Parallelization top-level config)async_mode(see Async Mode top-level config)promptstr - The prompt file to use.entity_typeslist[str] - The entity types to identify.max_gleaningsint - The maximum number of gleaning cycles to use.strategydict - Fully override the entity extraction strategy.
summarize_descriptions
Fields
llm(see LLM top-level config)parallelization(see Parallelization top-level config)async_mode(see Async Mode top-level config)promptstr - The prompt file to use.max_lengthint - The maximum number of output tokens per summarization.strategydict - Fully override the summarize description strategy.
claim_extraction
Fields
enabledbool - Whether to enable claim extraction. default=Falsellm(see LLM top-level config)parallelization(see Parallelization top-level config)async_mode(see Async Mode top-level config)promptstr - The prompt file to use.descriptionstr - Describes the types of claims we want to extract.max_gleaningsint - The maximum number of gleaning cycles to use.strategydict - Fully override the claim extraction strategy.
community_reports
Fields
llm(see LLM top-level config)parallelization(see Parallelization top-level config)async_mode(see Async Mode top-level config)promptstr - The prompt file to use.max_lengthint - The maximum number of output tokens per report.max_input_lengthint - The maximum number of input tokens to use when generating reports.strategydict - Fully override the community reports strategy.
cluster_graph
Fields
max_cluster_sizeint - The maximum cluster size to emit.strategydict - Fully override the cluster_graph strategy.
embed_graph
Fields
enabledbool - Whether to enable graph embeddings.num_walksint - The node2vec number of walks.walk_lengthint - The node2vec walk length.window_sizeint - The node2vec window size.iterationsint - The node2vec number of iterations.random_seedint - The node2vec random seed.strategydict - Fully override the embed graph strategy.
umap
Fields
enabledbool - Whether to enable UMAP layouts.
snapshots
Fields
graphmlbool - Emit graphml snapshots.raw_entitiesbool - Emit raw entity snapshots.top_level_nodesbool - Emit top-level-node snapshots.
encoding_model
str - The text encoding model to use. Default is cl100k_base.
skip_workflows
list[str] - Which workflow names to skip.