alihan/graphrag-microsoft

Fork 0

mirror of https://github.com/microsoft/graphrag.git synced 2025-03-11 01:26:14 +03:00

Files

Alonso Guevara 81b81cf60b Initial Release

2024-07-01 15:25:30 -06:00

8.0 KiB

Raw Blame History

title, navtitle, tags, layout, date

title

navtitle

Config Sections

input

Fields

type file|blob - The input type to use. Default=file
file_type text|csv - The type of input data to load. Either text or csv. Default is text
file_encoding str - The encoding of the input file. Default is utf-8
file_pattern str - A regex to match input files. Default is .*\.csv$ if in csv mode and .*\.txt$ if in text mode.
source_column str - (CSV Mode Only) The source column name.
timestamp_column str - (CSV Mode Only) The timestamp column name.
timestamp_format str - (CSV Mode Only) The source format.
text_column str - (CSV Mode Only) The text column name.
title_column str - (CSV Mode Only) The title column name.
document_attribute_columns list[str] - (CSV Mode Only) The additional document attributes to include.
connection_string str - (blob only) The Azure Storage connection string.
container_name str - (blob only) The Azure Storage container name.
base_dir str - The base directory to read input from, relative to the root.
storage_account_blob_url str - The storage account blob URL to use.

llm

This is the base LLM configuration section. Other steps may override this configuration with their own LLM configuration.

Fields

api_key str - The OpenAI API key to use.
type openai_chat|azure_openai_chat|openai_embedding|azure_openai_embedding - The type of LLM to use.
model str - The model name.
max_tokens int - The maximum number of output tokens.
request_timeout float - The per-request timeout.
api_base str - The API base url to use.
api_version str - The API version
organization str - The client organization.
proxy str - The proxy URL to use.
cognitive_services_endpoint str - The url endpoint for cognitive services.
deployment_name str - The deployment name to use (Azure).
model_supports_json bool - Whether the model supports JSON-mode output.
tokens_per_minute int - Set a leaky-bucket throttle on tokens-per-minute.
requests_per_minute int - Set a leaky-bucket throttle on requests-per-minute.
max_retries int - The maximum number of retries to use.
max_retry_wait float - The maximum backoff time.
sleep_on_rate_limit_recommendation bool - Whether to adhere to sleep recommendations (Azure).
concurrent_requests int The number of open requests to allow at once.

parallelization

Fields

stagger float - The threading stagger value.
num_threads int - The maximum number of work threads.

async_mode

asyncio|threaded The async mode to use. Either asyncio or `threaded.

embeddings

Fields

llm (see LLM top-level config)
parallelization (see Parallelization top-level config)
async_mode (see Async Mode top-level config)
batch_size int - The maximum batch size to use.
batch_max_tokens int - The maximum batch #-tokens.
target required|all - Determines which set of embeddings to emit.
skip list[str] - Which embeddings to skip.
strategy dict - Fully override the text-embedding strategy.

chunks

Fields

size int - The max chunk size in tokens.
overlap int - The chunk overlap in tokens.
group_by_columns list[str] - group documents by fields before chunking.
strategy dict - Fully override the chunking strategy.

cache

Fields

type file|memory|none|blob - The cache type to use. Default=file
connection_string str - (blob only) The Azure Storage connection string.
container_name str - (blob only) The Azure Storage container name.
base_dir str - The base directory to write cache to, relative to the root.
storage_account_blob_url str - The storage account blob URL to use.

storage

Fields

type file|memory|blob - The storage type to use. Default=file
connection_string str - (blob only) The Azure Storage connection string.
container_name str - (blob only) The Azure Storage container name.
base_dir str - The base directory to write reports to, relative to the root.
storage_account_blob_url str - The storage account blob URL to use.

reporting

Fields

type file|console|blob - The reporting type to use. Default=file
connection_string str - (blob only) The Azure Storage connection string.
container_name str - (blob only) The Azure Storage container name.
base_dir str - The base directory to write reports to, relative to the root.
storage_account_blob_url str - The storage account blob URL to use.

entity_extraction

Fields

llm (see LLM top-level config)
parallelization (see Parallelization top-level config)
async_mode (see Async Mode top-level config)
prompt str - The prompt file to use.
entity_types list[str] - The entity types to identify.
max_gleanings int - The maximum number of gleaning cycles to use.
strategy dict - Fully override the entity extraction strategy.

summarize_descriptions

Fields

llm (see LLM top-level config)
parallelization (see Parallelization top-level config)
async_mode (see Async Mode top-level config)
prompt str - The prompt file to use.
max_length int - The maximum number of output tokens per summarization.
strategy dict - Fully override the summarize description strategy.

claim_extraction

Fields

enabled bool - Whether to enable claim extraction. default=False
llm (see LLM top-level config)
parallelization (see Parallelization top-level config)
async_mode (see Async Mode top-level config)
prompt str - The prompt file to use.
description str - Describes the types of claims we want to extract.
max_gleanings int - The maximum number of gleaning cycles to use.
strategy dict - Fully override the claim extraction strategy.

community_reports

Fields

llm (see LLM top-level config)
parallelization (see Parallelization top-level config)
async_mode (see Async Mode top-level config)
prompt str - The prompt file to use.
max_length int - The maximum number of output tokens per report.
max_input_length int - The maximum number of input tokens to use when generating reports.
strategy dict - Fully override the community reports strategy.

cluster_graph

Fields

max_cluster_size int - The maximum cluster size to emit.
strategy dict - Fully override the cluster_graph strategy.

embed_graph

Fields

enabled bool - Whether to enable graph embeddings.
num_walks int - The node2vec number of walks.
walk_length int - The node2vec walk length.
window_size int - The node2vec window size.
iterations int - The node2vec number of iterations.
random_seed int - The node2vec random seed.
strategy dict - Fully override the embed graph strategy.

umap

Fields

enabled bool - Whether to enable UMAP layouts.

snapshots

Fields

graphml bool - Emit graphml snapshots.
raw_entities bool - Emit raw entity snapshots.
top_level_nodes bool - Emit top-level-node snapshots.

encoding_model

str - The text encoding model to use. Default is cl100k_base.

skip_workflows

list[str] - Which workflow names to skip.

8.0 KiB Raw Blame History

Config Sections

input

Fields

llm

Fields

parallelization

Fields

async_mode

embeddings

Fields

chunks

Fields

cache

Fields

storage

Fields

reporting

Fields

entity_extraction

Fields

summarize_descriptions

Fields

claim_extraction

Fields

community_reports

Fields

cluster_graph

Fields

embed_graph

Fields

umap

Fields

snapshots

Fields

encoding_model

skip_workflows

8.0 KiB

Raw Blame History