By default, the GraphRAG indexer will only emit embeddings required for our query methods. However, the model has embeddings defined for all plaintext fields, and these can be generated by setting the GRAPHRAG_EMBEDDING_TARGET environment variable to all.
If the embedding target is all, and you want to only embed a subset of these fields, you may specify which embeddings to skip using the GRAPHRAG_EMBEDDING_SKIP argument described below.
Embedded Fields
text_unit.text
document.raw_content
entity.name
entity.description
relationship.description
community.title
community.summary
community.full_content
Input Data
Our pipeline can ingest .csv or .txt data from an input folder. These files can be nested within subfolders. To configure how input data is handled, what fields are mapped over, and how timestamps are parsed, look for configuration values starting with GRAPHRAG_INPUT_ below. In general, CSV-based data provides the most customizeability. Each CSV should at least contain a text field (which can be mapped with environment variables), but it's helpful if they also have title, timestamp, and source fields. Additional fields can be included as well, which will land as extra fields on the Document table.
Base LLM Settings
These are the primary settings for configuring LLM connectivity.
Parameter
Required?
Description
Type
Default Value
GRAPHRAG_API_KEY
Yes for OpenAI. Optional for AOAI
The API key. (Note: `OPENAI_API_KEY is also used as a fallback). If not defined when using AOAI, managed identity will be used.
str
None
GRAPHRAG_API_BASE
For AOAI
The API Base URL
str
None
GRAPHRAG_API_VERSION
For AOAI
The AOAI API version.
str
None
GRAPHRAG_API_ORGANIZATION
The AOAI organization.
str
None
GRAPHRAG_API_PROXY
The AOAI proxy.
str
None
Text Generation Settings
These settings control the text generation model used by the pipeline. Any settings with a fallback will use the base LLM settings, if available.
Parameter
Required?
Description
Type
Default Value
GRAPHRAG_LLM_TYPE
For AOAI
The LLM operation type. Either openai_chat or azure_openai_chat
str
openai_chat
GRAPHRAG_LLM_DEPLOYMENT_NAME
For AOAI
The AOAI model deployment name.
str
None
GRAPHRAG_LLM_API_KEY
Yes (uses fallback)
The API key. If not defined when using AOAI, managed identity will be used.
str
None
GRAPHRAG_LLM_API_BASE
For AOAI (uses fallback)
The API Base URL
str
None
GRAPHRAG_LLM_API_VERSION
For AOAI (uses fallback)
The AOAI API version.
str
None
GRAPHRAG_LLM_API_ORGANIZATION
For AOAI (uses fallback)
The AOAI organization.
str
None
GRAPHRAG_LLM_API_PROXY
The AOAI proxy.
str
None
GRAPHRAG_LLM_MODEL
The LLM model.
str
gpt-4-turbo-preview
GRAPHRAG_LLM_MAX_TOKENS
The maximum number of tokens.
int
4000
GRAPHRAG_LLM_REQUEST_TIMEOUT
The maximum number of seconds to wait for a response from the chat client.
int
180
GRAPHRAG_LLM_MODEL_SUPPORTS_JSON
Indicates whether the given model supports JSON output mode. True to enable.
str
None
GRAPHRAG_LLM_THREAD_COUNT
The number of threads to use for LLM parallelization.
int
50
GRAPHRAG_LLM_THREAD_STAGGER
The time to wait (in seconds) between starting each thread.
float
0.3
GRAPHRAG_LLM_CONCURRENT_REQUESTS
The number of concurrent requests to allow for the embedding client.
int
25
GRAPHRAG_LLM_TPM
The number of tokens per minute to allow for the LLM client. 0 = Bypass
int
0
GRAPHRAG_LLM_RPM
The number of requests per minute to allow for the LLM client. 0 = Bypass
int
0
GRAPHRAG_LLM_MAX_RETRIES
The maximum number of retries to attempt when a request fails.
int
10
GRAPHRAG_LLM_MAX_RETRY_WAIT
The maximum number of seconds to wait between retries.
int
10
GRAPHRAG_LLM_SLEEP_ON_RATE_LIMIT_RECOMMENDATION
Whether to sleep on rate limit recommendation. (Azure Only)
bool
True
Text Embedding Settings
These settings control the text embedding model used by the pipeline. Any settings with a fallback will use the base LLM settings, if available.
Parameter
Required ?
Description
Type
Default
GRAPHRAG_EMBEDDING_TYPE
For AOAI
The embedding client to use. Either openai_embedding or azure_openai_embedding
str
openai_embedding
GRAPHRAG_EMBEDDING_DEPLOYMENT_NAME
For AOAI
The AOAI deployment name.
str
None
GRAPHRAG_EMBEDDING_API_KEY
Yes (uses fallback)
The API key to use for the embedding client. If not defined when using AOAI, managed identity will be used.
str
None
GRAPHRAG_EMBEDDING_API_BASE
For AOAI (uses fallback)
The API base URL.
str
None
GRAPHRAG_EMBEDDING_API_VERSION
For AOAI (uses fallback)
The AOAI API version to use for the embedding client.
str
None
GRAPHRAG_EMBEDDING_API_ORGANIZATION
For AOAI (uses fallback)
The AOAI organization to use for the embedding client.
Whether to sleep on rate limit recommendation. (Azure Only)
bool
True
Input Settings
These settings control the data input used by the pipeline. Any settings with a fallback will use the base LLM settings, if available.
Plaintext Input Data (GRAPHRAG_INPUT_FILE_TYPE=text)
Parameter
Description
Type
Required or Optional
Default
GRAPHRAG_INPUT_FILE_PATTERN
The file pattern regexp to use when reading input files from the input directory.
str
optional
.*\.txt$
CSV Input Data (GRAPHRAG_INPUT_FILE_TYPE=csv)
Parameter
Description
Type
Required or Optional
Default
GRAPHRAG_INPUT_TYPE
The input storage type to use when reading files. (file or blob)
str
optional
file
GRAPHRAG_INPUT_FILE_PATTERN
The file pattern regexp to use when reading input files from the input directory.
str
optional
.*\.txt$
GRAPHRAG_INPUT_SOURCE_COLUMN
The 'source' column to use when reading CSV input files.
str
optional
source
GRAPHRAG_INPUT_TIMESTAMP_COLUMN
The 'timestamp' column to use when reading CSV input files.
str
optional
None
GRAPHRAG_INPUT_TIMESTAMP_FORMAT
The timestamp format to use when parsing timestamps in the timestamp column.
str
optional
None
GRAPHRAG_INPUT_TEXT_COLUMN
The 'text' column to use when reading CSV input files.
str
optional
text
GRAPHRAG_INPUT_DOCUMENT_ATTRIBUTE_COLUMNS
A list of CSV columns, comma-separated, to incorporate as document fields.
str
optional
id
GRAPHRAG_INPUT_TITLE_COLUMN
The 'title' column to use when reading CSV input files.
str
optional
title
GRAPHRAG_INPUT_STORAGE_ACCOUNT_BLOB_URL
The Azure Storage blob endpoint to use when in blob mode and using managed identity. Will have the format https://<storage_account_name>.blob.core.windows.net
str
optional
None
GRAPHRAG_INPUT_CONNECTION_STRING
The connection string to use when reading CSV input files from Azure Blob Storage.
str
optional
None
GRAPHRAG_INPUT_CONTAINER_NAME
The container name to use when reading CSV input files from Azure Blob Storage.
str
optional
None
GRAPHRAG_INPUT_BASE_DIR
The base directory to read input files from.
str
optional
None
Data Mapping Settings
Parameter
Description
Type
Required or Optional
Default
GRAPHRAG_INPUT_FILE_TYPE
The type of input data, csv or text
str
optional
text
GRAPHRAG_INPUT_ENCODING
The encoding to apply when reading CSV/text input files.
str
optional
utf-8
Data Chunking
Parameter
Description
Type
Required or Optional
Default
GRAPHRAG_CHUNK_SIZE
The chunk size in tokens for text-chunk analysis windows.
str
optional
300
GRAPHRAG_CHUNK_OVERLAP
The chunk overlap in tokens for text-chunk analysis windows.
str
optional
100
GRAPHRAG_CHUNK_BY_COLUMNS
A comma-separated list of document attributes to groupby when performing TextUnit chunking.
str
optional
id
Prompting Overrides
Parameter
Description
Type
Required or Optional
Default
GRAPHRAG_ENTITY_EXTRACTION_PROMPT_FILE
The path (relative to the root) of an entity extraction prompt template text file.
str
optional
None
GRAPHRAG_ENTITY_EXTRACTION_MAX_GLEANINGS
The maximum number of redrives (gleanings) to invoke when extracting entities in a loop.
int
optional
0
GRAPHRAG_ENTITY_EXTRACTION_ENTITY_TYPES
A comma-separated list of entity types to extract.
str
optional
organization,person,event,geo
GRAPHRAG_SUMMARIZE_DESCRIPTIONS_PROMPT_FILE
The path (relative to the root) of an description summarization prompt template text file.
str
optional
None
GRAPHRAG_SUMMARIZE_DESCRIPTIONS_MAX_LENGTH
The maximum number of tokens to generate per description summarization.
int
optional
500
GRAPHRAG_CLAIM_EXTRACTION_ENABLED
Whether claim extraction is enabled for this pipeline.
bool
optional
False
GRAPHRAG_CLAIM_EXTRACTION_DESCRIPTION
The claim_description prompting argument to utilize.
string
optional
"Any claims or facts that could be relevant to threat analysis."
GRAPHRAG_CLAIM_EXTRACTION_PROMPT_FILE
The claim extraction prompt to utilize.
string
optional
None
GRAPHRAG_CLAIM_EXTRACTION_MAX_GLEANINGS
The maximum number of redrives (gleanings) to invoke when extracting claims in a loop.
int
optional
0
GRAPHRAG_COMMUNITY_REPORT_PROMPT_FILE
The community report extraction prompt to utilize.
string
optional
None
GRAPHRAG_COMMUNITY_REPORT_MAX_LENGTH
The maximum number of tokens to generate per community report.
int
optional
1500
Storage
This section controls the storage mechanism used by the pipeline used for emitting output tables.
Parameter
Description
Type
Required or Optional
Default
GRAPHRAG_STORAGE_TYPE
The type of reporter to use. Options are file, memory, or blob
str
optional
file
GRAPHRAG_STORAGE_STORAGE_ACCOUNT_BLOB_URL
The Azure Storage blob endpoint to use when in blob mode and using managed identity. Will have the format https://<storage_account_name>.blob.core.windows.net
str
optional
None
GRAPHRAG_STORAGE_CONNECTION_STRING
The Azure Storage connection string to use when in blob mode.
str
optional
None
GRAPHRAG_STORAGE_CONTAINER_NAME
The Azure Storage container name to use when in blob mode.
str
optional
None
GRAPHRAG_STORAGE_BASE_DIR
The base path to data outputs outputs.
str
optional
None
Cache
This section controls the cache mechanism used by the pipeline. This is used to cache LLM invocation results.
Parameter
Description
Type
Required or Optional
Default
GRAPHRAG_CACHE_TYPE
The type of cache to use. Options are file, memory, none or blob
str
optional
file
GRAPHRAG_CACHE_STORAGE_ACCOUNT_BLOB_URL
The Azure Storage blob endpoint to use when in blob mode and using managed identity. Will have the format https://<storage_account_name>.blob.core.windows.net
str
optional
None
GRAPHRAG_CACHE_CONNECTION_STRING
The Azure Storage connection string to use when in blob mode.
str
optional
None
GRAPHRAG_CACHE_CONTAINER_NAME
The Azure Storage container name to use when in blob mode.
str
optional
None
GRAPHRAG_CACHE_BASE_DIR
The base path to the reporting outputs.
str
optional
None
Reporting
This section controls the reporting mechanism used by the pipeline, for common events and error messages. The default is to write reports to a file in the output directory. However, you can also choose to write reports to the console or to an Azure Blob Storage container.
Parameter
Description
Type
Required or Optional
Default
GRAPHRAG_REPORTING_TYPE
The type of reporter to use. Options are file, console, or blob
str
optional
file
GRAPHRAG_REPORTING_STORAGE_ACCOUNT_BLOB_URL
The Azure Storage blob endpoint to use when in blob mode and using managed identity. Will have the format https://<storage_account_name>.blob.core.windows.net
str
optional
None
GRAPHRAG_REPORTING_CONNECTION_STRING
The Azure Storage connection string to use when in blob mode.
str
optional
None
GRAPHRAG_REPORTING_CONTAINER_NAME
The Azure Storage container name to use when in blob mode.
str
optional
None
GRAPHRAG_REPORTING_BASE_DIR
The base path to the reporting outputs.
str
optional
None
Node2Vec Parameters
Parameter
Description
Type
Required or Optional
Default
GRAPHRAG_NODE2VEC_ENABLED
Whether to enable Node2Vec
bool
optional
False
GRAPHRAG_NODE2VEC_NUM_WALKS
The Node2Vec number of walks to perform
int
optional
10
GRAPHRAG_NODE2VEC_WALK_LENGTH
The Node2Vec walk length
int
optional
40
GRAPHRAG_NODE2VEC_WINDOW_SIZE
The Node2Vec window size
int
optional
2
GRAPHRAG_NODE2VEC_ITERATIONS
The number of iterations to run node2vec
int
optional
3
GRAPHRAG_NODE2VEC_RANDOM_SEED
The random seed to use for node2vec
int
optional
597832
Data Snapshotting
Parameter
Description
Type
Required or Optional
Default
GRAPHRAG_SNAPSHOT_GRAPHML
Whether to enable GraphML snapshots.
bool
optional
False
GRAPHRAG_SNAPSHOT_RAW_ENTITIES
Whether to enable raw entity snapshots.
bool
optional
False
GRAPHRAG_SNAPSHOT_TOP_LEVEL_NODES
Whether to enable top-level node snapshots.
bool
optional
False
Miscellaneous Settings
Parameter
Description
Type
Required or Optional
Default
GRAPHRAG_ASYNC_MODE
Which async mode to use. Either asyncio or threaded.
str
optional
asyncio
GRAPHRAG_ENCODING_MODEL
The text encoding model, used in tiktoken, to encode text.
str
optional
cl100k_base
GRAPHRAG_MAX_CLUSTER_SIZE
The maximum number of entities to include in a single Leiden cluster.