353 Commits

Author SHA1 Message Date
Nathan Evans
66c2cfb3ce Support JSON input files (#1777)
* Add csv loader tests

* Add test loader tests

* Add json input support

* Remove temp path constraint

* Reuse loader cose

* Semver

* Set file pattern automatically based on type, if empty

* Remove pattern from smoke test config

* Spelling

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2025-03-10 14:04:07 -07:00
Nathan Evans
bcb74789f1 Next release docs (#1627)
* Wordind updates

* Update yam lconfig and add notes to "deprecated" env

* Add basic search section

* Update versioning docs

* Minor edits for clarity

* Update init command

* Update init to add --force in docs

* Add NLP extraction params

* Move vector_store to root

* Add workflows to config

* Add FastGraphRAG docs

* add metadata column changes

* Added documentation for multi index search.

* Minor fixes.

* Add config and table renames

* Update migration notebook and comments to specify v1

* Add frequency to entity table docs

* add new chunking options for metadata

* Update output docs

* Minor edits and cleanup

* Add model ids to search configs

* Spruce up migration notebook

* Lint/format multi-index notebook

* SpaCy model note

* Update SpaCy footnote

* Updated multi_index_search.ipynb to remove ruff errors.

* add spacy to dictionary

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
Co-authored-by: Dayenne Souza <ddesouza@microsoft.com>
Co-authored-by: dorbaker <dorbaker@microsoft.com>
2025-03-03 14:46:00 -08:00
Nathan Evans
bd06d8b4f0 Context property bag ("state") (#1774)
* Add pipeline state property bag to run context

* Move state creation out of context util

* Move callbacks into PipelineRunContext

* Semver

* Rename state.json to context.json to avoid confusion with stats.json

* Expand smoke test row count

* Add util to create storage and cache
2025-02-28 09:31:48 -08:00
Nathan Evans
a15942629b Add more verb tests (#1773)
* Add NLP verb test

* Add finalize_graph tests

* Add more thorough final column assertions
2025-02-27 09:31:46 -08:00
Alonso Guevara
b4b8b81c0a Remove spacy model from toml (#1771)
* Remove spacy model from toml

* Semver
2025-02-26 10:58:02 -06:00
Alonso Guevara
716f93dd8b Release v2.0.0 (#1769)
* Release v2.0.0

* snspshots...
v2.0.0
2025-02-25 17:52:30 -06:00
Alonso Guevara
facf68148a Fix summarization and relationship grouping on Inc Indexing (#1768)
* Finx sumarization for large descriptions on incremental indexing

* Semver

* Ruff
2025-02-25 17:29:55 -06:00
Nathan Evans
ede6a74546 Pipeline callbacks (#1729)
* Add pipeline_start and pipeline_end callbacks

* Collapse redundant callback/logger logic

* Remove redundant reporting config classes

* Remove a few out-of-date type ignores

* Semver

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2025-02-25 15:07:51 -08:00
Nathan Evans
e40476153d Speed up smoke tests (#1736)
* Move verb tests to regular CI

* Clean up env vars

* Update smoke runtime expectations

* Rework artifact assertions

* Fix plural in name

* remove redundant artifact len check

* Remove redundant artifact len check

* Adjust graph output expectations

* Update community expectations

* Include all workflow output

* Adjust text unit expectations

* Adjust assertions per dataset

* Fix test config param name

* Update nan allowed for optional model fields

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2025-02-25 13:24:35 -08:00
Nathan Evans
61a309b182 Incremental model alignment (#1766)
* Used shared schema lists for all final columns

* Semver
2025-02-25 13:14:42 -06:00
Alonso Guevara
0144b3fd88 Update FNLLM (#1738)
* Add ModelProvider to Query package.

* Spellcheck + others

* Semver

* Fix tests

* Format

* Fix Pyright

* Fix tests

* Fix for smoke tests

* Update fnllm version

* Semver

* Ruff
2025-02-24 20:30:45 -06:00
Nathan Evans
5dd9fc53cd Move embeddings snapshots (#1737)
* Move embedding snapshots to the workflow runner

* Semver

* Rename input tables
2025-02-24 17:38:01 -08:00
Alonso Guevara
e0d233fe10 Feat/llm provider query (#1735)
* Add ModelProvider to Query package.

* Spellcheck + others

* Semver

* Fix tests

* Format

* Fix Pyright

* Fix tests

* Fix for smoke tests
2025-02-24 18:35:51 -06:00
Nathan Evans
faa05b691f Fix text unit incremental ID updates (#1734)
* Increment text_unit ids during incremental

* Semver
2025-02-24 14:58:00 -08:00
Nathan Evans
a932b2d342 Fix StopAsyncIteration catch (#1730) 2025-02-21 11:46:44 -08:00
Derek Worthen
54885b8ab1 Refactor config defaults (#1723)
* Refactor config defaults

- Implement type-safe, hierarchical dataclass for config
defaults instead of namespaced constants.
- Allow for instantiating config directly from defaults data structure.

* fix vector_store db_uri default

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2025-02-20 13:01:29 -06:00
Alonso Guevara
7bdeaee94a Create Language Model Providers and Registry methods. Remove fnllm coupling (#1724)
* Base structure

* Add fnllm providers and Mock LLM

* Remove fnllm coupling, introduce llm providers

* Ruff + Tests fix

* Spellcheck

* Semver

* Format

* Default MockChat params

* Fix more tests

* Fix embedding smoke test

* Fix embeddings smoke test

* Fix MockEmbeddingLLM

* Rename LLM to model. Package organization

* Fix prompt tuning

* Oops

* Oops II
2025-02-20 08:56:20 -06:00
Nathan Evans
a42772d368 Query callbacks (#1721)
* Add callbacks to global search

* Add callbacks to local search

* Add streaming callbacks in local search CLI

* Add callbacks to basic search

* Add callbacks to DRIFT search

* Semver

* Return generators directly in API

* Guard callbacks
2025-02-19 13:00:07 -08:00
Nathan Evans
efcaf9636d Tuck flow functions under their workflows (#1720)
* Move flow functions to workflow

* Remove redundant workflow_name variable

* Semver
2025-02-18 15:33:36 -06:00
Alonso Guevara
7f020826be Fix/json mode community reports (#1713)
* Patch json mode on Community Reports

* Semversioner

* Wording oopsie
2025-02-14 16:51:42 -06:00
Nathan Evans
96219a2182 Register workflows (#1691)
* Add workflow registration

* Add ability to mutate config by workflows

* Separate graph finalization

* Separate graph pruning

* Semver

* Update tests

* Update smoke tests

* Fix iterrows on create_graph

* Remove prune_graph from llm construction

* Update test data

* Remove prune_graph from smoke tests
2025-02-14 13:21:31 -08:00
Nathan Evans
981fd31963 Community children (#1704)
* Add children to the community tables

* Replace NaN children with empty list

* Replace subcommunity logic with built-in parent/child fields

* Remove restore_community_hierarchy

* Add children and frequency to migration notebook

* Format

* Semver

* Add children to reports

* Update tests

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2025-02-13 17:03:51 -08:00
Nathan Evans
35b639399b Incremental flow rework (#1696)
* Rework update output structure

* Semver

* Fix unit test

* Update frequency in incremental

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2025-02-13 18:22:32 -06:00
Alonso Guevara
5ef2399a6f Chore/remove iterrows (#1708)
* Remove most iterrow usages

* Semver

* Ruff

* Pyright

* Format
2025-02-13 17:32:54 -06:00
Josh Bradley
f14cda2b6d Improve default llm retry logic to be more optimized (#1701) 2025-02-13 16:56:37 -05:00
Josh Bradley
b8b949f3bb Cleanup query api - remove code duplication (#1690)
* consolidate query api functions and remove code duplication

* refactor and remove more code duplication

* Add semversioner file

* fix basic search

* fix drift search and update base class function names

* update example notebooks
2025-02-13 16:31:08 -05:00
Nathan Evans
fe461417b5 Export NLP community reports prompt (#1697)
* Properly export the NLP community reports prompt

* Semver

* Fix verb tests
2025-02-12 10:41:39 -08:00
Dayenne Souza
b94290ec2b add option to add metadata into text chunks (#1681)
* add new options

* add metadata json into input document

* remove doc change

* add metadata column into text loader

* prepend_metadata

* run fix

* fix tests and patch

* fix test

* add watrning for metadata tokens > config size

* fix typo and run fix

* fix test_integration

* fix test

* run check

* rename and fix chunking

* fix

* fix

* fiz test verbs

* fix

* fix tests

* fix chunking

* fix index

* fix cosmos test

* fix vars

* fix after PR

* fix
2025-02-12 09:38:03 -08:00
KennyZhang1
b9dc7b90d5 Fix/streamline workflow miq bugs (#1694)
* Add vector store id reference to embeddings config.

* added communities to links and maxvals

* Consistent naming

* Update entity_ids to include index_name

* added consistent logging messages to miq cli

* semversioner

---------

Co-authored-by: Derek Worthen <worthend.derek@gmail.com>
Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
2025-02-11 16:13:28 -05:00
Nathan Evans
a6a78d5897 Nlp cache (#1689)
* Add cache to build_noun_graph

* Semver
2025-02-10 11:00:51 -08:00
Nathan Evans
c02ab0984a Streamline workflows (#1674)
* Remove create_final_nodes

* Rename final entity output to "entities"

* Remove duplicate code from graph extraction

* Rename create_final_relationships output to "relationships"

* Rename create_final_communities output to "communities"

* Combine compute_communities and create_final_communities

* Rename create_final_covariates output to "covariates"

* Rename create_final_community_reports output to "community_reports"

* Rename create_final_text_units output to "text_units"

* Rename create_final_documents output to "documents"

* Remove transient snapshots config

* Move create_final_entities to finalize_entities operation

* Move create_final_relationships flow to finalize_relationships operation

* Reuse some community report functions

* Collapse most of graph and text unit-based report generation

* Unify schemas files

* Move community reports extractor

* Move NLP report prompt to prompts folder

* Fix a few pandas warnings

* Rename embeddings config to embed_text

* Rename claim_extraction config to extract_claims

* Remove nltk from standard graph extraction

* Fix verb tests

* Fix extract graph config naming

* Fix moved file reference

* Create v1-to-v2 migration notebook

* Semver

* Fix smoke test artifact count

* Raise tpm/rpm on smoke tests

* Update drift settings for smoke tests

* Reuse project directory var in api notebook

* Format

* Format
2025-02-07 11:11:03 -08:00
KennyZhang1
83cc2daf91 Multi-index query CLI support (#1675)
* Add vector store id reference to embeddings config.

* changed structure of output config section

* added cli integration for multi index global

* added cli integration for multi index local

* added cli integration for multi index drift and basic

* finished local testing of multi-index cli

* ruff fixes

* partially refactored test code to align with new output section

* more test changes for new output structure

* semversioner

* refactored to align with new multi index config proposal

* locally tested new multi-index output proposal

* cleaned up tests to align with new structure

---------

Co-authored-by: Derek Worthen <worthend.derek@gmail.com>
2025-02-07 12:56:48 -05:00
Alonso Guevara
0805924a35 Fix/drift n depth (#1676)
* Fix n_depth param

* Semver

* Change smoke tests params for drift

* Reduce log printing for expected exceptions
2025-02-05 17:22:34 -06:00
JunHo Kim (김준호)
a4d35bc66f Fix typo in DEVELOPING.md instructions (#1631)
Corrected "this values" to "these values" for improved clarity. This ensures the documentation is more accurate and professional.

Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
2025-02-04 13:16:57 -08:00
JunHo Kim (김준호)
30f36316af Fix typo in table formatting in env_vars documentation (#1632)
Corrected a missing backtick in a note within the `GRAPHRAG_API_KEY` description. This ensures proper code formatting and improves readability in the documentation. No content was altered aside from formatting adjustments.

Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
2025-02-04 13:14:58 -08:00
Dayenne Souza
ad5b5120ec remove unused columns and rename document_attribute_columns (#1672)
* remove unused columns and change property document_attribute_columns to metadata

* format file

* fix 'metadata' column on output

* run check

* fix test on nltk

* remove docs changes
2025-02-03 14:37:06 -03:00
Nathan Evans
907d271f4e Fix recursive report generation (#1669) 2025-01-30 11:03:25 -08:00
Nathan Evans
53b06aa2ac Add generate_text_embeddings to FGR (#1667) 2025-01-29 14:31:48 -08:00
Derek Worthen
94bd2bb816 Require explicit azure auth settings when using AOI. (#1665)
* Require explicit azure auth settings when using AOI.

- Must set LanguageModel.azure_auth_type to either
"api_key" or "managed_identity" when using AOI.

* Fix smoke tests

* Use general auth_type property instead of azure_auth_type

* Remove unused error type

* Update validation

* Update validation comment
2025-01-29 12:28:47 -08:00
Nathan Evans
d31750f44d NLP graph extraction (#1652)
* Add NLP extraction workflow

* Add text unit community summarization

* Add CLI flag for indexing method

* Regenerate poetry.lock

* Fix claims loading

* Merge fixes

* Add workflow overrides to config

* Semver

* Add graph pruning config

* Remove degree re-compute from pruning

* Switch to percentile for edge weight pruning

* Add NLP extraction config

* Add new NLP extractor options

* Add FGR workflows to util method

* Use a generator factory for workflows

* Update pruning defaults

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2025-01-28 12:27:03 -08:00
Derek Worthen
eeee84e9d9 Add vector store id reference to embeddings config. (#1662) 2025-01-28 10:46:41 -08:00
KennyZhang1
1bbce33f42 Multi-index querying for API layer (#1644)
* added multi-global-query function header

* ported over code for merging dataframes

* added connection to global streaming api function

* added function header for update context helper

* implemented and incorperated update_context function

* Updated to make sure 'parent' column in final_communities gets incremented for multi index.

* first cut at multi_local_seach function

* several minor changes and fixes

* Updated multi index local search.

* Cleaned up code.

* fixed lambda function ruff errors

* fixed more ruff errors

* moved query api helpers to util file

* moved index api helpers to util file

* merged in code left out of conflict

* changed GraphRagConfig object to support lists of vector stores

* Updated with fixes for multi_local_search.

* Minor updates.

* Minor updates.

* Updates for ruff check.

* Minor updates.

* removed redundant vector_store_configs arg

* ruff formatting changes

* semversioner

* Minor fix.

* spellcheck fixes

* ruff

* test fix for cicd errors

* another test fix

* added explicit typing for ci tests

* added dict type check for vector_store during indexing

* more ruff fixes

* moved type check

* Removed streaming. Added multi drift and basic searches.

* Formatting changes.

* Updates for pyright.

* Update for ruff.

* Ruff formatted.

* first cut at fixing vector store typing errors

* got multi local search working with new config

* ruff and test fixes

* added fix for embeddings type error

* renamed multi index api functions

* ruff

* convert config model to dict[VectorStoreConfig]

* modified tests to support new vector_store model

* ruff fixes

* changed some test setups to match new model

* changed ci/cd settings files to match new structure

* Fix stderror check

* fixed bug in vector_store_config validation

* ruff

* add database_name field to vectorstoreconfig

* removed print statements

* small refactoring for PR comments

* modified default config in test

* modified vector store config unit test

---------

Co-authored-by: dorbaker <dorbaker@microsoft.com>
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2025-01-27 17:26:38 -05:00
Shamik
053bf60162 Update auto_prompt_tuning.md (#1659)
Updated the auto prompt tuning doc with `--selection-method` instead of only `--method` as per the latest API.

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2025-01-27 13:33:25 -06:00
Alonso Guevara
6b33977360 Add smoke tests for drift (#1658) 2025-01-24 12:31:37 -06:00
Derek Worthen
c644338bae Refactor config (#1593)
* Refactor config

- Add new ModelConfig to represent LLM settings
    - Combines LLMParameters, ParallelizationParameters, encoding_model, and async_mode
- Add top level models config that is a list of available LLM ModelConfigs
- Remove LLMConfig inheritance and delete LLMConfig
    - Replace the inheritance with a model_id reference to the ModelConfig listed in the top level models config
- Remove all fallbacks and hydration logic from create_graphrag_config
    - This removes the automatic env variable overrides
- Support env variables within config files using Templating
    - This requires "$" to be escaped with extra "$" so ".*\\.txt$" becomes ".*\\.txt$$"
- Update init content to initialize new config file with the ModelConfig structure

* Use dict of ModelConfig instead of list

* Add model validations and unit tests

* Fix ruff checks

* Add semversioner change

* Fix unit tests

* validate root_dir in pydantic model

* Rename ModelConfig to LanguageModelConfig

* Rename ModelConfigMissingError to LanguageModelConfigMissingError

* Add validationg for unexpected API keys

* Allow skipping pydantic validation for testing/mocking purposes.

* Add default lm configs to verb tests

* smoke test

* remove config from flows to fix llm arg mapping

* Fix embedding llm arg mapping

* Remove timestamp from smoke test outputs

* Remove unused "subworkflows" smoke test properties

* Add models to smoke test configs

* Update smoke test output path

* Send logs to logs folder

* Fix output path

* Fix csv test file pattern

* Update placeholder

* Format

* Instantiate default model configs

* Fix unit tests for config defaults

* Fix migration notebook

* Remove create_pipeline_config

* Remove several unused config models

* Remove indexing embedding and input configs

* Move embeddings function to config

* Remove skip_workflows

* Remove skip embeddings in favor of explicit naming

* fix unit test spelling mistake

* self.models[model_id] is already a language model. Remove redundant casting.

* update validation errors to instruct users to rerun graphrag init

* instantiate LanguageModelConfigs with validation

* skip validation in unit tests

* update verb tests to use default model settings instead of skipping validation

* test using llm settings

* cleanup verb tests

* remove unsafe default model config

* remove the ability to skip pydantic validation

* remove None union types when default values are set

* move vector_store from embeddings to top level of config and delete resolve_paths

* update vector store settings

* fix vector store and smoke tests

* fix serializing vector_store settings

* fix vector_store usage

* fix vector_store type

* support cli overrides for loading graphrag config

* rename storage to output

* Add --force flag to init

* Remove run_id and resume, fix Drift config assignment

* Ruff

---------

Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2025-01-21 17:52:06 -06:00
Nathan Evans
47adfe16f0 Fix DRIFT search on Azure AI Search (#1645)
* Add vector field to retrievable fields for Azure AI Search

* Add DRIFT and Basic search to smoke tests

* Semver

* Format

* Remove DRIFT smoke test for now (brittle)
2025-01-21 17:28:46 -06:00
Alonso Guevara
dd884c0ce2 Release v1.2.0 (#1625) v1.2.0 2025-01-15 15:49:07 -06:00
Alonso Guevara
3defab2ea4 Reduce Drift Response and Streaming endpoint (#1624)
* Adding basic wrappes for reduce in drift

* Add response_type parameter to run_drift_search and enhance reduce response functionality

* Add streaming endpoint

* Semver

* Spellcheck

* Ruff checks

* Count tokens on reduce

* Use list comprehension and remove llm_params map in favor of just using kwargs
2025-01-15 14:23:25 -06:00
KennyZhang1
4637270e9a Implement CosmosDB vector store (#1587) 2025-01-14 02:47:08 -05:00
Alonso Guevara
e21a38f2ab Fix/notebooks (#1614)
* Add new inputs and missing vector store for retrieving vectors

* Format

* Semver

* Remove .Identifier files

* Fix spellcheck

* Remove unnecessary input file for notebooks
2025-01-13 17:41:39 -06:00