doc: add notes of the key codes

2025-09-16 23:52:00 +03:00 · 2025-03-17 01:43:24 +08:00
parent 97d3ad9849
commit ade33bade2
1 changed files with 21 additions and 17 deletions
--- a/docs/notes_of_codes.md
+++ b/docs/notes_of_codes.md
@@ -1,12 +1,31 @@
-### GMM Clustering
+From the contributor [hhh2210](https://github.com/hhh2210).
+## Text Chunking
+**Core Code**: `extract_hierarchical_entities` in `hirag/_op.py`
+
+**Key Steps**:
+- The function processes text chunks to extract entities and relationships
+- It uses LLM prompts defined in `PROMPTS["hi_entity_extraction"]` for entity extraction
+- Each chunk is processed to extract entities via `_process_single_content_entity`
+- Embeddings are created for all extracted entities
+- It also extracts relationships between entities via `_process_single_content_relation`
+
+## Entity extraction
+- Happens in `_process_single_content_entity` and `_process_single_content_relation` functions within `extract_hierarchical_entities`. These functions:
+    - Use an LLM to extract entities with structured prompts
+    - Extract entity attributes like name, type, description, and source
+    - Store entities in a knowledge graph and vector database
+- The extracted entity information is stored in the knowledge graph and processed by the `_handle_single_entity_extraction` function (line 165), which parses entity attributes from LLM output.
+
+## GMM Clustering
 **Core Code**: Functions in `hirag/_cluster_utils.py`
+
 **Key Steps**:
 - Uses `sklearn.mixture.GaussianMixture` for clustering
 - Automatically determines optimal number of clusters with `get_optimal_clusters`
 - Applies dimension reduction with UMAP before clustering
 - Returns clusters as labels and probabilities

-### Summarization of Entities
+## Summarization of Entities
 - For each cluster from GMM clustering, generates a prompt with all entities in the cluster
 - Uses LLM to generate summary entities for the cluster
 - Parses the LLM response to extract new higher-level entities and relationships
@@ -17,18 +36,3 @@
 - "Identify at least one attribute entity for the given entity description list"
 - Generate entities matching types from the meta attribute list: `["organization", "person", "location", "event", "product", "technology", "industry", "mathematics", "social sciences"]`

-### Entity extraction
- Happens in `_process_single_content_entity` and `_process_single_content_relation` functions within `extract_hierarchical_entities`. These functions:
-    - Use an LLM to extract entities with structured prompts
-    - Extract entity attributes like name, type, description, and source
-    - Store entities in a knowledge graph and vector database
- The extracted entity information is stored in the knowledge graph and processed by the `_handle_single_entity_extraction` function (line 165), which parses entity attributes from LLM output.
-
-### Text Chunking
-**Core Code**: `extract_hierarchical_entities` in `hirag/_op.py`
-**Key Steps**:
- The function processes text chunks to extract entities and relationships
- It uses LLM prompts defined in `PROMPTS["hi_entity_extraction"]` for entity extraction
- Each chunk is processed to extract entities via `_process_single_content_entity`
- Embeddings are created for all extracted entities
- It also extracts relationships between entities via `_process_single_content_relation`