A clear commented version and pipeline summeration in my_progress.md

2025-09-16 23:52:00 +03:00 · 2025-03-16 19:47:24 +08:00
parent d5ff164818
commit aae6377500
4 changed files with 57 additions and 2 deletions
--- a/hirag/_op.py
+++ b/hirag/_op.py
@@ -140,6 +140,13 @@ async def _handle_entity_relation_summary(
    description: str,
    global_config: dict,
 ) -> str:
+    """Summarize the entity or relation description,is used during entity extraction and when merging nodes or edges in the knowledge graph
+
+    Args:
+        entity_or_relation_name: entity or relation name
+        description: description
+        global_config: global configuration
+    """
    use_llm_func: callable = global_config["cheap_model_func"]
    llm_max_tokens = global_config["cheap_model_max_token_size"]
    tiktoken_model_name = global_config["tiktoken_model_name"]
@@ -311,6 +318,17 @@ async def extract_hierarchical_entities(
    entity_vdb: BaseVectorStorage,
    global_config: dict,
 )-> Union[BaseGraphStorage, None]:
+    """Extract entities and relations from text chunks
+
+    Args:
+        chunks: text chunks
+        knowledge_graph_inst: knowledge graph instance
+        entity_vdb: entity vector database
+        global_config: global configuration
+
+    Returns:
+        Union[BaseGraphStorage, None]: knowledge graph instance
+    """
    use_llm_func: callable = global_config["best_model_func"]
    entity_extract_max_gleaning = global_config["entity_extract_max_gleaning"]

--- a/hirag/_storage/gdb_networkx.py
+++ b/hirag/_storage/gdb_networkx.py
@@ -199,7 +199,10 @@ class NetworkXStorage(BaseGraphStorage):

    async def _leiden_clustering(self):
        from graspologic.partition import hierarchical_leiden
-
+        """
+        It uses the hierarchical_leiden function from the graspologic library
+        The Leiden algorithm is used in the HiRAG.ainsert method
+        """
        graph = NetworkXStorage.stable_largest_connected_component(self._graph)
        community_mapping = hierarchical_leiden(
            graph,
--- a/hirag/prompt.py
+++ b/hirag/prompt.py
@@ -533,7 +533,7 @@ Entity description list: {entity_description_list}
 #######
 Output:
 """
-
+# TYPE的定义
 PROMPTS["DEFAULT_ENTITY_TYPES"] = ["organization", "person", "geo", "event"]
 PROMPTS["META_ENTITY_TYPES"] = ["organization", "person", "location", "event", "product", "technology", "industry", "mathematics", "social sciences"]
 PROMPTS["DEFAULT_TUPLE_DELIMITER"] = "<|>"
--- a/my_progress.md
+++ b/my_progress.md
@@ -0,0 +1,34 @@
+### GMM Clustering
+**Core Code**: Functions in `hirag/_cluster_utils.py`
+**Key Steps**:
+- Uses `sklearn.mixture.GaussianMixture` for clustering
+- Automatically determines optimal number of clusters with `get_optimal_clusters`
+- Applies dimension reduction with UMAP before clustering
+- Returns clusters as labels and probabilities
+
+### Summarization of Entities
+- For each cluster from GMM clustering, generates a prompt with all entities in the cluster
+- Uses LLM to generate summary entities for the cluster
+- Parses the LLM response to extract new higher-level entities and relationships
+- Creates embeddings for these summary entities
+- Adds these summaries to the next layer in the hierarchy
+
+**Prompt Design**: The `summary_clusters` prompt instructs the LLM to:
+- "Identify at least one attribute entity for the given entity description list"
+- Generate entities matching types from the meta attribute list: `["organization", "person", "location", "event", "product", "technology", "industry", "mathematics", "social sciences"]`
+
+### Entity extraction
+- Happens in `_process_single_content_entity` and `_process_single_content_relation` functions within `extract_hierarchical_entities`. These functions:
+    - Use an LLM to extract entities with structured prompts
+    - Extract entity attributes like name, type, description, and source
+    - Store entities in a knowledge graph and vector database
+- The extracted entity information is stored in the knowledge graph and processed by the `_handle_single_entity_extraction` function (line 165), which parses entity attributes from LLM output.
+
+### Text Chunking
+**Core Code**: `extract_hierarchical_entities` in `hirag/_op.py`
+**Key Steps**:
+- The function processes text chunks to extract entities and relationships
+- It uses LLM prompts defined in `PROMPTS["hi_entity_extraction"]` for entity extraction
+- Each chunk is processed to extract entities via `_process_single_content_entity`
+- Embeddings are created for all extracted entities
+- It also extracts relationships between entities via `_process_single_content_relation`