-
Notifications
You must be signed in to change notification settings - Fork 59
Description
Memory Issue: ECE Partitioning Consumes Excessive Memory with Large File Sets
Issue Description
When processing a large number of input files, the ECE partitioning method consumes excessive memory and often crashes with out-of-memory errors. This makes it unusable for large-scale knowledge graph processing.
Root Cause Analysis
The memory issue stems from two main sources in the ECE partitioning implementation:
1. Pre-tokenization Process
The _pre_tokenize() method in PartitionService loads all nodes and edges into memory simultaneously :
def _pre_tokenize(self) -> None:
"""Pre-tokenize all nodes and edges to add token length information."""
logger.info("Starting pre-tokenization of nodes and edges...")
nodes = self.kg_instance.get_all_nodes()
edges = self.kg_instance.get_all_edges()This process iterates through every node and edge to calculate token lengths , which becomes memory-intensive with large knowledge graphs.
2. ECE Partitioner Memory Usage
The ECE partitioner maintains multiple large data structures in memory:
- Complete node and edge lists
- Adjacency list representation
- Node and edge dictionaries
- Combined units list
Reproduction Steps
- Process a large dataset (1000+ files) using the Web UI
- Select ECE as the partitioning method
- Observe memory usage growing rapidly during the partitioning phase
- System crashes with OOM error when memory is exhausted
Expected Behavior
ECE partitioning should handle large datasets gracefully without consuming excessive memory.
Actual Behavior
Memory usage grows linearly with the number of nodes/edges, leading to system crashes.