Skip to content

Memory Issue: ECE Partitioning Consumes Excessive Memory with Large File Sets #128

@mdjhacker

Description

@mdjhacker

Memory Issue: ECE Partitioning Consumes Excessive Memory with Large File Sets

Issue Description

When processing a large number of input files, the ECE partitioning method consumes excessive memory and often crashes with out-of-memory errors. This makes it unusable for large-scale knowledge graph processing.

Image

Root Cause Analysis

The memory issue stems from two main sources in the ECE partitioning implementation:

1. Pre-tokenization Process

The _pre_tokenize() method in PartitionService loads all nodes and edges into memory simultaneously :

def _pre_tokenize(self) -> None:
    """Pre-tokenize all nodes and edges to add token length information."""
    logger.info("Starting pre-tokenization of nodes and edges...")
    
    nodes = self.kg_instance.get_all_nodes()
    edges = self.kg_instance.get_all_edges()

This process iterates through every node and edge to calculate token lengths , which becomes memory-intensive with large knowledge graphs.

2. ECE Partitioner Memory Usage

The ECE partitioner maintains multiple large data structures in memory:

  • Complete node and edge lists
  • Adjacency list representation
  • Node and edge dictionaries
  • Combined units list

Reproduction Steps

  1. Process a large dataset (1000+ files) using the Web UI
  2. Select ECE as the partitioning method
  3. Observe memory usage growing rapidly during the partitioning phase
  4. System crashes with OOM error when memory is exhausted

Expected Behavior

ECE partitioning should handle large datasets gracefully without consuming excessive memory.

Actual Behavior

Memory usage grows linearly with the number of nodes/edges, leading to system crashes.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions