Skip to content

Conversation

@danilojsl
Copy link
Contributor

@danilojsl danilojsl commented Dec 22, 2025

Description

New Feature: Add structural metadata (domPath, orderTableIndex, orderImageIndex) across document readers
Bug Fix: Resolves a bug where certain elements (e.g., section titles and diagnosis entries) appeared multiple times in the parsed output on HTML files

Motivation and Context

Previously, our document parsers (HTMLReader, XMLReader, WordReader, PowerPointReader, ExcelReader) relied heavily on positional or page-based coordinates for layout metadata.
However, non-PDF formats such as HTML, XML, DOC(X), PPT(X), and XLS(X) do not have fixed pages
To ensure deterministic element referencing and structural traceability across all document types, we needed to adopt a unified DOM-like metadata model.

This change standardizes metadata extraction so every element can be uniquely identified and re-located within its source document, independent of visual layout.

How Has This Been Tested?

Screenshots (if appropriate):

  • Unit tests
  • Local tests
  • Google Colab

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Code improvements with no or little impact
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING page.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@danilojsl danilojsl force-pushed the feature/SPARKNLP-1322-Store-Structural-Position-for-Tables-and-Images branch from cc2d966 to 4473a47 Compare December 22, 2025 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants