SPARKNLP-1322 Store Structural Position for Tables and Images #14713

danilojsl · 2025-12-22T13:19:34Z

Description

New Feature: Add structural metadata (domPath, orderTableIndex, orderImageIndex) across document readers
Bug Fix: Resolves a bug where certain elements (e.g., section titles and diagnosis entries) appeared multiple times in the parsed output on HTML files

Motivation and Context

Previously, our document parsers (HTMLReader, XMLReader, WordReader, PowerPointReader, ExcelReader) relied heavily on positional or page-based coordinates for layout metadata.
However, non-PDF formats such as HTML, XML, DOC(X), PPT(X), and XLS(X) do not have fixed pages
To ensure deterministic element referencing and structural traceability across all document types, we needed to adopt a unified DOM-like metadata model.

This change standardizes metadata extraction so every element can be uniquely identified and re-located within its source document, independent of visual layout.

How Has This Been Tested?

Screenshots (if appropriate):

Unit tests
Local tests
Google Colab

Types of changes

Bug fix (non-breaking change which fixes an issue)
Code improvements with no or little impact
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING page.
I have added tests to cover my changes.
All new and existing tests passed.

danilojsl added 2 commits December 22, 2025 08:24

[SPARKNLP-1322] Adding structural position to Readers

5797328

[SPARKNLP-1322] Fix duplicate elements from nested divs

4473a47

danilojsl force-pushed the feature/SPARKNLP-1322-Store-Structural-Position-for-Tables-and-Images branch from cc2d966 to 4473a47 Compare December 22, 2025 13:25

[SPARKNLP-1322] Adding missing test

d870d00

danilojsl requested a review from DevinTDHa December 22, 2025 14:01

danilojsl self-assigned this Dec 22, 2025

danilojsl added enhancement bug-fix labels Dec 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SPARKNLP-1322 Store Structural Position for Tables and Images #14713

SPARKNLP-1322 Store Structural Position for Tables and Images #14713

Uh oh!

danilojsl commented Dec 22, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SPARKNLP-1322 Store Structural Position for Tables and Images #14713

Are you sure you want to change the base?

SPARKNLP-1322 Store Structural Position for Tables and Images #14713

Uh oh!

Conversation

danilojsl commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danilojsl commented Dec 22, 2025 •

edited

Loading