[PDFBOX-5580] PDFTextStripperByArea ignores text for overlapping areas (regions) when suppressing duplicate overlapping text - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 2.0.27, 3.0.0 PDFBox
Fix Version/s: None
Component/s: Text extraction
Labels:
None

Description

Problem

Recently we encountered duplicate texts in our clients PDF documents which are typically created by applications to simulate some kind of bold text when no bold variant of a font is available. Fortunately, PDFBox's PDFTextStripperByArea has some logic to ignore exact duplicates at the same positions for these situations (which is inherited from the normal PDFTextStripper). So we changed from setSuppressDuplicateOverlappingText(false) to true.

But we encountered that texts for multiple regions are not extracted correctly in this case when some special conditions are met:

When using multiple regions which overlap each other and would provide exactly the same text, the first region text is extracted correctly but any following region with same text remains empty.

We believe this is a bug due to duplicate suppression not being respected correctly in PDFTextStripperByArea.

Possible cause

While investigating this problem we found that PDFTextStripperByArea swaps charactersByArticle for multiple regions and interprets a single page multiple times (once for each region). In PDFTextStripper a private HashMap characterListMapping keeps track of possible duplicate symbols with their positions. The HashMap is not being reset after each region extraction which leads to characters being ignored for subsequent areas.

Since the HashMap is private we were not able to subclass and customize PDFTextStripperByArea with some adjusted behavior to test this finding.

Workaround

When extracting regions one at a time for every page everything works fine. We currently don't see any performance disadvantages.

Reproduction

The attached PDF file does not actually include duplicate overlapping text since this is not needed to reproduce the issue.

try (final PDDocument doc = PDDocument.load(new File("C:\\Source\\test.pdf"))) {
    final PDFTextStripperByArea stripper = new PDFTextStripperByArea();
    stripper.setSuppressDuplicateOverlappingText(true);
    stripper.setPageEnd("");

    final Rectangle2D areaA = new Rectangle2D.Double(45, 319, 124, 19);
    final Rectangle2D areaB = new Rectangle2D.Double(43, 319, 130, 19);

    stripper.addRegion("A", areaA);
    stripper.addRegion("B", areaB);

    stripper.extractRegions(doc.getPage(0));

    System.out.println("A: " + stripper.getTextForRegion("A"));
    System.out.println("B: " + stripper.getTextForRegion("B"));
}

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

test.pdf
23/Mar/23 11:04
85 kB
Sebastian Holzki

Activity

People

Assignee:: Unassigned

Reporter:: Sebastian Holzki

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 23/Mar/23 11:04

Updated:: 18/May/24 14:03