Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5580

PDFTextStripperByArea ignores text for overlapping areas (regions) when suppressing duplicate overlapping text

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 2.0.27, 3.0.0 PDFBox
    • None
    • Text extraction
    • None

    Description

      Problem

      Recently we encountered duplicate texts in our clients PDF documents which are typically created by applications to simulate some kind of bold text when no bold variant of a font is available. Fortunately, PDFBox's PDFTextStripperByArea has some logic to ignore exact duplicates at the same positions for these situations (which is inherited from the normal PDFTextStripper). So we changed from setSuppressDuplicateOverlappingText(false) to true.

      But we encountered that texts for multiple regions are not extracted correctly in this case when some special conditions are met:

      When using multiple regions which overlap each other and would provide exactly the same text, the first region text is extracted correctly but any following region with same text remains empty.

      We believe this is a bug due to duplicate suppression not being respected correctly in PDFTextStripperByArea.

      Possible cause

      While investigating this problem we found that PDFTextStripperByArea swaps charactersByArticle for multiple regions and interprets a single page multiple times (once for each region). In PDFTextStripper a private HashMap characterListMapping keeps track of possible duplicate symbols with their positions. The HashMap is not being reset after each region extraction which leads to characters being ignored for subsequent areas.

      Since the HashMap is private we were not able to subclass and customize PDFTextStripperByArea with some adjusted behavior to test this finding.

      Workaround

      When extracting regions one at a time for every page everything works fine. We currently don't see any performance disadvantages.

      Reproduction

      The attached PDF file does not actually include duplicate overlapping text since this is not needed to reproduce the issue.

       

      try (final PDDocument doc = PDDocument.load(new File("C:\\Source\\test.pdf"))) {
          final PDFTextStripperByArea stripper = new PDFTextStripperByArea();
          stripper.setSuppressDuplicateOverlappingText(true);
          stripper.setPageEnd("");
      
          final Rectangle2D areaA = new Rectangle2D.Double(45, 319, 124, 19);
          final Rectangle2D areaB = new Rectangle2D.Double(43, 319, 130, 19);
      
          stripper.addRegion("A", areaA);
          stripper.addRegion("B", areaB);
      
          stripper.extractRegions(doc.getPage(0));
      
          System.out.println("A: " + stripper.getTextForRegion("A"));
          System.out.println("B: " + stripper.getTextForRegion("B"));
      } 

       

       

      Attachments

        1. test.pdf
          85 kB
          Sebastian Holzki

        Activity

          People

            Unassigned Unassigned
            sebho Sebastian Holzki
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: