Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5747

Surrogate pairs with combining diacritics are incorrectly ordered on text extraction

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.0.30
    • None
    • Text extraction
    • None

    Description

      When extending PDFTextStripper, the writeString override receives a List<TextPosition>. When iterating over them, the getUnicode() call should return the Unicode representation of the extracted text.

      However, for glyphs that require a surrogate pair (such as some mathematical symbols, e.g. 𝑋) that are modified with a combining diacritic (such as ^), the extracted Unicode characters are out of order.

      The attached PDF contains 𝑋̂. This is composed of 𝑋, which is represented as the surrogate pair \uD835\uDC4B and the combining diacritic \u0302

      However, when extracted, we get \uD835\u0302\uDC4B (the combining diacritic is placed in between the two characters of the surrogate pair). This is an invalid representation, and when encoded as a Json will break most parsers. The expected output would be \uD835\uDC4B\u0302

      Attachments

        1. invchar.pdf
          7 kB
          P Crossa

        Activity

          People

            Unassigned Unassigned
            pcrossa P Crossa
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: