[PDFBOX-5747] Surrogate pairs with combining diacritics are incorrectly ordered on text extraction - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.0.30
Fix Version/s: None
Component/s: Text extraction
Labels:
None

Description

When extending PDFTextStripper, the writeString override receives a List<TextPosition>. When iterating over them, the getUnicode() call should return the Unicode representation of the extracted text.

However, for glyphs that require a surrogate pair (such as some mathematical symbols, e.g. 𝑋) that are modified with a combining diacritic (such as ^), the extracted Unicode characters are out of order.

The attached PDF contains 𝑋̂. This is composed of 𝑋, which is represented as the surrogate pair \uD835\uDC4B and the combining diacritic \u0302

However, when extracted, we get \uD835\u0302\uDC4B (the combining diacritic is placed in between the two characters of the surrogate pair). This is an invalid representation, and when encoded as a Json will break most parsers. The expected output would be \uD835\uDC4B\u0302

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

invchar.pdf
26/Dec/23 12:51
7 kB
P Crossa

Activity

People

Assignee:: Unassigned

Reporter:: P Crossa

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 26/Dec/23 12:58

Updated:: 26/Dec/23 13:03