Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
2.0.30
-
None
-
None
Description
When extending PDFTextStripper, the writeString override receives a List<TextPosition>. When iterating over them, the getUnicode() call should return the Unicode representation of the extracted text.
However, for glyphs that require a surrogate pair (such as some mathematical symbols, e.g. 𝑋) that are modified with a combining diacritic (such as ^), the extracted Unicode characters are out of order.
The attached PDF contains 𝑋̂. This is composed of 𝑋, which is represented as the surrogate pair \uD835\uDC4B and the combining diacritic \u0302
However, when extracted, we get \uD835\u0302\uDC4B (the combining diacritic is placed in between the two characters of the surrogate pair). This is an invalid representation, and when encoded as a Json will break most parsers. The expected output would be \uD835\uDC4B\u0302