Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.11.0
-
None
Description
Description
The available() method of org.apache.commons.io.input.CharSequenceInputStream erroneously returns values larger than the actual number of available bytes in some cases.
The underlying issue is that CharSequenceInputStream makes incorrect assumptions about the relation between chars and bytes. The CodingErrorAction.REPLACE can convert 2 chars (1 supplementary code point) to one byte (the replacement char ?). Additionally in case CharSequenceInputStream is ever extended to support specifying a CharsetEncoder, the CodingErrorAction.IGNORE would probably cause similar issues. There might also be some uncommon charsets which can encode 2 chars to 1 byte; though I am not aware of such charset yet.
This was originally mentioned in pull request #293. That PR also proposed to replace the underlying CharSequenceInputStream implementation with ReaderInputStream because in general using CharsetEncoder is error-prone so it might be good to avoid having two classes implementing logic on top of it. (Potentially CharSequenceInputStream is missing a call to CharsetEncoder.flush, see also IO-714)
Example
In the example below available() erroneously returns 2 even though only 1 byte can be read.
Charset charset = Charset.forName("Big5"); CharSequenceInputStream in = new CharSequenceInputStream("\uD800\uDC00", charset); // BUG: available() returns 2 but only 1 byte is read afterwards System.out.println("Available: " + in.available()); // Note: readAllBytes() is a method added in Java 9 System.out.println("Actually read: " + in.readAllBytes().length);