Details
-
Sub-task
-
Status: Open
-
Major
-
Resolution: Unresolved
-
Impala 4.3.0
-
None
-
None
-
ghx-label-12
Description
Python 3 makes a clear distinction between bytes and strings (Unicode). To handle this appropriately, various places need to be clear about whether they are working on Unicode strings or bytes.
The typical way to fix this for text is to implement a "Unicode sandwich" where the input path is converted to Unicode as early as possible and the output path is converted to bytes as late as possible. This leaves all internal code working on Unicode strings.
Some parts of our code deal with bytes directly (e.g. tests/util/get_parquet_metadata.py has code that deals with the bytes of a Parquet file). Almost everything else should be dealing with Unicode strings.
This is also a good time to fix warnings about the unicode() builtin and basestring.