[IMPALA-11978] Implement Unicode sandwich for python code - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: Impala 4.3.0
Fix Version/s: None
Component/s: Infrastructure
Labels:
None

Epic Color:
ghx-label-12

Description

Python 3 makes a clear distinction between bytes and strings (Unicode). To handle this appropriately, various places need to be clear about whether they are working on Unicode strings or bytes.

The typical way to fix this for text is to implement a "Unicode sandwich" where the input path is converted to Unicode as early as possible and the output path is converted to bytes as late as possible. This leaves all internal code working on Unicode strings.

Some parts of our code deal with bytes directly (e.g. tests/util/get_parquet_metadata.py has code that deals with the bytes of a Parquet file). Almost everything else should be dealing with Unicode strings.

This is also a good time to fix warnings about the unicode() builtin and basestring.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Joe McDonnell

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 06/Mar/23 20:42

Updated:: 06/Mar/23 20:42