Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-12718

trim() functions are lack of utf-8 support

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • Impala 4.4.0
    • None
    • ghx-label-13

    Description

      The following string functions are lack of UTF-8 support:

      BTRIM(STRING a, STRING chars_to_trim)
      LTRIM(STRING a, STRING chars_to_trim)
      RTRIM(STRING a , STRING chars_to_trim)
      

      Here is an issue reported by our user:

      [localhost:21050] default> select rtrim('价格,', ',');
      +-----------------------+
      | rtrim('价格,', ',') |
      +-----------------------+
      | 价�                   |
      +-----------------------+

      The result is the same if setting utf8_mode=true. Note that the comma used in the above strings is Chinese punctuation mark ',' , not English(ASCII) mark ','.

      The cause is that the Chinese character ',' is used as a char set. The utf8 encoding of these characters:

      • '价': 0xe4 0xbb 0xb7
      • '格': 0xe6 0xa0 0xbc
      • ',': 0xef 0xbc 0x8c

      Each character is encoded into 3 bytes. The last byte of '格' is 0xbc which also appears in the bytes of ','. So it's removed as well. The result is a string of '价' and the first two bytes of '格'. The last character becomes a malformed unicode so it's replaced with '�'.

      Attachments

        Issue Links

          Activity

            People

              eyizoha Zihao Ye
              stigahuang Quanlong Huang
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: