Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-17355

Casting to Decimal along with UNION ALL gives incosistent results

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.1.0, 2.1.1
    • None
    • Parser, UDF
    • None
    • CentOS 7.2

    Description

      Extra trailing zeros are added when running "union all" on the tables containing decimal data types.

      Version: Hive 2.1

      Steps to repro:-

      1) CREATE TABLE `decisample`(
        `a` decimal(8,2),
        `b` int,
        `c` decimal(5,2))
      ROW FORMAT SERDE
        'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
      STORED AS INPUTFORMAT
        'org.apache.hadoop.mapred.TextInputFormat'
      OUTPUTFORMAT
        'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
      LOCATION
        'maprfs:/user/hive/warehouse/decisample'
      
      2) CREATE TABLE `decisample3`(
        `a` decimal(8,2),
        `b` int,
        `c` decimal(5,2))
      ROW FORMAT SERDE
        'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
      STORED AS INPUTFORMAT
        'org.apache.hadoop.mapred.TextInputFormat'
      OUTPUTFORMAT
        'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
      LOCATION
        'maprfs:/user/hive/warehouse/decisample3'
      
      3)hive> select * from decisample3;
      OK
      1.00    2       3.00
      7.00    8       9.00
      
      4)hive> select * from decisample;
      OK
      4.00    5       6.00
      
      5) query:- 
      select a1.a, '' as a1b,'' as a1c from decisample a1 union all select a2.a,a2.b,a2.c from decisample3 a2;
      
      o/p:-
      OK
      4.00            NULL
      1.00    2       3.000000000000000000
      7.00    8       9.000000000000000000
      Time taken: 87.993 seconds, Fetched: 3 row(s)
      
      6)select a2.a,a2.b,a2.c from decisample3 a2 union all select a1.a, '' as a1b,'' as a1c from decisample a1;
      
      o/p:-
      4.00
      1.00    2       3
      7.00    8       9
      
      

      Steps 5 is yielding 18 trailing zeros where as step 6 query is yieldings no trailing zero.

      Observation:

      1. Hive is trying to run the UNION ALL after ensuring the SELECT's are semantically same(equal number of columns and same datatypes). To do this, it is implicitly type casting the values where required.

      From the explain plan, type casting is not consistent when done 2 different ways:
      a) select-1 UNION ALL select-2 (Query-5 in above comment)
      vs
      b) select-2 UNION ALL select-2 (Query-6 in above comment)

      Showing only the "expresssions" part of execution plans

      Query-5:
      ========

      ..
      ..
      Map Operator Tree:
                TableScan
                  alias: a1
                  Statistics: Num rows: 1 Data size: 11 Basic stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: a (type: decimal(8,2)), '' (type: string), null (type: decimal(38,18))
                    outputColumnNames: _col0, _col1, _col2
      ..
      ..
      TableScan
                  alias: a2
                  Statistics: Num rows: 2 Data size: 22 Basic stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: a (type: decimal(8,2)), UDFToString(b) (type: string), CAST( c AS decimal(38,18)) (type: decimal(38,18))
      
      
      

      Query-6:
      ========

      ..
      ..
      Map Operator Tree:
                TableScan
                  alias: a2
                  Statistics: Num rows: 2 Data size: 22 Basic stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: a (type: decimal(8,2)), UDFToString(b) (type: string), UDFToString(c) (type: string)
      ..
      ..
      TableScan
                  alias: a1
                  Statistics: Num rows: 1 Data size: 11 Basic stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: a (type: decimal(8,2)), '' (type: string), '' (type: string)
      ..
      ..
      
      
      

      Attaching the execution plans for both queries for reference.

      2. The reason for 18 zeros in query-5 above is due to casting NULL to Decimal.
      And by default, the precision and scale are taken as (38,18) in Hive. This could be the reason for 18 zeros.

      3. This is repeating every time implicit type casting is happening on EMPTY strings.

      If excluding few columns in one of the SELECT statement is absolutely necessary, then the only Workaround is to explicitly type cast the empty strings to same Datatypes as the Other Select statement which included the columns.

      For ex:
      Q1:

      select a,b,c from decisample3
      union all
      select a,cast(' ' as int),cast(' ' as decimal) from decisample;

      Q2:

      select a,cast(' ' as int),cast(' ' as decimal) from decisample
      union all
      select a,b,c from decisample3;

      Both the above queries will give consistent result now.

      cast(' ' as int) ---> this was cast to INT, same as datatype of "b"
      cast(' ' as decimal) ---> this was cast to decimal, same as datatype of "c"

      Attachments

        Activity

          People

            Unassigned Unassigned
            adityaar Aditya Allamraju
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: