Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-25449

datediff() gives wrong output when run in a tez task with some non-UTC timezone

    XMLWordPrintableJSON

Details

    Description

      Repro (thanks Qiaosong Dong) -

      Add -Duser.timezone=GMT+8 to tez.task.launch.cmd-opts

      create external table test_dt(id string, dt date);
      insert into test_dt values('11', '2021-07-06'), ('22', '2021-07-07');
      
      select datediff(dt1.dt, '2021-07-01') from test_dt dt1 left join test_dt dt on dt1.id = dt.id;
      +------+
      | _c0  |
      +------+
      | 6    |
      | 7    |
      +------+
      

      Expected output -

      +------+
      | _c0  |
      +------+
      | 5    |
      | 6    |
      +------+
      

      Cause

      This happens because in VectorUDFDateDiffColScalar class
      1. For second argument(scalar) , we use java.text.SimpleDateFormat to parse the date strings which interprets it to be in local timezone.

      2. For first column we get a column vector which represents the date as epoch day. This is always in UTC.

      Solution

      We need to check other variants of datediff UDFs as well and change the parsing mechanism to always interpret date strings in UTC.

      I did a quick change in VectorUDFDateDiffColScalar which fixes the issue.

      -          date.setTime(formatter.parse(new String(bytesValue, "UTF-8")).getTime());
      -          baseDate = DateWritableV2.dateToDays(date);
      +          org.apache.hadoop.hive.common.type.Date hiveDate
      +              = org.apache.hadoop.hive.common.type.Date.valueOf(new String(bytesValue, "UTF-8"));
      +          date.setTime(hiveDate.toEpochMilli());
      +          baseDate = hiveDate.toEpochDay();
      

      Attachments

        Issue Links

          Activity

            People

              sumitverma4 Sumit Verma
              ShubhamChaurasia Shubham Chaurasia
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m