Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-22561

Data loss on map join for bucketed, partitioned table

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Patch Available
    • Blocker
    • Resolution: Unresolved
    • 3.1.2
    • None
    • None
    • None

    Description

      A map join on a column (which is neither involved in bucketing and partition) causes data loss. 

      Steps to reproduce:

      Env: [hive-dev-box|https://github.com/kgyrtkirk/hive-dev-box] hive 3.1.2.

      Create tables:

       

      CREATE TABLE `testj2`(
        `id` int, 
        `bn` string, 
        `cn` string, 
        `ad` map<string,int>, 
        `mi` array<int>)
      PARTITIONED BY ( 
        `br` string)
      CLUSTERED BY ( 
        bn) 
      INTO 2 BUCKETS
      ROW FORMAT DELIMITED
      FIELDS TERMINATED BY ','
      STORED AS TEXTFILE
      TBLPROPERTIES (
        'bucketing_version'='2');
      
      CREATE TABLE `testj1`(
        `id` int, 
        `can` string, 
        `cn` string, 
        `ad` map<string,int>, 
        `av` boolean, 
        `mi` array<int>)
      PARTITIONED BY ( 
        `brand` string)
      CLUSTERED BY ( 
        can) 
      INTO 2 BUCKETS
      ROW FORMAT DELIMITED
      FIELDS TERMINATED BY ','
      STORED AS TEXTFILE
      TBLPROPERTIES (
        'bucketing_version'='2');
      

      insert some data in both:

      insert into testj1 values (100, 'mes_1', 'customer_1',  map('city1', 560077), false, array(5, 10), 'brand_1'),
      (101, 'mes_2', 'customer_2',  map('city2', 560078), true, array(10, 20), 'brand_2'),
      (102, 'mes_3', 'customer_3',  map('city3', 560079), false, array(15, 30), 'brand_3'),
      (103, 'mes_4', 'customer_4',  map('city4', 560080), true, array(20, 40), 'brand_4'),
      (104, 'mes_5', 'customer_5',  map('city5', 560081), false, array(25, 50), 'brand_5');
      
      insert into table testj2 values (100, 'tv_0', 'customer_0', map('city0', 560076),array(0, 0, 0), 'tv'),
      (101, 'tv_1', 'customer_1', map('city1', 560077),array(20, 25, 30), 'tv'),
      (102, 'tv_2', 'customer_2', map('city2', 560078),array(40, 50, 60), 'tv'),
      (103, 'tv_3', 'customer_3', map('city3', 560079),array(60, 75, 90), 'tv'),
      (104, 'tv_4', 'customer_4', map('city4', 560080),array(80, 100, 120), 'tv');
      

      Do a join between them:

      select t1.id, t1.can, t1.cn, t2.bn,t2.ad, t2.br FROM testj1 t1 JOIN testj2 t2 on (t1.id = t2.id) order by t1.id;
      

      Observed results:

      In the plan, I can see a map join. Disabling it gives the correct result.

       

       

      Attachments

        1. HIVE-22561.1.branch-3.1.patch
          39 kB
          Aditya Shah
        2. HIVE-22561.branch-3.1.patch
          39 kB
          Aditya Shah
        3. HIVE-22561.patch
          39 kB
          Aditya Shah
        4. image-2019-11-28-20-46-25-432.png
          43 kB
          Aditya Shah
        5. Screenshot 2019-11-28 at 8.45.17 PM.png
          43 kB
          Aditya Shah

        Activity

          People

            aditya-shah Aditya Shah
            aditya-shah Aditya Shah
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated: