Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-5941

Skip header / footer logic works incorrectly for Hive tables when file has several input splits

    XMLWordPrintableJSON

Details

    Description

      To reproduce
      1. Create csv file with two columns (key, value) for 3000029 rows, where first row is a header.
      The data file has size of should be greater than chunk size of 256 MB. Copy file to the distributed file system.

      2. Create table in Hive:

      CREATE EXTERNAL TABLE `h_table`(
        `key` bigint,
        `value` string)
      ROW FORMAT DELIMITED
        FIELDS TERMINATED BY ','
      STORED AS INPUTFORMAT
        'org.apache.hadoop.mapred.TextInputFormat'
      OUTPUTFORMAT
        'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
      LOCATION
        'maprfs:/tmp/h_table'
      TBLPROPERTIES (
       'skip.header.line.count'='1');
      

      3. Execute query select * from hive.h_table in Drill (query data using Hive plugin). The result will return less rows then expected. Expected result is 3000028 (total count minus one row as header).

      The root cause
      Since file is greater than default chunk size, it's split into several fragments, known as input splits. For example:

      maprfs:/tmp/h_table/h_table.csv:0+268435456
      maprfs:/tmp/h_table/h_table.csv:268435457+492782112
      

      TextHiveReader is responsible for handling skip header and / or footer logic.
      Currently Drill creates reader for each input split and skip header and /or footer logic is applied for each input splits, though ideally the above mentioned input splits should have been read by one reader, so skip / header footer logic was applied correctly.

      Attachments

        Issue Links

          Activity

            People

              arina Arina Ielchiieva
              arina Arina Ielchiieva
              Padma Penumarthy Padma Penumarthy
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: