Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-21924

Split text files even if header/footer exists

Log workAgile BoardRank to TopRank to BottomBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      https://github.com/apache/hive/blob/967a1cc98beede8e6568ce750ebeb6e0d048b8ea/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L494-L503

          int headerCount = 0;
          int footerCount = 0;
          if (table != null) {
            headerCount = Utilities.getHeaderCount(table);
            footerCount = Utilities.getFooterCount(table, conf);
            if (headerCount != 0 || footerCount != 0) {
              // Input file has header or footer, cannot be splitted.
              HiveConf.setLongVar(conf, ConfVars.MAPREDMINSPLITSIZE, Long.MAX_VALUE);
            }
          }
      

      this piece of code makes the CSV (or any text files with header/footer) files not splittable if header or footer is present.
      If only header is present, we can find the offset after first line break and use that to split. Similarly for footer, may be read few KB's of data at the end and find the last line break offset. Use that to determine the data range which can be used for splitting. Few reads during split generation are cheaper than not splitting the file at all.

      Attachments

        1. HIVE-21924.patch
          46 kB
          Mustafa İman
        2. HIVE-21924.6.patch
          61 kB
          Mustafa İman
        3. HIVE-21924.5.patch
          61 kB
          Mustafa İman
        4. HIVE-21924.4.patch
          60 kB
          Mustafa İman
        5. HIVE-21924.3.patch
          60 kB
          Mustafa İman
        6. HIVE-21924.2.patch
          51 kB
          Mustafa İman

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            mustafaiman Mustafa İman Assign to me
            prasanth_j Prasanth Jayachandran
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - Not Specified
              Not Specified
              Remaining:
              Remaining Estimate - 0h
              0h
              Logged:
              Time Spent - 4h 40m
              4h 40m

              Slack

                Issue deployment