Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-22004

Non-acid to acid conversion doesn't handle random filenames

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Transactions
    • None

    Description

      Right now the supported filename patterns for non-acid to acid table's files (original files) are the only ones created by Hive itself (eg. 000000, 000000_COPY_1, bucket_00000, etc). But at the same time Hive non-acid table supports reading from tables having files with random filenames. We should support the same for acid tables.

      A way to handle this would be to rename such files and though rename is not a costly operation for HDFS, But for non-acid tables with the location on a blobstore like s3 and having random filenames will have costly added steps to convert to acid.

      Current scenario: What we do now for original files is assign them a logical bucket id and for unrecognized patterns we assign -1 and ignore those files.

      Proposed alternatives:

      1) For all the random files assume the logical bucket id as 0 and let the files belong to the same bucket in the way similar to we do for multiple files with same bucket id (_copy_N).
      2) For all the random files lexicographically sort them and sequentially assign them a bucket id similar to the handling of multiple files for a non-bucketed table where we extract the bucket id simply from filenames

      Attachments

        Activity

          People

            aditya-shah Aditya Shah
            aditya-shah Aditya Shah
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: