Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-22371

CTAS not working with non-ACID managed tables

Log workAgile BoardRank to TopRank to BottomBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 4.0.0
    • None
    • Query Planning
    • None

    Description

      I used Hive commit HIVE-21344 (f16509a5c9187f592c48c253ee001fc3a5e0d508) in the master branch, which was committed on 12 Oct.

      When I submit a query below, the query was finished without any errors.

      create table call_center
      stored as orc 
       as select * from tpcds_text_2.call_center;
      

      However, "select count( * ) from call_center" returned 0, and data in HDFS looks strange.

      • Two tables were created, one in the warehouse directory and another in the external warehouse directory.
      • Table `call_center` in the external warehouse is empty.
       > hdfs dfs -du -h $WAREHOUSE_PATH
       5.0 K 14.9 K $WAREHOUSE_PATH/call_center
       0 0 $WAREHOUSE_PATH/tpcds_text_2.db
      
      > hdfs dfs -du -h $EXTERNAL_WAREHOUSE_PATH
       2.1 G 2.1 G $EXTERNAL_WAREHOUSE_PATH/2
       0 0 $EXTERNAL_WAREHOUSE_PATH/call_center
      

      After a few hours of digging, I guess this bug was introduced in HIVE-22158, which creates every non-ACID managed table in the external warehouse directory by default. In the example above, call_center is intended as a managed table, but not explicitly specified as ACID. Hence, it should created in the external warehouse directory.

      However, the table call_center created in the external warehouse directory is empty, while another non-empty table of the same name is created in the warehouse directory. This is because in the current implementation, the (buggy) compiled query plan proceeds as follows:

      1. Write data to a temporary directory
      2. Move the data to the warehouse directory ($WAREHOUSE_PATH/call_center)
      3. Create a table using data in the warehouse directory

      Without the bug, step 2 would move the data to the external warehouse directory, and step 3 would create a table using the data in the external warehouse directory. The crux of the problem is that the query compiler checks only whether the query does not include the keyword "external" or not. In other words, the query compiler should also be aware of the changes made in HIVE-22158 and updated accordingly.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            nisgoel Nishant Goel Assign to me
            jc5201 Jaechang Kim

            Dates

              Created:
              Updated:

              Slack

                Issue deployment