Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-26533

Column data type is lost when an Avro table with a BYTE column is written through spark-sql

Log workAgile BoardRank to TopRank to BottomBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.1.2
    • None
    • None

    Description

      Describe the bug

      We are trying to store a table through the spark-sql interface with the Avro file format. The table's schema contains a column with the BYTE data type. Additionally, the column's name contains uppercase letters.

      When we INSERT some valid values (e.g. -128), we see the below message:

      WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving.

       
      Finally, when we perform a DESC on the table, we observe that the BYTE data type has been converted to int, and the case sensitivity of the column name has been lost (it is converted to lowercase).

      Step to reproduce

      On Spark 3.2.1 (commit 4f25b3f712), using spark-shell with the Avro package:

      ./bin/spark-sql --packages org.apache.spark:spark-avro_2.12:3.2.1

       
      Execute the following:

      spark-sql> create table hive_tinyint_avro(c0 INT, C1 BYTE) ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.avro.AvroSerDe" STORED AS INPUTFORMAT "org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat";
      22/08/28 15:44:21 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
      Time taken: 0.359 seconds
      spark-sql> insert into hive_tinyint_avro select 0, cast(-128 as byte);
      22/08/28 15:44:28 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving.
      22/08/28 15:44:29 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving.
      Time taken: 1.605 seconds
      spark-sql> desc hive_tinyint_avro;
      22/08/28 15:44:32 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving.
      22/08/28 15:44:32 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving.
      c0                      int
      c1                      int // Data type and case-sensitivity lost
      Time taken: 0.068 seconds, Fetched 2 row(s)

      Expected behavior

      We expect the case sensitivity and data type to be preserved. We tried other formats like Parquet & ORC and the outcome is consistent with this expectation.

      Here are the logs from our attempt at doing the same with Parquet:

      spark-sql> create table hive_tinyint_parquet(c0 INT, C1 BYTE) stored as PARQUET;
      Time taken: 0.134 seconds
      spark-sql> insert into hive_tinyint_parquet select 0, cast(-128 as byte);
      Time taken: 0.995 seconds
      spark-sql> desc hive_tinyint_parquet;
      c0                      int
      C1                      tinyint  // Data type and case-sensitivity preserved
      Time taken: 0.092 seconds, Fetched 2 row(s)

      Root Cause

      TypeInfoToSchema's createAvroPrimitive is where Hive's BYTE, SHORT & INT are all converted into Avro's INT:

            case BYTE:
              schema = Schema.create(Schema.Type.INT);
              break;
            case SHORT:
              schema = Schema.create(Schema.Type.INT);
              break;
            case INT:
              schema = Schema.create(Schema.Type.INT);
              break;
      

       
      Once converted into Avro schema, we lose track of the actual Hive schema specified by the user. Therefore, once TINYINT/BYTE is converted into INT, the former is lost in the AvroSerde instance.
       

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned Assign to me
            x/sys xsys

            Dates

              Created:
              Updated:

              Slack

                Issue deployment