[HIVE-26533] Column data type is lost when an Avro table with a BYTE column is written through spark-sql - ASF JIRA

Log work

Agile Board

Rank to Top

Rank to Bottom

Bulk Copy Attachments

Bulk Move Attachments

Add vote

Voters

Watch issue

Watchers

Create sub-task

Convert to sub-task

Move

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.1.2
Fix Version/s: None
Component/s: Serializers/Deserializers
Labels:
None

Description

Describe the bug

We are trying to store a table through the spark-sql interface with the Avro file format. The table's schema contains a column with the BYTE data type. Additionally, the column's name contains uppercase letters.

When we INSERT some valid values (e.g. -128), we see the below message:

WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving.

Finally, when we perform a DESC on the table, we observe that the BYTE data type has been converted to int, and the case sensitivity of the column name has been lost (it is converted to lowercase).

Step to reproduce

On Spark 3.2.1 (commit 4f25b3f712), using spark-shell with the Avro package:

./bin/spark-sql --packages org.apache.spark:spark-avro_2.12:3.2.1

Execute the following:

spark-sql> create table hive_tinyint_avro(c0 INT, C1 BYTE) ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.avro.AvroSerDe" STORED AS INPUTFORMAT "org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat";
22/08/28 15:44:21 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
Time taken: 0.359 seconds
spark-sql> insert into hive_tinyint_avro select 0, cast(-128 as byte);
22/08/28 15:44:28 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving.
22/08/28 15:44:29 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving.
Time taken: 1.605 seconds
spark-sql> desc hive_tinyint_avro;
22/08/28 15:44:32 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving.
22/08/28 15:44:32 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving.
c0                      int
c1                      int // Data type and case-sensitivity lost
Time taken: 0.068 seconds, Fetched 2 row(s)

Expected behavior

We expect the case sensitivity and data type to be preserved. We tried other formats like Parquet & ORC and the outcome is consistent with this expectation.

Here are the logs from our attempt at doing the same with Parquet:

spark-sql> create table hive_tinyint_parquet(c0 INT, C1 BYTE) stored as PARQUET;
Time taken: 0.134 seconds
spark-sql> insert into hive_tinyint_parquet select 0, cast(-128 as byte);
Time taken: 0.995 seconds
spark-sql> desc hive_tinyint_parquet;
c0                      int
C1                      tinyint  // Data type and case-sensitivity preserved
Time taken: 0.092 seconds, Fetched 2 row(s)

Root Cause

TypeInfoToSchema's createAvroPrimitive is where Hive's BYTE, SHORT & INT are all converted into Avro's INT:

      case BYTE:
        schema = Schema.create(Schema.Type.INT);
        break;
      case SHORT:
        schema = Schema.create(Schema.Type.INT);
        break;
      case INT:
        schema = Schema.create(Schema.Type.INT);
        break;

Once converted into Avro schema, we lose track of the actual Hive schema specified by the user. Therefore, once TINYINT/BYTE is converted into INT, the former is lost in the AvroSerde instance.