Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.1.2
-
None
-
None
Description
Describe the bug
We are trying to store a table through the spark-sql interface with the Avro file format. The table's schema contains a column with the BYTE data type. Additionally, the column's name contains uppercase letters.
When we INSERT some valid values (e.g. -128), we see the below message:
WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving.
Finally, when we perform a DESC on the table, we observe that the BYTE data type has been converted to int, and the case sensitivity of the column name has been lost (it is converted to lowercase).
Step to reproduce
On Spark 3.2.1 (commit 4f25b3f712), using spark-shell with the Avro package:
./bin/spark-sql --packages org.apache.spark:spark-avro_2.12:3.2.1
Execute the following:
spark-sql> create table hive_tinyint_avro(c0 INT, C1 BYTE) ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.avro.AvroSerDe" STORED AS INPUTFORMAT "org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat"; 22/08/28 15:44:21 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. Time taken: 0.359 seconds spark-sql> insert into hive_tinyint_avro select 0, cast(-128 as byte); 22/08/28 15:44:28 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving. 22/08/28 15:44:29 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving. Time taken: 1.605 seconds spark-sql> desc hive_tinyint_avro; 22/08/28 15:44:32 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving. 22/08/28 15:44:32 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<c0:int,c1:int>) is different from the schema when this table was created by Spark SQL(struct<c0:int,C1:tinyint>). We have to fall back to the table schema from Hive metastore which is not case preserving. c0 int c1 int // Data type and case-sensitivity lost Time taken: 0.068 seconds, Fetched 2 row(s)
Expected behavior
We expect the case sensitivity and data type to be preserved. We tried other formats like Parquet & ORC and the outcome is consistent with this expectation.
Here are the logs from our attempt at doing the same with Parquet:
spark-sql> create table hive_tinyint_parquet(c0 INT, C1 BYTE) stored as PARQUET; Time taken: 0.134 seconds spark-sql> insert into hive_tinyint_parquet select 0, cast(-128 as byte); Time taken: 0.995 seconds spark-sql> desc hive_tinyint_parquet; c0 int C1 tinyint // Data type and case-sensitivity preserved Time taken: 0.092 seconds, Fetched 2 row(s)
Root Cause
TypeInfoToSchema's createAvroPrimitive is where Hive's BYTE, SHORT & INT are all converted into Avro's INT:
case BYTE: schema = Schema.create(Schema.Type.INT); break; case SHORT: schema = Schema.create(Schema.Type.INT); break; case INT: schema = Schema.create(Schema.Type.INT); break;
Once converted into Avro schema, we lose track of the actual Hive schema specified by the user. Therefore, once TINYINT/BYTE is converted into INT, the former is lost in the AvroSerde instance.
Attachments
Issue Links
- is related to
-
HIVE-14509 AvroSerde mutates tinyint and smallint columns when specifying native columns
- Open