Details
Description
I have json files of objects created with a nested structure roughly of the formof the form:
{ id: 123, event: "login", meta_data: {'user: "user1"}}
....
{ id: 125, event: "login", meta_data: {'user: "user2"}}
I load the data via spark with
rdd = sql_context.jsonFile()
- save it as a parquet file
rdd.saveAsParquetFile()
rdd = sql_context.parquetFile()
rdd.registerTempTable('events')
so if I run this query it works without issue if predicate pushdown is disabled
select count(1) from events where meta_data.user = "user1"
if I enable predicate pushdown I get an error saying meta_data.user is not in the schema
Py4JJavaError: An error occurred while calling o218.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 125 in stage 12.0 failed 4 times, most recent failure: Lost task 125.3 in stage 12.0 (TID 6164, ): java.lang.IllegalArgumentException: Column [user] was not found in schema!
at parquet.Preconditions.checkArgument(Preconditions.java:47)
at parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
.....
I expect this is actually related to another bug I filed where nested structure is not preserved with spark sql.
Attachments
Issue Links
- duplicates
-
SPARK-17636 Parquet predicate pushdown for nested fields
- Resolved