[PARQUET-1441] SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.11.0
Component/s: parquet-avro
Labels:
- pull-request-available

Description

The following unit test added to TestAvroSchemaConverter fails

@Test
public void testConvertedSchemaToStringCantRedefineList() throws Exception {
  String parquet = "message spark_schema {\n" +
      "  optional group annotation {\n" +
      "    optional group transcriptEffects (LIST) {\n" +
      "      repeated group list {\n" +
      "        optional group element {\n" +
      "          optional group effects (LIST) {\n" +
      "            repeated group list {\n" +
      "              optional binary element (UTF8);\n" +
      "            }\n" +
      "          }\n" +
      "        }\n" +
      "      }\n" +
      "    }\n" +
      "  }\n" +
      "}\n";

  Configuration conf = new Configuration(false);
  AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
  Schema schema = avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
  schema.toString();
}

while this one succeeds

@Test
public void testConvertedSchemaToStringCantRedefineList() throws Exception {
  String parquet = "message spark_schema {\n" +
      "  optional group annotation {\n" +
      "    optional group transcriptEffects (LIST) {\n" +
      "      repeated group list {\n" +
      "        optional group element {\n" +
      "          optional group effects (LIST) {\n" +
      "            repeated group list {\n" +
      "              optional binary element (UTF8);\n" +
      "            }\n" +
      "          }\n" +
      "        }\n" +
      "      }\n" +
      "    }\n" +
      "  }\n" +
      "}\n";
 
  Configuration conf = new Configuration(false);
  conf.setBoolean("parquet.avro.add-list-element-records", false);
  AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
  Schema schema = avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
  schema.toString();
}

I don't see a way to influence the code path in AvroIndexedRecordConverter to respect this configuration, resulting in the following stack trace downstream

  Cause: org.apache.avro.SchemaParseException: Can't redefine: list
  at org.apache.avro.Schema$Names.put(Schema.java:1128)
  at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:562)
  at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:690)
  at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:805)
  at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
  at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
  at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
  at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
  at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
  at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
  at org.apache.avro.Schema.toString(Schema.java:324)
  at org.apache.avro.SchemaCompatibility.checkReaderWriterCompatibility(SchemaCompatibility.java:68)
  at org.apache.parquet.avro.AvroRecordConverter.isElementType(AvroRecordConverter.java:866)
  at org.apache.parquet.avro.AvroIndexedRecordConverter$AvroArrayConverter.<init>(AvroIndexedRecordConverter.java:333)
  at org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:172)
  at org.apache.parquet.avro.AvroIndexedRecordConverter.<init>(AvroIndexedRecordConverter.java:94)
  at org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:168)
  at org.apache.parquet.avro.AvroIndexedRecordConverter.<init>(AvroIndexedRecordConverter.java:94)
  at org.apache.parquet.avro.AvroIndexedRecordConverter.<init>(AvroIndexedRecordConverter.java:66)
  at org.apache.parquet.avro.AvroCompatRecordMaterializer.<init>(AvroCompatRecordMaterializer.java:34)
  at org.apache.parquet.avro.AvroReadSupport.newCompatMaterializer(AvroReadSupport.java:144)
  at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:136)
  at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:204)
  at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182)
  at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
...

See also downstream issues
https://issues.apache.org/jira/browse/SPARK-25588
https://github.com/bigdatagenomics/adam/issues/2058

Attachments

Issue Links

duplicates

PARQUET-1409 Can write but read parquet file with nested arrays

Open

links to

GitHub Pull Request #555

GitHub Pull Request #560

SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter

Details

Description

Attachments

Issue Links

Activity

People

Dates