Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1441

SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter

    XMLWordPrintableJSON

Details

    Description

      The following unit test added to TestAvroSchemaConverter fails

      @Test
      public void testConvertedSchemaToStringCantRedefineList() throws Exception {
        String parquet = "message spark_schema {\n" +
            "  optional group annotation {\n" +
            "    optional group transcriptEffects (LIST) {\n" +
            "      repeated group list {\n" +
            "        optional group element {\n" +
            "          optional group effects (LIST) {\n" +
            "            repeated group list {\n" +
            "              optional binary element (UTF8);\n" +
            "            }\n" +
            "          }\n" +
            "        }\n" +
            "      }\n" +
            "    }\n" +
            "  }\n" +
            "}\n";
      
        Configuration conf = new Configuration(false);
        AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
        Schema schema = avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
        schema.toString();
      }
      

      while this one succeeds

      @Test
      public void testConvertedSchemaToStringCantRedefineList() throws Exception {
        String parquet = "message spark_schema {\n" +
            "  optional group annotation {\n" +
            "    optional group transcriptEffects (LIST) {\n" +
            "      repeated group list {\n" +
            "        optional group element {\n" +
            "          optional group effects (LIST) {\n" +
            "            repeated group list {\n" +
            "              optional binary element (UTF8);\n" +
            "            }\n" +
            "          }\n" +
            "        }\n" +
            "      }\n" +
            "    }\n" +
            "  }\n" +
            "}\n";
       
        Configuration conf = new Configuration(false);
        conf.setBoolean("parquet.avro.add-list-element-records", false);
        AvroSchemaConverter avroSchemaConverter = new AvroSchemaConverter(conf);
        Schema schema = avroSchemaConverter.convert(MessageTypeParser.parseMessageType(parquet));
        schema.toString();
      }
      

      I don't see a way to influence the code path in AvroIndexedRecordConverter to respect this configuration, resulting in the following stack trace downstream

        Cause: org.apache.avro.SchemaParseException: Can't redefine: list
        at org.apache.avro.Schema$Names.put(Schema.java:1128)
        at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:562)
        at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:690)
        at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:805)
        at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
        at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
        at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
        at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882)
        at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716)
        at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)
        at org.apache.avro.Schema.toString(Schema.java:324)
        at org.apache.avro.SchemaCompatibility.checkReaderWriterCompatibility(SchemaCompatibility.java:68)
        at org.apache.parquet.avro.AvroRecordConverter.isElementType(AvroRecordConverter.java:866)
        at org.apache.parquet.avro.AvroIndexedRecordConverter$AvroArrayConverter.<init>(AvroIndexedRecordConverter.java:333)
        at org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:172)
        at org.apache.parquet.avro.AvroIndexedRecordConverter.<init>(AvroIndexedRecordConverter.java:94)
        at org.apache.parquet.avro.AvroIndexedRecordConverter.newConverter(AvroIndexedRecordConverter.java:168)
        at org.apache.parquet.avro.AvroIndexedRecordConverter.<init>(AvroIndexedRecordConverter.java:94)
        at org.apache.parquet.avro.AvroIndexedRecordConverter.<init>(AvroIndexedRecordConverter.java:66)
        at org.apache.parquet.avro.AvroCompatRecordMaterializer.<init>(AvroCompatRecordMaterializer.java:34)
        at org.apache.parquet.avro.AvroReadSupport.newCompatMaterializer(AvroReadSupport.java:144)
        at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:136)
        at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:204)
        at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182)
        at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
      ...
      

      See also downstream issues
      https://issues.apache.org/jira/browse/SPARK-25588
      https://github.com/bigdatagenomics/adam/issues/2058

      Attachments

        Issue Links

          Activity

            People

              nkollar Nándor Kollár
              heuermh Michael Heuer
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: