Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-651

Parquet-avro fails to decode array of record with a single field name "element" correctly

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.7.0, 1.8.0, 1.8.1, 1.9.0
    • 1.9.0, 1.8.2
    • parquet-avro
    • None

    Description

      Found this issue while investigating SPARK-16344.

      For the following Parquet schema

      message root {
        optional group f (LIST) {
          repeated group list {
            optional group element {
              optional int64 element;
            }
          }
        }
      }
      

      parquet-avro decodes it as something like this:

      record SingleElement {
        int element;
      }
      
      record NestedSingleElement {
        SingleElement element;
      }
      
      record Spark16344Wrong {
        array<NestedSingleElement> f;
      }
      

      while correct interpretation should be:

      record SingleElement {
        int element;
      }
      
      record Spark16344 {
        array<SingleElement> f;
      }
      

      The reason is that the element syntactic group for LIST in

      <list-repetition> group <name> (LIST) {
        repeated group list {
          <element-repetition> <element-type> element;
        }
      }
      

      is recognized as a record field named element. The problematic code lies in AvroRecordConverter.isElementType(). We should probably check the standard 3-level layout first before falling back to the legacy 2-level layout.

      Attachments

        Issue Links

          Activity

            People

              rdblue Ryan Blue
              lian cheng Cheng Lian
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: