Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34365

Support configurable Avro schema field matching for positional or by-name

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.1
    • 3.2.0
    • SQL
    • None

    Description

      When reading an Avro dataset (using the dataset's schema or by overriding it with 'avroSchema') or writing an Avro dataset with a provided schema by 'avroSchema', currently the matching of Catalyst-to-Avro fields is done by field name.

      This behavior is somewhat recent; prior to SPARK-27762 (fixed in 3.0.0), at least on the write path, we would match the schemas by positionally ("structural" comparison). While I agree that this is much more sensible for default behavior, I propose that we make this behavior configurable using an option for the Avro datasource. Even at the time that SPARK-27762 was handled, there was interest in making this behavior configurable, but it appears it went unaddressed.

      There is precedence for configurability of this behavior as seen in SPARK-32864, which added this support for ORC. Besides this precedence, the behavior of Hive is to perform matching positionally (ref), so this is behavior that Hadoop/Hive ecosystem users are familiar with:

      Hive is very forgiving about types: it will attempt to store whatever value matches the provided column in the equivalent column position in the new table. No matching is done on column names, for instance.

      Attachments

        Issue Links

          Activity

            People

              xkrogen Erik Krogen
              xkrogen Erik Krogen
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: