Uploaded image for project: 'ORC'
  1. ORC
  2. ORC-200

json-schema and convert commands should support schema evolution of json documents

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.5.0
    • None
    • Java
    • None

    Description

      Using the command (sample payloads attached):
      java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar json-schema -t ~/example-v1.json

      Produces the following output:
      create table tbl (
      about string,
      address string,
      age tinyint,
      balance string,
      company string,
      email string,
      eyeColor string,
      favoriteFruit string,
      friends array <struct <
      id: tinyint,
      name: string>>,
      gender string,
      greeting string,
      guid string,
      id binary,
      index tinyint,
      isActive boolean,
      latitude decimal(8,6),
      longitude decimal(8,6),
      name string,
      phone string,
      picture string,
      registered timestamp,
      tags array <string>
      )

      Notice that because org/apache/orc/tools/json/StructType.java uses a java.util.TreeMap for the fields instance variable the generated DDL is sorted alphabetically and not ordered by structure. This causes problems for the convert command as well.

      java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar meta -j -p output-v1.orc

      <output ommited for brevity>

      "schemaString": "struct<about:string,address:string,age:tinyint,balance:string,company:string,email:string,eyeColor:string,favoriteFruit:string,friends:array<struct<id:tinyint,name:string>>,gender:string,greeting:string,guid:string,id:binary,index:tinyint,isActive:boolean,latitude:decimal(8,6),longitude:decimal(8,6),name:string,phone:string,picture:string,registered:timestamp,tags:array<string>>",
      "schema": [
      {
      "columnId": 0,
      "columnType": "STRUCT",
      "childColumnNames": [
      "about",
      "address",
      "age",
      "balance",
      "company",
      "email",
      "eyeColor",
      "favoriteFruit",
      "friends",
      "gender",
      "greeting",
      "guid",
      "id",
      "index",
      "isActive",
      "latitude",
      "longitude",
      "name",
      "phone",
      "picture",
      "registered",
      "tags"
      ],
      <output ommited for brevity>

      This causes major problems when a field is added to the JSON document later

      e.g.
      java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar json-schema -t ~/example-v2.json

      Examine where the newField field is added in the example-v2.json document and then examine the output below. This also affects the convert command.

      create table tbl (
      about string,
      address string,
      age tinyint,
      balance string,
      company string,
      email string,
      eyeColor string,
      favoriteFruit string,
      friends array <struct <
      id: tinyint,
      name: string>>,
      gender string,
      greeting string,
      guid string,
      id binary,
      index tinyint,
      isActive boolean,
      latitude decimal(8,6),
      longitude decimal(8,6),
      name string,
      newField string,
      phone string,
      picture string,
      registered timestamp,
      tags array <string>
      )

      The org/apache/orc/tools/json/StructType.java class should use java.util.LinkedHashMap for the fields instance variable so order is maintained across changes to the JSON schema.

      Pull request with test cases incoming

      Attachments

        1. example-v2.json
          1 kB
          Shawn Hooton
        2. example-v1.json
          1 kB
          Shawn Hooton

        Activity

          People

            codingogre Shawn Hooton
            codingogre Shawn Hooton
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: