Uploaded image for project: 'ORC'
  1. ORC
  2. ORC-200

json-schema and convert commands should support schema evolution of json documents

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.5.0
    • None
    • Java
    • None

    Description

      Using the command (sample payloads attached):
      java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar json-schema -t ~/example-v1.json

      Produces the following output:
      create table tbl (
      about string,
      address string,
      age tinyint,
      balance string,
      company string,
      email string,
      eyeColor string,
      favoriteFruit string,
      friends array <struct <
      id: tinyint,
      name: string>>,
      gender string,
      greeting string,
      guid string,
      id binary,
      index tinyint,
      isActive boolean,
      latitude decimal(8,6),
      longitude decimal(8,6),
      name string,
      phone string,
      picture string,
      registered timestamp,
      tags array <string>
      )

      Notice that because org/apache/orc/tools/json/StructType.java uses a java.util.TreeMap for the fields instance variable the generated DDL is sorted alphabetically and not ordered by structure. This causes problems for the convert command as well.

      java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar meta -j -p output-v1.orc

      <output ommited for brevity>

      "schemaString": "struct<about:string,address:string,age:tinyint,balance:string,company:string,email:string,eyeColor:string,favoriteFruit:string,friends:array<struct<id:tinyint,name:string>>,gender:string,greeting:string,guid:string,id:binary,index:tinyint,isActive:boolean,latitude:decimal(8,6),longitude:decimal(8,6),name:string,phone:string,picture:string,registered:timestamp,tags:array<string>>",
      "schema": [
      {
      "columnId": 0,
      "columnType": "STRUCT",
      "childColumnNames": [
      "about",
      "address",
      "age",
      "balance",
      "company",
      "email",
      "eyeColor",
      "favoriteFruit",
      "friends",
      "gender",
      "greeting",
      "guid",
      "id",
      "index",
      "isActive",
      "latitude",
      "longitude",
      "name",
      "phone",
      "picture",
      "registered",
      "tags"
      ],
      <output ommited for brevity>

      This causes major problems when a field is added to the JSON document later

      e.g.
      java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar json-schema -t ~/example-v2.json

      Examine where the newField field is added in the example-v2.json document and then examine the output below. This also affects the convert command.

      create table tbl (
      about string,
      address string,
      age tinyint,
      balance string,
      company string,
      email string,
      eyeColor string,
      favoriteFruit string,
      friends array <struct <
      id: tinyint,
      name: string>>,
      gender string,
      greeting string,
      guid string,
      id binary,
      index tinyint,
      isActive boolean,
      latitude decimal(8,6),
      longitude decimal(8,6),
      name string,
      newField string,
      phone string,
      picture string,
      registered timestamp,
      tags array <string>
      )

      The org/apache/orc/tools/json/StructType.java class should use java.util.LinkedHashMap for the fields instance variable so order is maintained across changes to the JSON schema.

      Pull request with test cases incoming

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            codingogre Shawn Hooton
            codingogre Shawn Hooton

            Dates

              Created:
              Updated:

              Slack

                Issue deployment