[ORC-200] json-schema and convert commands should support schema evolution of json documents - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.5.0
Fix Version/s: None
Component/s: Java
Labels:
None

Description

Using the command (sample payloads attached):
java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar json-schema -t ~/example-v1.json

Produces the following output:
create table tbl (
about string,
address string,
age tinyint,
balance string,
company string,
email string,
eyeColor string,
favoriteFruit string,
friends array <struct <
id: tinyint,
name: string>>,
gender string,
greeting string,
guid string,
id binary,
index tinyint,
isActive boolean,
latitude decimal(8,6),
longitude decimal(8,6),
name string,
phone string,
picture string,
registered timestamp,
tags array <string>
)

Notice that because org/apache/orc/tools/json/StructType.java uses a java.util.TreeMap for the fields instance variable the generated DDL is sorted alphabetically and not ordered by structure. This causes problems for the convert command as well.

java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar meta -j -p output-v1.orc

"schemaString": "struct<about:string,address:string,age:tinyint,balance:string,company:string,email:string,eyeColor:string,favoriteFruit:string,friends:array<struct<id:tinyint,name:string>>,gender:string,greeting:string,guid:string,id:binary,index:tinyint,isActive:boolean,latitude:decimal(8,6),longitude:decimal(8,6),name:string,phone:string,picture:string,registered:timestamp,tags:array<string>>",
"schema": [
{
"columnId": 0,
"columnType": "STRUCT",
"childColumnNames": [
"about",
"address",
"age",
"balance",
"company",
"email",
"eyeColor",
"favoriteFruit",
"friends",
"gender",
"greeting",
"guid",
"id",
"index",
"isActive",
"latitude",
"longitude",
"name",
"phone",
"picture",
"registered",
"tags"
],
<output ommited for brevity>

This causes major problems when a field is added to the JSON document later

e.g.
java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar json-schema -t ~/example-v2.json

Examine where the newField field is added in the example-v2.json document and then examine the output below. This also affects the convert command.

create table tbl (
about string,
address string,
age tinyint,
balance string,
company string,
email string,
eyeColor string,
favoriteFruit string,
friends array <struct <
id: tinyint,
name: string>>,
gender string,
greeting string,
guid string,
id binary,
index tinyint,
isActive boolean,
latitude decimal(8,6),
longitude decimal(8,6),
name string,
newField string,
phone string,
picture string,
registered timestamp,
tags array <string>
)

The org/apache/orc/tools/json/StructType.java class should use java.util.LinkedHashMap for the fields instance variable so order is maintained across changes to the JSON schema.

Pull request with test cases incoming

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

example-v2.json
25/May/17 05:24
1 kB
Shawn Hooton
example-v1.json
25/May/17 05:24
1 kB
Shawn Hooton

Activity

People

Assignee:: Shawn Hooton

Reporter:: Shawn Hooton

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 25/May/17 05:24

Updated:: 25/May/17 15:44