Details
-
New Feature
-
Status: In Progress
-
Major
-
Resolution: Unresolved
-
2.4.4
-
None
-
None
Description
Semi-structured data is used widely in the data industry for reporting events in a wide variety of formats. Click events in product analytics can be stored as json. Some application logs can be in the form of delimited key=value text. Some data may be in xml.
The goal of this project is to be able to signal Spark that such a column exists. This will then enable Spark to "auto-parse" these columns on the fly. The proposal is to store this information as part of the column metadata, in the fields:
- format: The format of the semi-structured column, e.g. json, xml, avro
- options: Options for parsing these columns
Then imagine having the following data:
+------------+-------+--------------------+ | ts | event | raw | +------------+-------+--------------------+ | 2019-10-12 | click | {"field":"value"} | +------------+-------+--------------------+
SELECT raw.field FROM data
will return "value"
or the following data
+------------+-------+----------------------+ | ts | event | raw | +------------+-------+----------------------+ | 2019-10-12 | click | field1=v1|field2=v2 | +------------+-------+----------------------+
SELECT raw.field1 FROM data
will return v1.
As a first step, we will introduce the function "as_json", which accomplishes this for JSON columns.
Attachments
Issue Links
- links to