[SPARK-30334] Add metadata around semi-structured columns to Spark - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.4.4
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

Semi-structured data is used widely in the data industry for reporting events in a wide variety of formats. Click events in product analytics can be stored as json. Some application logs can be in the form of delimited key=value text. Some data may be in xml.

The goal of this project is to be able to signal Spark that such a column exists. This will then enable Spark to "auto-parse" these columns on the fly. The proposal is to store this information as part of the column metadata, in the fields:

- format: The format of the semi-structured column, e.g. json, xml, avro

- options: Options for parsing these columns

Then imagine having the following data:

+------------+-------+--------------------+
|     ts     | event |        raw         |
+------------+-------+--------------------+
| 2019-10-12 | click | {"field":"value"}  |
+------------+-------+--------------------+

SELECT raw.field FROM data

will return "value"

or the following data

+------------+-------+----------------------+
|     ts     | event |         raw          |
+------------+-------+----------------------+
| 2019-10-12 | click | field1=v1|field2=v2  |
+------------+-------+----------------------+

SELECT raw.field1 FROM data

will return v1.

As a first step, we will introduce the function "as_json", which accomplishes this for JSON columns.

Attachments

Issue Links

links to

GitHub Pull Request #26987

Activity

People

Assignee:: Unassigned

Reporter:: Burak Yavuz

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 23/Dec/19 17:17

Updated:: 25/Apr/24 19:20