[SPARK-21706] Support Custom PartitionSpec Provider for Kinesis Firehose or similar - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 1.6.3, 2.1.1, 2.2.0
Fix Version/s: None
Component/s: SQL
Labels:
- bulk-closed
- custom
- firehose
- kinesis
- partition
- partitioning
- spark
- sql

Description

Many people are using Kinesis Firehose to ingest data into a S3-based data lake. Kinesis Firehose produces a directory layout like this:

s3://data-lake-bucket/my-prefix/2017/08/11/10/my-stream-2017-08-11-11-10-10
s3://data-lake-bucket/my-prefix/2017/08/11/11/my-stream-2017-08-11-11-11-10
  .
  .
  .
s3://data-lake-bucket/my-prefix/2017/08/12/00/my-stream-2017-08-12-00-01-01

Spark is (like Hive) not supporting this kind of partitioning. Therefore it would be great, if you could configure a CustomPartitionDiscoverer or PartitionSpecProvider to provide a custom partition mapping and easily select a date range of files afterwards. Sadly, the partition discovery is deeply integrated into DataSource.
Could this be encapsulated smarter to be able to intercept the default behaviour?

Another partition schema that I've seen a lot in this context is:

s3://data-lake-bucket/prefix/2017-08-11/file.1.json
s3://data-lake-bucket/prefix/2017-08-11/file.2.json
  .
  .
  .
s3://data-lake-bucket/prefix/2017-08-12/file.1.json

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Sebastian Herold

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 11/Aug/17 07:37

Updated:: 21/May/19 04:17

Resolved:: 21/May/19 04:17