Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21706

Support Custom PartitionSpec Provider for Kinesis Firehose or similar

    XMLWordPrintableJSON

Details

    Description

      Many people are using Kinesis Firehose to ingest data into a S3-based data lake. Kinesis Firehose produces a directory layout like this:

      s3://data-lake-bucket/my-prefix/2017/08/11/10/my-stream-2017-08-11-11-10-10
      s3://data-lake-bucket/my-prefix/2017/08/11/11/my-stream-2017-08-11-11-11-10
        .
        .
        .
      s3://data-lake-bucket/my-prefix/2017/08/12/00/my-stream-2017-08-12-00-01-01
      

      Spark is (like Hive) not supporting this kind of partitioning. Therefore it would be great, if you could configure a CustomPartitionDiscoverer or PartitionSpecProvider to provide a custom partition mapping and easily select a date range of files afterwards. Sadly, the partition discovery is deeply integrated into DataSource.
      Could this be encapsulated smarter to be able to intercept the default behaviour?

      Another partition schema that I've seen a lot in this context is:

      s3://data-lake-bucket/prefix/2017-08-11/file.1.json
      s3://data-lake-bucket/prefix/2017-08-11/file.2.json
        .
        .
        .
      s3://data-lake-bucket/prefix/2017-08-12/file.1.json
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            sebastianherold Sebastian Herold
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: