Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-44076

SPIP: Python Data Source API

Rank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 4.0.0
    • None
    • PySpark
    • None

    Description

      This proposal aims to introduce a simple API in Python for Data Sources. The idea is to enable Python developers to create data sources without having to learn Scala or deal with the complexities of the current data source APIs. The goal is to make a Python-based API that is simple and easy to use, thus making Spark more accessible to the wider Python developer community. This proposed approach is based on the recently introduced Python user-defined table functions (SPARK-43797) with extensions to support data sources.

      SPIP: https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing

      Attachments

        1.
        Initial support for Python data source read API Sub-task Resolved Allison Wang Actions
        2.
        Support registering Python data sources Sub-task Resolved Allison Wang Actions
        3.
        Support loading Python data sources in DataFrameReader Sub-task Resolved Allison Wang Actions
        4.
        Add InputPartition to DataSourceReader interface Sub-task Resolved Allison Wang Actions
        5.
        Add Python data source write API Sub-task Resolved Allison Wang Actions
        6.
        Make Python data source registration session level Sub-task Resolved Allison Wang Actions
        7.
        Plan Python data source read using mapInArrow Sub-task Resolved Allison Wang Actions
        8.
        Change saveMode to overwrite for DataSourceWriter constructor Sub-task Resolved Allison Wang Actions
        9.
        Support spark.read.schema(...) for Python data source API Sub-task Resolved Unassigned Actions
        10.
        Respect column names when Python data source read function outputs named Row objects Sub-task Resolved Allison Wang Actions
        11.
        Initial support for Python data source write API Sub-task Resolved Allison Wang Actions
        12.
        Support spark.read.load() with non-empty path for Python data source API Sub-task Open Unassigned Actions
        13.
        Support creating table using a Python data source in SQL Sub-task Resolved Hyukjin Kwon Actions
        14.
        Support `commit` and `abort` API for Python data source write Sub-task Resolved Allison Wang Actions
        15.
        Support overwrite mode for Python data source write Sub-task Resolved Allison Wang Actions
        16.
        Investigate runtime registration and feasibility of overwriting the datasource Sub-task Resolved Unassigned Actions
        17.
        Statically register Python Data Source Sub-task Resolved Hyukjin Kwon Actions
        18.
        Update `path` handling in Python data source Sub-task Resolved Allison Wang Actions
        19.
        Allow non-deterministic Python UDFs in MapInPandas/MapInArrow Sub-task Resolved Allison Wang Actions
        20.
        Support create table using DSv2 sources Sub-task Resolved Allison Wang Actions
        21.
        Support CTAS using DSv2 sources Sub-task Resolved Allison Wang Actions
        22.
        Support INSERT INTO/OVERWRITE using DSv2 sources Sub-task Resolved Allison Wang Actions
        23.
        Add documentation for Python data source API Sub-task Resolved Allison Wang Actions
        24.
        Refactor Python Data Source instance loading Sub-task Resolved Hyukjin Kwon Actions
        25.
        Support PythonSQLMetrics.pythonMetrics Sub-task Resolved Hyukjin Kwon Actions
        26.
        Add a new API in DSv2 DataWriter to write an iterator of records Sub-task Resolved Allison Wang Actions
        27.
        Block Python data source registration with name conflicts Sub-task Resolved Allison Wang Actions
        28.
        Improve error messages for invalid save mode Sub-task Resolved Allison Wang Actions
        29.
        Check Python executable when looking up available Data Sources Sub-task Resolved Hyukjin Kwon Actions
        30.
        Improve Python data source error classes and messages Sub-task Resolved Allison Wang Actions
        31.
        Python data source options should be a case insensitive dictionary Sub-task Resolved Allison Wang Actions
        32.
        Improve error messages for unsupported data source save mode Sub-task Resolved Allison Wang Actions
        33.
        Log full exception when failed to lookup Python Data Sources Sub-task Resolved Hyukjin Kwon Actions
        34.
        Disallow re-registration of statically registered data sources Sub-task Open Unassigned Actions
        35.
        Improve error messages for DATA_SOURCE_NOT_FOUND error Sub-task Resolved Allison Wang Actions
        36.
        Make DataSourceManager isolated and self clone-able Sub-task Resolved Hyukjin Kwon Actions
        37.
        Refactor Python Data Source to align with other built-in Data Sources Sub-task Resolved Hyukjin Kwon Actions
        38.
        Skip test_datasource if PyArrow is not installed Sub-task Resolved Hyukjin Kwon Actions
        39.
        Skip V2 table lookup when a table is in V1 table cache Sub-task Resolved Allison Wang Actions
        40.
        Make daemon mode configurable when creating Python workers Sub-task Resolved Allison Wang Actions
        41.
        Support Python data source API with Spark Connect Sub-task Resolved Allison Wang Actions
        42.
        Fix docstring links and type hints in Python Data Source Sub-task Resolved Hyukjin Kwon Actions
        43.
        Document Python Data Source API in API reference page Sub-task Resolved Hyukjin Kwon Actions
        44.
        Remove the private[sql] modifier for Python data sources Sub-task Resolved Allison Wang Actions

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            allisonwang-db Allison Wang
            Hyukjin Kwon Hyukjin Kwon

            Dates

              Created:
              Updated:

              Slack

                Issue deployment