Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26696

Dataset encoder should be publicly accessible

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.0
    • 3.0.0
    • SQL

    Description

      As a platform, Spark should enable framework developers to accomplish outside of the Spark codebase much of what can be accomplished inside the Spark codebase. One of the obstacles to this is a historical pattern of excessive data hiding in Spark, e.g., expr in Column not being accessible. This issue is an example of this pattern when it comes to Dataset.

      Consider a transformation with the signature `def foo[A](ds: Dataset[A]): Dataset[A]`, which requires the use of toDF(). To get back to Dataset[A] would require calling .as[A], which requires an implicit Encoder[A]. A naive approach would change the function signature to `foo[A : Encoder]` but this is poor API design that requires unnecessarily carrying of implicits from user code into framework code. We know `Encoder[A]` exists because we have access to an instance of `Dataset[A]`... but its `encoder` is not accessible.

      The solution is simple: make encoder a @transient val just as is the case with queryExecution.

      Attachments

        Issue Links

          Activity

            People

              simeons Simeon Simeonov
              simeons Simeon Simeonov
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: