Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25687

A dataset can store a column as sequence of Vectors but not directly vectors

Rank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • 2.3.1
    • None
    • ML, SQL

    Description

      A dataset can store an array of vectors but not a vector. This is inconsistent.

      To reproduce:

      {     import org.apache.spark.sql.Row     import org.apache.spark.ml.linalg.\{Vectors, DenseVector, Vector}

          import org.apache.spark.ml.linalg.SQLDataTypes.VectorType
          import org.apache.spark.sql.types._
          import spark.implicits._

          val rdd = sc.parallelize(Seq(Row(Seq(Vectors.dense(Array(1.0, 2.0)).toSparse))))
          val arrayOfVectorsDS = spark.createDataFrame(rowRDD= rdd, schema = new StructType(Array(StructField(name = "value", dataType = ArrayType(elementType = VectorType))))).as[Seq[Vector]]
      //    val vectorsDS = arrayOfVectorsDS.flatMap(a => a)
          .show
      }

       If the line before ".show" is uncommented this code will throw the well known error: error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.

       

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            forchard Francisco Orchard
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment