Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25687

A dataset can store a column as sequence of Vectors but not directly vectors

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • 2.3.1
    • None
    • ML, SQL

    Description

      A dataset can store an array of vectors but not a vector. This is inconsistent.

      To reproduce:

      {     import org.apache.spark.sql.Row     import org.apache.spark.ml.linalg.\{Vectors, DenseVector, Vector}

          import org.apache.spark.ml.linalg.SQLDataTypes.VectorType
          import org.apache.spark.sql.types._
          import spark.implicits._

          val rdd = sc.parallelize(Seq(Row(Seq(Vectors.dense(Array(1.0, 2.0)).toSparse))))
          val arrayOfVectorsDS = spark.createDataFrame(rowRDD= rdd, schema = new StructType(Array(StructField(name = "value", dataType = ArrayType(elementType = VectorType))))).as[Seq[Vector]]
      //    val vectorsDS = arrayOfVectorsDS.flatMap(a => a)
          .show
      }

       If the line before ".show" is uncommented this code will throw the well known error: error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            forchard Francisco Orchard
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: