Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31299

Pyspark.ml.clustering illegalArgumentException with dataframe created from rows

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 2.4.0
    • None
    • ML, PySpark
    • None

    Description

      I hope this is the right place and way to report a bug in (at least) the PySpark API:

      BisectingKMeans in the following example is only exemplary, the error occurs with all clustering algorithms:

      from pyspark.sql import Row
      from pyspark.mllib.linalg import DenseVector
      from pyspark.ml.clustering import BisectingKMeans
      
      data = spark.createDataFrame([Row(test_features=DenseVector([43.0, 0.0, 200.0, 1.0, 1.0, 1.0, 0.0, 3.0])),
       Row(test_features=DenseVector([44.0, 0.0, 250.0, 1.0, 1.0, 1.0, 0.0, 1.0])),
       Row(test_features=DenseVector([23.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0])),
       Row(test_features=DenseVector([25.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 2.0])),
       Row(test_features=DenseVector([19.0, 0.0, 200.0, 1.0, 0.0, 1.0, 0.0, 1.0]))])
      
      kmeans = BisectingKMeans(featuresCol='test_features').setK(4).setSeed(1)
      model = kmeans.fit(data)
      

      The .fit-call in the last line will fail with the following error:

      Py4JJavaError: An error occurred while calling o51.fit.
      : java.lang.IllegalArgumentException: requirement failed: Column test_features must be of type equal to one of the following types: [struct<type:tinyint,size:int,indices:array<int>,values:array<double>>, array<double>, array<float>] but was actually of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.
      

      As can be seen, the data type reported to be passed to the function is the first data type in the list of allowed data types, yet the call ends in an error because of it.

      See my [StackOverflow issue|https://stackoverflow.com/questions/60884142/pyspark-py4j-illegalargumentexception-with-spark-createdataframe-and-pyspark-ml] for more context

      Attachments

        Activity

          People

            Unassigned Unassigned
            Lukas Thaler Lukas Thaler
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: