Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-33398

AnalysisException when loading a PipelineModel with Spark 3

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.1
    • 3.0.2, 3.1.0
    • MLlib
      • Databricks runtime 7.3 ML
      • Spark 3.0.1
      • Python 3.7

    Description

      I am upgrading my Spark version from 2.4.5 to 3.0.1 and I cannot load anymore the PipelineModel objects that use a "DecisionTreeClassifier" stage.

      In my code I load several PipelineModel, all the PipelineModel with stages ["CountVectorizer_[uid]", "LinearSVC_[uid]"] are loading fine whereas the models with stages
      ["CountVectorizer_[uid]","DecisionTreeClassifier_[uid]"] are throwing the following exception:

      AnalysisException: cannot resolve '`rawCount`' given input columns: [gain, id, impurity, impurityStats, leftChild, prediction, rightChild, split];

      Here is the code I am using and the full stacktrace:

      from pyspark.ml.pipeline import PipelineModel
      PipelineModel.load("/path/to/model")
      
      AnalysisException                         Traceback (most recent call last)
      <command-1278858167154148> in <module>
      ----> 1 RalentModel = PipelineModel.load(MODELES_ATTRIBUTS + "RalentModel_DT")/databricks/spark/python/pyspark/ml/util.py in load(cls, path)
          368     def load(cls, path):
          369         """Reads an ML instance from the input path, a shortcut of `read().load(path)`."""
      --> 370         return cls.read().load(path)
          371 
          372 /databricks/spark/python/pyspark/ml/pipeline.py in load(self, path)
          289         metadata = DefaultParamsReader.loadMetadata(path, self.sc)
          290         if 'language' not in metadata['paramMap'] or metadata['paramMap']['language'] != 'Python':
      --> 291             return JavaMLReader(self.cls).load(path)
          292         else:
          293             uid, stages = PipelineSharedReadWrite.load(metadata, self.sc, path)/databricks/spark/python/pyspark/ml/util.py in load(self, path)
          318         if not isinstance(path, basestring):
          319             raise TypeError("path should be a basestring, got type %s" % type(path))
      --> 320         java_obj = self._jread.load(path)
          321         if not hasattr(self._clazz, "_from_java"):
          322             raise NotImplementedError("This Java ML type cannot be loaded into Python currently: %r"/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
         1303         answer = self.gateway_client.send_command(command)
         1304         return_value = get_return_value(
      -> 1305             answer, self.gateway_client, self.target_id, self.name)
         1306 
         1307         for temp_arg in temp_args:/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
          131                 # Hide where the exception came from that shows a non-Pythonic
          132                 # JVM exception message.
      --> 133                 raise_from(converted)
          134             else:
          135                 raise/databricks/spark/python/pyspark/sql/utils.py in raise_from(e)
      AnalysisException: cannot resolve '`rawCount`' given input columns: [gain, id, impurity, impurityStats, leftChild, prediction, rightChild, split];
      

      These pipeline models where saved using Spark 2.4.3, I can load them fine using Spark 2.4.5.

      I tried to investigate further and load each stage separately. Loading the CountVectorizerModel with

      from pyspark.ml.feature import CountVectorizerModel
      CountVectorizerModel.read().load("/path/to/model/stages/0_CountVectorizer_efce893314a9")
      

      yields a CountVectorizerModel, but my code fails when trying to load the DecisionTreeClassificationModel:

      DecisionTreeClassificationModel.read().load("/path/to/model/stages/1_DecisionTreeClassifier_4d2a76c565b0")
      AnalysisException: cannot resolve '`rawCount`' given input columns: [gain, id, impurity, impurityStats, leftChild, prediction, rightChild, split];
      

      And here is the content of the "data" of my Decision Tree Classifier:

      spark.read.parquet("/path/to/model/stages/1_DecisionTreeClassifier_4d2a76c565b0/data").show()
      
      +---+----------+--------------------+-------------+--------------------+---------+----------+----------------+
      | id|prediction|            impurity|impurityStats|                gain|leftChild|rightChild|           split|
      +---+----------+--------------------+-------------+--------------------+---------+----------+----------------+
      |  0|       0.0|  0.3926234384295062| [90.0, 33.0]| 0.16011830963990054|        1|        16|[190, [0.5], -1]|
      |  1|       0.0|  0.2672722508516028| [90.0, 17.0]| 0.11434106988303855|        2|        15|[512, [0.5], -1]|
      |  2|       0.0|  0.1652892561983472|  [90.0, 9.0]| 0.06959547629404085|        3|        14|[583, [0.5], -1]|
      |  3|       0.0| 0.09972299168975082|  [90.0, 5.0]|0.026984966852376356|        4|        11|[480, [0.5], -1]|
      |  4|       0.0|0.043933846736523306|  [87.0, 2.0]|0.021717299239076976|        5|        10|[555, [1.5], -1]|
      |  5|       0.0|0.022469008264462766|  [87.0, 1.0]|0.011105371900826402|        6|         7|[833, [0.5], -1]|
      |  6|       0.0|                 0.0|  [86.0, 0.0]|                -1.0|       -1|        -1|    [-1, [], -1]|
      |  7|       0.0|                 0.5|   [1.0, 1.0]|                 0.5|        8|         9|  [0, [0.5], -1]|
      |  8|       0.0|                 0.0|   [1.0, 0.0]|                -1.0|       -1|        -1|    [-1, [], -1]|
      |  9|       1.0|                 0.0|   [0.0, 1.0]|                -1.0|       -1|        -1|    [-1, [], -1]|
      | 10|       1.0|                 0.0|   [0.0, 1.0]|                -1.0|       -1|        -1|    [-1, [], -1]|
      | 11|       0.0|                 0.5|   [3.0, 3.0]|                 0.5|       12|        13| [14, [1.5], -1]|
      | 12|       0.0|                 0.0|   [3.0, 0.0]|                -1.0|       -1|        -1|    [-1, [], -1]|
      | 13|       1.0|                 0.0|   [0.0, 3.0]|                -1.0|       -1|        -1|    [-1, [], -1]|
      | 14|       1.0|                 0.0|   [0.0, 4.0]|                -1.0|       -1|        -1|    [-1, [], -1]|
      | 15|       1.0|                 0.0|   [0.0, 8.0]|                -1.0|       -1|        -1|    [-1, [], -1]|
      | 16|       1.0|                 0.0|  [0.0, 16.0]|                -1.0|       -1|        -1|    [-1, [], -1]|
      +---+----------+--------------------+-------------+--------------------+---------+----------+----------------+
      
      

      Attachments

        Issue Links

          Activity

            People

              podongfeng Ruifeng Zheng
              LoicH LoicH
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: