Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25733

The method toLocalIterator() with dataframe doesn't work

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Cannot Reproduce
    • 2.3.1
    • None
    • PySpark
    • None
    • Spark in standalone mode, and 48 cores are available.

      spark-defaults.conf as blew:
      spark.pyshark.python /usr/bin/python3.6
      spark.driver.memory 4g
      spark.executor.memory 8g

       

      other configurations are at default.

    Description

      The dataset which I used attached.

       

      First I loaded a dataframe from local disk:

      df = spark.read.load('report_dataset')

      there are about 200 partitions stored in s3, and the max size of partitions is 28.37MB.

       

      after data loaded,  I execute "df.take(1)" to test the dataframe, and expected output printed "[Row(s3_link='https://dcm-ul-phy.s3-china-1.eecloud.nsn-net.net/normal/run2/pool1/Tests.NbIot.NBCellSetupDelete.LTE3374_CellSetup_4x5M_2RX_3CELevel_Loop100.html', sequences=[364, 15, 184, 34, 524, 49, 30, 527, 44, 366, 125, 85, 69, 524, 49, 389, 575, 29, 179, 447, 168, 3, 223, 116, 573, 524, 49, 30, 527, 56, 366, 125, 85, 524, 118, 295, 440, 123, 389, 32, 575, 529, 192, 524, 49, 389, 575, 29, 179, 29, 140, 268, 96, 508, 389, 32, 575, 529, 192, 524, 49, 389, 575, 29, 179, 180, 451, 69, 286, 524, 49, 389, 575, 29, 42, 553, 451, 37, 125, 524, 49, 389, 575, 29, 42, 553, 451, 37, 125, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, 588, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, 588, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, 588, 524, 49, 389], next_word=575, line_num=12)]" 

       

      Then I try to convert dataframe to the local iterator and want to print one row in dataframe for testing, and blew code is used:

      for row in df.toLocalIterator():

          print(row)

          break

      But there is no output printed after that code executed.

       

      Then I execute "df.take(1)" and blew error is reported:
      ERROR:root:Exception while sending command.
      Traceback (most recent call last):
      File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 1159, in send_command
      raise Py4JNetworkError("Answer from Java side is empty")
      py4j.protocol.Py4JNetworkError: Answer from Java side is empty

      During handling of the above exception, another exception occurred:
      ERROR:root:Exception while sending command.
      Traceback (most recent call last):
      File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 1159, in send_command
      raise Py4JNetworkError("Answer from Java side is empty")
      py4j.protocol.Py4JNetworkError: Answer from Java side is empty

      During handling of the above exception, another exception occurred:

      Traceback (most recent call last):
      File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 985, in send_command
      response = connection.send_command(command)
      File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 1164, in send_command
      "Error while receiving", e, proto.ERROR_ON_RECEIVE)
      py4j.protocol.Py4JNetworkError: Error while receiving
      ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37735)
      Traceback (most recent call last):
      File "/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
      exec(code_obj, self.user_global_ns, self.user_ns)
      File "<ipython-input-7-3959105b378f>", line 1, in <module>
      df.take(1)
      File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 504, in take
      return self.limit(num).collect()
      File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 493, in limit
      jdf = self._jdf.limit(num)
      File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 1257, in _call_
      answer, self.gateway_client, self.target_id, self.name)
      File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco
      return f(*a, **kw)
      File "/opt/k2-v02/lib/python3.6/site-packages/py4j/protocol.py", line 336, in get_return_value
      format(target_id, ".", name))
      py4j.protocol.Py4JError: An error occurred while calling o29.limit

      During handling of the above exception, another exception occurred:

      Traceback (most recent call last):
      File "/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 1863, in showtraceback
      stb = value.render_traceback()
      AttributeError: 'Py4JError' object has no attribute 'render_traceback'

      During handling of the above exception, another exception occurred:

      Traceback (most recent call last):
      File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 929, in _get_connection
      connection = self.deque.pop()
      IndexError: pop from an empty deque

      During handling of the above exception, another exception occurred:

      Traceback (most recent call last):
      File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 1067, in start
      self.socket.connect((self.address, self.port))
      ConnectionRefusedError: [Errno 111] Connection refused
      ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37735)
      Traceback (most recent call last):
      File "/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
      exec(code_obj, self.user_global_ns, self.user_ns)
      File "<ipython-input-7-3959105b378f>", line 1, in <module>
      df.take(1)
      File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 504, in take
      return self.limit(num).collect()
      File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 493, in limit
      jdf = self._jdf.limit(num)
      File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 1257, in _call_
      answer, self.gateway_client, self.target_id, self.name)
      File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco
      return f(*a, **kw)
      File "/opt/k2-v02/lib/python3.6/site-packages/py4j/protocol.py", line 336, in get_return_value
      format(target_id, ".", name))
      py4j.protocol.Py4JError: An error occurred while calling o29.limit

      During handling of the above exception, another exception occurred:

      Traceback (most recent call last):
      File "/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 1863, in showtraceback
      stb = value.render_traceback()
      AttributeError: 'Py4JError' object has no attribute 'render_traceback'

      During handling of the above exception, another exception occurred:

      Traceback (most recent call last):
      File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 929, in _get_connection
      connection = self.deque.pop()
      IndexError: pop from an empty deque

      During handling of the above exception, another exception occurred:

      Traceback (most recent call last):
      File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 1067, in start
      self.socket.connect((self.address, self.port))
      ConnectionRefusedError: [Errno 111] Connection refused
      ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37735)
      Traceback (most recent call last):
      File "/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
      exec(code_obj, self.user_global_ns, self.user_ns)
      File "<ipython-input-7-3959105b378f>", line 1, in <module>
      df.take(1)
      File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 504, in take
      return self.limit(num).collect()
      File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 493, in limit
      jdf = self._jdf.limit(num)
      File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 1257, in _call_
      answer, self.gateway_client, self.target_id, self.name)
      File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco
      return f(*a, **kw)
      File "/opt/k2-v02/lib/python3.6/site-packages/py4j/protocol.py", line 336, in get_return_value
      format(target_id, ".", name))
      py4j.protocol.Py4JError: An error occurred while calling o29.limit

      During handling of the above exception, another exception occurred:

      Traceback (most recent call last):
      File "/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 1863, in showtraceback
      stb = value.render_traceback()
      AttributeError: 'Py4JError' object has no attribute 'render_traceback'

      During handling of the above exception, another exception occurred:

      Traceback (most recent call last):
      File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 929, in _get_connection
      connection = self.deque.pop()
      IndexError: pop from an empty deque

      During handling of the above exception, another exception occurred:

      Traceback (most recent call last):
      File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 1067, in start
      self.socket.connect((self.address, self.port))
      ConnectionRefusedError: [Errno 111] Connection refused
       
      ---------------------------------------------------------------------------Py4JError Traceback (most recent call last)<ipython-input-7-3959105b378f> in <module>()----> 1 df.take(1)/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py in take(self, num) 502 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')] 503 """--> 504 return self.limit(num).collect() 505 506 @since(1.3)/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py in limit(self, num) 491 [] 492 """--> 493 jdf = self._jdf.limit(num) 494 return DataFrame(jdf, self.sql_ctx) 495/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py in _call_(self, *args) 1255 answer = self.gateway_client.send_command(command) 1256 return_value = get_return_value(-> 1257 answer, self.gateway_client, self.target_id, self.name) 1258 1259 for temp_arg in temp_args:/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw) 61 def deco(*a, **kw): 62 try:---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString()/opt/k2-v02/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 334 raise Py4JError( 335 "An error occurred while calling {0}{1}{2}".--> 336 format(target_id, ".", name)) 337 else: 338 type = answer[1]Py4JError: An error occurred while calling o29.limit
       
       

       

      Attachments

        1. report_dataset.zip.001
          59.00 MB
          Bihui Jin
        2. report_dataset.zip.002
          53.10 MB
          Bihui Jin

        Issue Links

          Activity

            People

              Unassigned Unassigned
              breath Bihui Jin
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: