Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26911

Spark do not see column in table

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.3.1
    • None
    • Spark Core
    • None
    • PySpark (Spark 2.3.1)

    Description

       

       

      Spark cannot find column that actually exists in array

      org.apache.spark.sql.AnalysisException: cannot resolve '`id`' given input columns: [flid.palfl_timestamp, flid.id, flid.pal_state, flid.prs_id, flid.bank_id, flid.wr_id, flid.link_id]; 

       

       

      ---------------------------------------------------------------------------
      Py4JJavaError                             Traceback (most recent call last)
      /usr/share/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
           62         try:
      ---> 63             return f(*a, **kw)
           64         except py4j.protocol.Py4JJavaError as e:
      
      /usr/share/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
          327                     "An error occurred while calling {0}{1}{2}.\n".
      --> 328                     format(target_id, ".", name), value)
          329             else:
      
      Py4JJavaError: An error occurred while calling o35.sql.
      : org.apache.spark.sql.AnalysisException: cannot resolve '`id`' given input columns: [flid.palfl_timestamp, flid.id, flid.pal_state, flid.prs_id, flid.bank_id, flid.wr_id, flid.link_id]; line 10 pos 98;
      'Project ['multiples.id, 'multiples.link_id]
      
      

       

      Query:

      q = f"""
      with flid as (
      select * from flow_log_by_id
      )
      select multiples.id, multiples.link_id
      from (select fl.id, fl.link_id
      from (select id from {flow_log_by_id} group by id having count(*) > 1) multiples
      join {flow_log_by_id} fl on fl.id = multiples.id) multiples
      join {level_link} ll
      on multiples.link_id = ll.link_id_old and ll.link_id_new in (select link_id from flid where id = multiples.id)
      """
      flow_subset_test_result = spark.sql(q)
      

       `with flid` used because without it spark do not find `flow_log_by_id` table, so looks like another issues. In sql it works without problems.

      Attachments

        Activity

          People

            Unassigned Unassigned
            Sonique Vitaly Larchenkov
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: