Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.3.1
-
None
-
None
-
PySpark (Spark 2.3.1)
Description
Spark cannot find column that actually exists in array
org.apache.spark.sql.AnalysisException: cannot resolve '`id`' given input columns: [flid.palfl_timestamp, flid.id, flid.pal_state, flid.prs_id, flid.bank_id, flid.wr_id, flid.link_id];
--------------------------------------------------------------------------- Py4JJavaError Traceback (most recent call last) /usr/share/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: /usr/share/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 327 "An error occurred while calling {0}{1}{2}.\n". --> 328 format(target_id, ".", name), value) 329 else: Py4JJavaError: An error occurred while calling o35.sql. : org.apache.spark.sql.AnalysisException: cannot resolve '`id`' given input columns: [flid.palfl_timestamp, flid.id, flid.pal_state, flid.prs_id, flid.bank_id, flid.wr_id, flid.link_id]; line 10 pos 98; 'Project ['multiples.id, 'multiples.link_id]
Query:
q = f""" with flid as ( select * from flow_log_by_id ) select multiples.id, multiples.link_id from (select fl.id, fl.link_id from (select id from {flow_log_by_id} group by id having count(*) > 1) multiples join {flow_log_by_id} fl on fl.id = multiples.id) multiples join {level_link} ll on multiples.link_id = ll.link_id_old and ll.link_id_new in (select link_id from flid where id = multiples.id) """ flow_subset_test_result = spark.sql(q)
`with flid` used because without it spark do not find `flow_log_by_id` table, so looks like another issues. In sql it works without problems.