Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-45583

Spark SQL returning incorrect values for full outer join on keys with the same name.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.4.1
    • 3.5.0
    • SQL
    • None

    Description

      The following query gives the wrong results.

       

      WITH people as (
        SELECT * FROM (VALUES 
          (1, 'Peter'), 
          (2, 'Homer'), 
          (3, 'Ned'),
          (3, 'Jenny')
        ) AS Idiots(id, FirstName)
      ), location as (
        SELECT * FROM (VALUES
          (1, 'sample0'),
          (1, 'sample1'),
          (2, 'sample2')  
        ) as Locations(id, address)
      )SELECT
        *
      FROM
        people
      FULL OUTER JOIN
        location
      ON
        people.id = location.id

      We find the following table:

      id: integer FirstName: string id: integer address: string
      2 Homer 2 sample2
      null Ned null null
      null Jenny null null
      1 Peter 1 sample0
      1 Peter 1 sample1

      But clearly the first `id` column is wrong, the nulls should be 3.

      If we rename the id column in (only) the person table to pid we get the correct results:

      pid: integer FirstName: string id: integer address: string
      2 Homer 2 sample2
      3 Ned null null
      3 Jenny null null
      1 Peter 1 sample0
      1 Peter 1 sample1

      Attachments

        Activity

          People

            Unassigned Unassigned
            hcampbell Huw
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: