Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-23304

Two SQL execution results with the same semantics are not the same

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 2.3.7
    • None
    • Hive, SQL
    • None
    • hadoop version 2.7.7

      hive version 2.3.7

      hive.execution.engine=mr

       

    Description

      The simplified question is like:

               sql_1: select * from (subquery)

               sql_2: create  temporary  table with subquery

                       and  then  select * from  temporary table

      the result of sql_1 and sql_2 is diffrent.

       

      The specific questions are as follows:

      execute sql

       

      SELECT AVG(a.rate), a.movieid, MAX(b.userid)
      FROM t_rating a
      	INNER JOIN (
      		SELECT a.movieid AS movieid, a.userid AS userid, a.rate
      		FROM t_rating a
      			JOIN (
      				SELECT COUNT(*) AS rate_times, a.userid AS userid
      				FROM t_rating a
      					JOIN (
      						SELECT *
      						FROM t_user
      						WHERE sex = 'F'
      					) b
      					ON a.userid = b.userid
      				GROUP BY a.userid
      				ORDER BY rate_times DESC
      				LIMIT 0, 1
      			) b
      			ON a.userid = b.userid
      		ORDER BY a.rate DESC
      		LIMIT 0, 10
      	) b
      	ON a.movieid = b.movieid
      GROUP BY a.movieid

      The result is below.But this is not the result I want 

      4.063136456211812	162	1150
      4.476190476190476	904	1150
      4.249370277078086	951	1150
      4.14167916041979	1230	1150
      3.6464646464646466	1966	1150
      4.163043478260869	2330	1150
      3.7039473684210527	3163	1150
      4.387453874538745	3307	1150
      4.047363717605005	3671	1150
      3.8265682656826567	3675	1150

      So I divided sql into two steps

      fisrt: create temporary table with subquery

       

      create temporary table tmp as
      SELECT a.movieid AS movieid, a.userid AS userid, a.rate
      		FROM t_rating a
      			JOIN (
      				SELECT COUNT(*) AS rate_times, a.userid AS userid
      				FROM t_rating a
      					JOIN (
      						SELECT *
      						FROM t_user
      						WHERE sex = 'F'
      					) b
      					ON a.userid = b.userid
      				GROUP BY a.userid
      				ORDER BY rate_times DESC
      				LIMIT 0, 1
      			) b
      			ON a.userid = b.userid
      		ORDER BY a.rate DESC
      		LIMIT 0, 10

      second: use temporary table replace subquery

       

      SELECT AVG(a.rate), a.movieid, MAX(b.userid)
      FROM t_rating a INNER JOIN tmp b
      	ON a.movieid = b.movieid
      GROUP BY a.movieid

      the result 

      4.52054794520548	745	1150
      4.4498902706656915	750	1150
      4.476190476190476	904	1150
      4.280748663101604	905	1150
      3.7314890154597236	1094	1150
      4.188888888888889	1236	1150
      4.21043771043771	1256	1150
      3.747422680412371	1279	1150
      4.0739348370927315	2064	1150
      4.125390450691656	2997	1150

      Why the results of two executions are different?

       

       

       

       

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            gumengchao gumengchao
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: