Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-45599

Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset

    XMLWordPrintableJSON

Details

    Description

      I think this actually impacts all versions that have ever supported percentile and it may impact other things because the bug is in OpenHashMap.

       

      I am really surprised that we caught this bug because everything has to hit just wrong to make it happen. in python/pyspark if you run

       

      from math import *
      from pyspark.sql.types import *
      
      data = [(1.779652973678931e+173,), (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), (-3.085825028509117e+74,), (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), (-5.682293414619055e+46,), (-4.585039307326895e+166,), (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), (-5.046677974902737e+132,), (-5.490780063080251e-09,), (1.703824427218836e-55,), (-1.1961155424160076e+102,), (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), (5.120795466142678e-215,), (-9.01991342808203e+282,), (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), (3.4543959813437507e-304,), (-7.590734560275502e-63,), (9.376528689861087e+117,), (-2.1696969883753554e-292,), (7.227411393136537e+206,), (-2.428999624265911e-293,), (5.741383583382542e-14,), (-1.4882040107841963e+286,), (2.1973064836362255e-159,), (0.028096279323357867,), (8.475809563703283e-64,), (3.002803065141241e-139,), (-1.1041009815645263e+203,), (1.8461539468514548e-225,), (-5.620339412794757e-251,), (3.5103766991437114e-60,), (2.4925669515657655e+165,), (3.217759099462207e+108,), (-8.796717685143486e+203,), (2.037360925124577e+292,), (-6.542279108216022e+206,), (-7.951172614280046e-74,), (6.226527569272003e+152,), (-5.673977270111637e-84,), (-1.0186016078084965e-281,), (1.7976931348623157e+308,), (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), (1.7976931348623157e+308,), (4.3214483342777574e-117,), (-7.973642629411105e-89,), (-1.1028137694801181e-297,), (2.9000325280299273e-39,), (-1.077534929323113e-264,), (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), (-1.831402251805194e+65,), (-2.664533698035492e+203,), (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), (-9.607772864590422e+217,), (3.437191836077251e+209,), (1.9846569552093057e-137,), (-3.010452936419635e-233,), (1.4309793775440402e-87,), (-2.9383643865423363e-103,), (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), (2.187766760184779e+306,), (7.679268835670585e+223,), (6.3131466321042515e+153,), (1.779652973678931e+173,), (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), (-3.085825028509117e+74,), (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), (-5.682293414619055e+46,), (-4.585039307326895e+166,), (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), (-5.046677974902737e+132,), (-5.490780063080251e-09,), (1.703824427218836e-55,), (-1.1961155424160076e+102,), (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), (5.120795466142678e-215,), (-9.01991342808203e+282,), (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), (3.4543959813437507e-304,), (-7.590734560275502e-63,), (9.376528689861087e+117,), (-2.1696969883753554e-292,), (7.227411393136537e+206,), (-2.428999624265911e-293,), (5.741383583382542e-14,), (-1.4882040107841963e+286,), (2.1973064836362255e-159,), (0.028096279323357867,), (8.475809563703283e-64,), (3.002803065141241e-139,), (-1.1041009815645263e+203,), (1.8461539468514548e-225,), (-5.620339412794757e-251,), (3.5103766991437114e-60,), (2.4925669515657655e+165,), (3.217759099462207e+108,), (-8.796717685143486e+203,), (2.037360925124577e+292,), (-6.542279108216022e+206,), (-7.951172614280046e-74,), (6.226527569272003e+152,), (-5.673977270111637e-84,), (-1.0186016078084965e-281,), (1.7976931348623157e+308,), (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), (1.7976931348623157e+308,), (4.3214483342777574e-117,), (-7.973642629411105e-89,), (-1.1028137694801181e-297,), (2.9000325280299273e-39,), (-1.077534929323113e-264,), (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), (-1.831402251805194e+65,), (-2.664533698035492e+203,), (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), (-9.607772864590422e+217,), (3.437191836077251e+209,), (1.9846569552093057e-137,), (-3.010452936419635e-233,), (1.4309793775440402e-87,), (-2.9383643865423363e-103,), (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), (2.187766760184779e+306,), (7.679268835670585e+223,), (6.3131466321042515e+153,), (1.779652973678931e+173,), (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), (-3.085825028509117e+74,), (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), (-5.682293414619055e+46,), (-4.585039307326895e+166,), (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), (-5.046677974902737e+132,), (-5.490780063080251e-09,), (1.703824427218836e-55,), (-1.1961155424160076e+102,), (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), (5.120795466142678e-215,), (-9.01991342808203e+282,), (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), (3.4543959813437507e-304,), (-7.590734560275502e-63,), (9.376528689861087e+117,), (-2.1696969883753554e-292,), (7.227411393136537e+206,), (-2.428999624265911e-293,), (5.741383583382542e-14,), (-1.4882040107841963e+286,), (2.1973064836362255e-159,), (0.028096279323357867,), (8.475809563703283e-64,), (3.002803065141241e-139,), (-1.1041009815645263e+203,), (1.8461539468514548e-225,), (-5.620339412794757e-251,), (3.5103766991437114e-60,), (2.4925669515657655e+165,), (3.217759099462207e+108,), (-8.796717685143486e+203,), (2.037360925124577e+292,), (-6.542279108216022e+206,), (-7.951172614280046e-74,), (6.226527569272003e+152,), (-5.673977270111637e-84,), (-1.0186016078084965e-281,), (1.7976931348623157e+308,), (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), (1.7976931348623157e+308,), (4.3214483342777574e-117,), (-7.973642629411105e-89,), (-1.1028137694801181e-297,), (2.9000325280299273e-39,), (-1.077534929323113e-264,), (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), (-1.831402251805194e+65,), (-2.664533698035492e+203,), (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), (-9.607772864590422e+217,), (3.437191836077251e+209,), (1.9846569552093057e-137,), (-3.010452936419635e-233,), (1.4309793775440402e-87,), (-2.9383643865423363e-103,), (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), (2.187766760184779e+306,), (7.679268835670585e+223,), (6.3131466321042515e+153,), (1.779652973678931e+173,), (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), (-3.085825028509117e+74,), (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), (-5.682293414619055e+46,), (-4.585039307326895e+166,), (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), (-5.046677974902737e+132,), (-5.490780063080251e-09,), (1.703824427218836e-55,), (-1.1961155424160076e+102,), (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), (5.120795466142678e-215,), (-9.01991342808203e+282,), (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), (3.4543959813437507e-304,), (-7.590734560275502e-63,), (9.376528689861087e+117,), (-2.1696969883753554e-292,), (7.227411393136537e+206,), (-2.428999624265911e-293,), (5.741383583382542e-14,), (-1.4882040107841963e+286,), (2.1973064836362255e-159,), (0.028096279323357867,), (8.475809563703283e-64,), (3.002803065141241e-139,), (-1.1041009815645263e+203,), (1.8461539468514548e-225,), (-5.620339412794757e-251,), (3.5103766991437114e-60,), (2.4925669515657655e+165,), (3.217759099462207e+108,), (-8.796717685143486e+203,), (2.037360925124577e+292,), (-6.542279108216022e+206,), (-7.951172614280046e-74,), (6.226527569272003e+152,), (-5.673977270111637e-84,), (-1.0186016078084965e-281,), (1.7976931348623157e+308,), (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), (1.7976931348623157e+308,), (4.3214483342777574e-117,)]
      
      df = spark.createDataFrame(SparkContext.getOrCreate().parallelize(data, numSlices=4), StructType([StructField('val',DoubleType(),True)]))
      
      df.selectExpr('percentile(val, 0.1)').show(truncate=False)

      You will get back a result of -5.924228780007003E136 but the correct answer is -4.739483957565084E136 which can be verified by changing the number of slices used to import the data into spark.

       

      What is happening is that we are getting super unlucky. In the 4th input partition the data is read in from 7.849390806334983E226 to the end. This works fine and we get an OpenHashMap with an entry for both 0.0 and -0.0

       

       OpenHashMap((1.0075153091760986E-236,1), (0.0,1), (-2.646144697462316E-35,1), (-7.951172614280046E-74,1), (-3.468683249247593E-196,1), (5.741383583382542E-14,1), (-2.664533698035492E203,1), (7.227411393136537E206,1), (-3.588518231990927E12,1), (1.9042708096454302E195,1), (-1.9569489404314425E128,1), (7.849390806334983E226,1), (2.187766760184779E306,1), (2.4925669515657655E165,1), (-5.620339412794757E-251,1), (-0.0,1), (-1.1041009815645263E203,1), (-2.3016388448634844E-155,1), (2.1973064836362255E-159,1), (-1.831402251805194E65,1), (1.4403274475565667E41,1), (-3.085825028509117E74,1), (-6.542279108216022E206,1), (-9.871721037428167E119,1), (8.475809563703283E-64,1), (-5.673977270111637E-84,1), (5.120795466142678E-215,1), (-5.046677974902737E132,1), (-4.5154178008513483E-122,1), (1.9846569552093057E-137,1), (-3.3885098786542755E-128,1), (7.679268835670585E223,1), (4.920675036053339E9,1), (-1.0,1), (-4.585039307326895E166,1), (-9.01991342808203E282,1), (5.4470705929955455E-86,1), (9.247723870123388E-295,1), (-1.8891559842111865E63,1), (-4.696878567317712E-162,1), (-1.4882040107841963E286,1), (-5.936844510098297E-82,1), (6.226527569272003E152,1), (-1.1961155424160076E102,1), (-1.6663254121185628E-256,1), (4.4501477170144023E-308,1), (-9.607772864590422E217,1), (-3.010452936419635E-233,1), (4.051866849943636E-254,1), (1.4309793775440402E-87,1), (2.5212410617263588E-282,1), (3.4543959813437507E-304,1), (0.028096279323357867,1), (-7.590734560275502E-63,1), (5.211702553315461E-259,1), (-1.0186016078084965E-281,1), (3.437191836077251E209,1), (NaN,1), (NaN,1), (NaN,1), (8.391630779050713E-135,1), (-5.490780063080251E-9,1), (-2.9383643865423363E-103,1), (2.0738138203216883E201,1), (1.8461539468514548E-225,1), (1.822129180806602E-245,1), (NaN,1), (4.205809391029644E137,1), (2.037360925124577E292,1), (-2.428999624265911E-293,1), (3.002803065141241E-139,1), (6.3131466321042515E153,1), (3.5103766991437114E-60,1), (3.217759099462207E108,1), (-8.796717685143486E203,1), (-2.2385155698231885E285,1), (2.176024662699802E-210,1), (1.703824427218836E-55,1), (9.376528689861087E117,1), (1.7976931348623157E308,2), (NaN,1), (5.891823952773268E98,1), (-2.1696969883753554E-292,1), (4.3214483342777574E-117,1), (Infinity,2), (1.779652973678931E173,1), (-5.234708055733116E12,1), (-5.682293414619055E46,1))
      

      But when we go to deserialize the map after the shuffle we get a different result out with the entry for -0.0 gone. That is because the deserialize code uses update that will overwrite the count if the keys are the same. But -0.0 and 0.0 are not the same. Unless they happen to hash to the same position when they are being added in. In that case they end up being equal to each other and the count for 0.0 is replaced with the count for -0.0 and we lose one row in the data.

       

      Because the keys are stored in OpenHashSet that is where the bug actually is. https://github.com/apache/spark/blob/7e82e1bc43e0297c3036d802b3a151d2b93db2f6/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala#L139-L159

       

      I see a few ways to fix this.

      1. Update OpenHashSet/OpenHashMap to do the right thing for floats and doubles around -0.0 and 0.0
      2. normalize nans and zeros before doing percentiles
      3. reinterpret the bits for float/double as an int/long before putting them into the map and do the reverse when we pull them out.  That would also have the advantage of making NaN == NaN which would reduce the size of the map in those cases for percentile.
      4. Update the deserialize code for percentile to not call update, but instead to call counts.changeValue(key, count, _ + count)

      I am not sure if something similar can happen in other places, but I know for hash aggregate/etc we normalize the floating point values because of things like this.

      Attachments

        Activity

          People

            nchammas Nicholas Chammas
            revans2 Robert Joseph Evans
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: