Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-10824

Hash in stats.ApproximateUniqueCombineFn NON-deterministic

Details

    • Bug
    • Status: Resolved
    • P1
    • Resolution: Fixed
    • None
    • Missing
    • sdk-py-core

    Description

      The python hash() function is non-deterministic. As a result, different workers will map identical values to different hashes. This leads to overestimation of the number of unique values (by several magnitudes, in my experience x1000) in a distributed processing model. 

      https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L218

       

       

      Attachments

        Issue Links

          Activity

            People

              monicadsong Monica Song
              monicadsong Monica Song
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 24h
                  24h
                  Remaining:
                  Time Spent - 21h Remaining Estimate - 3h
                  3h
                  Logged:
                  Time Spent - 21h Remaining Estimate - 3h
                  21h