Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-33841

Jobs disappear intermittently from the SHS under high load

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0, 3.0.1, 3.1.0, 3.2.0
    • 3.0.2, 3.1.0, 3.2.0
    • Spark Core
    • None
    • SHS is running locally on Ubuntu 19.04

       

    Description

      Ran into an issue when a particular job was displayed in the SHS and disappeared after some time, but then, in several minutes showed up again.

      The issue is caused by SPARK-29043, which is designated to improve the concurrent performance of the History Server. The change breaks the "app deletion" logic because of missing proper synchronization for processing event log entries. Since SHS now filters out all processing event log entries, such entries do not have a chance to be updated with the new lastProcessed time and thus any entity that completes processing right after filtering and before the check for stale entities will be identified as stale and will be deleted from the UI until the next checkForLogs run. This is because updated lastProcessed time is used as criteria, and event log entries that missed to be updated with a new time, will match that criteria.

      The issue can be reproduced by generating a big number of event logs and uploading them to the SHS event log directory on S3. Essentially, around 800(82.6 MB) copies of an event log file were created using shs-monitor script. Strange behavior of SHS counting the total number of applications was noticed - at first, the number was increasing as expected, but with the next page refresh, the total number of applications decreased. No errors were logged by SHS.

      Attachments

        Activity

          People

            vladglinskiy Vladislav Glinskiy
            vladglinskiy Vladislav Glinskiy
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: