Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-37329

File system delegation tokens are leaked

    XMLWordPrintableJSON

Details

    • Bug
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 2.4.0
    • None
    • Security, YARN
    • None

    Description

      On a very busy Hadoop cluster (with HDFS at rest encryption) we found KMS accumulated millions of delegation tokens that are not cancelled even after jobs are finished, and KMS goes out of memory within a day because of the delegation token leak.

      We were able to reproduce the bug in a smaller test cluster, and realized when a Spark job starts, it acquires two delegation tokens, and only one is cancelled properly after the job finishes. The other one is left over and linger around for up to 7 days ( default Hadoop delegation token life time).

      YARN handles the lifecycle of a delegation token properly if its renewer is 'yarn'. However, Spark intentionally (a hack?) acquires a second delegation token with the job issuer as the renewer, simply to get the token renewal interval. The token is then ignored but not cancelled.

      Propose: cancel the delegation token immediately after the token renewal interval is obtained.

      Environment: CDH6.3.2 (based on Apache Spark 2.4.0) but the bug probably got introduced since day 1.

      Attachments

        Activity

          People

            Unassigned Unassigned
            weichiu Wei-Chiu Chuang
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: