Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-44306

Group FileStatus with few RPC calls within Yarn Client

    XMLWordPrintableJSON

Details

    Description

      It's inefficient to obtain FileStatus for each resource one by one. In our company setting, we are running Spark with Hadoop Yarn and HDFS. We noticed the current behavior has two major drawbacks:

      1. Since each getFileStatus call involves network delays, the overall delay can be large and add uncertainty to the overall Spark job runtime. Specifically, we quantify this overhead within our cluster. We see the p50 overhead is around 10s, p80 is 1 min, and p100 is up to 15 mins. When HDFS is overloaded, the delays become more severe. 
      2. In our cluster, we have nearly 100 million getFileStatus call to HDFS daily. We noticed that in our cluster, most resources come from the same HDFS directory for each user (See our engineer blog post about why we took this approach). Therefore, we can greatly reduce nearly 100 million getFileStatus call to 0.1 million listStatus calls daily. This will further reduce overhead from the HDFS side. 

      All in all, a more efficient way to fetch the FileStatus for each resource is highly needed.

      Attachments

        Activity

          People

            shuwang SHU WANG
            shuwang SHU WANG
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: