Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-6112

Improve the thread pool size detection logic while loading partitioned table block metadata

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • Impala 2.11.0
    • None
    • Catalog
    • ghx-label-4

    Description

      IMPALA-5429 added a thread pool based block metadata loading logic to increase the block metadata loading throughput.

        /**
         * Returns the thread pool size to load the file metadata of this table.
         * 'numPaths' is the number of paths for which the file metadata should be loaded.
         *
         * We use different thread pool sizes for HDFS and non-HDFS tables since the latter
         * supports much higher throughput of RPC calls for listStatus/listFiles. For
         * simplicity, the filesystem type is determined based on the table's root path and
         * not for each partition individually. Based on our experiments, S3 showed a linear
         * speed up (up to ~100x) with increasing number of loading threads where as the HDFS
         * throughput was limited to ~5x in un-secure clusters and up to ~3.7x in secure
         * clusters. We narrowed it down to scalability bottlenecks in HDFS RPC implementation
         * (HADOOP-14558) on both the server and the client side.
         */
        private int getLoadingThreadPoolSize(int numPaths) throws CatalogException {
          Preconditions.checkState(numPaths > 0);
          FileSystem tableFs;
          try {
            tableFs  = (new Path(getLocation())).getFileSystem(CONF);
          } catch (IOException e) {
            throw new CatalogException("Invalid table path for table: " + getFullName(), e);
          }
          int threadPoolSize = FileSystemUtil.supportsStorageIds(tableFs) ?
              MAX_HDFS_PARTITIONS_PARALLEL_LOAD : MAX_NON_HDFS_PARTITIONS_PARALLEL_LOAD;
          // Thread pool size need not exceed the number of paths to be loaded.
          return Math.min(numPaths, threadPoolSize);
        }
      

      As the method comment says, we choose the thread pool size based on table base directory FS type. Given we support multiple filesystem types in the same partitioned table, this may not always be optimal. For example if the base table file system is on HDFS, but 90% of the partitions are on S3, the above method returns a smaller thread pool size which is sub-optimal. This jira is to fix that behavior.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              bharathv Bharath Vissapragada
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: