Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-21286

Parallelize computeHDFSBlocksDistribution when getting splits of a HBaseSnapshot

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Patch Available
    • Minor
    • Resolution: Unresolved
    • 1.4.0
    • None
    • snapshots
    • None

    Description

      Even if this step is called computeHDFSBlocksDistribution, this is executed no matter the file system of the snapshot. For example, we have observed an important slowness when we have a snapshot in s3 (~26k regions, 5column families, 2 files per column family) the getsplits time is ~40min due to the calls in s3 for listing the files to get the best locations.

      Parallelizing this operation can reduce the overall setup time. The thread pool should be configurable and a good choice could be "hbase.snapshot.thread.pool.max" that is also used in RestoreSnapshotHelper.

      Attachments

        1. HBASE-21286.branch-1.4.001.patch
          4 kB
          Lavinia-Stefania Sirbu
        2. HBASE-21286.branch-1.4.002.patch
          4 kB
          Lavinia-Stefania Sirbu
        3. HBASE-21286.branch-1.4.003.patch
          4 kB
          Lavinia-Stefania Sirbu

        Activity

          People

            lavinia.sirbu Lavinia-Stefania Sirbu
            lavinia.sirbu Lavinia-Stefania Sirbu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: