Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-24368

Optimise AcidUtils::getAcidFilesForStats for ACID tables

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • HiveServer2
    • None

    Description

      After insert, hive gathers statistics for ACID table and that becomes expensive over time, due to number of delta folders and scanning .

      https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java#L2648

       

      public static List<FileStatus> getAcidFilesForStats(
            	Table table, Path dir, Configuration jc, FileSystem fs) throws IOException {
            ...
            Directory acidInfo = AcidUtils.getAcidState(fs, dir, jc, idList, null, false, hdfsDirSnapshots);
            ...
            ..+ other calls
            ...
            }
      
       

       

      Runtime keeps increasing as more deltas are generated. 

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              rajesh.balamohan Rajesh Balamohan
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: