Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-12962

Estimated metadata size of a table doesn't match the actual java object size

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Catalog
    • ghx-label-3

    Description

      Catalogd shows the top-25 largest tables in its WebUI at the "/catalog" endpoint. The estimated metadata size is computed in HdfsTable#getTHdfsTable():
      https://github.com/apache/impala/blob/0d49c9d6cc7fc0903d60a78d8aaa996af0249c06/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java#L2414-L2451
      The current formula is

      • memUsageEstimate = numPartitions * 2KB + numFiles * 500B + numBlocks * 150B + (optional) incrementalStats
      • (optional) incrementalStats = numPartitions * numColumns * 200B

      It's ok to use this formula to compare tables. But it can't be used to estimate the max heap size of catalogd. E.g. it doesn't consider the column comments and tblproperties which could have long strings. Column names should also be considered in case the table is a wide table.

      We can compare the estimated sizes with results from ehcache-sizeof or jamm and update the formula. Or use these libraries to estimate the sizes directly if they won't impact the performance.

      CC MikaelSmith 

      Attachments

        Activity

          People

            Unassigned Unassigned
            stigahuang Quanlong Huang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: