Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4623

Parquet Scanner - reduce NN RPC

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • Impala 2.2.4
    • Impala 2.10.0
    • Backend

    Description

      The current Parquet scanner implementation treats a column in a row group as a "scan range". When reading a "scan range", Impala will issue a fopen RPC to the name node. Therefore, Impala will issue one RPC per column per row group. NN has a limited processing rate of fopen RPC and this can be a limiting factor on the query performance.

      Fundamentally, there is no need to issue a fopen for each column. Impala should issue at most one fopen for each row group.

      The current workaround of using file handle cache is not practical due to the large (1k byte) memory footprint per file handle cache. File handle cannot be shared by concurrent readers. So, if we have 10 queries reading the same file at the same time, we need 10 file handles cached.

      Attachments

        Issue Links

          Activity

            People

              joemcdonnell Joe McDonnell
              alan@cloudera.com Alan Choi
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: