Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-2473

Excessive memory usage by scan nodes

    XMLWordPrintableJSON

Details

    Description

      In some situations the Hdfs scan nodes, e.g. the parquet scanner, use excessive amounts of memory. The below query consumes over 2GB of memory in the scanner. The problem appears to be that very large batches (8MB) are being produced by the scanner and the current methods for limiting memory usage (queue length I believe) do not work well with large batches.

      [tarmstrong-box.ca.cloudera.com:21000] > use tpch25_nested_parquet;
      Query: use tpch25_nested_parquet
      [tarmstrong-box.ca.cloudera.com:21000] > select
      t10.c_custkey AS int_col,
      COALESCE(t10.c_custkey, t8.int_col_t5, t9.s_suppkey) AS int_col_t11
      FROM customer t1
      INNER JOIN (
      SELECT
      (COALESCE(t3.item.l_partkey, t3.item.l_partkey, t3.item.l_partkey)) - (MIN(t3.item.l_partkey)) AS int_col,
      COALESCE(SUM(t3.item.l_partkey), t3.item.l_partkey, t3.item.l_partkey) AS int_col_t5,
      t3.item.l_partkey AS int_col_t6
      FROM t1.c_orders.item.o_lineitems t2                                                                                                     LEFT JOIN t1.c_orders.item.o_lineitems t3 ON (t3.item.l_shipinstruct) = (t2.item.l_comment)
      LEFT JOIN t1.c_orders.item.o_lineitems t4 ON (((t4.item.l_suppkey) = (t2.pos)) AND ((t4.item.l_shipdate) = (t2.item.l_comment))) AND ((t4.item.l_comment) = (t2.item.l_comment))
      WHERE
      (t3.pos) NOT IN (t3.item.l_partkey, 184.99)
      GROUP BY
      t3.item.l_partkey
      HAVING
      (MAX(t2.pos)) IN ((t3.item.l_partkey) - (t3.item.l_partkey), t3.item.l_partkey)
      ) t8
      INNER JOIN supplier t9 ON (t9.s_nationkey) = (t8.int_col)
      LEFT JOIN customer t10 ON (t10.c_custkey) = (t9.s_nationkey);
      Query: select t10.c_custkey AS int_col,
      COALESCE(t10.c_custkey, t8.int_col_t5, t9.s_suppkey) AS int_col_t11
      FROM customer t1
      INNER JOIN (
      SELECT
      (COALESCE(t3.item.l_partkey, t3.item.l_partkey, t3.item.l_partkey)) - (MIN(t3.item.l_partkey)) AS int_col,
      COALESCE(SUM(t3.item.l_partkey), t3.item.l_partkey, t3.item.l_partkey) AS int_col_t5,
      t3.item.l_partkey AS int_col_t6
      FROM t1.c_orders.item.o_lineitems t2
      LEFT JOIN t1.c_orders.item.o_lineitems t3 ON (t3.item.l_shipinstruct) = (t2.item.l_comment)
      LEFT JOIN t1.c_orders.item.o_lineitems t4 ON (((t4.item.l_suppkey) = (t2.pos)) AND ((t4.item.l_shipdate) = (t2.item.l_comment))) AND ((t4.item.l_comment) = (t2.item.l_comment))
      WHERE
      (t3.pos) NOT IN (t3.item.l_partkey, 184.99)
      GROUP BY
      t3.item.l_partkey
      HAVING
      (MAX(t2.pos)) IN ((t3.item.l_partkey) - (t3.item.l_partkey), t3.item.l_partkey)
      ) t8
      INNER JOIN supplier t9 ON (t9.s_nationkey) = (t8.int_col)
      LEFT JOIN customer t10 ON (t10.c_custkey) = (t9.s_nationkey)
      

      Workaround
      Setting num_scanner_threads to a lower number will reduce memory usage at some cost to scan performance. Setting batch_size to a lower value will also reduce memory usage at a cost to query performance.

      Attachments

        Activity

          People

            tarmstrong Tim Armstrong
            tarmstrong Tim Armstrong
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: