Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-26346

Default Tez memory limits occasionally result in killing container

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.1.3
    • None
    • Tez
    • None

    Description

      When inserting data into Hive, the insert occasionally fails with messages like

      FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1605060173780_0039_2_00, diagnostics=[Task failed, taskId=task_1605060173780_0039_2_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Container container_1605060173780_0039_01_000002 finished with diagnostics set to [Container failed, exitCode=-104. [2020-11-11 02:35:11.768]Container [pid=16810,containerID=container_1605060173780_0039_01_000002] is running 7729152B beyond the 'PHYSICAL' memory limit. Current usage: 1.0 GB of 1 GB physical memory used; 2.5 GB of 2.1 GB virtual memory used. Killing container.

      Specifically that the TezChild container is using some small amount of physical memory beyond its limit, so Tez kills the container.

      Identifying how to resolve this is somewhat fraught:

      • There's no clear troubleshooting advice around this error from our docs. Googling led to several forums that had some good and some awful advice. https://community.cloudera.com/t5/Community-Articles/Demystify-Apache-Tez-Memory-Tuning-Step-by-Step/ta-p/245279 is probably the best one.
      • The issue itself comes down to Tez allocating 80% of the memory limit to Java heap (Xmx), which depending on other memory usage (stack memory, JIT, other JVM overhead) can be too little. By comparison: when running in a cgroup, Java defaults Xmx to 25% of the memory limit.
      • Identifying the right parameters to tune, and verifying they've been set correctly, was a bit challenging. We ended up playing with tez.container.max.java.heap.fraction, hive.tez.container.size, and yarn.scheduler.minimum-allocation-mb. I would then verify those took effect by monitoring process arguments (with htop) for any changes in Xmx. Definitely had some missteps figuring out when it's hive.tez.container vs tez.container.

      In the end, any of the following seems to have worked for us

      • SET yarn.scheduler.minimum-allocation-mb=2048
      • SET tez.container.max.java.heap.fraction=0.75
      • SET hive.tez.container.size=2048

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              MikaelSmith Michael Smith
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: