Uploaded image for project: 'Phoenix'
  1. Phoenix
  2. PHOENIX-2883

Region close during automatic disabling of index for rebuilding can lead to RS abort

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • None
    • None
    • None
    • None

    Description

      (disclaimer: still performing due-diligence on this one)

      I've been helping a user this week with what is thought to be a race condition in secondary index updates. This user has a relatively heavy write-based workload with a few tables that each have at least one index.

      What we have seen is that when the region distribution is changing (concretely, we were doing a rolling restart of the cluster without the load balancer disabled in the hopes of retaining as much availability as possible), I've seen the following general outline in the logs:

      • An index update fails (due to ERROR 2008 (INT10) the index metadata cache expired or is just missing)
      • The index is taken offline to be asynchronously rebuilt
      • A flush on the data table's region is queue for quite some time
      • RS is asked to close a region (due to a move, commonly)
      • RS aborts because the memstore for the data table's region is in an inconsistent state (e.g. Assertion failed while closing store <region> <colfam> flushableSize expected=0, actual= 193392. Current memstoreSize=-552208. Maybe a coprocessor operation failed and left the memstore in a partially updated state.

      Some relevant HBase issues include HBASE-10514 and HBASE-10844.

      Have been talking to ayingshu and devaraj about it, but haven't found anything definitively conclusive yet. Will dump findings here.

      Attachments

        Issue Links

          Activity

            People

              elserj Josh Elser
              elserj Josh Elser
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: