Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-2681

Account for already-running tasks before sending new ones upon processing tablet reports

    XMLWordPrintableJSON

Details

    Description

      I've seen a case where the master will reschedule the same delete tablet task for a given tablet multiple times, e.g. because it received a new tablet report that the tablet still exists on a given tserver. This results in significant log-spam, and ends up sending excessive RPCs to the tablet servers. Here are some master logs demonstrating this (note the repeated attempt numbers):

       

      I0129 05:09:43.918886 22190 catalog_manager.cc:2922] Sending DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet 1d75a2458b544c6ea01fb6ccb238ebbb on 90369522338b4763ae25dd0161d6e548 (server:7050) (Replica with old config index 3048677 (current committed config index is 3054594))
      W0129 05:09:43.919509 22190 catalog_manager.cc:2892] TS 90369522338b4763ae25dd0161d6e548 (server:7050): delete failed for tablet 1d75a2458b544c6ea01fb6ccb238ebbb with error code TABLET_NOT_RUNNING: Already present: State transition of tablet 1d75a2458b544c6ea01fb6ccb238ebbb already in progress: opening tablet
      I0129 05:09:43.919517 22190 catalog_manager.cc:2700] Scheduling retry of 1d75a2458b544c6ea01fb6ccb238ebbb Delete Tablet RPC for TS=90369522338b4763ae25dd0161d6e548 with a delay of 8226 ms (attempt = 10)
      I0129 05:09:43.960479 22190 catalog_manager.cc:2922] Sending DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet 1d75a2458b544c6ea01fb6ccb238ebbb on 90369522338b4763ae25dd0161d6e548 (server:7050) (Replica with old config index 3048677 (current committed config index is 3054594))
      W0129 05:09:43.961150 22190 catalog_manager.cc:2892] TS 90369522338b4763ae25dd0161d6e548 (server:7050): delete failed for tablet 1d75a2458b544c6ea01fb6ccb238ebbb with error code TABLET_NOT_RUNNING: Already present: State transition of tablet 1d75a2458b544c6ea01fb6ccb238ebbb already in progress: opening tablet
      I0129 05:09:43.961158 22190 catalog_manager.cc:2700] Scheduling retry of 1d75a2458b544c6ea01fb6ccb238ebbb Delete Tablet RPC for TS=90369522338b4763ae25dd0161d6e548 with a delay of 8235 ms (attempt = 10)
      I0129 05:09:44.016152 22190 catalog_manager.cc:2922] Sending DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet 1d75a2458b544c6ea01fb6ccb238ebbb on 90369522338b4763ae25dd0161d6e548 (server:7050) (Replica with old config index 3048677 (current committed config index is 3054594))
      W0129 05:09:44.016383 22190 catalog_manager.cc:2892] TS 90369522338b4763ae25dd0161d6e548 (server:7050): delete failed for tablet 1d75a2458b544c6ea01fb6ccb238ebbb with error code TABLET_NOT_RUNNING: Already present: State transition of tablet 1d75a2458b544c6ea01fb6ccb238ebbb already in progress: opening tablet
      I0129 05:09:44.016391 22190 catalog_manager.cc:2700] Scheduling retry of 1d75a2458b544c6ea01fb6ccb238ebbb Delete Tablet RPC for TS=90369522338b4763ae25dd0161d6e548 with a delay of 8206 ms (attempt = 10)
      I0129 05:09:44.226428 22190 catalog_manager.cc:2922] Sending DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet 1d75a2458b544c6ea01fb6ccb238ebbb on 90369522338b4763ae25dd0161d6e548 (server:7050) (Replica with old config index 3048677 (current committed config index is 3054594))
      W0129 05:09:44.226753 22190 catalog_manager.cc:2892] TS 90369522338b4763ae25dd0161d6e548 (server:7050): delete failed for tablet 1d75a2458b544c6ea01fb6ccb238ebbb with error code TABLET_NOT_RUNNING: Already present: State transition of tablet 1d75a2458b544c6ea01fb6ccb238ebbb already in progress: opening tablet
      I0129 05:09:44.226773 22190 catalog_manager.cc:2700] Scheduling retry of 1d75a2458b544c6ea01fb6ccb238ebbb Delete Tablet RPC for TS=90369522338b4763ae25dd0161d6e548 with a delay of 8207 ms (attempt = 10)
      I0129 05:09:44.234709 22190 catalog_manager.cc:2922] Sending DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet 1d75a2458b544c6ea01fb6ccb238ebbb on 90369522338b4763ae25dd0161d6e548 (server:7050) (Replica with old config index 3048677 (current committed config index is 3054594))
      W0129 05:09:44.234923 22190 catalog_manager.cc:2892] TS 90369522338b4763ae25dd0161d6e548 (server:7050): delete failed for tablet 1d75a2458b544c6ea01fb6ccb238ebbb with error code TABLET_NOT_RUNNING: Already present: State transition of tablet 1d75a2458b544c6ea01fb6ccb238ebbb already in progress: opening tablet
      I0129 05:09:44.234936 22190 catalog_manager.cc:2700] Scheduling retry of 1d75a2458b544c6ea01fb6ccb238ebbb Delete Tablet RPC for TS=90369522338b4763ae25dd0161d6e548 with a delay of 8199 ms (attempt = 10)
      I0129 05:09:44.322351 22190 catalog_manager.cc:2922] Sending DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet 1d75a2458b544c6ea01fb6ccb238ebbb on 90369522338b4763ae25dd0161d6e548 (server:7050) (Replica with old config index 3048677 (current committed config index is 3054594))
      W0129 05:09:44.322854 22190 catalog_manager.cc:2892] TS 90369522338b4763ae25dd0161d6e548 (server:7050): delete failed for tablet 1d75a2458b544c6ea01fb6ccb238ebbb with error code TABLET_NOT_RUNNING: Already present: State transition of tablet 1d75a2458b544c6ea01fb6ccb238ebbb already in progress: opening tablet
      I0129 05:09:44.322860 22190 catalog_manager.cc:2700] Scheduling retry of 1d75a2458b544c6ea01fb6ccb238ebbb Delete Tablet RPC for TS=90369522338b4763ae25dd0161d6e548 with a delay of 8207 ms (attempt = 10)

       

      Here is a similar log for alter schema requests:

       

      I0129 05:09:50.798943 22188 catalog_manager.cc:2700] Scheduling retry of 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC with a delay of 60047 ms (attempt = 27)
      W0129 05:09:50.798954 22188 catalog_manager.cc:2719] Async tablet task 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC failed: Not found: failed to reset TS proxy: Could not find TS for UUID 94b8851c35914acca3ebea8c0ba6e06e
      I0129 05:09:50.889256 22191 catalog_manager.cc:2700] Scheduling retry of 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC with a delay of 60044 ms (attempt = 27)
      W0129 05:09:50.889267 22191 catalog_manager.cc:2719] Async tablet task 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC failed: Not found: failed to reset TS proxy: Could not find TS for UUID 94b8851c35914acca3ebea8c0ba6e06e
      I0129 05:09:50.928457 22188 catalog_manager.cc:2700] Scheduling retry of 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC with a delay of 60001 ms (attempt = 27)
      W0129 05:09:50.928469 22188 catalog_manager.cc:2719] Async tablet task 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC failed: Not found: failed to reset TS proxy: Could not find TS for UUID 94b8851c35914acca3ebea8c0ba6e06e
      I0129 05:09:51.024364 22188 catalog_manager.cc:2700] Scheduling retry of 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC with a delay of 60036 ms (attempt = 27)
      W0129 05:09:51.024372 22188 catalog_manager.cc:2719] Async tablet task 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC failed: Not found: failed to reset TS proxy: Could not find TS for UUID 94b8851c35914acca3ebea8c0ba6e06e
      I0129 05:09:51.061228 22191 catalog_manager.cc:2700] Scheduling retry of 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC with a delay of 60001 ms (attempt = 27)
      W0129 05:09:51.061239 22191 catalog_manager.cc:2719] Async tablet task 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC failed: Not found: failed to reset TS proxy: Could not find TS for UUID 94b8851c35914acca3ebea8c0ba6e06e
      I0129 05:09:51.080605 22191 catalog_manager.cc:2700] Scheduling retry of 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC with a delay of 60038 ms (attempt = 27)
      W0129 05:09:51.080643 22191 catalog_manager.cc:2719] Async tablet task 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC failed: Not found: failed to reset TS proxy: Could not find TS for UUID 94b8851c35914acca3ebea8c0ba6e06e

       

      Kudu should take into account existing, retrying tasks running on the master to make sure we don't stack up in this way.

      I've seen this confirmed with Kudu 1.7.1, but looking through the code at CatalogManager::ProcessTabletReport, this seems possible on master too.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              awong Andrew Wong
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: