Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-19427

Fix concurrent access of ClientWarn causing AIOBE for SELECT WHERE IN queries with multiple coordinator-local partitions

    XMLWordPrintableJSON

Details

    • Degradation - Other Exception
    • Normal
    • Low Hanging Fruit
    • User Report
    • All
    • Hide

      Demonstration branch included but no new unit tests, this fixes a transient issue that is dependent on thread scheduling

      Show
      Demonstration branch included but no new unit tests, this fixes a transient issue that is dependent on thread scheduling

    Description

      On one of our clusters, we noticed rare but periodic ArrayIndexOutOfBoundsExceptions:

       

      message="Uncaught exception on thread Thread[ReadStage-3,5,main]"
      exception="java.lang.RuntimeException: java.lang.ArrayIndexOutOfBoundsException
      at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2579)
      at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
      at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162)
      at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134)
      at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:119)
      at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
      at java.base/java.lang.Thread.run(Thread.java:829)
      Caused by: java.lang.ArrayIndexOutOfBoundsException"

       

       

      The error was in a Runnable, so the stacktrace didn't directly indicate where the error was coming from. We enabled JFR to log the underlying exception that was thrown:
       

      message="Uncaught exception on thread Thread[ReadStage-2,5,main]" exception="java.lang.RuntimeException: java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 0
      at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2579)
      at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
      at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162)
      at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134)
      at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:119)
      at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
      at java.base/java.lang.Thread.run(Thread.java:829)
      Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 0
      at java.base/java.util.ArrayList.add(ArrayList.java:487)
      at java.base/java.util.ArrayList.add(ArrayList.java:499)
      at org.apache.cassandra.service.ClientWarn$State.add(ClientWarn.java:84)
      at org.apache.cassandra.service.ClientWarn$State.access$000(ClientWarn.java:77)
      at org.apache.cassandra.service.ClientWarn.warn(ClientWarn.java:51)
      at org.apache.cassandra.db.ReadCommand$1MetricRecording.onClose(ReadCommand.java:596)
      at org.apache.cassandra.db.transform.BasePartitions.runOnClose(BasePartitions.java:70)
      at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:95)
      at org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:2260)
      at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2575)
      ... 6 more"

       
       

      An AIOBE on ArrayList.add(E) should only be possible when multiple threads attempt to call the method at the same time.

       

      This was seen while executing a SELECT WHERE IN query with multiple partition keys. This exception could happen when multiple local reads are dispatched by the coordinator in org.apache.cassandra.service.reads.AbstractReadExecutor#makeRequests. In this case, multiple local reads exceed the tombstone warning threshold, so multiple tombstone warnings are added to the same ClientWarn.State reference.  Currently, org.apache.cassandra.service.ClientWarn.State#warnings is an ArrayList, which isn't safe for concurrent modification, causing the AIOBE to be thrown.

       

      I have a patch available for this, and I'm preparing it now. The patch is simple - it just changes org.apache.cassandra.service.ClientWarn.State#warnings to a thread-safe CopyOnWriteArrayList. I also have a jvm-dtest that demonstrates the issue but doesn't need to be merged - it shows how a SELECT WHERE IN query with local reads that add client warnings can add to the same ClientWarn.State from different threads. I'll push that in a separate branch just for demonstration purposes.

       

      Demonstration branch: https://github.com/apache/cassandra/compare/trunk...aratno:cassandra:CASSANDRA-19427-aiobe-clientwarn-demo

      Fix branch: https://github.com/apache/cassandra/compare/trunk...aratno:cassandra:CASSANDRA-19427-aiobe-clientwarn-fix (PR linked below)

       

      This appears to have been an issue since at least 3.11, that was the earliest release I checked.

      Attachments

        Activity

          People

            aratnofsky Abe Ratnofsky
            aratnofsky Abe Ratnofsky
            Abe Ratnofsky
            Caleb Rackliffe, Stefan Miklosovic
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 1h 40m
                1h 40m