Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-15304

Timeout in receiving streams while repairing causes corruption

    XMLWordPrintableJSON

Details

    • Java8, OpenJDK, Linux, x86, HDD
    • None

    Description

      I have 4-node cluster, when doing a repair node3 streams sstables to node1, if node 3 hangs for some reason (in my case i/o cpu or gc) and timeouts (but it can happens also because a network problem), node1 leaves corrupted sstable files without notice.

      When node1 start compacting, effectively show a corruption error:

      org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: /var/lib/cassandra/data/keyspace/cf-28189320ff8211e7961c1fd53f574685/md-15146-big-Data.db
       at org.apache.cassandra.io.util.CompressedChunkReader$Mmap.readChunk(CompressedChunkReader.java:227) ~[apache-cassandra-3.11.4.jar:3.11.4]
       at org.apache.cassandra.cache.ChunkCache.load(ChunkCache.java:158) ~[apache-cassandra-3.11.4.jar:3.11.4]
       at org.apache.cassandra.cache.ChunkCache.load(ChunkCache.java:39) ~[apache-cassandra-3.11.4.jar:3.11.4]
       at com.github.benmanes.caffeine.cache.BoundedLocalCache$BoundedLocalLoadingCache.lambda$new$0(BoundedLocalCache
      

      by going back in the log I've discovered a timeout while node1 was writing that file.
      node3 at the same time says:

      ERROR [ReadRepairStage:60] 2019-09-06 09:28:02,918 CassandraDaemon.java:228 - Exception in thread Thread[ReadRepairStage:60,5,main]
      org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 1 responses.
              at org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:202) ~[apache-cassandra-3.11.4.jar:3.11.4]
              at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:175) ~[apache-cassandra-3.11.4.jar:3.11.4]
              at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:92) ~[apache-cassandra-3.11.4.jar:3.11.4]
              at org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:79) ~[apache-cassandra-3.11.4.jar:3.11.4]
              at org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.11.4.jar:3.11.4]
              at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-3.11.4.jar:3.11.4]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_222]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_222]
              at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) ~[apache-cassandra-3.11.4.jar:3.11.4]
              at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_222]
      INFO  [ScheduledTasks:1] 2019-09-06 09:28:02,920 MessagingService.java:1236 - MUTATION messages were dropped in last 5000 ms: 12 internal and 0 cross node. Mean internal dropped latency: 4085 ms and Mean cross-node dropped latency: 4030 ms
      INFO  [ScheduledTasks:1] 2019-09-06 09:28:02,920 StatusLogger.java:47 - Pool Name                    Active   Pending      Completed   Blocked  All Time Blocked
      INFO  [ScheduledTasks:1] 2019-09-06 09:28:02,924 StatusLogger.java:51 - MutationStage                    16      2475      595615310         0                 0
      ...
      ...
      INFO  [ScheduledTasks:1] 2019-09-06 09:28:07,930 MessagingService.java:1236 - MUTATION messages were dropped in last 5000 ms: 621 internal and 783 cross node. Mean internal dropped latency: 3176 ms and Mean cross-node dropped latency: 2801 ms
      ....
      INFO  [ScheduledTasks:1] 2019-09-06 09:28:12,939 MessagingService.java:1236 - READ messages were dropped in last 5000 ms: 0 internal and 8 cross node. Mean internal dropped latency: 0 ms and Mean cross-node dropped latency: 7247 ms
      ...
      

      this happens almost always and I'm often unable to scrub the sstable because of CASSANDRA-15284

      Attachments

        Activity

          People

            Unassigned Unassigned
            sherpya Gianluigi Tiesi
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: