Noticed replication sources could get stuck while doing some tests that involved RS restart. On these cases, upon RS restart, the newly created normal source was reaching wal end and not recognising it was open for write, what leads to remove it from source queue. Thus, no new OP get's replicated unless this log goes to a recovery queue.
Checking this further, my understanding is that, during restart, RS will start replication services, which inits ReplicationSourceManager and ReplicationSources for each wal group id, in below sequence:
At this point, ReplicationSources have no paths yet, so WAL reader thread is not running. ReplicationSourceManager is registered as a WAL listener, in order to get notified whenever new wal file is available. During ReplicationSourceManager and ReplicationSource instances creation, a WALFileLengthProvider instance is obtained from WALProvider and cached by both ReplicationSourceManager and ReplicationSource. The default implementation for this WALFileLengthProvider is below, on WALProvider interface:
Notice that if WALProvider.getWALs returns an empty list, this WALFileLengthProvider instance is always going to return nothing. This is relevant because when ReplicationSource finally starts ReplicationSourceWALReader thread, it passes this WALFileLengthProvider, which is used by WALEntryStream (inside the wal reader) to determine if wal is being written (and should be kept in the queue) here:
Here code snippet for WALEntryStream.readNextEntryAndRecordReaderPosition() method that relies on the WALFileLengthProvider:
The problem can occur because when wal file is indeed created in AbstractFSWALProvider.getWAL() method (snippet shown below), line marked as #2 in below snippet triggers notification of registered WALListeners, including ReplicationSourceManager, which will start ReplicationSourceWALReader thread. If ReplicationSourceWALReader thread reaches the point #1 from snippet above before the thread running AbstractFSWALProvider.getWAL() method gets to point #3 from below snippet, then WALFileLengthProvider will return empty and the wal will not be considered as open, causing it to be dequeued:
This can be sorted by making AbstractFSWALProvider.getWALs reuse AbstractFSWALProvider.getWAL method to obtain the WAL instance. Do we really have scenarios where we want to return no WAL instance? Another possibility could be to synchronize getWALs on same lock currently used by getWAL.
Am proposing an initial patch with the 1st solution, after some tests, it does seem to be enough to sort the problem.