Fix concurrency issue between AbandonPendingBacklog() and CheckBacklogForTimeouts(), and remove backlog locking #2430

kornelpal · 2023-04-05T03:44:13Z

There is only one ProcessBacklogAsync() thread running at a time and all current backlog locks are within that thread, so there is no need for these locks. On the other hand AbandonPendingBacklog() can run concurrently with the ProcessBacklogAsync() thread, that runs CheckBacklogForTimeouts(), but AbandonPendingBacklog() is not locking the backlog that can result in concurrency issues. This can result in CheckBacklogForTimeouts() leaving the dequeued message abandoned in an uncompleted (hung) state. This fix the resolves the concurrency issue by introducing an _abandonPendingBacklogException field that also enables removing the lock. The "failed" message is completed with the thrown exception to make any potential concurrency issues more visible.

…gForTimeouts(), and remove backlog locking.

mgravell · 2023-04-05T06:39:47Z

This feels like we're solving the wrong problem; IMO we should be fixing whatever gap it is falling in currently. I need to look carefully at what is going on here, but I don't see that we should need the new exception bits.

…

On Wed, 5 Apr 2023, 04:44 Kornel Pal, ***@***.***> wrote: There is only one ProcessBacklogAsync() thread running at a time and all current backlog locks are within that thread, so there is no need for these locks. On the other hand AbandonPendingBacklog() can run concurrently with the ProcessBacklogAsync() thread, that runs CheckBacklogForTimeouts(), but AbandonPendingBacklog() is not locking the backlog that can result in concurrency issues. This can result in CheckBacklogForTimeouts() leaving the dequeued message abandoned in an uncompleted (hung) state. This fix the resolves the concurrency issue by introducing an _abandonPendingBacklogException field that also enables removing the lock. The "failed" message is completed with the thrown exception to make any potential concurrency issues more visible. ------------------------------ You can view, comment on, or merge this pull request online at: #2430 Commit Summary - 9035679 <9035679> Fix concurrency issue between AbandonPendingBacklog() and CheckBacklogForTimeouts(), and remove backlog locking. File Changes (1 file <https://github.com/StackExchange/StackExchange.Redis/pull/2430/files>) - *M* src/StackExchange.Redis/PhysicalBridge.cs <https://github.com/StackExchange/StackExchange.Redis/pull/2430/files#diff-c64610826746e4cc2aeb0edf12469d2ea64583486a9246f7493d197bc33c6af1> (60) Patch Links: - https://github.com/StackExchange/StackExchange.Redis/pull/2430.patch - https://github.com/StackExchange/StackExchange.Redis/pull/2430.diff — Reply to this email directly, view it on GitHub <#2430>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAEHMHLX2PTX3NJDCFAQF3W7TTBPANCNFSM6AAAAAAWTQYAOA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

kornelpal · 2023-04-05T12:36:53Z

I've created this lock-free fix inspired by this comment.

At one point #2397 had the following fix to the same problem using a lock, if you prefer that:

private void AbandonPendingBacklog(Exception ex)
{
    while (true)
    {
        Message? next;
        lock (_backlog)
        {
            if (!BacklogTryDequeue(out next)) break;
        }

        Multiplexer?.OnMessageFaulted(next, ex);
        next.SetExceptionAndComplete(ex, this);
    }
}

mgravell · 2023-04-05T15:55:53Z

that can result in concurrency issues

Please can you be very explicit about what concurrency issue we're discussing? what is the actual symptom/issue that we're looking at resolving here? To understand whether this resolves them, first I need to have a clear vision of what that "them" are.

So: talk me through it; what scenario are we discussing?

kornelpal · 2023-04-05T20:48:41Z

Although PhysicalBridge._backlog is a ConcurrentQueue, PhysicalBridge.CheckBacklogForTimeouts() is using it in a non-thread-safe way. The existing comment from that method describes it best:

Because peeking at the backlog, checking message and then dequeuing, is not thread-safe, we do have to use
a lock here, for mutual exclusion of backlog DEQUEUERS. Unfortunately.

When not all dequeuers are locking the backlog then CheckBacklogForTimeouts() can dequeue a message then abandon it without ever being completed.

Code from inside the lock in CheckBacklogForTimeouts() annotated by me for the problematic scenario:

// There is a message in the backlog, so no break.
if (!_backlog.TryPeek(out message)) break;
// The message peeked at has timed out, so no break.
if (!message.HasTimedOut(now, timeout, out var _)) break;
// Another thread without locking the backlog already dequeued the previous message
// between the TryPeek() and BacklogTryDequeue() calls.
// Scenario 1; there were no messages left: This is not really an issue.
// Scenario 2; another message (message2) was dequeued: It may or may not be timed out,
// but the current logic does not care, just abandons the message and it will not be completed
// as it is not stored anywhere else. This is a problem for async messages only,
// not for sync (wait timeout), or F+F (not completed otherwise either).
if (!BacklogTryDequeue(out var message2) || (message != message2))
{
    // In both Scenario 1 and 2 the backlog processing thread fails,
    // but a new one will be started by the heartbeat or by adding a message to the backlog.
    throw new RedisException("Thread safety bug detected! A queue message disappeared while we had the backlog lock");
}

Methods dequeuing from the backlog:

CheckBacklogForTimeouts(): Properly locks the backlog, and only runs on the backlog processing thread.
ProcessBacklogAsync(): Properly locks the backlog, and only runs on the backlog processing thread.
AbandonPendingBacklog(): Does not lock the backlog and can run concurrently with the backlog processing thread. Called from PhysicalBridge.Dispose(), ~PhysicalBridge() and PhysicalBridge.OnConnectionFailed().

Since the two methods that actually lock the backlog cannot run concurrently, the current lock is just an overhead.

On the other hand not locking the backlog in AbandonPendingBacklog() can cause the concurrency issue described in the annotated code above that can cause one task per occurrence to be left in a hung state.

kornelpal · 2023-04-06T05:14:30Z

I've added a test for the Dispose() case. It fails without the fix and succeeds with the fix. Should be possible to cause the issue for BacklogPolicy.AbortPendingOnConnectionFailure = true too, but I don't know how to simulate a connection failure with a large backlog.

kornelpal · 2023-04-06T12:32:17Z

I just realized that clearing _abandonPendingBacklogException at the end of AbandonPendingBacklog() can result in CheckBacklogForTimeouts() failing when AbandonPendingBacklog() is running on multiple threads in parallel, so more complexity (like a wrapper for the backlog) would be needed reliable bug detection in CheckBacklogForTimeouts().

kornelpal · 2023-04-12T17:41:55Z

I have one more idea, inspired by PhysicalBridge.HasPendingCallerFacingItems(); Instead of removing items, CheckBacklogForTimeouts() could be changed to enumerate the items, and ProcessBridgeBacklogAsync() could be changed to ignore completed items. This way the concurrency issue was eliminated and there was no need for a lock or the exception field. Although adds some more compute overhead, checking for timeout in ProcessBridgeBacklogAsync() again might be simpler than adding tweaks at other places to complete timed out sync messages and identify timed out F+F messages (that never have a result box). Update: It might not be a good option as it keeps all the messages when there is an extended outage.

Fix concurrency issue between AbandonPendingBacklog() and CheckBacklo…

9035679

…gForTimeouts(), and remove backlog locking.

Add test for Dispose() case.

0a7b544

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix concurrency issue between AbandonPendingBacklog() and CheckBacklogForTimeouts(), and remove backlog locking #2430

Fix concurrency issue between AbandonPendingBacklog() and CheckBacklogForTimeouts(), and remove backlog locking #2430

kornelpal commented Apr 5, 2023

mgravell commented Apr 5, 2023 via email

kornelpal commented Apr 5, 2023

mgravell commented Apr 5, 2023

kornelpal commented Apr 5, 2023

kornelpal commented Apr 6, 2023

kornelpal commented Apr 6, 2023

kornelpal commented Apr 12, 2023 •

edited

Fix concurrency issue between AbandonPendingBacklog() and CheckBacklogForTimeouts(), and remove backlog locking #2430

Are you sure you want to change the base?

Fix concurrency issue between AbandonPendingBacklog() and CheckBacklogForTimeouts(), and remove backlog locking #2430

Conversation

kornelpal commented Apr 5, 2023

mgravell commented Apr 5, 2023 via email

kornelpal commented Apr 5, 2023

mgravell commented Apr 5, 2023

kornelpal commented Apr 5, 2023

kornelpal commented Apr 6, 2023

kornelpal commented Apr 6, 2023

kornelpal commented Apr 12, 2023 • edited

kornelpal commented Apr 12, 2023 •

edited