Durable function stopped but with status Running - Split brain detected error #1046

potomato · 2024-02-23T11:45:36Z

I use a small number of durable functions for frequent (per minute) singleton jobs. They are launched by a regular timer-triggered functions and each orchestration function invokes one worker function.

Periodically we find that the orchestration and worker functions have stopped being invoked, because the timer-triggered function receives Running state when querying for the durable function status, when it hasn't run for (potentially) days.

I have captured and investigated app insights logs which include Split brain detected: Another worker already updated the history for this instance - the 4 history event result(s) will be discarded". It seems the Completed status was discarded.

I don't believe another worker is running though. There was only ever one of each of HostInstanceId, InvocationId and ProcessId in logs.
The error seems to have come from a Storage Precondition not met header not being satisfied.

I've attached a snippet of logs from the incident. I have more detail if required.

What exactly has happened to cause this and what can I look for in logs, or what actions can I take to mitigate this? Currently we have to go and clean up Storage Tables to make it work again.
DurableFailureLogsForGH.csv

Thanks.

The text was updated successfully, but these errors were encountered:

cgillum · 2024-02-23T21:21:53Z

I was working on an investigation recently where I saw very similar symptoms. An orchestration got stuck in a "Running" state and the root cause seems to be an unexpected "split brain" warning resulting in an ExecutionCompleted history event being discarded. It also seemed like a potentially bogus split-brain warning because there was only a single worker running.

My theory at this point is that some error was returned from Azure Storage that was miscategorized as an eTag violation. I didn't have access to Azure Storage logs to confirm what the actual error was to confirm. By chance did this happen to you recently enough that you're able to get the Azure Storage logs and see what the error was?

cgillum · 2024-02-23T21:37:19Z

Looking at your logs, regardless of whether the warning was categorized correctly or not, the fundamental issue seems to be the fact that the history table was updated to show that the orchestration completed, but the instances table was never updated due to the error, leaving this orchestration in an inconsistent state since we aren't able to have a transaction across these two tables. We don't correctly recover from this because the worker only considers the runtime status in the History table when deciding whether to discard work items and doesn't consider the instances table.

This particular case is odd because the History update operation succeeded on the Azure Storage side but was reported as a failure on the client side, preventing us from updating the Instances table with the Completed status (the Instances table is what we use for status query operations). The Azure Storage logs could help here because we'd want to confirm whether the operation actually succeeded or failed from the storage API perspective. That would at least help let us know if there's a problem in Azure Storage itself or in the Azure Storage SDK we're using.

Regardless, I think there is a behavior we need to fix on our side to double check the status of an orchestration in the Instances table before deciding to discard a work-item. I could see other corner-case failure conditions where this would be beneficial.

potomato · 2024-02-23T21:49:45Z

Thanks very much for your replies. Your comments sound promising.

Unfortunately I don't think we have diagnostic logging on the storage account where this happened. I'll check on Monday morning (it's Friday night here) but I don't think we'll get the detail we'd like for this one. I will look at environments that have logging to see if we have any Split brain errors, and if so will report back with storage logs.

But as you say, if there's a way to handle the error better and keep state more consistent then we'll be better off.

Thanks for your work on Durable functions, and also Dapr!

potomato · 2024-02-26T08:38:23Z

Hi, I checked across all our subscriptions and unfortunately we don't have any where storage logs are turned on and we have had a Split Brain trace.

If it happens in an environment with logging turned on I can capture it then.

In the meantime is there anything I can do to help progress the fix?

Thanks.

cgillum added the bug label Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Durable function stopped but with status Running - Split brain detected error #1046

Durable function stopped but with status Running - Split brain detected error #1046

potomato commented Feb 23, 2024

cgillum commented Feb 23, 2024

cgillum commented Feb 23, 2024

potomato commented Feb 23, 2024 •

edited

potomato commented Feb 26, 2024

Durable function stopped but with status Running - Split brain detected error #1046

Durable function stopped but with status Running - Split brain detected error #1046

Comments

potomato commented Feb 23, 2024

cgillum commented Feb 23, 2024

cgillum commented Feb 23, 2024

potomato commented Feb 23, 2024 • edited

potomato commented Feb 26, 2024

potomato commented Feb 23, 2024 •

edited