Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Durable function stopped but with status Running - Split brain detected error #1046

Open
potomato opened this issue Feb 23, 2024 · 4 comments
Labels

Comments

@potomato
Copy link

I use a small number of durable functions for frequent (per minute) singleton jobs. They are launched by a regular timer-triggered functions and each orchestration function invokes one worker function.

Periodically we find that the orchestration and worker functions have stopped being invoked, because the timer-triggered function receives Running state when querying for the durable function status, when it hasn't run for (potentially) days.

I have captured and investigated app insights logs which include Split brain detected: Another worker already updated the history for this instance - the 4 history event result(s) will be discarded". It seems the Completed status was discarded.

I don't believe another worker is running though. There was only ever one of each of HostInstanceId, InvocationId and ProcessId in logs.
The error seems to have come from a Storage Precondition not met header not being satisfied.

I've attached a snippet of logs from the incident. I have more detail if required.

What exactly has happened to cause this and what can I look for in logs, or what actions can I take to mitigate this? Currently we have to go and clean up Storage Tables to make it work again.
DurableFailureLogsForGH.csv

Thanks.

@cgillum
Copy link
Collaborator

cgillum commented Feb 23, 2024

I was working on an investigation recently where I saw very similar symptoms. An orchestration got stuck in a "Running" state and the root cause seems to be an unexpected "split brain" warning resulting in an ExecutionCompleted history event being discarded. It also seemed like a potentially bogus split-brain warning because there was only a single worker running.

My theory at this point is that some error was returned from Azure Storage that was miscategorized as an eTag violation. I didn't have access to Azure Storage logs to confirm what the actual error was to confirm. By chance did this happen to you recently enough that you're able to get the Azure Storage logs and see what the error was?

@cgillum
Copy link
Collaborator

cgillum commented Feb 23, 2024

Looking at your logs, regardless of whether the warning was categorized correctly or not, the fundamental issue seems to be the fact that the history table was updated to show that the orchestration completed, but the instances table was never updated due to the error, leaving this orchestration in an inconsistent state since we aren't able to have a transaction across these two tables. We don't correctly recover from this because the worker only considers the runtime status in the History table when deciding whether to discard work items and doesn't consider the instances table.

This particular case is odd because the History update operation succeeded on the Azure Storage side but was reported as a failure on the client side, preventing us from updating the Instances table with the Completed status (the Instances table is what we use for status query operations). The Azure Storage logs could help here because we'd want to confirm whether the operation actually succeeded or failed from the storage API perspective. That would at least help let us know if there's a problem in Azure Storage itself or in the Azure Storage SDK we're using.

Regardless, I think there is a behavior we need to fix on our side to double check the status of an orchestration in the Instances table before deciding to discard a work-item. I could see other corner-case failure conditions where this would be beneficial.

@cgillum cgillum added the bug label Feb 23, 2024
@potomato
Copy link
Author

potomato commented Feb 23, 2024

Thanks very much for your replies. Your comments sound promising.

Unfortunately I don't think we have diagnostic logging on the storage account where this happened. I'll check on Monday morning (it's Friday night here) but I don't think we'll get the detail we'd like for this one. I will look at environments that have logging to see if we have any Split brain errors, and if so will report back with storage logs.

But as you say, if there's a way to handle the error better and keep state more consistent then we'll be better off.

Thanks for your work on Durable functions, and also Dapr!

@potomato
Copy link
Author

Hi, I checked across all our subscriptions and unfortunately we don't have any where storage logs are turned on and we have had a Split Brain trace.

If it happens in an environment with logging turned on I can capture it then.

In the meantime is there anything I can do to help progress the fix?

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants