-
Notifications
You must be signed in to change notification settings - Fork 287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent performance degradation from TypeMissingException #925
base: main
Are you sure you want to change the base?
Prevent performance degradation from TypeMissingException #925
Conversation
Hei, @cgillum , @sebastianburckhardt, could you help with a review here please ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me as per the explanation here: #886 (comment)
@cgillum - mind giving this a quick check? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposed fix in its current state could result in a tight failure loop, depending on the implementation of IOrchestrationService
. I think we need to think a bit more about how to handle cases like this so that we don't put ourselves into either extreme (global backoffs, slowing everything down and rapid-retries, which can lead to other problems).
I'm open to suggestions and further discussion on how we can solve this.
@@ -387,20 +386,13 @@ async Task ProcessWorkItemAsync(WorkItemDispatcherContext context, object workIt | |||
this.LogHelper.ProcessWorkItemFailed( | |||
context, | |||
workItemId, | |||
$"Backing off for {BackOffIntervalOnInvalidOperationSecs} seconds", | |||
$"Work item will be re-processed after lock expires", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual behavior will be different from what's implied by this message. The call to either SafeReleaseWorkItem
further down in this method will release the lock, so the work item may actually get processed immediately, depending on how the IOrchestrationService
decides to implement its AbortWorkItem
logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this would be the behavior with the current implementation, I missed that part during my initial analysis. Can you provide some details about why the abort and release were enforced as part of the processing of a work Item in the first place?
With the current code, I see the following execution path :
- The task is processed successfully, in which case it is safely released if the method is implemented by the provider
- The processing fails, in which case the task is Aborted and then Released.
2.1 If the Exception isTypeMissingException
, force a global back-off of the worker
2.2 For other Exceptions, let the providers decide if they have applied a global back-off or not through theGetDelayInSecondsAfterOnProcessException
method.
My proposal for this would be to unify the exception flow and let the providers decide how to handle the exception in terms of global back-off while remaining backward compatible with the current capabilities of the providers. So the exception flow would change as follows :
For all Exceptions, let the providers decide if they will apply a global back-off or not through the GetDelayInSecondsAfterOnProcessException
method and perform only the Abort operation on the exception flow, skipping the ReleaseStep.
By not including the SafeRelease mechanism on the exception path, I think we will create a clear mechanism for the providers to handle this specific scenario of failed work attempts, through custom implementations of the GetDelayInSecondsAfterOnProcessException
and AbortWorkItem
.
I think skipping the SafeRelease on the exception flow does not present any risks, since providers that have not yet implemented the 2 methods from above, will rely on the automatic expiration of the lock in this case, without an explicit release.
I have pushed a new version with the new proposed implementation, let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide some details about why the abort and release were enforced as part of the processing of a work Item in the first place?
This is how the framework was originally designed, before I was involved in the project. But even still, the design seems reasonable to me. My understanding of message timeouts is that they are intended to allow alternate VMs to pick up a message if the current VM becomes non-responsive. I would expect message handling failures to be dealt with using some other mechanism.
I'll respond to your other proposal on the main PR discussion thread.
This change addresses the issue described here : Azure#886
2491752
to
9ecebcb
Compare
Just to summarize, I understand that you're proposing two things:
I'm fine with (1) but worried about the unintended consequences of (2).
When you say "providers that have not yet implemented the 2 methods from above", which two methods are you referring to? By removing the release from the error handling flow, you're basically guaranteeing that any unhandled exception in message processing will result in the message remaining locked for the full length of the message timeout, which could be several minutes. This is a huge penalty to pay, particularly for transient issues, and our users will definitely notice and complain about this. Let me know if I'm misunderstanding something. |
By the 2 methods from above I mean To verify if there will be indeed a penalty for this change in the existing providers, I performed the following analysis :
After analyzing the above, my conclusion is : Please let me know if I missed something or if this is not aligned with the long-term vision of the OrchestrationService. |
@microsoft-github-policy-service agree company="UiPath" |
@cgillum Gentle reminder to take a look at the above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @moldovangeorge. I read your last response but I'm still not sure if it covers my previous question:
By removing the release from the error handling flow, you're basically guaranteeing that any unhandled exception in message processing will result in the message remaining locked for the full length of the message timeout, which could be several minutes. This is a huge penalty to pay, particularly for transient issues, and our users will definitely notice and complain about this.
Can you re-explain to me why we're moving the "release" out of the finally
block and instead only executing it in the try
block?
Hei @cgillum , sure : So the reason for excluding the SafeReleaseWorkItem from the error handling flow is to avoid excessive retries and to give the liberty to the provider implementation to choose how to handle broken tasks via the
only if there would be a provider which
My analysis from the previous comment tried to find out if there is any DTF provider that would be affected by this change, and I could not find any provider that would suffer a change in the behaviour related to failed tasks, after this change would be merged. |
Thanks @moldovangeorge - I agree that removing the global backoff is a good thing and I understand the concern about creating a tight failure loop for failures generally. One more understanding that I'd like to confirm before going forward with this:
If I understand correctly:
Is this accurate? If so, I think this is fine. However, I am a bit concerned about this:
As you mentioned, DurableTask.AzureStorage (the most popular backend) is protected because the abandon logic already does exponential backoff. DurableTask.ServiceBus (the original/oldest provider) might be okay because I think there should be some kind of poison message handling in the Service Bus layer to protect it from runaway errors. I'm a bit worried about the Service Fabric implementation, however. I wonder if we should include Adding @shankarsama and @sebastianburckhardt for FYI on this behavior change since this will affect the provider implementations you maintain. I'm inclined to accept this change, so please speak up if you have concerns. |
Yes, @cgillum your understanding from above is accurate. For the 3 providers that you mentioned (ServiceFabric, ServiceBus and Netherite) yes, the behaviour for |
Hei @cgillum, @shankarsama, @sebastianburckhardt, are there any updates on this matter? |
@moldovangeorge: Just to build up context - is there a particular storage provider you're most interested in here? Say, if we improved this behavior just for Azure Storage but not for the other backends, would that work for your scenario? I'm trying to figure out if a scoped change like that would be easier to merge. As it stands, affecting all storage providers will require multiple stakeholders to chime in, and that coordination is tricky. |
@davidmrdavid This is not fixable at the provider level. Without this change, all providers suffer from the same vulnerability: a serious performance degradation from TypeMissingException. Since the root cause for this is found in the DTF Core, there is no provider-level alternative for fixing this. |
@moldovangeorge. I took a closer look at the code, and I see what you mean now. I'm just trying to reduce the coordination effort needed to merge this as, for example, the Netherite SME is not available at this time to chime in (on vacation) and they could probably be a blocker here. A way to make this immediately safer to merge would be to turn this into opt-in behavior. Perhaps this can be a boolean setting that users opt into, but it is by default turned off. That would allow us to merge this behavior more easily, without having to worry much about implicitly changing the behavior of storage providers. |
Hei @davidmrdavid, as long as the necessary people will eventually sign-off on this, I'm fine waiting, this has been open already for 9months, and at this point I would rather wait a little longer since transforming this into an opt-in behaviour might add some clutter in a critical part of the Core project. |
Fair enough. I'll tag @sebastianburckhardt here for his 2 cents once he's back from OOF |
This change addresses the issue described here : #886