Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent performance degradation from TypeMissingException #925

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

moldovangeorge
Copy link

This change addresses the issue described here : #886

@moldovangeorge
Copy link
Author

Hei, @cgillum , @sebastianburckhardt, could you help with a review here please ?

Copy link
Collaborator

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me as per the explanation here: #886 (comment)

@davidmrdavid
Copy link
Collaborator

@cgillum - mind giving this a quick check?

Copy link
Collaborator

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed fix in its current state could result in a tight failure loop, depending on the implementation of IOrchestrationService. I think we need to think a bit more about how to handle cases like this so that we don't put ourselves into either extreme (global backoffs, slowing everything down and rapid-retries, which can lead to other problems).

I'm open to suggestions and further discussion on how we can solve this.

@@ -387,20 +386,13 @@ async Task ProcessWorkItemAsync(WorkItemDispatcherContext context, object workIt
this.LogHelper.ProcessWorkItemFailed(
context,
workItemId,
$"Backing off for {BackOffIntervalOnInvalidOperationSecs} seconds",
$"Work item will be re-processed after lock expires",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual behavior will be different from what's implied by this message. The call to either SafeReleaseWorkItem further down in this method will release the lock, so the work item may actually get processed immediately, depending on how the IOrchestrationService decides to implement its AbortWorkItem logic.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this would be the behavior with the current implementation, I missed that part during my initial analysis. Can you provide some details about why the abort and release were enforced as part of the processing of a work Item in the first place?
With the current code, I see the following execution path :

  1. The task is processed successfully, in which case it is safely released if the method is implemented by the provider
  2. The processing fails, in which case the task is Aborted and then Released.
    2.1 If the Exception is TypeMissingException, force a global back-off of the worker
    2.2 For other Exceptions, let the providers decide if they have applied a global back-off or not through the GetDelayInSecondsAfterOnProcessException method.

My proposal for this would be to unify the exception flow and let the providers decide how to handle the exception in terms of global back-off while remaining backward compatible with the current capabilities of the providers. So the exception flow would change as follows :

For all Exceptions, let the providers decide if they will apply a global back-off or not through the GetDelayInSecondsAfterOnProcessException method and perform only the Abort operation on the exception flow, skipping the ReleaseStep.
By not including the SafeRelease mechanism on the exception path, I think we will create a clear mechanism for the providers to handle this specific scenario of failed work attempts, through custom implementations of the GetDelayInSecondsAfterOnProcessException and AbortWorkItem.
I think skipping the SafeRelease on the exception flow does not present any risks, since providers that have not yet implemented the 2 methods from above, will rely on the automatic expiration of the lock in this case, without an explicit release.

I have pushed a new version with the new proposed implementation, let me know what you think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide some details about why the abort and release were enforced as part of the processing of a work Item in the first place?

This is how the framework was originally designed, before I was involved in the project. But even still, the design seems reasonable to me. My understanding of message timeouts is that they are intended to allow alternate VMs to pick up a message if the current VM becomes non-responsive. I would expect message handling failures to be dealt with using some other mechanism.

I'll respond to your other proposal on the main PR discussion thread.

This change addresses the issue described here : Azure#886
@moldovangeorge moldovangeorge force-pushed the georgemoldovan/prevent_performance_degradation branch from 2491752 to 9ecebcb Compare December 19, 2023 11:31
@cgillum
Copy link
Collaborator

cgillum commented Dec 19, 2023

Just to summarize, I understand that you're proposing two things:

  1. Remove the catching of TypeMissingException and instead rely on the generic exception handling mechanism for all exception types.
  2. Only release the work item in the success case.

I'm fine with (1) but worried about the unintended consequences of (2).

I think skipping the SafeRelease on the exception flow does not present any risks, since providers that have not yet implemented the 2 methods from above, will rely on the automatic expiration of the lock in this case, without an explicit release.

When you say "providers that have not yet implemented the 2 methods from above", which two methods are you referring to?

By removing the release from the error handling flow, you're basically guaranteeing that any unhandled exception in message processing will result in the message remaining locked for the full length of the message timeout, which could be several minutes. This is a huge penalty to pay, particularly for transient issues, and our users will definitely notice and complain about this.

Let me know if I'm misunderstanding something.

@moldovangeorge
Copy link
Author

moldovangeorge commented Dec 20, 2023

By the 2 methods from above I mean GetDelayInSecondsAfterOnProcessException and AbortWorkItem. These 2 methods should determine the behavior for failed tasks for a given provider. AbortWorkItem is actually the AbandonTaskOrchestrationWorkItemAsync in the IOrchestrationService.
Removing the release from the error handling flow will only result in the message remaining locked if a given provider does not implement the GetDelayInSecondsAfterOnProcessException and AbortWorkItem methods, but does implement ReleaseTaskOrchestrationWorkItemAsync. I will analyze if such a provider exists today.

To verify if there will be indeed a penalty for this change in the existing providers, I performed the following analysis :

  • DurableTask.ServiceBus: Implements both AbandonTaskOrchestrationWorkItemAsync and ReleaseTaskOrchestrationWorkItemAsync. No performance penalty will be caused by long lock acquisition. The current implementation of GetDelayInSecondsAfterOnProcessException is to insert a global back-off for transient exceptions and 0 otherwise. This will mean that tasks that failed with TypeMissingException will be retried immediately and indefinitely until a worker eventually is able to pick it up. It is not ideal, but I think it's a better middle ground than the current situation where such an exception will cause all workers to eventually stop processing until that task is forcefully removed from the queue.
  • DurableTask.AzureStorage: Same as DurableTask.ServiceBus, but with better handling of AbandonTaskOrchestrationWorkItemAsync (as you described here to avoid tight failure loops. No penalty for this provider. The current implementation of GetDelayInSecondsAfterOnProcessException is to have a global back-off of 10 seconds for all exceptions. This coupled with the implementation of the Abort operation in this provider I think makes sense and has very low chances of both stopping processing or over retrying tasks that are going to fail anyway.
  • DurableTask.AzureServiceFabric: Same as DurableTask.ServiceBus, no penalty for this provider. The current implementation of GetDelayInSecondsAfterOnProcessException is to have 1 or 2 seconds global back-off for FabricNotReadableException and TimeoutException and 0 otherwise. Same vulnerability for over retry-ing as the ServiceBus provider.
  • DurableTask.Netherite: It implements the AbandonTaskOrchestrationWorkItemAsync but does not implement the ReleaseTaskOrchestrationWorkItemAsync. Since the release was not doing anything anyway, no penalty here as well. 0 seconds delay for the GetDelayInSecondsAfterOnProcessException. Same vulnerability as ServiceBus provider.
  • DurableTask.SqlServer - No implementation for any of the 3 methods (ReleaseTaskOrchestrationWorkItemAsync, AbandonTaskOrchestrationWorkItemAsync, GetDelayInSecondsAfterOnProcessException) so this provider allready had a penalty for any task that suffered an exception, because the lock was not released.
  • DurableTask.Emulator - Out of scope for this change.

After analyzing the above, my conclusion is :
Before this change, all providers had an issue: If a few tasks that generated TypeMissingException were present in the queue, this would cause a serious performance degradation leading workers to eventually stop completely.
After this change: Providers will maintain the same flow for handling different exceptions as before, no performance penalty will be introduced, and they will be able to handle the TypeMissingException explicitly through their implementations of GetDelayInSecondsAfterOnProcessException and AbandonTaskOrchestrationWorkItemAsync. Specifically, most of the providers, with their current implementation will replace this performance degradation with continuous retrying of the tasks that generate the TypeMissingException. My personal opinion is that this is a good middle-ground that is easily fixable in all the providers by implementing the same type of logic (exponential back-off for a given task) that is now present in the DurableTask.AzureStorage provider for their AbandonTaskOrchestrationWorkItemAsync method.

Please let me know if I missed something or if this is not aligned with the long-term vision of the OrchestrationService.

@moldovangeorge
Copy link
Author

@moldovangeorge please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

@microsoft-github-policy-service agree company="UiPath"

@moldovangeorge
Copy link
Author

@cgillum Gentle reminder to take a look at the above.

Copy link
Collaborator

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @moldovangeorge. I read your last response but I'm still not sure if it covers my previous question:

By removing the release from the error handling flow, you're basically guaranteeing that any unhandled exception in message processing will result in the message remaining locked for the full length of the message timeout, which could be several minutes. This is a huge penalty to pay, particularly for transient issues, and our users will definitely notice and complain about this.

Can you re-explain to me why we're moving the "release" out of the finally block and instead only executing it in the try block?

@moldovangeorge
Copy link
Author

Hi @moldovangeorge. I read your last response but I'm still not sure if it covers my previous question:

By removing the release from the error handling flow, you're basically guaranteeing that any unhandled exception in message processing will result in the message remaining locked for the full length of the message timeout, which could be several minutes. This is a huge penalty to pay, particularly for transient issues, and our users will definitely notice and complain about this.

Can you re-explain to me why we're moving the "release" out of the finally block and instead only executing it in the try block?

Hei @cgillum , sure :
The presence of SafeReleaseWorkItem on the exception path was forcing a fast-retry behaviour on the exception path for the DTF providers that implemented the SafeReleaseLogic. While the global back-off was in place, this was not an issue because the retries were delayed by the global back-off anyway.
With this change, now that the global back-off will disappear, keeping the SafeReleaseWorkItem on the error handling flow would result in an immediate retry of the task (**For the providers that implement ReleaseTaskOrchestrationWorkItemAsync **), making us go in the other extreme: from global back-off to immediate and never-ending retry.

So the reason for excluding the SafeReleaseWorkItem from the error handling flow is to avoid excessive retries and to give the liberty to the provider implementation to choose how to handle broken tasks via the GetDelayInSecondsAfterOnProcessException and AbortWorkItem methods.
The removal of the SafeReleaseWorkItem would

guarantee that any unhandled exception in message processing will result in the message remaining locked for the full length of the message timeout

only if there would be a provider which

does not implement the GetDelayInSecondsAfterOnProcessException and AbortWorkItem methods, but does implement ReleaseTaskOrchestrationWorkItemAsync.

My analysis from the previous comment tried to find out if there is any DTF provider that would be affected by this change, and I could not find any provider that would suffer a change in the behaviour related to failed tasks, after this change would be merged.

@cgillum
Copy link
Collaborator

cgillum commented Mar 28, 2024

Thanks @moldovangeorge - I agree that removing the global backoff is a good thing and I understand the concern about creating a tight failure loop for failures generally. One more understanding that I'd like to confirm before going forward with this:

...give the liberty to the provider implementation to choose how to handle broken tasks via the GetDelayInSecondsAfterOnProcessException and AbortWorkItem methods.

If I understand correctly:

  • GetDelayInSecondsAfterOnProcessException allows backend implementations to introduce the global backoff behavior if they want it. The default is no global backoff.
  • AbortWorkItem let's implementations decide if they want to release the lock on the work item or let it remain locked until the lock timeout expires. The default (based on current implementations) is that the work item will remain locked until the timeout expires.

Is this accurate? If so, I think this is fine. However, I am a bit concerned about this:

most of the providers, with their current implementation will replace this performance degradation with continuous retrying of the tasks that generate the TypeMissingException.

As you mentioned, DurableTask.AzureStorage (the most popular backend) is protected because the abandon logic already does exponential backoff. DurableTask.ServiceBus (the original/oldest provider) might be okay because I think there should be some kind of poison message handling in the Service Bus layer to protect it from runaway errors. I'm a bit worried about the Service Fabric implementation, however. I wonder if we should include TypeMissingException to the list of exceptions handled by GetDelayInSecondsAfterOnProcessException to avoid an unexpected change in behavior. It seems that the Netherite provider could also experience an unpleasant behavior change if there's a TypeMissingException.

Adding @shankarsama and @sebastianburckhardt for FYI on this behavior change since this will affect the provider implementations you maintain. I'm inclined to accept this change, so please speak up if you have concerns.

@moldovangeorge
Copy link
Author

GetDelayInSecondsAfterOnProcessException allows backend implementations to introduce the global backoff behavior if they want it. The default is no global backoff.
AbortWorkItem lets implementations decide if they want to release the lock on the work item or let it remain locked until the lock timeout expires. The default (based on current implementations) is that the work item will remain locked until the timeout expires.

Yes, @cgillum your understanding from above is accurate.

For the 3 providers that you mentioned (ServiceFabric, ServiceBus and Netherite) yes, the behaviour for TypeMissingException will change from worker degradation until complete hault to continuous retries of tasks that generate TypeMissingExceptions. My personal opinion is that this is a good step forward, that can be further improved by implementing in these providers the same type of behaviour that the Azure Storage provider has for GetDelayInSecondsAfterOnProcessException and AbortWorkItem. From the perspective of a user of this technology, I would rather have a task stuck in the queue and infinitely retrying than have my workers completely stopped because of some poisonous tasks. It's not perfect, but it's easily fixable at the provider level.

@moldovangeorge
Copy link
Author

Hei @cgillum, @shankarsama, @sebastianburckhardt, are there any updates on this matter?

@davidmrdavid
Copy link
Collaborator

davidmrdavid commented May 3, 2024

@moldovangeorge: Just to build up context - is there a particular storage provider you're most interested in here? Say, if we improved this behavior just for Azure Storage but not for the other backends, would that work for your scenario? I'm trying to figure out if a scoped change like that would be easier to merge. As it stands, affecting all storage providers will require multiple stakeholders to chime in, and that coordination is tricky.

@moldovangeorge
Copy link
Author

@davidmrdavid This is not fixable at the provider level. Without this change, all providers suffer from the same vulnerability: a serious performance degradation from TypeMissingException. Since the root cause for this is found in the DTF Core, there is no provider-level alternative for fixing this.

@davidmrdavid
Copy link
Collaborator

@moldovangeorge. I took a closer look at the code, and I see what you mean now. I'm just trying to reduce the coordination effort needed to merge this as, for example, the Netherite SME is not available at this time to chime in (on vacation) and they could probably be a blocker here.

A way to make this immediately safer to merge would be to turn this into opt-in behavior. Perhaps this can be a boolean setting that users opt into, but it is by default turned off. That would allow us to merge this behavior more easily, without having to worry much about implicitly changing the behavior of storage providers.

@moldovangeorge
Copy link
Author

Hei @davidmrdavid, as long as the necessary people will eventually sign-off on this, I'm fine waiting, this has been open already for 9months, and at this point I would rather wait a little longer since transforming this into an opt-in behaviour might add some clutter in a critical part of the Core project.

@davidmrdavid
Copy link
Collaborator

Fair enough. I'll tag @sebastianburckhardt here for his 2 cents once he's back from OOF

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants