Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add sticky clientid shared subscription strategies #12676

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

hholstil
Copy link

@hholstil hholstil commented Mar 11, 2024

Add new sticky_clientid and sticky_clientid_leastpubs strategies for shared subscriptions.

Release version: v/e5.7.0

Summary

PR Checklist

Please convert it to a draft if any of the following conditions are not met. Reviewers may skip over until all the items are checked:

  • Added tests for the changes
  • Added property-based tests for code which performs user input validation
  • Changed lines covered in coverage report
  • Change log has been added to changes/(ce|ee)/(feat|perf|fix|breaking)-<PR-id>.en.md files
  • For internal contributor: there is a jira ticket to track this change
  • Created PR to emqx-docs if documentation update is required, or link to a follow-up jira ticket
  • Schema changes are backward compatible

Checklist for CI (.github/workflows) changes

  • If changed package build workflow, pass this action (manual trigger)
  • Change log has been added to changes/ dir for user-facing artifacts update

@hholstil hholstil requested review from lafirest and a team as code owners March 11, 2024 12:03
apps/emqx/src/emqx_shared_sub.erl Outdated Show resolved Hide resolved

- `sticky_clientid`

Dispatch messages from a publisher (distinguished by their client ID) always to to the same subscriber as long as both the publisher and subscriber sessions exist. The initial subscriber is selected using the hash_clientid strategy. The difference to hash_clientid is that the subscriber associated to a certain publisher will not change even if new subscribers join the group.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the intention of this correctly, you want to make it possible for publisher client to pair up a subscriber client and survive the publisher client reconnect ?

The challenges are

  1. The publishing client reconnect will trigger session terminate, and I noticed that session terminate un-sticks the client.
  2. When clustered, if a client reconnects, it may connect to another node in the cluster, however the stickiness state is local (in a ETS table).

Copy link
Author

@hholstil hholstil Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main goal is to make sure the subscriber does not change when new subscribers get added to the shared subscription group (or when subscribers are removed, unless it's the current subscriber). I think surviving the client's session termination would not be necessary.

  1. I couldn't figure out a better way to make sure that the sticky session data gets cleaned up when it's no longer needed. If the client sets the session expiry interval to a non-zero value the sticky session should stay until that expires regardless of reconnection, right?

  2. Do you think it would make sense to share the sticky state between cluster nodes? What would be the best way to accomplish this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main goal is to make sure the subscriber does not change when new subscribers get added to the shared subscription group (or when subscribers are removed, unless it's the current subscriber)

the current sticky strategy should achieve this? Or is it the initial random pick you don’t like?

I couldn't figure out a better way to make sure that the sticky session data gets cleaned up when it's no longer needed. If the client sets the session expiry interval to a non-zero value the sticky session should stay until that expires regardless of reconnection, right?

process dictionary should do, no need for ets

Do you think it would make sense to share the sticky state between cluster nodes? What would be the best way to accomplish this?

no, that would be too expensive

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the current sticky strategy should achieve this? Or is it the initial random pick you don’t like?

The current hash_clientid strategy works well if the subscribers for the group stay constant, but not when they get added/removed (due to kubernetes autoscaling, for example). The hash_clientid strategy picks the Nth subscriber in the list for each message based on the hash but it's not necessarily the same subscriber if the list/count has changed.

The initial pick is a separate thing, I think it would be nice to have an option for using the subscriber with the least amount of load (which is the goal with the sticky_clientid_leastpubs strategy).

process dictionary should do, no need for ets

But regardless of which one is used, the sticky state for each client should be cleared when it's no longer needed, right? Otherwise the table would keep just growing in size for each new client that publishes to the shared subscription. (I'm thinking about clients connecting with random client ids and not reusing them so they become stale).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current hash_clientid strategy works well if the subscribers for the group stay constant, but not when they get added/removed (due to kubernetes autoscaling, for example). The hash_clientid strategy picks the Nth subscriber in the list for each message based on the hash but it's not necessarily the same subscriber if the list/count has changed.

I meant the current sticky strategy, but did NOT mean the current hash_clientid strategy.

The initial pick is a separate thing, I think it would be nice to have an option for using the subscriber with the least amount of load (which is the goal with the sticky_clientid_leastpubs strategy).

To this, I agree. Maybe worth a separate PR.

But regardless of which one is used, the sticky state for each client should be cleared when it's no longer needed, right?

Process dictionary goes away when the client process terminates, so there is no need for any cleanup.

Copy link
Member

@zmstone zmstone Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. there is perhaps some misunderstanding about how EMQX message dispatch works.
The sticky strategy binds a PUBLISH client with a SUBSCRIBE client as long as they are both alive.
If another PUBLISH client sends a message, the sticky strategy allows a fresh initialization (random) of the bound for this client, but will not change the bind for other clients.

Same for the process dictionary, the stickiness state is stored in the publisher's process dictionary, but not subscriber.

Copy link
Author

@hholstil hholstil Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks for the information. I have indeed misunderstood this (I'm new both to Erlang and the EMQX codebase).

Earlier when I tested the sticky strategy I got all my publishing clients stickied to just one subscriber in the group which kind of "confirmed" my (wrong) assumption. I think there might have been something weird happening when I changed the setting on a running cluster... I re-tested with the latest version on a clean deployment and it works as expected.

So it seems that what I really want to do with this MR would be like the current sticky strategy but with the initial pick made either by hash_clientid or by choosing the "least loaded" subscriber (instead of random).

The ideal case would be if the sticky state could also survive as long as the client's session (not just connection) and be shared among cluster nodes but if you think it's not feasible then let's forget about that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a config initial_sticky_pick = random | hash_clientid | hash_topic | least_loaded

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the PR, but didn't change the configuration yet to use initial_sticky_pick. I can still do that if this looks otherwise ok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I'll have a look soon.

@hholstil hholstil force-pushed the feat-sharedsub-sticky-clientid branch from f50a2eb to d5a10ee Compare March 28, 2024 10:41
public, set, {write_concurrency, true}
]),
emqx_hooks:put(
'client.disconnected', {?MODULE, on_client_disconnected, [self()]}, ?HP_LOWEST
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for not thinking it though during previous reviews.
I am afraid this is not quite reliable because a client process can be killed (or crash), in which case, it has no chance to call the hooks.

We would need to monitor the publisher clients, however it would be too expensive to keep track of the monitoring references because there could potentially millions of publisher clients vs only a few subscribers.

Also, in case of clustering, nodes may have different view of which one is least loaded, so I am unsure if this is really going to be more balanced than random or hash based strategy.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. My goal with this is to improve the behavior when a new subscriber joins the group and is able/expected to take a bigger share of (new) publisher clients than the existing subscriptions. Assuming the clients have same kind of behavior on average and are equally distributed among nodes, I think it would work for that purpose even if the balancing isn't perfect.

Anyway, I'm thinking if there would be a better/alternative way to accomplish this. Perhaps some kind of "weighted random" where newly joined subscriber(s) would have a higher chance of getting selected (with the weights being adjusted to equal after some time). It could be done without having to monitor the publishers. Do you think it would make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weighted random might work.
e.g. bump a counter for each pick, for sticky_prioritize_leastload strategy, pick with condition Rand > 1 - (PickCount+1) / (TotalCount+N).
Maybe reset the counters every 10 seconds to reflect real time load.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the algorithm to weighted random based on the filtered rate of new clients.
I'm sure the filter implementation could still be further tweaked to improve it (LEASTLOAD_RATE_FILTER_TIME_CONSTANT value needs to be rather large to produce the desired effect, maybe it should be also configurable) but I think this could work reasonably well.

Add new sticky_clientid and sticky_prioritize_leastload strategies
for shared subscriptions.
@hholstil hholstil force-pushed the feat-sharedsub-sticky-clientid branch from d5a10ee to cb4a0f8 Compare April 25, 2024 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants