feat: Add sticky clientid shared subscription strategies #12676

hholstil · 2024-03-11T12:03:09Z

Add new sticky_clientid and sticky_clientid_leastpubs strategies for shared subscriptions.

Release version: v/e5.7.0

Summary

PR Checklist

Please convert it to a draft if any of the following conditions are not met. Reviewers may skip over until all the items are checked:

Added tests for the changes
Added property-based tests for code which performs user input validation
Changed lines covered in coverage report
Change log has been added to changes/(ce|ee)/(feat|perf|fix|breaking)-<PR-id>.en.md files
For internal contributor: there is a jira ticket to track this change
Created PR to emqx-docs if documentation update is required, or link to a follow-up jira ticket
Schema changes are backward compatible

Checklist for CI (.github/workflows) changes

If changed package build workflow, pass this action (manual trigger)
Change log has been added to changes/ dir for user-facing artifacts update

apps/emqx/src/emqx_shared_sub.erl

zmstone · 2024-03-13T09:31:19Z

changes/ce/feat-12676.en.md

+
+- `sticky_clientid`
+
+    Dispatch messages from a publisher (distinguished by their client ID) always to to the same subscriber as long as both the publisher and subscriber sessions exist. The initial subscriber is selected using the hash_clientid strategy. The difference to hash_clientid is that the subscriber associated to a certain publisher will not change even if new subscribers join the group.


If I understand the intention of this correctly, you want to make it possible for publisher client to pair up a subscriber client and survive the publisher client reconnect ?

The challenges are

The publishing client reconnect will trigger session terminate, and I noticed that session terminate un-sticks the client.

When clustered, if a client reconnects, it may connect to another node in the cluster, however the stickiness state is local (in a ETS table).

The main goal is to make sure the subscriber does not change when new subscribers get added to the shared subscription group (or when subscribers are removed, unless it's the current subscriber). I think surviving the client's session termination would not be necessary.

I couldn't figure out a better way to make sure that the sticky session data gets cleaned up when it's no longer needed. If the client sets the session expiry interval to a non-zero value the sticky session should stay until that expires regardless of reconnection, right?

Do you think it would make sense to share the sticky state between cluster nodes? What would be the best way to accomplish this?

The main goal is to make sure the subscriber does not change when new subscribers get added to the shared subscription group (or when subscribers are removed, unless it's the current subscriber)

the current sticky strategy should achieve this? Or is it the initial random pick you don’t like?

I couldn't figure out a better way to make sure that the sticky session data gets cleaned up when it's no longer needed. If the client sets the session expiry interval to a non-zero value the sticky session should stay until that expires regardless of reconnection, right?

process dictionary should do, no need for ets

Do you think it would make sense to share the sticky state between cluster nodes? What would be the best way to accomplish this?

no, that would be too expensive

the current sticky strategy should achieve this? Or is it the initial random pick you don’t like?

The current hash_clientid strategy works well if the subscribers for the group stay constant, but not when they get added/removed (due to kubernetes autoscaling, for example). The hash_clientid strategy picks the Nth subscriber in the list for each message based on the hash but it's not necessarily the same subscriber if the list/count has changed.

The initial pick is a separate thing, I think it would be nice to have an option for using the subscriber with the least amount of load (which is the goal with the sticky_clientid_leastpubs strategy).

process dictionary should do, no need for ets

But regardless of which one is used, the sticky state for each client should be cleared when it's no longer needed, right? Otherwise the table would keep just growing in size for each new client that publishes to the shared subscription. (I'm thinking about clients connecting with random client ids and not reusing them so they become stale).

The current hash_clientid strategy works well if the subscribers for the group stay constant, but not when they get added/removed (due to kubernetes autoscaling, for example). The hash_clientid strategy picks the Nth subscriber in the list for each message based on the hash but it's not necessarily the same subscriber if the list/count has changed.

I meant the current sticky strategy, but did NOT mean the current hash_clientid strategy.

The initial pick is a separate thing, I think it would be nice to have an option for using the subscriber with the least amount of load (which is the goal with the sticky_clientid_leastpubs strategy).

To this, I agree. Maybe worth a separate PR.

But regardless of which one is used, the sticky state for each client should be cleared when it's no longer needed, right?

Process dictionary goes away when the client process terminates, so there is no need for any cleanup.

Hmm. there is perhaps some misunderstanding about how EMQX message dispatch works.
The sticky strategy binds a PUBLISH client with a SUBSCRIBE client as long as they are both alive.
If another PUBLISH client sends a message, the sticky strategy allows a fresh initialization (random) of the bound for this client, but will not change the bind for other clients.

Same for the process dictionary, the stickiness state is stored in the publisher's process dictionary, but not subscriber.

Ok, thanks for the information. I have indeed misunderstood this (I'm new both to Erlang and the EMQX codebase).

Earlier when I tested the sticky strategy I got all my publishing clients stickied to just one subscriber in the group which kind of "confirmed" my (wrong) assumption. I think there might have been something weird happening when I changed the setting on a running cluster... I re-tested with the latest version on a clean deployment and it works as expected.

So it seems that what I really want to do with this MR would be like the current sticky strategy but with the initial pick made either by hash_clientid or by choosing the "least loaded" subscriber (instead of random).

The ideal case would be if the sticky state could also survive as long as the client's session (not just connection) and be shared among cluster nodes but if you think it's not feasible then let's forget about that.

Maybe add a config initial_sticky_pick = random | hash_clientid | hash_topic | least_loaded

I have updated the PR, but didn't change the configuration yet to use initial_sticky_pick. I can still do that if this looks otherwise ok.

Thank you. I'll have a look soon.

zmstone · 2024-04-04T17:02:48Z

apps/emqx/src/emqx_shared_sub.erl

        public, set, {write_concurrency, true}
    ]),
+    emqx_hooks:put(
+        'client.disconnected', {?MODULE, on_client_disconnected, [self()]}, ?HP_LOWEST


sorry for not thinking it though during previous reviews.
I am afraid this is not quite reliable because a client process can be killed (or crash), in which case, it has no chance to call the hooks.

We would need to monitor the publisher clients, however it would be too expensive to keep track of the monitoring references because there could potentially millions of publisher clients vs only a few subscribers.

Also, in case of clustering, nodes may have different view of which one is least loaded, so I am unsure if this is really going to be more balanced than random or hash based strategy.

I see. My goal with this is to improve the behavior when a new subscriber joins the group and is able/expected to take a bigger share of (new) publisher clients than the existing subscriptions. Assuming the clients have same kind of behavior on average and are equally distributed among nodes, I think it would work for that purpose even if the balancing isn't perfect.

Anyway, I'm thinking if there would be a better/alternative way to accomplish this. Perhaps some kind of "weighted random" where newly joined subscriber(s) would have a higher chance of getting selected (with the weights being adjusted to equal after some time). It could be done without having to monitor the publishers. Do you think it would make sense?

weighted random might work.
e.g. bump a counter for each pick, for sticky_prioritize_leastload strategy, pick with condition Rand > 1 - (PickCount+1) / (TotalCount+N).
Maybe reset the counters every 10 seconds to reflect real time load.

I have updated the algorithm to weighted random based on the filtered rate of new clients.
I'm sure the filter implementation could still be further tweaked to improve it (LEASTLOAD_RATE_FILTER_TIME_CONSTANT value needs to be rather large to produce the desired effect, maybe it should be also configurable) but I think this could work reasonably well.

Add new sticky_clientid and sticky_prioritize_leastload strategies for shared subscriptions.

hholstil requested review from lafirest and a team as code owners March 11, 2024 12:03

zmstone reviewed Mar 13, 2024

View reviewed changes

hholstil force-pushed the feat-sharedsub-sticky-clientid branch from f50a2eb to d5a10ee Compare March 28, 2024 10:41

zmstone reviewed Apr 4, 2024

View reviewed changes

feat: Add new sticky shared subscription strategies

cb4a0f8

Add new sticky_clientid and sticky_prioritize_leastload strategies for shared subscriptions.

hholstil force-pushed the feat-sharedsub-sticky-clientid branch from d5a10ee to cb4a0f8 Compare April 25, 2024 09:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add sticky clientid shared subscription strategies #12676

feat: Add sticky clientid shared subscription strategies #12676

hholstil commented Mar 11, 2024 •

edited by zmstone

zmstone Mar 13, 2024

hholstil Mar 14, 2024 •

edited

zmstone Mar 16, 2024

hholstil Mar 18, 2024

zmstone Mar 18, 2024

zmstone Mar 18, 2024 •

edited

hholstil Mar 20, 2024 •

edited

zmstone Mar 23, 2024

hholstil Mar 28, 2024

zmstone Apr 25, 2024

zmstone Apr 4, 2024

hholstil Apr 5, 2024

zmstone Apr 6, 2024

hholstil Apr 25, 2024


		- `sticky_clientid`

		Dispatch messages from a publisher (distinguished by their client ID) always to to the same subscriber as long as both the publisher and subscriber sessions exist. The initial subscriber is selected using the hash_clientid strategy. The difference to hash_clientid is that the subscriber associated to a certain publisher will not change even if new subscribers join the group.

feat: Add sticky clientid shared subscription strategies #12676

Are you sure you want to change the base?

feat: Add sticky clientid shared subscription strategies #12676

Conversation

hholstil commented Mar 11, 2024 • edited by zmstone

Summary

PR Checklist

Checklist for CI (.github/workflows) changes

Choose a reason for hiding this comment

hholstil Mar 14, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zmstone Mar 18, 2024 • edited

Choose a reason for hiding this comment

hholstil Mar 20, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hholstil commented Mar 11, 2024 •

edited by zmstone

hholstil Mar 14, 2024 •

edited

zmstone Mar 18, 2024 •

edited

hholstil Mar 20, 2024 •

edited