Mute just restarted nodes in leader_balancer #18497

ztlpn · 2024-05-15T15:20:18Z

Mute just restarted nodes in leader_balancer, as their health reports can have incomplete partition info, and they are probably busy recovering partitions anyway.

Fixes #17150

Backports Required

Release Notes

Improvements

Don't try to transfer leadership to just restarted nodes when balancing leaders.

vbotbuildovich · 2024-05-15T17:28:59Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/49162#018f7d17-82bc-48cf-b433-0c9851414504

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/49162#018f7d1f-a86b-476f-af6b-18eede55586a

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/49162#018f7d1f-a86e-4f3a-bf44-258891b862ca

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/49228#018f8226-1f81-40f2-8711-9c4910f8634f

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/49396#018f9daf-68ec-4efa-96e8-a7e7a72f486f

bharathv · 2024-05-16T02:17:51Z

tests/rptest/tests/leadership_transfer_test.py

        for s, count in shard2leaders.items():
-            expected_min = math.floor(expected_on_shard * 0.8)
+            # Check with a lot of slack because leader balancer may not be able to achieve


wonder if we should mark this as ok_to_fail instead, so we don't lose track of tightening the check once the underlying issue is fixed.

Well if it is marked ok_to_fail we will surely lose track :) and I don't think we can mark individual assertions ok_to_fail... Also even in this form the check is somewhat useful

vbotbuildovich · 2024-05-16T16:55:30Z

new failures in https://buildkite.com/redpanda/redpanda/builds/49228#018f8226-1f86-4639-be96-7e0e03f9e76b:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=check_manifest_and_segment_metadata"
"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.S3.check_mode=check_manifest_existence"

new failures in https://buildkite.com/redpanda/redpanda/builds/49228#018f8215-c581-409c-b052-78e58d78c3c4:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.S3.check_mode=check_manifest_and_segment_metadata"

new failures in https://buildkite.com/redpanda/redpanda/builds/49228#018f8226-1f81-40f2-8711-9c4910f8634f:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=no_check"

new failures in https://buildkite.com/redpanda/redpanda/builds/49228#018f8226-1f89-46ed-b36f-d8f40d5f346a:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=check_manifest_existence"
"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.S3.check_mode=no_check"

new failures in https://buildkite.com/redpanda/redpanda/builds/49228#018f8215-c585-4e68-89d5-245de545bb40:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=check_manifest_existence"

new failures in https://buildkite.com/redpanda/redpanda/builds/49228#018f8215-c583-4ebe-a3d7-96108f6a4b42:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=check_manifest_and_segment_metadata"
"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.S3.check_mode=check_manifest_existence"

new failures in https://buildkite.com/redpanda/redpanda/builds/49228#018f8215-c57e-4a43-97a7-8423e98bb3c6:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=no_check"

new failures in https://buildkite.com/redpanda/redpanda/builds/49228#018f8300-c364-4b33-9d07-7b67d4bb629d:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=check_manifest_existence"
"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.S3.check_mode=no_check"

new failures in https://buildkite.com/redpanda/redpanda/builds/49396#018f9da7-7fbe-433f-9aa9-bf906ba7c3c4:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=check_manifest_existence"

new failures in https://buildkite.com/redpanda/redpanda/builds/49396#018f9da7-7fbc-46e2-94ab-2fea83f5e43d:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=check_manifest_and_segment_metadata"

new failures in https://buildkite.com/redpanda/redpanda/builds/49396#018f9da7-7fc2-4294-9018-8ace8d69812e:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=no_check"

new failures in https://buildkite.com/redpanda/redpanda/builds/49396#018f9daf-68e3-4246-a9f1-0e9ff37873de:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.S3.check_mode=check_manifest_and_segment_metadata"

new failures in https://buildkite.com/redpanda/redpanda/builds/49396#018f9daf-68e6-4789-bc86-964fc89cb749:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=check_manifest_and_segment_metadata"
"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.S3.check_mode=check_manifest_existence"

new failures in https://buildkite.com/redpanda/redpanda/builds/49396#018f9daf-68e9-47b9-9923-9e87cf7f343c:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=check_manifest_existence"

new failures in https://buildkite.com/redpanda/redpanda/builds/49396#018f9daf-68ec-4efa-96e8-a7e7a72f486f:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=no_check"

bharathv · 2024-05-21T17:28:36Z

@ztlpn is this ready for review? Lots of failures, so unsure if they are related or not.

ztlpn · 2024-05-21T17:32:06Z

@bharathv they are related, though this is more of a test problem. Currently discussing with the storage team how to fix the test.

ztlpn · 2024-05-21T23:07:55Z

merged #18603, retrying ci...

ztlpn · 2024-05-21T23:08:09Z

/ci-repeat

Just restarted nodes may have their health reports incomplete because not all partitions have started yet. Also right after restart the node is probably busy catching up and replicating data that was produced in its absense. Because of these two reasons just restarted nodes are bad candidates for leadership transfers, mute them.

Because in this test we wait for the set of objects in S3 to stabilize, it is dependent on leader balancer timings and the previous commit makes it fail. Give it more time to stabilize.

ztlpn · 2024-06-11T20:27:19Z

After merging #18744 this should enough to fix #17150

ztlpn requested review from bharathv, bashtanov and mmaslankaprv May 15, 2024 15:20

github-actions bot added the area/redpanda label May 15, 2024

bharathv reviewed May 16, 2024

View reviewed changes

ztlpn force-pushed the fix-17150 branch from ca483a4 to e68016b Compare May 16, 2024 14:26

ztlpn requested a review from bharathv May 16, 2024 14:35

ztlpn marked this pull request as draft June 10, 2024 11:04

ztlpn added 2 commits June 11, 2024 22:24

tests: increase timeout in TopicRecoveryTest.test_many_partitions

5084da5

Because in this test we wait for the set of objects in S3 to stabilize, it is dependent on leader balancer timings and the previous commit makes it fail. Give it more time to stabilize.

ztlpn force-pushed the fix-17150 branch from e68016b to 5084da5 Compare June 11, 2024 20:24

ztlpn changed the title ~~tests: relax check in AutomaticLeadershipBalancingTest~~ Mute just restarted nodes in leader_balancer Jun 11, 2024

ztlpn marked this pull request as ready for review June 11, 2024 20:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mute just restarted nodes in leader_balancer #18497

Mute just restarted nodes in leader_balancer #18497

ztlpn commented May 15, 2024 •

edited

vbotbuildovich commented May 15, 2024 •

edited

bharathv May 16, 2024

ztlpn May 16, 2024

vbotbuildovich commented May 16, 2024 •

edited

bharathv commented May 21, 2024

ztlpn commented May 21, 2024

ztlpn commented May 21, 2024

ztlpn commented May 21, 2024

ztlpn commented Jun 11, 2024

Mute just restarted nodes in leader_balancer #18497

Are you sure you want to change the base?

Mute just restarted nodes in leader_balancer #18497

Conversation

ztlpn commented May 15, 2024 • edited

Backports Required

Release Notes

Improvements

vbotbuildovich commented May 15, 2024 • edited

bharathv May 16, 2024

Choose a reason for hiding this comment

ztlpn May 16, 2024

Choose a reason for hiding this comment

vbotbuildovich commented May 16, 2024 • edited

bharathv commented May 21, 2024

ztlpn commented May 21, 2024

ztlpn commented May 21, 2024

ztlpn commented May 21, 2024

ztlpn commented Jun 11, 2024

ztlpn commented May 15, 2024 •

edited

vbotbuildovich commented May 15, 2024 •

edited

vbotbuildovich commented May 16, 2024 •

edited