Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mute just restarted nodes in leader_balancer #18497

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

ztlpn
Copy link
Contributor

@ztlpn ztlpn commented May 15, 2024

Mute just restarted nodes in leader_balancer, as their health reports can have incomplete partition info, and they are probably busy recovering partitions anyway.

Fixes #17150

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.1.x
  • v23.3.x
  • v23.2.x

Release Notes

Improvements

  • Don't try to transfer leadership to just restarted nodes when balancing leaders.

for s, count in shard2leaders.items():
expected_min = math.floor(expected_on_shard * 0.8)
# Check with a lot of slack because leader balancer may not be able to achieve
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wonder if we should mark this as ok_to_fail instead, so we don't lose track of tightening the check once the underlying issue is fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well if it is marked ok_to_fail we will surely lose track :) and I don't think we can mark individual assertions ok_to_fail... Also even in this form the check is somewhat useful

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented May 16, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/49228#018f8226-1f86-4639-be96-7e0e03f9e76b:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=check_manifest_and_segment_metadata"
"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.S3.check_mode=check_manifest_existence"

new failures in https://buildkite.com/redpanda/redpanda/builds/49228#018f8215-c581-409c-b052-78e58d78c3c4:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.S3.check_mode=check_manifest_and_segment_metadata"

new failures in https://buildkite.com/redpanda/redpanda/builds/49228#018f8226-1f81-40f2-8711-9c4910f8634f:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=no_check"

new failures in https://buildkite.com/redpanda/redpanda/builds/49228#018f8226-1f89-46ed-b36f-d8f40d5f346a:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=check_manifest_existence"
"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.S3.check_mode=no_check"

new failures in https://buildkite.com/redpanda/redpanda/builds/49228#018f8215-c585-4e68-89d5-245de545bb40:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=check_manifest_existence"

new failures in https://buildkite.com/redpanda/redpanda/builds/49228#018f8215-c583-4ebe-a3d7-96108f6a4b42:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=check_manifest_and_segment_metadata"
"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.S3.check_mode=check_manifest_existence"

new failures in https://buildkite.com/redpanda/redpanda/builds/49228#018f8215-c57e-4a43-97a7-8423e98bb3c6:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=no_check"

new failures in https://buildkite.com/redpanda/redpanda/builds/49228#018f8300-c364-4b33-9d07-7b67d4bb629d:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=check_manifest_existence"
"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.S3.check_mode=no_check"

new failures in https://buildkite.com/redpanda/redpanda/builds/49396#018f9da7-7fbe-433f-9aa9-bf906ba7c3c4:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=check_manifest_existence"

new failures in https://buildkite.com/redpanda/redpanda/builds/49396#018f9da7-7fbc-46e2-94ab-2fea83f5e43d:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=check_manifest_and_segment_metadata"

new failures in https://buildkite.com/redpanda/redpanda/builds/49396#018f9da7-7fc2-4294-9018-8ace8d69812e:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=no_check"

new failures in https://buildkite.com/redpanda/redpanda/builds/49396#018f9daf-68e3-4246-a9f1-0e9ff37873de:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.S3.check_mode=check_manifest_and_segment_metadata"

new failures in https://buildkite.com/redpanda/redpanda/builds/49396#018f9daf-68e6-4789-bc86-964fc89cb749:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=check_manifest_and_segment_metadata"
"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.S3.check_mode=check_manifest_existence"

new failures in https://buildkite.com/redpanda/redpanda/builds/49396#018f9daf-68e9-47b9-9923-9e87cf7f343c:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=check_manifest_existence"

new failures in https://buildkite.com/redpanda/redpanda/builds/49396#018f9daf-68ec-4efa-96e8-a7e7a72f486f:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=no_check"

@bharathv
Copy link
Contributor

@ztlpn is this ready for review? Lots of failures, so unsure if they are related or not.

@ztlpn
Copy link
Contributor Author

ztlpn commented May 21, 2024

@bharathv they are related, though this is more of a test problem. Currently discussing with the storage team how to fix the test.

@ztlpn
Copy link
Contributor Author

ztlpn commented May 21, 2024

merged #18603, retrying ci...

@ztlpn
Copy link
Contributor Author

ztlpn commented May 21, 2024

/ci-repeat

@ztlpn ztlpn marked this pull request as draft June 10, 2024 11:04
Just restarted nodes may have their health reports incomplete because
not all partitions have started yet. Also right after restart the node
is probably busy catching up and replicating data that was produced in
its absense. Because of these two reasons just restarted nodes are bad
candidates for leadership transfers, mute them.
Because in this test we wait for the set of objects in S3 to stabilize,
it is dependent on leader balancer timings and the previous commit makes
it fail. Give it more time to stabilize.
@ztlpn ztlpn changed the title tests: relax check in AutomaticLeadershipBalancingTest Mute just restarted nodes in leader_balancer Jun 11, 2024
@ztlpn ztlpn marked this pull request as ready for review June 11, 2024 20:25
@ztlpn
Copy link
Contributor Author

ztlpn commented Jun 11, 2024

After merging #18744 this should enough to fix #17150

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants