-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mute just restarted nodes in leader_balancer #18497
base: dev
Are you sure you want to change the base?
Conversation
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/49162#018f7d17-82bc-48cf-b433-0c9851414504 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/49162#018f7d1f-a86b-476f-af6b-18eede55586a ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/49162#018f7d1f-a86e-4f3a-bf44-258891b862ca ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/49228#018f8226-1f81-40f2-8711-9c4910f8634f ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/49396#018f9daf-68ec-4efa-96e8-a7e7a72f486f |
for s, count in shard2leaders.items(): | ||
expected_min = math.floor(expected_on_shard * 0.8) | ||
# Check with a lot of slack because leader balancer may not be able to achieve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wonder if we should mark this as ok_to_fail instead, so we don't lose track of tightening the check once the underlying issue is fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well if it is marked ok_to_fail we will surely lose track :) and I don't think we can mark individual assertions ok_to_fail... Also even in this form the check is somewhat useful
new failures in https://buildkite.com/redpanda/redpanda/builds/49228#018f8226-1f86-4639-be96-7e0e03f9e76b:
new failures in https://buildkite.com/redpanda/redpanda/builds/49228#018f8215-c581-409c-b052-78e58d78c3c4:
new failures in https://buildkite.com/redpanda/redpanda/builds/49228#018f8226-1f81-40f2-8711-9c4910f8634f:
new failures in https://buildkite.com/redpanda/redpanda/builds/49228#018f8226-1f89-46ed-b36f-d8f40d5f346a:
new failures in https://buildkite.com/redpanda/redpanda/builds/49228#018f8215-c585-4e68-89d5-245de545bb40:
new failures in https://buildkite.com/redpanda/redpanda/builds/49228#018f8215-c583-4ebe-a3d7-96108f6a4b42:
new failures in https://buildkite.com/redpanda/redpanda/builds/49228#018f8215-c57e-4a43-97a7-8423e98bb3c6:
new failures in https://buildkite.com/redpanda/redpanda/builds/49228#018f8300-c364-4b33-9d07-7b67d4bb629d:
new failures in https://buildkite.com/redpanda/redpanda/builds/49396#018f9da7-7fbe-433f-9aa9-bf906ba7c3c4:
new failures in https://buildkite.com/redpanda/redpanda/builds/49396#018f9da7-7fbc-46e2-94ab-2fea83f5e43d:
new failures in https://buildkite.com/redpanda/redpanda/builds/49396#018f9da7-7fc2-4294-9018-8ace8d69812e:
new failures in https://buildkite.com/redpanda/redpanda/builds/49396#018f9daf-68e3-4246-a9f1-0e9ff37873de:
new failures in https://buildkite.com/redpanda/redpanda/builds/49396#018f9daf-68e6-4789-bc86-964fc89cb749:
new failures in https://buildkite.com/redpanda/redpanda/builds/49396#018f9daf-68e9-47b9-9923-9e87cf7f343c:
new failures in https://buildkite.com/redpanda/redpanda/builds/49396#018f9daf-68ec-4efa-96e8-a7e7a72f486f:
|
@ztlpn is this ready for review? Lots of failures, so unsure if they are related or not. |
@bharathv they are related, though this is more of a test problem. Currently discussing with the storage team how to fix the test. |
merged #18603, retrying ci... |
/ci-repeat |
Just restarted nodes may have their health reports incomplete because not all partitions have started yet. Also right after restart the node is probably busy catching up and replicating data that was produced in its absense. Because of these two reasons just restarted nodes are bad candidates for leadership transfers, mute them.
Because in this test we wait for the set of objects in S3 to stabilize, it is dependent on leader balancer timings and the previous commit makes it fail. Give it more time to stabilize.
Mute just restarted nodes in leader_balancer, as their health reports can have incomplete partition info, and they are probably busy recovering partitions anyway.
Fixes #17150
Backports Required
Release Notes
Improvements