-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define an official performance validation suite for etcd #16467
Comments
Talked with @mborsz who is member of Kubernetes SIG scalability about how we should approach performance testing of etcd. We came to conclusion that we need 3 things:
Based on above points the work is:
|
should the etcd SLIs be part of the contract ? Ref: https://docs.google.com/document/d/1NUZDiJeiIH5vo_FMaTWf0JtrQKCx0kpEaIIuPoj9P6A/edit#heading=h.tlkin1a8b8bl? |
Potentially - Let's try and get some SLI's proposed initially and see how they fit in relation to the current contract? I have been meaning to sit down and list out potential SLI's here we can cherry pick from, feel free to do that same 🙏🏻 |
Recording a discussion during kubecon na - Along with identify service level indicators as a starting point for this work we can also take lessons from kubernetes sig-scale to identify a set of dimensions that our new performance validation suite will have an envelope within: https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md We can review the older benchmark tooling to get a starting point on dimensions and iterate from there. |
Expect the performance test suite should help detect/prevent #17529 or in the robustness test kubernetes traffic. Do we think there is a gap in general on performance testing? I can help addressing it. |
Thanks @chaochn47 - Yes my expectations from updated performance validation suite once complete is we can catch issues like the one linked earlier. @ivanvc is currently getting some basic prow jobs running that will be running some existing tools like |
Don't think so, performance and correctness are pretty different beast that needs different approaches. Checking correctness requires a lot of overhead to check it, while performance measuring wants as little noise as possible to provide reproducible results. What failed in #17529 was an unknown throughput breaking point that was hiding a correctness issue under it. I think we can use performance testing to discover more of such breaking points, and then try to simulate them during correctness testing. This was already done in the e2e test that you provided in #17555. Failpoint |
Hi Team - @ivanvc and I would like to propose the first service level indicator. We are keen for your feedback on this first one before we move on to proposing additional.
Mutating calls being Please let us know what you think. If this first SLI is accepted we will be updating |
If we are intended to optimize etcd performance in kubernetes, IMHO we should generate k8s like traffic. For example, rw-heatmap tool uses mixed read-only and write-only transactions, which does not have watch traffic simulated. Hopefully it is already in the roadmap. |
We need both. This issue is important, but not getting enough attention. Unfortunately I don't have enough time to lead this, Is there someone that could work on this with my guidance? |
/assign I can help since recently I am looking into etcd performance aspect. |
@serathius, @chaochn47 - Please let us know if the first etcd SLI drafted above looks ok. Agree watch is critical, there should be an SLI relating to this also. We intend to work iteratively to propose a larger table of SLI's as the k8s project have done. |
@chaochn47 Can you start from creating a document where we can start discussing the SLIs? Maybe just copy K8s SLIs that make sense for etcd and we can iterate on that. |
@serathius This is the bare minimum doc etcd performance work stream that created from my head. I would fill in more details and PoC soon. |
What would you like to be added?
The current performance validation process for etcd relies heavily on the Kubernetes scalability tests. While this approach has been valuable we need to create an official performance validation for etcd that is maintained within the project and therefore more accessible and integrated into regular project activity.
In my mind this will include developing a comprehensive suite of performance tests that cover various real-world usage scenarios. Integrating these tests into some form of on demand or scheduled etcd ci pipeline and making this accessible to work undertaken, for example ensure a pull request proposing upgrading a golang version can be validated for any performance regressions.
With this issue I would like to capture recent discussion in #16463 (comment) and the intent that we progress creating an independent and dedicated performance validation mechanism for etcd and ensure we do not lose sight of this work. We can use this issue to track any ideas and further conversation before starting any work.
References:
rw-heatmaps
using golang and rename it torw-benchmark
#15060Why is this needed?
Sub task tracking
The text was updated successfully, but these errors were encountered: