Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client-go: data consistency checker for list requests #124963

Conversation

p0lyn0mial
Copy link
Contributor

@p0lyn0mial p0lyn0mial commented May 20, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds data consistency checker to client-go for checking data retrieved via list requests from the cache and directly from etcd. The detector can be enabled by setting KUBE_LIST_FROM_CACHE_INCONSISTENCY_DETECTOR env var.

// CheckListFromCacheDataConsistencyIfRequested performs a data consistency check only when
// the KUBE_LIST_FROM_CACHE_INCONSISTENCY_DETECTOR environment variable was set during a binary startup
// for requests that have a high chance of being served from the watch-cache.
//
// The consistency check is meant to be enforced only in the CI, not in production.
// The check ensures that data retrieved by a list api call from the watch-cache
// is exactly the same as data received by the list api call from etcd.
//
// Note that this function will panic when data inconsistency is detected.
// This is intentional because we want to catch it in the CI.
//
// Note that this function doesn't examine the ListOptions to determine
// if the original request has hit the cache because it would be challenging
// to maintain consistency with the server-side implementation.
// For simplicity, we assume that the first request retrieved data from
// the cache (even though this might not be true for some requests)
// and issue the second call to get data from etcd for comparison.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. area/code-generation cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. sig/auth Categorizes an issue or PR as relevant to SIG Auth. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 20, 2024
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 20, 2024
@k8s-ci-robot k8s-ci-robot requested a review from deads2k May 20, 2024 10:14
@k8s-ci-robot k8s-ci-robot added sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 20, 2024
@p0lyn0mial
Copy link
Contributor Author

/assign @wojtek-t
/cc @serathius

//
// if ResourceVersion = "" and ConsistendListFromCache is disabled or RequestWatchProgress isn't supported,
// then the request will be served from the storage.
func wasListRequestServedFromStorage(opts metav1.ListOptions) bool {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copied from the cacher.go - it looks like it does what we want expect the case mentioned in the comment and after clarifying rows 1 and 3 from the KEP.

//
// Note that this function will panic when data inconsistency is detected.
// This is intentional because we want to catch it in the CI.
func CheckListAgainstCacheDataConsistencyIfRequested[T runtime.Object](ctx context.Context, identity string, listItemsFn listItemsFunc[T], optionsUsedToReceiveList metav1.ListOptions, receivedList runtime.Object) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to find a better home for this function. Depending on where it is defined and whether it will be public, we could consider adding more validation for opts (e.g., checking if it was actually a list request).

@@ -64,7 +115,7 @@ func checkWatchListDataConsistencyIfRequested(ctx context.Context, identity stri
// it is guarded by an environmental variable.
// we cannot manipulate the environmental variable because
// it will affect other tests in this package.
func checkDataConsistency(ctx context.Context, identity string, lastSyncedResourceVersion string, listItemsFn listItemsFunc, retrieveCollectedItemsFn retrieveCollectedItemsFunc) {
func checkDataConsistency[T runtime.Object, U any](ctx context.Context, identity string, lastSyncedResourceVersion string, listItemsFn listItemsFunc[T], retrieveCollectedItemsFn retrieveCollectedItemsFunc[U]) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to find a better home for this function. This function is common for checkWatchListDataConsistencyIfRequested and CheckListAgainstCacheDataConsistencyIfRequested.

@p0lyn0mial p0lyn0mial force-pushed the upstream-data-consistency-checker-for-list-requests branch from 6fe1b36 to 6583cc6 Compare May 20, 2024 10:27
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 20, 2024
@@ -461,6 +462,11 @@ func new$.type|publicPlural$(c *$.GroupGoName$$.Version$Client) *$.type|privateP
var listTemplate = `
// List takes label and field selectors, and returns the list of $.resultType|publicPlural$ that match those selectors.
func (c *$.type|privatePlural$) List(ctx context.Context, opts $.ListOptions|raw$) (result *$.resultType|raw$List, err error) {
defer func() {
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really what you want? We rather want to call it if error was non-nil, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ups, yeah, it should have been if err == nil - thanks.

@@ -697,7 +697,7 @@ func (r *Reflector) watchList(stopCh <-chan struct{}) (watch.Interface, error) {
// we utilize the temporaryStore to ensure independence from the current store implementation.
// as of today, the store is implemented as a queue and will be drained by the higher-level
// component as soon as it finishes replacing the content.
checkWatchListConsistencyIfRequested(stopCh, r.name, resourceVersion, r.listerWatcher, temporaryStore)
checkWatchListDataConsistencyIfRequested(wait.ContextForChannel(stopCh), fmt.Sprintf("watch-list reflector with name: %q", r.name), resourceVersion, wrapListFuncWithContext(r.listerWatcher.List), temporaryStore.List)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, seeing how you want to use it now and where are the problems, let's first merge this commit as a #124446

checkDataConsistency(ctx, identity, lastSyncedResourceVersion, listItemsFn, func() []runtime.Object { return rawListItems })
}

// wasListRequestServedFromStorage based on the passed ListOptions determines
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should rather evolve it to something like:

"if continuation wasn't set, the call LIST with RV= and ResourceVersionMatch=Exact and rest parameters the same"

Seems simpler, potentially makes this check unnecessarily, but it's not a big deal.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 21, 2024
@p0lyn0mial p0lyn0mial force-pushed the upstream-data-consistency-checker-for-list-requests branch from 6583cc6 to f7457cb Compare May 27, 2024 13:41
@p0lyn0mial p0lyn0mial changed the title [POC][WIP] data consistency checker for list requests [WIP] data consistency checker for list requests May 27, 2024
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 27, 2024
@p0lyn0mial
Copy link
Contributor Author

/retest

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 27, 2024
@p0lyn0mial
Copy link
Contributor Author

/test pull-kubernetes-node-e2e-containerd

@p0lyn0mial p0lyn0mial force-pushed the upstream-data-consistency-checker-for-list-requests branch from 57c4f77 to 472043e Compare May 29, 2024 10:33
@p0lyn0mial p0lyn0mial force-pushed the upstream-data-consistency-checker-for-list-requests branch from 472043e to 42da3cf Compare May 29, 2024 11:04
@deads2k
Copy link
Contributor

deads2k commented Jun 4, 2024

I approve of the approach, though I suspect further refactoring will be necessary for streaming list.

@p0lyn0mial
Copy link
Contributor Author

I approve of the approach, though I suspect further refactoring will be necessary for streaming list.

thanks, yes, support for streaming list will require more changes to the code generation.

@p0lyn0mial p0lyn0mial force-pushed the upstream-data-consistency-checker-for-list-requests branch 2 times, most recently from c7d63a2 to 06b7d97 Compare June 6, 2024 06:04
@p0lyn0mial
Copy link
Contributor Author

OK, so this PR has a commit that always enables the detector for list requests. Here is the summary of test failures.

I think that we could initially enable the detector for an e2e job like pull-kubernetes-e2e-gce. If that's okay then I could create an issue to capture the following failures and fix them at a later point in time.

Unit Tests (3 failures)

  1. TestEventIsSorted
    The test produces fake data via the getFakeEvents() method, which doesn’t return a stable output (the time changes between invocations). For example:
EventTime: v1.MicroTime{Time: s"2024-06-05 11:01:04.54382 +0000 UTC"}
EventTime: v1.MicroTime{Time: s"2024-06-05 11:01:04.538752 +0000 UTC}
  1. Test_buildClientCertificateManager
    Sets up a fake/custom server which cannot handle the additional request made by the detector.

  2. TestDrain
    Sets up a fake/custom rest-client which is then used by a real client. The additional call made by the detector is intercepted by the test and, since it is unexpected, the test fails.

Integration Tests (2 failures)

  1. TestExecPluginViaClient
    This test is failing because it has a custom round tripper for collecting some data. The detector makes an additional call, causing the data to not match expectations.

  2. TestLegacyServiceAccountTokenTracking
    This test is failing because it has a custom round tripper (actually a warning handler) for collecting warnings. The detector makes an additional call, resulting in more warnings than expected.

@p0lyn0mial p0lyn0mial force-pushed the upstream-data-consistency-checker-for-list-requests branch from 06b7d97 to 448180d Compare June 6, 2024 08:32
@p0lyn0mial p0lyn0mial changed the title [WIP] data consistency checker for list requests client-go: data consistency checker for list requests Jun 6, 2024
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jun 6, 2024
@fedebongio
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 6, 2024
@wojtek-t
Copy link
Member

wojtek-t commented Jun 7, 2024

Yes - the most important thing is that we understand the failures in unit/integration tests and those are test issues, not real issues. So that's fine.
I'm +1 on enabling that for e2e tests only (at least for now).

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 7, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: dbdb31c5b4c899b9617fe73668262c8f5c915e79

@wojtek-t
Copy link
Member

wojtek-t commented Jun 7, 2024

/kind feature
/retest

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Jun 7, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: p0lyn0mial, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 7, 2024
@k8s-ci-robot k8s-ci-robot merged commit 3532601 into kubernetes:master Jun 7, 2024
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.31 milestone Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver area/code-generation cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: Closed / Done
Development

Successfully merging this pull request may close these issues.

None yet

5 participants