Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: TestDatabasesTablesV2 failed #124395

Closed
cockroach-teamcity opened this issue May 19, 2024 · 4 comments · Fixed by #124579
Closed

server: TestDatabasesTablesV2 failed #124395

cockroach-teamcity opened this issue May 19, 2024 · 4 comments · Fixed by #124579
Assignees
Labels
branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-observability

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented May 19, 2024

server.TestDatabasesTablesV2 failed on master @ 93ad913106b6f0f6ec98bc2cfa788ff6d8085bd4:

=== RUN   TestDatabasesTablesV2
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestDatabasesTablesV23023245521
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:144: cluster virtualization disabled in global scope due to issue: #76378 (expected label: C-bug)
    api_v2_sql_schema_test.go:148: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/server/api_v2_sql_schema_test.go:148
        	Error:      	Received unexpected error:
        	            	Get "https://127.0.0.1:33161/api/v2/databases/testdb/tables/public.testtable/": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
        	Test:       	TestDatabasesTablesV2
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestDatabasesTablesV23023245521
--- FAIL: TestDatabasesTablesV2 (54.20s)

Parameters:

  • attempt=1
  • race=true
  • run=2
  • shard=5
Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/obs-prs @cockroachdb/server

This test on roachdash | Improve this report!

Jira issue: CRDB-38868

@cockroach-teamcity cockroach-teamcity added branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-observability labels May 19, 2024
@abarganier abarganier self-assigned this May 20, 2024
@abarganier
Copy link
Member

This failed last week here, which I chalked up to a flake.

Given the newest failures, I'm going to give this a closer look.

@abarganier
Copy link
Member

abarganier commented May 20, 2024

I was able to reproduce locally under --stress --race.

From what I can tell, the test cluster/server is receiving signals to shutdown before the test completes. The specific call that's failing alone does not seem to be problematic.

It seems that this query is failing to finish executing. When we attempt to read through the result set, it.Next() returns the error:

‹admin-show-grants›: failed to read query result: query execution canceled

Looking at the timestamps in the logs, quite a bit of time passes from the time the test cluster is initialized, until the TableDetails endpoint is invoked. However, the amount of time spent within the TableDetails endpoint before the test cluster begins shutting down is only ~10 seconds, which should not trigger the request timeout.

  • 17:51:04 - Test cluster initialized
  • 17:51:07 - First indication of the test interacting with the test cluster (
  • 17:52:53 - TableDetails endpoint request received
  • 17:53:01 - Endpoint runs admin-show-grants query
  • 17:53:03.417 - "TestCluster quiescing nodes"
  • 17:53:03.744 - admin-show-grants query returns
  • 17:53:03.785 - Attempting to iterate result set fails

Seems like the endpoint itself is responding in an acceptable amount of time. However, the amount of time it takes for the test to reach the point where it makes a request seems to be too long.

Plenty of things happen between the test's first interaction with the test cluster and the timeout during the specific call to TableDetails. I'm going to see if I can pinpoint the portion of the test that's taking so long.

@abarganier
Copy link
Member

I added some timers to various parts of the test.

api_v2_sql_schema_test.go:85: Table & DB creation time - start: 18:42:56.787315, duration 22.221347291s
api_v2_sql_schema_test.go:97: /databases/ call - start: 18:43:37.780579, duration 1.719976542s
api_v2_sql_schema_test.go:112: /databases/testdb/ call - start: 18:43:39.50957, duration 3.060485459s
api_v2_sql_schema_test.go:125: /databases/testdb/grants/ call - start: 18:43:42.642032, duration 1.945669334s
api_v2_sql_schema_test.go:139: /databases/testdb/tables/ call - start: 18:43:44.588601, duration 2.728289625s
api_v2_sql_schema_test.go:156: /databases/testdb2/tables/ call (should 404) - start: 18:43:47.317876, duration 515.812625ms

...and the cluster log indicating that the cluster has received a signal to shut down:

I240520 18:43:57.957227 51 testutils/testcluster/testcluster.go:148 ⋮ [-] 1191  TestCluster quiescing nodes

22 seconds for creation of DBs/tables/users/roles required by the test feels a bit long, but is that the true problem?

From the first moment the test begins to interact with the test cluster until the test cluster receives a signal to shut down, only around ~1 minute passes.

I'm going to look into the default timeouts and see if this is expected, and/or if we need to increase the test timeout.

@dhartunian
Copy link
Collaborator

@abarganier I think it's significant that the test fails right after a test that checks for a 404 Not Found response. Is it possible that the helper we use to retrieve table data leaks resources in the error case? That could cause this odd error result.

This is the handler which looks innocuous:

func (a *apiV2Server) tableDetails(w http.ResponseWriter, r *http.Request) {
ctx := r.Context()
username := authserver.UserFromHTTPAuthInfoContext(ctx)
ctx = a.sqlServer.AnnotateCtx(ctx)
pathVars := mux.Vars(r)
req := &serverpb.TableDetailsRequest{
Database: pathVars["database_name"],
Table: pathVars["table_name"],
}
resp, err := a.admin.tableDetailsHelper(ctx, req, username)
if err != nil {
if status.Code(err) == codes.NotFound || isNotFoundError(err) {
http.Error(w, "database or table not found", http.StatusNotFound)
} else {
srverrors.APIV2InternalError(ctx, err, w)
}
return
}
apiutil.WriteJSONResponse(ctx, w, 200, tableDetailsResponse(*resp))
}

But the tableDetailsHelper is quite hefty and has all sorts of stuff going on.

@dhartunian dhartunian added P-2 Issues/test failures with a fix SLA of 3 months and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels May 22, 2024
dhartunian added a commit to dhartunian/cockroach that referenced this issue May 22, 2024
Previously, we'd use a 10s timeout on HTTP clients for all test runs.
This change modifies this to be `40s` under race.

Resolves cockroachdb#124395
Resolves cockroachdb#122892

Release note: None

<pkg>: <short description - lowercase, no final period>

<what was there before: Previously, ...>
<why it needed to change: This was inadequate because ...>
<what you did about it: To address this, this patch ...>
dhartunian added a commit to dhartunian/cockroach that referenced this issue May 22, 2024
Previously, we'd use a 10s timeout on HTTP clients for all test runs.
This change modifies this to be `40s` under race.

Resolves cockroachdb#124395
Resolves cockroachdb#122892

Release note: None
craig bot pushed a commit that referenced this issue May 23, 2024
124579: server: increase test http client timeouts under race r=kyle-a-wong a=dhartunian

Previously, we'd use a 10s timeout on HTTP clients for all test runs.
This change modifies this to be `30s` under race.

Resolves #124395
Resolves #122892

Release note: None

Co-authored-by: David Hartunian <davidh@cockroachlabs.com>
@craig craig bot closed this as completed in 8c99abc May 23, 2024
blathers-crl bot pushed a commit that referenced this issue May 23, 2024
Previously, we'd use a 10s timeout on HTTP clients for all test runs.
This change modifies this to be `30s` under race.

Resolves #124395
Resolves #122892

Release note: None
blathers-crl bot pushed a commit that referenced this issue May 23, 2024
Previously, we'd use a 10s timeout on HTTP clients for all test runs.
This change modifies this to be `30s` under race.

Resolves #124395
Resolves #122892

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-observability
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants