Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout awaiting response - random short streaks #2700

Open
Crossbow78 opened this issue Apr 18, 2024 · 3 comments
Open

Timeout awaiting response - random short streaks #2700

Crossbow78 opened this issue Apr 18, 2024 · 3 comments

Comments

@Crossbow78
Copy link

Our .NET 8 application is running in Kubernetes, and we have 3 service pods using a Redis cache which is hosted as a dedicated redis server pod in the same Kubernetes namespace. The redis pod has a persisted storage and uses an appendonly file.

We're experiencing short streaks of redis client timeouts randomly throughout the day. For example, one day it's 20 failing requests within a period 10 seconds in the morning, the next day there's 10 failing requests within 5 seconds in the afternoon.

There are around 200k distributed cache requests per day, and even though the fail rate is very low (about 10-100 requests per day), they do have a negative effect on the application. The values stored and retrieved are very small, and hardly exceed a few kilobytes.

We do not see CPU spikes on either the redis client (application) or the redis server pod.

All exceptions have very similar metrics, and look as follows:

Timeout awaiting response (outbound=7KiB, inbound=0KiB, 5512ms elapsed, timeout is 5000ms), command=HSET, next: HSET b1547e0e-ca13-4531-ae31-3174bc7dea2b, inst: 0, qu: 0, qs: 1, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 0, in-pipe: 0, out-pipe: 0, last-in: 7293, cur-in: 0, sync-ops: 0, async-ops: 24789596, serverEndpoint: redis-myserviceapi:6379, conn-sec: 2355933.28, aoc: 1, mc: 1/1/0, mgr: 10 of 10 available, clientName: myserviceapi-69b7b7ff56-njnds(SE.Redis-v2.7.17.27058), IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=1,Free=32766,Min=1,Max=32767), POOL: (Threads=3,QueuedItems=0,CompletedItems=92320083,Timers=14), v: 2.7.17.27058 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

I cannot see anything suspicious here, although I must admit I find it hard to interpret these numbers.

The SLOWLOG from the redis server does not have any correlation with the client timeout exceptions.

How could we proceed in finding the root cause of (and mitigating action for) these timeouts? How can we even tell whether this is caused by the server or the client?

@aaronvarga
Copy link

aaronvarga commented Apr 25, 2024

I don't mean to hijack the thread, but I'm experiencing very similar behavior, and I also don't see anything super obvious, though I too find it difficult to interpret these numbers. Here are a few errors I'm receiving:

Timeout awaiting response (outbound=0KiB, inbound=0KiB, 60187ms elapsed, timeout is 18000ms), command=PING, next: PING, inst: 0, qu: 0, qs: 0, aw: False, bw: Inactive, rs: DequeueResult, ws: Idle, in: 0, last-in: 28, cur-in: 7, sync-ops: 0, async-ops: 39104, serverEndpoint: redacted.redis.cache.windows.net:6380, conn-sec: 1173267.98, aoc: 0, mc: 1/1/0, mgr: 8 of 10 available, clientName: redacted(SE.Redis-v2.7.4.20928), IOCP: (Busy=4,Free=996,Min=25,Max=1000), WORKER: (Busy=9,Free=1014,Min=25,Max=1023), POOL: (Threads=15,QueuedItems=25,CompletedItems=21339081,Timers=89), v: 2.7.4.20928 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

Timeout awaiting response (outbound=0KiB, inbound=0KiB, 26188ms elapsed, timeout is 18000ms), command=EXPIRE, next: EXPIRE MyCacheKey, inst: 0, qu: 0, qs: 0, aw: True, bw: SpinningDown, rs: DequeueResult, ws: Writing, in: 0, last-in: 0, cur-in: 8, sync-ops: 0, async-ops: 52696, serverEndpoint: redacted.redis.cache.windows.net:6380, conn-sec: 850021.5, aoc: 0, mc: 1/1/0, mgr: 7 of 10 available, clientName: redacted(SE.Redis-v2.7.4.20928), IOCP: (Busy=0,Free=1000,Min=25,Max=1000), WORKER: (Busy=10,Free=1013,Min=25,Max=1023), POOL: (Threads=31,QueuedItems=0,CompletedItems=15462543,Timers=141), v: 2.7.4.20928 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

Timeout awaiting response (outbound=0KiB, inbound=0KiB, 24641ms elapsed, timeout is 18000ms), command=HMGET, next: HMGET MyCacheKey, inst: 0, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: DequeueResult, ws: Idle, in: 0, last-in: 0, cur-in: 868, sync-ops: 0, async-ops: 62872, serverEndpoint: redacted.redis.cache.windows.net:6380, conn-sec: 929628.06, aoc: 0, mc: 1/1/0, mgr: 9 of 10 available, clientName: redacted(SE.Redis-v2.7.4.20928), IOCP: (Busy=1,Free=999,Min=25,Max=1000), WORKER: (Busy=5,Free=1018,Min=25,Max=1023), POOL: (Threads=8,QueuedItems=0,CompletedItems=16944007,Timers=102), v: 2.7.4.20928 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

@NickCraver
Copy link
Collaborator

@Crossbow78 Is there a chance the nodes are swapping on the network side somewhere? This appears that we legit did not get a response - given the lower volume, I'd even suggest a long-running scoped packet capture to just those commands perhaps to see what's happening, I don't see anything odd on your client stats that indicates a problem unless that outbound is common. Do all of your occurences have an outbound buffer that hasn't been sent?

@aaronvarga Yours is very perplexing, those timeout differentials tell us that the timer to process backlogs isn't firing for up to 40 seconds which is absolutely crazy lag - is the machine pegged on CPU/low on resources in general? That's one of the most extreme timer lags we've had reported.

@Crossbow78
Copy link
Author

Thanks for your response, @NickCraver !

I have collected all exception metrics in an Excel sheet to get a better overview, and I can see that in 90% of the cases both 'outbound' and 'inbound' are 0.

Your remark about nodes triggered me to double-check how our pods were distributed over the underlying k8s cluster nodes. I found that 2 application nodes were running on different nodes than the redis pod, and 1 was running on the same. The distribution of client timeouts however is equally spread over all 3 application pods.

I'm not sure I understand what kind of network swapping could be involved, nor what a long-running scoped packet capture would look like. If you could elaborate I'd be happy to give it a try to gain more insight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants