Timeout awaiting response - random short streaks #2700

Crossbow78 · 2024-04-18T15:00:22Z

Our .NET 8 application is running in Kubernetes, and we have 3 service pods using a Redis cache which is hosted as a dedicated redis server pod in the same Kubernetes namespace. The redis pod has a persisted storage and uses an appendonly file.

We're experiencing short streaks of redis client timeouts randomly throughout the day. For example, one day it's 20 failing requests within a period 10 seconds in the morning, the next day there's 10 failing requests within 5 seconds in the afternoon.

There are around 200k distributed cache requests per day, and even though the fail rate is very low (about 10-100 requests per day), they do have a negative effect on the application. The values stored and retrieved are very small, and hardly exceed a few kilobytes.

We do not see CPU spikes on either the redis client (application) or the redis server pod.

All exceptions have very similar metrics, and look as follows:

Timeout awaiting response (outbound=7KiB, inbound=0KiB, 5512ms elapsed, timeout is 5000ms), command=HSET, next: HSET b1547e0e-ca13-4531-ae31-3174bc7dea2b, inst: 0, qu: 0, qs: 1, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 0, in-pipe: 0, out-pipe: 0, last-in: 7293, cur-in: 0, sync-ops: 0, async-ops: 24789596, serverEndpoint: redis-myserviceapi:6379, conn-sec: 2355933.28, aoc: 1, mc: 1/1/0, mgr: 10 of 10 available, clientName: myserviceapi-69b7b7ff56-njnds(SE.Redis-v2.7.17.27058), IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=1,Free=32766,Min=1,Max=32767), POOL: (Threads=3,QueuedItems=0,CompletedItems=92320083,Timers=14), v: 2.7.17.27058 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

I cannot see anything suspicious here, although I must admit I find it hard to interpret these numbers.

The SLOWLOG from the redis server does not have any correlation with the client timeout exceptions.

How could we proceed in finding the root cause of (and mitigating action for) these timeouts? How can we even tell whether this is caused by the server or the client?

The text was updated successfully, but these errors were encountered:

aaronvarga · 2024-04-25T12:48:13Z

I don't mean to hijack the thread, but I'm experiencing very similar behavior, and I also don't see anything super obvious, though I too find it difficult to interpret these numbers. Here are a few errors I'm receiving:

Timeout awaiting response (outbound=0KiB, inbound=0KiB, 60187ms elapsed, timeout is 18000ms), command=PING, next: PING, inst: 0, qu: 0, qs: 0, aw: False, bw: Inactive, rs: DequeueResult, ws: Idle, in: 0, last-in: 28, cur-in: 7, sync-ops: 0, async-ops: 39104, serverEndpoint: redacted.redis.cache.windows.net:6380, conn-sec: 1173267.98, aoc: 0, mc: 1/1/0, mgr: 8 of 10 available, clientName: redacted(SE.Redis-v2.7.4.20928), IOCP: (Busy=4,Free=996,Min=25,Max=1000), WORKER: (Busy=9,Free=1014,Min=25,Max=1023), POOL: (Threads=15,QueuedItems=25,CompletedItems=21339081,Timers=89), v: 2.7.4.20928 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

Timeout awaiting response (outbound=0KiB, inbound=0KiB, 26188ms elapsed, timeout is 18000ms), command=EXPIRE, next: EXPIRE MyCacheKey, inst: 0, qu: 0, qs: 0, aw: True, bw: SpinningDown, rs: DequeueResult, ws: Writing, in: 0, last-in: 0, cur-in: 8, sync-ops: 0, async-ops: 52696, serverEndpoint: redacted.redis.cache.windows.net:6380, conn-sec: 850021.5, aoc: 0, mc: 1/1/0, mgr: 7 of 10 available, clientName: redacted(SE.Redis-v2.7.4.20928), IOCP: (Busy=0,Free=1000,Min=25,Max=1000), WORKER: (Busy=10,Free=1013,Min=25,Max=1023), POOL: (Threads=31,QueuedItems=0,CompletedItems=15462543,Timers=141), v: 2.7.4.20928 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

Timeout awaiting response (outbound=0KiB, inbound=0KiB, 24641ms elapsed, timeout is 18000ms), command=HMGET, next: HMGET MyCacheKey, inst: 0, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: DequeueResult, ws: Idle, in: 0, last-in: 0, cur-in: 868, sync-ops: 0, async-ops: 62872, serverEndpoint: redacted.redis.cache.windows.net:6380, conn-sec: 929628.06, aoc: 0, mc: 1/1/0, mgr: 9 of 10 available, clientName: redacted(SE.Redis-v2.7.4.20928), IOCP: (Busy=1,Free=999,Min=25,Max=1000), WORKER: (Busy=5,Free=1018,Min=25,Max=1023), POOL: (Threads=8,QueuedItems=0,CompletedItems=16944007,Timers=102), v: 2.7.4.20928 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

NickCraver · 2024-05-07T14:56:55Z

@Crossbow78 Is there a chance the nodes are swapping on the network side somewhere? This appears that we legit did not get a response - given the lower volume, I'd even suggest a long-running scoped packet capture to just those commands perhaps to see what's happening, I don't see anything odd on your client stats that indicates a problem unless that outbound is common. Do all of your occurences have an outbound buffer that hasn't been sent?

@aaronvarga Yours is very perplexing, those timeout differentials tell us that the timer to process backlogs isn't firing for up to 40 seconds which is absolutely crazy lag - is the machine pegged on CPU/low on resources in general? That's one of the most extreme timer lags we've had reported.

Crossbow78 · 2024-05-07T15:32:41Z

Thanks for your response, @NickCraver !

I have collected all exception metrics in an Excel sheet to get a better overview, and I can see that in 90% of the cases both 'outbound' and 'inbound' are 0.

Your remark about nodes triggered me to double-check how our pods were distributed over the underlying k8s cluster nodes. I found that 2 application nodes were running on different nodes than the redis pod, and 1 was running on the same. The distribution of client timeouts however is equally spread over all 3 application pods.

I'm not sure I understand what kind of network swapping could be involved, nor what a long-running scoped packet capture would look like. If you could elaborate I'd be happy to give it a try to gain more insight.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout awaiting response - random short streaks #2700

Timeout awaiting response - random short streaks #2700

Crossbow78 commented Apr 18, 2024

aaronvarga commented Apr 25, 2024 •

edited

NickCraver commented May 7, 2024

Crossbow78 commented May 7, 2024

Timeout awaiting response - random short streaks #2700

Timeout awaiting response - random short streaks #2700

Comments

Crossbow78 commented Apr 18, 2024

aaronvarga commented Apr 25, 2024 • edited

NickCraver commented May 7, 2024

Crossbow78 commented May 7, 2024

aaronvarga commented Apr 25, 2024 •

edited