-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
linux: exploit eventfd
in EPOLLET
mode to avoid readding per wakeup
#4400
base: v1.x
Are you sure you want to change the base?
Conversation
Register the eventfd with EPOLLET to enable edge-triggered notification where we're able to eliminate the overhead of reading the eventfd via system call on each wakeup event. When the eventfd counter reaches the maximum value of the unsigned 64-bit, which may not happen for the entire lifetime of the process, we rewind the counter and retry. This optimization saves one system call on each event-loop wakeup, eliminating the overhead of read(2) as well as the extra latency for each epoll wakeup. --------- Signed-off-by: Andy Pan <i@andypan.me>
My initial reaction is that this is not how edge triggering would usually work for epoll, so I am skeptical how it would work here (it would work if this was kevent, as there is a different PR open now for that, but it is not). The man page for eventfd does not seem to mention being permitted you to skip the read syscall. The eventfd_read and eventfd_write are documented to be thin wrappers around the read/write, so I would prefer to stick without those added wrapper as well. Is this an undocumented linux kernel bug or feature here that edge trigger on eventfd does not follow normal edge trigger semantics (which normally requires 2 read calls every time to trigger EAGAIN and re-arm the edge trigger)? |
I'm not sure what this "normal edge trigger semantics" looks like in your mind. But I'm sure that your "which normally requires 2 read calls every time to trigger EAGAIN and re-arm the edge trigger" is wrong, I don't think we need to rearm the events of Back to the rationale behind this PR, the main difference between I'll quote this paragraph from the man pages:
Every time we write to the else if (!(epi->event.events & EPOLLET)) {
/*
* If this file has been added with Level
* Trigger mode, we need to insert back inside
* the ready list, so that the next call to
* epoll_wait() will check again the events
* availability. At this point, no one can insert
* into ep->rdllist besides us. The epoll_ctl()
* callers are locked out by
* ep_send_events() holding "mtx" and the
* poll callback will queue them in ep->ovflist.
*/
list_add_tail(&epi->rdllink, &ep->rdllist);
ep_pm_stay_awake(epi);
} |
Just in case, one more thing about Therefore, reading until |
--------- Signed-off-by: Andy Pan <i@andypan.me>
I think that's the common understanding when using a fd that point to a socket for example, right? Is this different in case of eventfd? |
Yes, we often do that when working with sockets, and that's legit.
To clarify, sockets and eventfd share the same implementation of |
Hum, I still don't get it. Let's say we have the loop thread plus other 4.
Where does EPOLLET play here? Is this an optimization for the case in which we read the value before all threads have incremented it? So say we read 2, then another thread peforms the write and we'd be woken up again to read a 1 and so on? |
Sorry, I don't get your example because I don't see this example has anything to do with |
Perhaps some sample code and the Granted, if one of the threads performs the write after the loop has read the value a new wakeup will happen, and that is ok, because we don't know how much later that was. What am I missing? |
If you were talking about N threads calling uv_async_send per thread to wake another one specific thread t1 and the wakeup times is less than N because t1 read eventfd and reset it to zero, then it's true regardless of the old implementation with LT or the new one with ET. If you want to trigger extract N times of wakeup, you need to specify
Again, this will happen with either LT or ET. |
Correct, that was the case I was making.
Right, but uv_async is defined so it wakes up at least once, regardless of the number of times
Ok. So then, can you please elaborate on what case this PR helps with? |
To sum up, before this PR, we use |
So, what I am pointing out is that this contradicts the documentation for
What is strange is that it gives an exception for stream-oriented files, but then makes a claim that directly contradicts the documentation for To be clear, I am content to assume that this is a kernel bug that we can (ab)use in our favor for eventfd, but it would be good to make sure we can actually rely on the behavior of epoll_wait not exactly following this part of the documentation for it. According to the documentation for epoll, it seems a new call to |
In what case would that manifest? When someonee calls uv_async_send we'll get a wakeup and read from the fd, thus reseting it to zero, we cannot avoid that. So in what circumstance would be leave the counter to not zero intil called again? |
This would be true under one specific circumstance: the interval between two writes is extremely small. So if you issued a new write right after the previous without a gap, eventfd in ET mode would be waked up only once. By contrast, if you issued a new write after some times from the previous write, it would be waked up twice. This is because linux will converge multiple ready events on a same file descriptor that are in the short time windows and only trigger the wakeup callback once. As for the confusing description on man pages, it's just a common pattern of working with And I don't think this is a kernel bug, based on the source code of Linux, it's solid and it look more like a functionality to me. |
We can avoid that by using |
It looks like the kernel may not care if the read actually ever happened, as it doesn't have a cheap way to filter this for whether the item in the queue is currently LT or ET, so it just sets it after every write? I am not entirely certain of how the kernel decides when to clear the bit (for LT) |
Sorry if it feels like I am ignoring that you made that comment before, but the documentation for this function does not seem to line up with your description, and instead repeatedly states that reading the file is required afterwards, while reiterating that this "common pattern of working with EPOLLET" that you are using should not to be relied upon, as it may work in most cases, but is not guaranteed to work correctly all of the time or in all cases. Do you know if there is any kernel documentation for eventfd that might clarify this? |
What's your main concern here? I'm still not quite sure. |
The man page for epoll_wait specifically says this PR is not correct, but the kernel implementation may allow it. Do we trust that the linux kernel developers will never update the implementation to more closely follow the man page, given that the kernel documentation for this appears to be non-existent? |
There is actually also a second more complicated concern for it, which is that this syscall must have sequentially-consistent ordering. The read/write pair used to guarantee that, but does epoll_wait also guarantee that? (edit: we can probably fix this by adding explicit fences, if necessary) |
I actually don't think this PR contradicts the man page.
When we receive a readable event from the |
Maybe this could make things more interesting: https://lwn.net/Articles/865400/ |
Alright, yeah, that email from Linus stating this this PR relies on a kernel bug that probably won't get fixed (https://lwn.net/ml/linux-kernel/CAHk-=witY33b-vqqp=ApqyoFDpx9p+n4PwG9N-TvF8bq7-tsHw@mail.gmail.com/) is probably compelling |
Note that this stated usage is still wrong (per the documentation and implementation), but that this case is fundamentally different because it never reads from the fd and only ever cares about the occurrence of the write syscall itself (which Linus's email indicates is a kernel bug that probably won't get fixed), but never about the quantity of data written (which is not an existing kernel bug, but may be an existing reliability issue in some applications). |
Well, it is shocking to me... Normally you don't expect something like epoll whose design and source code are that sophisticated to be fundamentally broken when it was implemented. But thanks for the link, it's frustrating but useful. |
Sorry, why would it be wrong when reading a socket fd in ET mode until EAGAIN is returned? Could you clarify that with more contexts? |
If you do read until EAGAIN, it is correct, but wastes a syscall every time you go to call |
Oh, I think we were just not on the same page about that. I was specifically talking about EAGAIN with ET mode, it's clear that we don't have to read until EAGAIN in LT mode. Now we are in sync. |
So, what are we going to do with this PR? Does it still seem worth going through for |
I'd say some benchmarks would help decide. |
Great thanks. I think we already have a benchmark for this as well, so just posting the comparison numbers for that may suffice. |
Which benchmark test should I run? Environment
Benchmark commandUV_USE_IO_URING=0 build/uv_run_benchmarks_a million_async Result/* libuv:v1.x */
ok 1 - million_async
# 4,835,403 async events in 5.0 seconds (967,080/s, 1,048,576 unique handles seen)
/* panjf2000:eventfd-et */
ok 1 - million_async
# 5,096,626 async events in 5.0 seconds (1,019,325/s, 1,048,576 unique handles seen) Something like this? Updated: upgrade to Ubuntu 24.04 (Noble Numbat) Environment
/* libuv:v1.x */
ok 1 - million_async
# 5,483,795 async events in 5.0 seconds (1,096,759/s, 1,048,576 unique handles seen)
/* panjf2000:eventfd-et */
ok 1 - million_async
# 6,118,112 async events in 5.0 seconds (1,223,622/s, 1,048,576 unique handles seen) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apropos EPOLLET and reading until EAGAIN: that was the common wisdom for years (because it was true for a long time) but things changed during the 2.6 (2.4? 2.5?) window.
I think this PR should Just Work(TM) as it stands. A kernel regression would be horrendously hard to debug though.
One reason for not merging is the following (hopefully hypothetical) scenario: another thread or process accidentally polling the file descriptor, thereby consuming the edge event and stopping the libuv state machine in its tracks.
Level-triggered I/O is robust against such snafus because libuv keeps receiving events until it actually drains the eventfd.
Merging should therefore hinge on:
-
Someone demonstrably needing peak uv_async_t performance, and
-
The scenario above being sufficiently implausible that it's likely to stay a hypothetical
Node.js would likely benefit from (1) but node.js developers sometimes do incredibly stupid shit such that (2) is not completely impossible.
(E.g., I occasionally got bug reports along the lines of "app crashes after for (let fd = 0; fd < 1024; fd++) fs.closeSync(fd)
" 🤦)
True, and I also understand your concern here. But on the other side, I also think it's a little bit unfair to punish the node.js developers who are rigorous about their code for the negligence that might be done by those inconsiderate ones. On further reflection, do you think it's pragmatic to make this |
--------- Signed-off-by: Andy Pan <i@andypan.me>
Seems pretty unlikely that the other thread will be calling epoll_wait specifically on a random fd. We know at least Electron does this correctly, (even though it previously could have done the other thing and used Apropos using ET in more places, it looks like the |
Any updates? |
Hi, folks! Any new thoughts on this? @vtjnash @bnoordhuis @saghul |
Would you be willing to submit a documentation update to include this promise in the external documentation for the kernel, mentioning the performance advantages and the previous commits and LKML discussions that seemed to indicate this behavior is expected to continue working in the future @ https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man/man7/epoll.7 |
I would if I could. But after I've done some research about this whole This quote from Linus:
makes it clear that Linux won't break any the existing applications with new patch intentionally. Another reference from the man pages of
makes me believe that if we can count on Secondly, I also tried to seek out similar applications of Updated: Another application of nginx is not doing it the exact same way as those networking frameworks mentioned above do: it doesn't read the |
Furthermore, I think to some extent there is already a statement of triggering multiple events in
|
--------- Signed-off-by: Andy Pan <i@andypan.me>
Benchmark updated: upgrade to Ubuntu 24.04 (Noble Numbat)Environment
/* libuv:v1.x */
ok 1 - million_async
# 5,483,795 async events in 5.0 seconds (1,096,759/s, 1,048,576 unique handles seen)
/* panjf2000:eventfd-et */
ok 1 - million_async
# 6,118,112 async events in 5.0 seconds (1,223,622/s, 1,048,576 unique handles seen) |
eventfd
in EPOLLET
mode to avoid readding per wakeup
Ping @bnoordhuis @vtjnash @saghul |
Register the eventfd with EPOLLET to enable edge-triggered notification where we're able to eliminate the overhead of reading the eventfd via system call on each wakeup event.
When the eventfd counter reaches the maximum value of the unsigned 64-bit, which may not happen for the entire lifetime of the process, we rewind the counter and retry.
This optimization saves one system call on each event-loop wakeup, eliminating the overhead of read(2) as well as the extra latency for each epoll wakeup.
Signed-off-by: Andy Pan i@andypan.me