Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

linux: exploit eventfd in EPOLLET mode to avoid readding per wakeup #4400

Open
wants to merge 5 commits into
base: v1.x
Choose a base branch
from

Conversation

panjf2000
Copy link
Contributor

Register the eventfd with EPOLLET to enable edge-triggered notification where we're able to eliminate the overhead of reading the eventfd via system call on each wakeup event.

When the eventfd counter reaches the maximum value of the unsigned 64-bit, which may not happen for the entire lifetime of the process, we rewind the counter and retry.

This optimization saves one system call on each event-loop wakeup, eliminating the overhead of read(2) as well as the extra latency for each epoll wakeup.


Signed-off-by: Andy Pan i@andypan.me

Register the eventfd with EPOLLET to enable edge-triggered notification
where we're able to eliminate the overhead of reading the eventfd via
system call on each wakeup event.

When the eventfd counter reaches the maximum value of the unsigned 64-bit,
which may not happen for the entire lifetime of the process, we rewind the
counter and retry.

This optimization saves one system call on each event-loop wakeup,
eliminating the overhead of read(2) as well as the extra latency
for each epoll wakeup.

---------

Signed-off-by: Andy Pan <i@andypan.me>
@vtjnash
Copy link
Member

vtjnash commented May 9, 2024

My initial reaction is that this is not how edge triggering would usually work for epoll, so I am skeptical how it would work here (it would work if this was kevent, as there is a different PR open now for that, but it is not). The man page for eventfd does not seem to mention being permitted you to skip the read syscall. The eventfd_read and eventfd_write are documented to be thin wrappers around the read/write, so I would prefer to stick without those added wrapper as well.

Is this an undocumented linux kernel bug or feature here that edge trigger on eventfd does not follow normal edge trigger semantics (which normally requires 2 read calls every time to trigger EAGAIN and re-arm the edge trigger)?

@panjf2000
Copy link
Contributor Author

panjf2000 commented May 9, 2024

Is this an undocumented linux kernel bug or feature here that edge trigger on eventfd does not follow normal edge trigger semantics (which normally requires 2 read calls every time to trigger EAGAIN and re-arm the edge trigger)?

I'm not sure what this "normal edge trigger semantics" looks like in your mind. But I'm sure that your "which normally requires 2 read calls every time to trigger EAGAIN and re-arm the edge trigger" is wrong, I don't think we need to rearm the events of EPOLLET when we didn't specify the EPOLLONESHOT.

Back to the rationale behind this PR, the main difference between EPOLLLT and EPOLLET is that ET completely relies on the wakeup callback mechanism to add events to the ready list and remove them after the poll while LT doesn't absolutely rely on the wakeup callback mechanism but will always add events back to the ready list after the poll, which means that programs with EPOLLET mode get notified only when there is new ready event occurring where a wakeup callback associated with each epoll entry will be triggered and it ends up calling ep_poll_callback() to add the event to the ready list, if that event is ignored (in our case, we don't read the eventfd), they won't be notified again until the next event arrives (in our case, we write new data to the eventfd) because unlike LT this event won't be added back in ready list under ET mode.

I'll quote this paragraph from the man pages:

Since even with edge-triggered epoll, multiple events can be
generated upon receipt of multiple chunks of data, the caller has
the option to specify the EPOLLONESHOT flag, to tell epoll to
disable the associated file descriptor after the receipt of an
event with epoll_wait(2). When the EPOLLONESHOT flag is
specified, it is the caller's responsibility to rearm the file
descriptor using epoll_ctl(2) with EPOLL_CTL_MOD.

Every time we write to the eventfd with EPOLLET, we get a wakeup event and are notified by epoll_wait(), but we're not obligated to read it because that event won't retain in the ready list after this wakeup:

		else if (!(epi->event.events & EPOLLET)) {
			/*
			 * If this file has been added with Level
			 * Trigger mode, we need to insert back inside
			 * the ready list, so that the next call to
			 * epoll_wait() will check again the events
			 * availability. At this point, no one can insert
			 * into ep->rdllist besides us. The epoll_ctl()
			 * callers are locked out by
			 * ep_send_events() holding "mtx" and the
			 * poll callback will queue them in ep->ovflist.
			 */
			list_add_tail(&epi->rdllink, &ep->rdllist);
			ep_pm_stay_awake(epi);
		}

@panjf2000
Copy link
Contributor Author

Just in case, one more thing about EPOLLET and EAGAIN, we all know that one of the most common pattern of working with EPOLLET is to read until EAGAIN is returned, this often misleads people into thinking that they have to drain out the underlying buffer to get notified when there is new data arriving, but this is a classic misunderstanding because even if we don't drain out the buffer, we will still get notified whenever a new event occurs (it will trigger the wakeup callback). The man pages only suggested this way to prevent programs from hanging and never getting a chance to consume the leftover data in the underlying buffer when the remote peer stops sending data for some reason.

Therefore, reading until EAGAIN is returned is not mandatory for all the use cases of EPOLLET.

---------

Signed-off-by: Andy Pan <i@andypan.me>
@saghul
Copy link
Member

saghul commented May 9, 2024

Just in case, one more thing about EPOLLET and EAGAIN, we all know that one of the most common pattern of working with EPOLLET is to read until EAGAIN is returned, this often misleads people into thinking that they have to drain out the underlying buffer to get notified when there is new data arriving, but this is a classic misunderstanding because even if we don't drain out the buffer, we will still get notified whenever a new event occurs (it will trigger the wakeup callback). The man pages only suggested this way to prevent programs from hanging and never getting a chance to consume the leftover data in the underlying buffer when the remote peer stops sending data for some reason.

Therefore, reading until EAGAIN is returned is not mandatory for all the use cases of EPOLLET.

I think that's the common understanding when using a fd that point to a socket for example, right? Is this different in case of eventfd?

@panjf2000
Copy link
Contributor Author

panjf2000 commented May 9, 2024

I think that's the common understanding when using a fd that point to a socket for example, right?

Yes, we often do that when working with sockets, and that's legit.

Is this different in case of eventfd?

To clarify, sockets and eventfd share the same implementation of EPOLLET under the hood and we should use the same pattern I described above on both of them when we care about the data on the underlying kernel buffer and need to read and use that data. But when it comes to the use case of merely using eventfd as a notification mechanism (also the exact use case in this PR), we neither care about the data on the underlying buffer nor need to read and use that data, that's why we can use the implementation in this PR to eliminate read(2) system call on each wakeup event and libuv will continue to work correctly.

@saghul
Copy link
Member

saghul commented May 9, 2024

Hum, I still don't get it. Let's say we have the loop thread plus other 4.

  • Each of the 4 non-loop threads call uv_async_send
  • Assuming the value they pass is 1, chances are we'd read a 4
  • The loop threads gets the wakeup, performs one read, gets the 4 and the counter would be reset to 0

Where does EPOLLET play here? Is this an optimization for the case in which we read the value before all threads have incremented it? So say we read 2, then another thread peforms the write and we'd be woken up again to read a 1 and so on?

@panjf2000
Copy link
Contributor Author

panjf2000 commented May 9, 2024

Hum, I still don't get it. Let's say we have the loop thread plus other 4.

  • Each of the 4 non-loop threads call uv_async_send
  • Assuming the value they pass is 1, chances are we'd read a 4
  • The loop threads gets the wakeup, performs one read, gets the 4 and the counter would be reset to 0

Where does EPOLLET play here? Is this an optimization for the case in which we read the value before all threads have incremented it? So say we read 2, then another thread peforms the write and we'd be woken up again to read a 1 and so on?

Sorry, I don't get your example because I don't see this example has anything to do with EPOLLET. The improvement from this PR is that we don't need to perform a read(2) for each wakeup event. Other than that, I think the new impl will behave the same way the old impl used to do.

@saghul
Copy link
Member

saghul commented May 9, 2024

Perhaps some sample code and the perf output would help here then. If N threads call uv_async_send there won't necessarily be N wakeups, the loop will be awakeen so long as the counter is > 0, but then it will be reset to 0, so it will be < N.

Granted, if one of the threads performs the write after the loop has read the value a new wakeup will happen, and that is ok, because we don't know how much later that was.

What am I missing?

@panjf2000
Copy link
Contributor Author

Perhaps some sample code and the perf output would help here then. If N threads call uv_async_send there won't necessarily be N wakeups, the loop will be awakeen so long as the counter is > 0, but then it will be reset to 0, so it will be < N.

If you were talking about N threads calling uv_async_send per thread to wake another one specific thread t1 and the wakeup times is less than N because t1 read eventfd and reset it to zero, then it's true regardless of the old implementation with LT or the new one with ET. If you want to trigger extract N times of wakeup, you need to specify EFD_SEMAPHORE for the eventfd.

Granted, if one of the threads performs the write after the loop has read the value a new wakeup will happen, and that is ok, because we don't know how much later that was.

Again, this will happen with either LT or ET.

@saghul
Copy link
Member

saghul commented May 9, 2024

If you were talking about N threads calling uv_async_send per thread to wake another one specific thread t1 and the wakeup times is less than N because t1 read eventfd and reset it to zero, then it's true regardless of the old implementation with LT or the new one with ET.

Correct, that was the case I was making.

f you want to trigger extract N times of wakeup, you need to specify EFD_SEMAPHORE for the eventfd.

Right, but uv_async is defined so it wakes up at least once, regardless of the number of times uv_async_send was called.

Again, this will happen with either LT or ET.

Ok.

So then, can you please elaborate on what case this PR helps with?

@panjf2000
Copy link
Contributor Author

panjf2000 commented May 9, 2024

To sum up, before this PR, we use eventfd in LT mode and we need to perform a read(2) for every wakeup event and this extra system call read(2) is mandatory and unavoidable, if we don't read the eventfd to reset it to zero, the kernel will keep waking up epoll_wait(). After this PR, we use eventfd in ET mode and that allows us to avoid calling read(2) to reset eventfd for every wakeup event, this helps us eliminate this system call overhead because under ET mode the kernel won't wake up the epoll_wait() again even if we don't read the eventfd to reset it to zero until next time we call uv_async_send from any threads.

@vtjnash
Copy link
Member

vtjnash commented May 9, 2024

Every time we write to the eventfd with EPOLLET, we get a wakeup event and are notified by epoll_wait

So, what I am pointing out is that this contradicts the documentation for epoll_wait, which states that if you do this, it will work most of the time, but not be reliable or robust:

• Do I need to continuously read/write a file descriptor until
EAGAIN when using the EPOLLET flag (edge-triggered behavior)?

      Receiving an event from [epoll_wait(2)](https://man7.org/linux/man-pages/man2/epoll_wait.2.html) should suggest to you
      that such file descriptor is ready for the requested I/O
      operation.  You must consider it ready until the next
      (nonblocking) read/write yields EAGAIN.  When and how you will
      use the file descriptor is entirely up to you.

      For packet/token-oriented files (e.g., datagram socket,
      terminal in canonical mode), the only way to detect the end of
      the read/write I/O space is to continue to read/write until
      EAGAIN.

      For stream-oriented files (e.g., pipe, FIFO, stream socket),
      the condition that the read/write I/O space is exhausted can
      also be detected by checking the amount of data read from /
      written to the target file descriptor.  For example, if you
      call [read(2)](https://man7.org/linux/man-pages/man2/read.2.html) by asking to read a certain amount of data and
      [read(2)](https://man7.org/linux/man-pages/man2/read.2.html) returns a lower number of bytes, you can be sure of
      having exhausted the read I/O space for the file descriptor.
      The same is true when writing using [write(2)](https://man7.org/linux/man-pages/man2/write.2.html).  (Avoid this
      latter technique if you cannot guarantee that the monitored
      file descriptor always refers to a stream-oriented file.)

What is strange is that it gives an exception for stream-oriented files, but then makes a claim that directly contradicts the documentation for read on a stream (it also may return a short read if the requested size is larger than the atomic size).

To be clear, I am content to assume that this is a kernel bug that we can (ab)use in our favor for eventfd, but it would be good to make sure we can actually rely on the behavior of epoll_wait not exactly following this part of the documentation for it. According to the documentation for epoll, it seems a new call to write while a read is already possible should not trigger a new event until after the old event has been read by the event loop and discarded.

@saghul
Copy link
Member

saghul commented May 9, 2024

To sum up, before this PR, we use eventfd in LT mode and we need to perform a read(2) for every wakeup event and this extra system call read(2) is mandatory and unavoidable, if we don't read the eventfd to reset it to zero, the kernel will keep waking up epoll_wait(). After this PR, we use eventfd in ET mode and that allows us to avoid calling read(2) to reset eventfd for every wakeup event, this helps us eliminate this system call overhead because under ET mode the kernel won't wake up the epoll_wait() again even if we don't read the eventfd to reset it to zero until next time we call uv_async_send from any threads.

In what case would that manifest? When someonee calls uv_async_send we'll get a wakeup and read from the fd, thus reseting it to zero, we cannot avoid that. So in what circumstance would be leave the counter to not zero intil called again?

@panjf2000
Copy link
Contributor Author

it seems a new call to write while a read is already possible should not trigger a new event until after the old event has been read by the event loop and discarded.

This would be true under one specific circumstance: the interval between two writes is extremely small. So if you issued a new write right after the previous without a gap, eventfd in ET mode would be waked up only once. By contrast, if you issued a new write after some times from the previous write, it would be waked up twice. This is because linux will converge multiple ready events on a same file descriptor that are in the short time windows and only trigger the wakeup callback once.

As for the confusing description on man pages, it's just a common pattern of working with EPOLLET, we have to analyze the situation in the real world on a case-by-case basis, like I said before: #4400 (comment) and #4400 (comment). In our case of using eventfd, we only care about the wakeup event, reading the eventfd or not on each wakeup event doesn't matter as long as we received the event as expected.

And I don't think this is a kernel bug, based on the source code of Linux, it's solid and it look more like a functionality to me.
@vtjnash

@panjf2000
Copy link
Contributor Author

To sum up, before this PR, we use eventfd in LT mode and we need to perform a read(2) for every wakeup event and this extra system call read(2) is mandatory and unavoidable, if we don't read the eventfd to reset it to zero, the kernel will keep waking up epoll_wait(). After this PR, we use eventfd in ET mode and that allows us to avoid calling read(2) to reset eventfd for every wakeup event, this helps us eliminate this system call overhead because under ET mode the kernel won't wake up the epoll_wait() again even if we don't read the eventfd to reset it to zero until next time we call uv_async_send from any threads.

In what case would that manifest? When someonee calls uv_async_send we'll get a wakeup and read from the fd, thus reseting it to zero, we cannot avoid that. So in what circumstance would be leave the counter to not zero intil called again?

We can avoid that by using EPOLLET on eventfd, this is what this PR does.

@vtjnash
Copy link
Member

vtjnash commented May 9, 2024

It looks like the kernel may not care if the read actually ever happened, as it doesn't have a cheap way to filter this for whether the item in the queue is currently LT or ET, so it just sets it after every write? I am not entirely certain of how the kernel decides when to clear the bit (for LT)
https://github.com/torvalds/linux/blob/45db3ab70092637967967bfd8e6144017638563c/fs/eventfd.c#L273-L274

@vtjnash
Copy link
Member

vtjnash commented May 9, 2024

like I said before

Sorry if it feels like I am ignoring that you made that comment before, but the documentation for this function does not seem to line up with your description, and instead repeatedly states that reading the file is required afterwards, while reiterating that this "common pattern of working with EPOLLET" that you are using should not to be relied upon, as it may work in most cases, but is not guaranteed to work correctly all of the time or in all cases.

Do you know if there is any kernel documentation for eventfd that might clarify this?

@panjf2000
Copy link
Contributor Author

What's your main concern here? I'm still not quite sure.

@vtjnash
Copy link
Member

vtjnash commented May 9, 2024

The man page for epoll_wait specifically says this PR is not correct, but the kernel implementation may allow it. Do we trust that the linux kernel developers will never update the implementation to more closely follow the man page, given that the kernel documentation for this appears to be non-existent?

@vtjnash
Copy link
Member

vtjnash commented May 9, 2024

There is actually also a second more complicated concern for it, which is that this syscall must have sequentially-consistent ordering. The read/write pair used to guarantee that, but does epoll_wait also guarantee that? (edit: we can probably fix this by adding explicit fences, if necessary)

@panjf2000
Copy link
Contributor Author

panjf2000 commented May 9, 2024

The man page for epoll_wait specifically says this PR is not correct, but the kernel implementation may allow it. Do we trust that the linux kernel developers will never update the implementation to more closely follow the man page, given that the kernel documentation for this appears to be non-existent?

I actually don't think this PR contradicts the man page.

Receiving an event from epoll_wait(2) should suggest to you
that such file descriptor is ready for the requested I/O
operation. You must consider it ready until the next
(nonblocking) read/write yields EAGAIN. When and how you will
use the file descriptor is entirely up to you.

When we receive a readable event from the eventfd, the man pages say that we must consider it ready until the next EAGAIN, it never says we can't get a new ready event again (and epoll in ET mode will always send a new ready event, and I don't think it's done by accident after all the kernel releases). If we want to read and use the data on the kernel buffer, we must read until EAGAIN is returned, but if we don't care about that data, then we don't read it and just wait for the next ready event.

@panjf2000
Copy link
Contributor Author

Maybe this could make things more interesting: https://lwn.net/Articles/865400/

@vtjnash
Copy link
Member

vtjnash commented May 9, 2024

Alright, yeah, that email from Linus stating this this PR relies on a kernel bug that probably won't get fixed (https://lwn.net/ml/linux-kernel/CAHk-=witY33b-vqqp=ApqyoFDpx9p+n4PwG9N-TvF8bq7-tsHw@mail.gmail.com/) is probably compelling

@vtjnash
Copy link
Member

vtjnash commented May 9, 2024

Yes, we often do that when working with sockets, and that's legit.

Note that this stated usage is still wrong (per the documentation and implementation), but that this case is fundamentally different because it never reads from the fd and only ever cares about the occurrence of the write syscall itself (which Linus's email indicates is a kernel bug that probably won't get fixed), but never about the quantity of data written (which is not an existing kernel bug, but may be an existing reliability issue in some applications).

@panjf2000
Copy link
Contributor Author

Alright, yeah, that email from Linus stating this this PR relies on a kernel bug that probably won't get fixed (lwn.net/ml/linux-kernel/CAHk-=witY33b-vqqp=ApqyoFDpx9p+n4PwG9N-TvF8bq7-tsHw@mail.gmail.com) is probably compelling

Well, it is shocking to me... Normally you don't expect something like epoll whose design and source code are that sophisticated to be fundamentally broken when it was implemented. But thanks for the link, it's frustrating but useful.

@panjf2000
Copy link
Contributor Author

Yes, we often do that when working with sockets, and that's legit.

Note that this stated usage is still wrong (per the documentation and implementation), but that this case is fundamentally different because it never reads from the fd and only ever cares about the occurrence of the write syscall itself (which Linus's email indicates is a kernel bug that probably won't get fixed), but never about the quantity of data written (which is not an existing kernel bug, but may be an existing reliability issue in some applications).

Sorry, why would it be wrong when reading a socket fd in ET mode until EAGAIN is returned? Could you clarify that with more contexts?

@vtjnash
Copy link
Member

vtjnash commented May 9, 2024

If you do read until EAGAIN, it is correct, but wastes a syscall every time you go to call read (as you could have used LT instead, and called epoll_wait, and gotten the same info simultaneously for every socket instead of just one)

@panjf2000
Copy link
Contributor Author

panjf2000 commented May 9, 2024

If you do read until EAGAIN, it is correct, but wastes a syscall every time you go to call read (as you could have used LT instead, and called epoll_wait, and gotten the same info simultaneously for every socket instead of just one)

Oh, I think we were just not on the same page about that. I was specifically talking about EAGAIN with ET mode, it's clear that we don't have to read until EAGAIN in LT mode. Now we are in sync.

@panjf2000
Copy link
Contributor Author

So, what are we going to do with this PR? Does it still seem worth going through for libuv?

@saghul
Copy link
Member

saghul commented May 9, 2024

I'd say some benchmarks would help decide.

@vtjnash
Copy link
Member

vtjnash commented May 9, 2024

Great thanks. I think we already have a benchmark for this as well, so just posting the comparison numbers for that may suffice.

@panjf2000
Copy link
Contributor Author

panjf2000 commented May 9, 2024

Which benchmark test should I run? million_async?

Environment

    OS : Ubuntu 22.04/x86_64
   CPU : 8 CPU cores, AMD EPYC 7K62 48-Core Processor
Memory : 16.0 GiB

Benchmark command

UV_USE_IO_URING=0 build/uv_run_benchmarks_a million_async

Result

/* libuv:v1.x */
ok 1 - million_async
# 4,835,403 async events in 5.0 seconds (967,080/s, 1,048,576 unique handles seen)

/* panjf2000:eventfd-et */
ok 1 - million_async
# 5,096,626 async events in 5.0 seconds (1,019,325/s, 1,048,576 unique handles seen)

Something like this?


Updated: upgrade to Ubuntu 24.04 (Noble Numbat)

Environment

    OS : Ubuntu 24.04/x86_64
   CPU : 8 CPU cores, AMD EPYC 7K62 48-Core Processor
Memory : 16.0 GiB
/* libuv:v1.x */
ok 1 - million_async
# 5,483,795 async events in 5.0 seconds (1,096,759/s, 1,048,576 unique handles seen)

/* panjf2000:eventfd-et */
ok 1 - million_async
# 6,118,112 async events in 5.0 seconds (1,223,622/s, 1,048,576 unique handles seen)

Copy link
Member

@bnoordhuis bnoordhuis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apropos EPOLLET and reading until EAGAIN: that was the common wisdom for years (because it was true for a long time) but things changed during the 2.6 (2.4? 2.5?) window.

I think this PR should Just Work(TM) as it stands. A kernel regression would be horrendously hard to debug though.


One reason for not merging is the following (hopefully hypothetical) scenario: another thread or process accidentally polling the file descriptor, thereby consuming the edge event and stopping the libuv state machine in its tracks.

Level-triggered I/O is robust against such snafus because libuv keeps receiving events until it actually drains the eventfd.

Merging should therefore hinge on:

  1. Someone demonstrably needing peak uv_async_t performance, and

  2. The scenario above being sufficiently implausible that it's likely to stay a hypothetical

Node.js would likely benefit from (1) but node.js developers sometimes do incredibly stupid shit such that (2) is not completely impossible.

(E.g., I occasionally got bug reports along the lines of "app crashes after for (let fd = 0; fd < 1024; fd++) fs.closeSync(fd)" 🤦)

src/unix/async.c Outdated Show resolved Hide resolved
src/unix/async.c Outdated Show resolved Hide resolved
src/unix/async.c Outdated Show resolved Hide resolved
src/unix/async.c Outdated Show resolved Hide resolved
@panjf2000
Copy link
Contributor Author

Node.js would likely benefit from (1) but node.js developers sometimes do incredibly stupid shit such that (2) is not completely impossible.

True, and I also understand your concern here. But on the other side, I also think it's a little bit unfair to punish the node.js developers who are rigorous about their code for the negligence that might be done by those inconsiderate ones.

On further reflection, do you think it's pragmatic to make this eventfd in ET mode thing configurable?

@panjf2000 panjf2000 requested a review from bnoordhuis May 13, 2024 11:51
---------

Signed-off-by: Andy Pan <i@andypan.me>
@vtjnash
Copy link
Member

vtjnash commented May 13, 2024

another thread or process accidentally polling the file descriptor

Seems pretty unlikely that the other thread will be calling epoll_wait specifically on a random fd. We know at least Electron does this correctly, (even though it previously could have done the other thing and used epoll_wait here directly on the uv_backend_fd without previously having had run into any issues with that):
https://github.com/electron/electron/blob/4c27b0c28270280de3b5949fa4a93114452d509f/shell/common/node_bindings_linux.cc#L15-L19

Apropos using ET in more places, it looks like the pipe.c code in the kernel at least now documents and respects that a partial-read/write (after a variety of aborted attempts to break it in 2019) does now trigger a new event, regardless of what the documentation warned about. c.f. torvalds/linux@fe67f4d (specifically, note how pending signals are not checked for until after the buffer is drained fully or the buffer is filled). So it could be possible for libuv's pipe code to rely on ET being valid and a partial read being a guarantee that the buffer is empty.

@panjf2000
Copy link
Contributor Author

Any updates?

@panjf2000
Copy link
Contributor Author

Hi, folks! Any new thoughts on this? @vtjnash @bnoordhuis @saghul

@vtjnash
Copy link
Member

vtjnash commented May 22, 2024

Would you be willing to submit a documentation update to include this promise in the external documentation for the kernel, mentioning the performance advantages and the previous commits and LKML discussions that seemed to indicate this behavior is expected to continue working in the future @ https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man/man7/epoll.7

@panjf2000
Copy link
Contributor Author

panjf2000 commented May 22, 2024

Would you be willing to submit a documentation update to include this promise in the external documentation for the kernel, mentioning the performance advantages and the previous commits and LKML discussions that seemed to indicate this behavior is expected to continue working in the future @ git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man/man7/epoll.7

I would if I could. But after I've done some research about this whole EPOLLET mechanism, I believe that this won't be mandatory because there are sufficient proofs that can endorse this PR. First of all, I went through this thread -- Regression: epoll edge-triggered (EPOLLET) for pipes/FIFOs, which described a regression about how EPOLLET works with pipe caused by torvalds/linux@1b6b26a where the behavior of EPOLLET on pipe went from issuing a new wakeup whenever there is new arrival of data to issuing a new wakeup only when an empty buffer becomes non-empty (this is the real edge implementation). But this commit got "reverted" by torvalds/linux@3a34b13 (see also torvalds/linux@3b84482), which tweaked the EPOLLET back to the way it used to be.

This quote from Linus:

Our regression rule for the kernel is that if applications break from
new behavior, it's a regression, even if it was because the application
did something patently wrong.

makes it clear that Linux won't break any the existing applications with new patch intentionally.

Another reference from the man pages of eventfd:

Applications can use an eventfd file descriptor instead of a pipe
(see pipe(2)) in all cases where a pipe is used simply to signal
events. The kernel overhead of an eventfd file descriptor is
much lower than that of a pipe, and only one file descriptor is
required (versus the two required for a pipe).

makes me believe that if we can count on pipe (with or without EPOLLET) as a simple notification mechanism, we can count on eventfd as a substitution (with or without EPOLLET) that is advocated by the kernel officially.

Secondly, I also tried to seek out similar applications of eventfd with EPOLLET from other renowned networking frameworks, and I did find it: libevent, gnet, netty, and mio. Well, I guess the first two might not be so cogent, considering they were submitted by me... 😅 But I think the latter two should be compelling. I mean, if we merge this PR, and then in the future libuv goes wrong because Linux breaks the backward compatibility of EPOLLET, we at least won't be alone in the course of the protest against that breaking patch. 😂

@vtjnash


Updated:

Another application of eventfd with EPOLLET: https://github.com/nginx/nginx/blob/efc6a217b92985a1ee211b6bb7337cd2f62deb90/src/event/modules/ngx_epoll_module.c#L386-L457

nginx is not doing it the exact same way as those networking frameworks mentioned above do: it doesn't read the eventfd for each wakeup, but will perform one read on eventfd every 0xFFFFFFFF (max value of uint32) times it's been awakened. But I think this is only because its counter had been defined as unsigned long.

@panjf2000
Copy link
Contributor Author

Furthermore, I think to some extent there is already a statement of triggering multiple events in EPOLLET mode on the man pages:

Since even with edge-triggered epoll, multiple events can be
generated upon receipt of multiple chunks of data, the caller has
the option to specify the EPOLLONESHOT flag, to tell epoll to
disable the associated file descriptor after the receipt of an
event with epoll_wait(2). When the EPOLLONESHOT flag is
specified, it is the caller's responsibility to rearm the file
descriptor using epoll_ctl(2) with EPOLL_CTL_MOD.

---------

Signed-off-by: Andy Pan <i@andypan.me>
@panjf2000
Copy link
Contributor Author

Benchmark updated: upgrade to Ubuntu 24.04 (Noble Numbat)

Environment

    OS : Ubuntu 24.04/x86_64
   CPU : 8 CPU cores, AMD EPYC 7K62 48-Core Processor
Memory : 16.0 GiB
/* libuv:v1.x */
ok 1 - million_async
# 5,483,795 async events in 5.0 seconds (1,096,759/s, 1,048,576 unique handles seen)

/* panjf2000:eventfd-et */
ok 1 - million_async
# 6,118,112 async events in 5.0 seconds (1,223,622/s, 1,048,576 unique handles seen)

@panjf2000 panjf2000 changed the title linux: eliminate syscall read(2) on eventfd per event-loop wakeup linux: exploit eventfd in EPOLLET mode to avoid readding per wakeup May 24, 2024
@panjf2000
Copy link
Contributor Author

Ping @bnoordhuis @vtjnash @saghul

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants