Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: emit diagnostics_channel events upon routing request #5252

Merged
merged 3 commits into from
May 31, 2024

Conversation

tlhunter
Copy link
Contributor

@tlhunter tlhunter commented Jan 3, 2024

This adds support for emitting diagnostics channel events for a request. It's an extension of #5105.

The dc-polyfill package provides a polyfill of the diagnostics_channel module for older versions of Node.js. This is why the existing diagnostics channel call has been removed from a conditional in fastify.js. Note that dc-polyfill falls back to the existing diagnostics channel if present. That's why the tests are requiring diagnostics channel directly instead of the polyfill to show that it works as expected.

Here's some things I'm wondering about:

  • extracting payload from reply.send()
  • rebased on main
  • tests
  • docs

Benchmark results

main

┌─────────┬──────┬──────┬───────┬───────┬────────┬─────────┬────────┐
│ Stat    │ 2.5% │ 50%  │ 97.5% │ 99%   │ Avg    │ Stdev   │ Max    │
├─────────┼──────┼──────┼───────┼───────┼────────┼─────────┼────────┤
│ Latency │ 4 ms │ 9 ms │ 13 ms │ 13 ms │ 8.3 ms │ 4.19 ms │ 365 ms │
└─────────┴──────┴──────┴───────┴───────┴────────┴─────────┴────────┘
┌───────────┬─────────┬─────────┬─────────┬─────────┬────────────┬──────────┬─────────┐
│ Stat      │ 1%      │ 2.5%    │ 50%     │ 97.5%   │ Avg        │ Stdev    │ Min     │
├───────────┼─────────┼─────────┼─────────┼─────────┼────────────┼──────────┼─────────┤
│ Req/Sec   │ 104,639 │ 104,639 │ 114,495 │ 116,735 │ 113,796.27 │ 2,451.61 │ 104,582 │
├───────────┼─────────┼─────────┼─────────┼─────────┼────────────┼──────────┼─────────┤
│ Bytes/Sec │ 19.7 MB │ 19.7 MB │ 21.5 MB │ 22 MB   │ 21.4 MB    │ 461 kB   │ 19.7 MB │
└───────────┴─────────┴─────────┴─────────┴─────────┴────────────┴──────────┴─────────┘

this change

┌─────────┬──────┬──────┬───────┬───────┬─────────┬─────────┬────────┐
│ Stat    │ 2.5% │ 50%  │ 97.5% │ 99%   │ Avg     │ Stdev   │ Max    │
├─────────┼──────┼──────┼───────┼───────┼─────────┼─────────┼────────┤
│ Latency │ 4 ms │ 9 ms │ 13 ms │ 13 ms │ 8.21 ms │ 4.23 ms │ 364 ms │
└─────────┴──────┴──────┴───────┴───────┴─────────┴─────────┴────────┘
┌───────────┬─────────┬─────────┬─────────┬─────────┬───────────┬──────────┬─────────┐
│ Stat      │ 1%      │ 2.5%    │ 50%     │ 97.5%   │ Avg       │ Stdev    │ Min     │
├───────────┼─────────┼─────────┼─────────┼─────────┼───────────┼──────────┼─────────┤
│ Req/Sec   │ 107,455 │ 107,455 │ 115,455 │ 117,503 │ 114,822.4 │ 2,261.46 │ 107,414 │
├───────────┼─────────┼─────────┼─────────┼─────────┼───────────┼──────────┼─────────┤
│ Bytes/Sec │ 20.2 MB │ 20.2 MB │ 21.7 MB │ 22.1 MB │ 21.6 MB   │ 426 kB   │ 20.2 MB │
└───────────┴─────────┴─────────┴─────────┴─────────┴───────────┴──────────┴─────────┘

Checklist

@tlhunter
Copy link
Contributor Author

tlhunter commented Jan 3, 2024

/cc @Qard

Copy link

@Qard Qard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few suggestions. Basically try/finally to more reliably ensure end events happen. Should probably also just put a catch in there too to be sure any unexpected errors get reported too.

lib/handleRequest.js Outdated Show resolved Hide resolved
reply[kReplyIsError] = true
reply.send(err)
channels.end.publish(store)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than putting the end in all the branches you could use try/finally. That also ensures the event still fires if something throws.

@@ -28,6 +33,8 @@ function wrapThenable (thenable, reply) {
reply.send(err)
}
}

if (channels) channels.asyncEnd.publish(store)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should do try/finally here too.

Copy link
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given dc-polyfill, I think we can land this on main if it does not introduce overhead when disabled.

fastify.js Outdated
} catch (e) {
// This only happens if `diagnostics_channel` isn't available, i.e. earlier
// versions of Node.js. In that event, we don't care, so ignore the error.
const dc = require('dc-polyfill')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to the top of the file

Copy link
Member

@jsumners jsumners Jan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'd rather it be named at least diagnostics instead of dc in order to avoid having to lookup what "dc" means.

lib/wrapThenable.js Show resolved Hide resolved
@mcollina
Copy link
Member

mcollina commented Jan 4, 2024

Is there a way to get the payload sent via reply.send()?

inside reply.send() or when executing the onSend hook.

@Uzlopak
Copy link
Contributor

Uzlopak commented Jan 4, 2024

What about the diagnostics_channel plugin?

docs/Reference/Hooks.md Outdated Show resolved Hide resolved
docs/Reference/Hooks.md Outdated Show resolved Hide resolved
docs/Reference/Hooks.md Outdated Show resolved Hide resolved
@Uzlopak
Copy link
Contributor

Uzlopak commented Jan 4, 2024

Clarifying my last comment. How does this PR relate to

https://github.com/fastify/fastify-diagnostics-channel

?

Seems like this PR makes the plugin partially redundant?

@Qard
Copy link

Qard commented Jan 4, 2024

Definitely overlaps with the plugin, though this is more limited to just producing a trace around a request whereas the plugin gives a bit more control. This might benefit from a few more events to cover the other use cases provided by the plugin.

The main benefit I see for this existing rather than just the plugin is that if it lives directly in the library it enables automatic instrumentation tracers to listen without producing side effects while adding the plugin could have user-visible behaviour. There's also probably performance advantages to going direct rather than going through the plugin.

@jsumners
Copy link
Member

jsumners commented Jan 4, 2024

I think instrumentation hooks such as these should be in core. They provide a way for other instrumentations to do things via a specific plugin or wrapping package. If we were trying to add such hooks via some third party library, like dtrace, then my opinion would be different. But the API used in this PR is meant to be a core API.

@tlhunter
Copy link
Contributor Author

tlhunter commented Jan 4, 2024

What about the diagnostics_channel plugin?

@Uzlopak I think you're asking about the removal of the diagnostics_channel npm package?

If so, the dc-polyfill package makes the diagnostics_channel package obsolete. The way diagnostics_channel worked is that it was a snapshot of the internal diagnostics_channel module that was made into an npm package of the same name. On ancient Node.js versions where the internal diagnostics_channel module was missing a call to require('diagnostics_channel') would pickup the npm package and that same call on newer versions of Node.js simply retrieves the internal module. However there were shortcomings with the diagnostics_channel package, like if multiple versions of the diagnostics_channel package existed in the a node_modules/ dependency tree then they wouldn't cross communicate. Also the diagnostics_channel package hasn't been maintained and is missing newer features that this PR depends on. dc-polyfill alleviates those issues and many more.

If anyone is curious please checkout the dc-polyfill docs for more info. Ultimately dc-polyfill is a drop-in replacement for both the diagnostics_channel npm package and internal module

@Qard
Copy link

Qard commented Jan 4, 2024

I think instrumentation hooks such as these should be in core. They provide a way for other instrumentations to do things via a specific plugin or wrapping package. If we were trying to add such hooks via some third party library, like dtrace, then my opinion would be different. But the API used in this PR is meant to be a core API.

It is a core API. The dc-polyfill API just enables supporting the newer features beyond the limited backporting window of Node.js core. With dc-polyfill we had it tested and working all the way back to Node.js 12. It uses the core API where available though, so it should always be a drop-in replacement for using the core API as if using the current version of Node.js.

As for the plugin versus embedding in the library, I much favour embedding in the library as the diagnostics_channel API (and dc-polyfill on top of it) are aimed at very high performance and as close to zero cost as possible when nothing is listening. Abstracting that with a plugin in the middle adds a bunch of unnecessary overhead as fastify then loses the context of what is ignorable as the plugin barrier isn't designed to forward that information.

I created diagnostics_channel and it is quite specifically my intent that it be used directly in libraries in this way as it gets us as close to the source as possible feed of diagnostics data, eliminating a ton of instrumentation cost in the process and making gathering instrumentation insights much, much easier in the process. The existing model of monkey-patching everything is highly unreliable and very unapproachable to an average developer to be able to use as a way to gain insight about their applications. I intended for diagnostics_channel to encourage library developers to start thinking about exposing diagnostics data directly and thinking about the performance profile of their code more.

@tlhunter
Copy link
Contributor Author

tlhunter commented Jan 5, 2024

inside reply.send() or when executing the onSend hook.

@mcollina I'm now grabbing the payload within Reply instances and setting it on itself using a new symbol: 92f77e4

That said I'm not sure if there are expectations on how long Reply instances stick around and if there may be memory concerns. Let me know if this approach isn't ideal and I'll try something else.

@Uzlopak
Copy link
Contributor

Uzlopak commented Jan 5, 2024

I wonder if we should integrate then the whole diagnostics-channel stuff from the plugin into core.

lib/reply.js Outdated
@@ -66,6 +67,7 @@ function Reply (res, request, log) {
this[kReplyTrailers] = null
this[kReplyHasStatusCode] = false
this[kReplyStartTime] = undefined
this[kReplyPayload] = undefined
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only concern is when the payload is stream.
How can the subscriber safely consume the stream for logging?
The same problem exist when I implement the Trailer feature and it is not resolved until now.
Reference #4373

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see so many ways this can end badly long term. Can you avoid storing the payload inside reply?

Copy link
Contributor Author

@tlhunter tlhunter Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest commit no longer attaches the payload value to the Reply instance: 325ba54

@mcollina
Copy link
Member

mcollina commented Jan 5, 2024

I think you can target main for this.

@mcollina
Copy link
Member

mcollina commented Jan 5, 2024

I wonder if we should integrate then the whole diagnostics-channel stuff from the plugin into core.

Agreed

@tlhunter tlhunter force-pushed the tlhunter/diagnostics-channel branch from 4b3a0cb to 1b00702 Compare January 5, 2024 19:48
@tlhunter tlhunter changed the base branch from next to main January 5, 2024 19:49
@@ -158,7 +163,6 @@ function preHandlerCallback (err, request, reply) {
if (result !== null && typeof result.then === 'function') {
wrapThenable(result, reply, channels, store)
} else {
store.result = result
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line becomes redundant as the kReplyOnSend callback sets the value.

@tlhunter
Copy link
Contributor Author

tlhunter commented Jan 5, 2024

I think you can target main for this.

This PR is now based on main. Ended up squashing due to conflicts.

Copy link
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! I think a few edge cases are missing:

  1. reply.callNotFound() // this will execute the not found handler
  2. Fastify error handlers (nested)
  3. routes with setImmediate(() => reply.send('hello world))
  4. async routes with setImmediate(() => reply.send('hello world)); return reply

The reason why I'm asking is that it will definitely pass through reply.send() multiple times, and both synchronously and asynchronously.

@@ -5,28 +5,37 @@ const {
kReplyHijacked
} = require('./symbols')

function wrapThenable (thenable, reply) {
function wrapThenable (thenable, reply, channels, store) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this function should be modified, I would recommend to keep the change between preHandlerCallback and reply.send().
Specifically, I would emit asyncStart and asyncEnd within reply.send(), as there are multiple ways to call that function (synchronous and asynchronous)

@tlhunter
Copy link
Contributor Author

@mcollina So far it seems that by returning a simple value from a sync request handler that value is used as the response. Or, if a request handler returns a promise, the resolved value from that promise is used as the response. Or, if at any time reply.send() is used, then that value is used as the response. Is it safe to assume that those are the canonical ways that Fastify determines if a response is finished?

@mcollina
Copy link
Member

Yes. The error flow resets all this for the error handler.

@Uzlopak
Copy link
Contributor

Uzlopak commented Jan 11, 2024

@mcollina

Should this PR also integrate the plugins behaviour? Or should this be a follow up?

@mcollina
Copy link
Member

@mcollina

Should this PR also integrate the plugins behaviour? Or should this be a follow up?

Can you clarify?

@tlhunter
Copy link
Contributor Author

@mcollina I think @Uzlopak is asking if I should copy all of the code from the existing fastify-diagnostics-channel package and incorporate it into my PR. I'm of the opinion that it shouldn't be included in this PR. While that package is semi-related in scope, it uses a different pattern, and this PR is just attempting to implement the TracingChannel "spec". Plus I'm going to write some blog posts that link to this PR to share with the community and I'd like to keep the PR as succinct and isolated as possible to avoid confusing the audience.

@mcollina
Copy link
Member

A follow up would be good, this would be massive already after the fixes I mentioned.

@tlhunter
Copy link
Contributor Author

@mcollina Just so I know if I'm on the right track, do you think I need to refactor a bunch of Fastify to achieve those changes?

@mcollina
Copy link
Member

No, but possibly change the approach you have taken with this PR.

@tlhunter tlhunter force-pushed the tlhunter/diagnostics-channel branch from 8d204d6 to 69a421d Compare May 7, 2024 19:15
@tlhunter
Copy link
Contributor Author

tlhunter commented May 8, 2024

I had a few questions for @Qard but he just went on vacation so I'll try to figure things out without him. Here's a paper trail of what's going on:

  1. @Qard landed a commit in Node.js versions 22.0.0 and 20.13.0 which does two things:
  2. I now have a PR open to add just the hasSubscribers method to dc-polyfill while we figure out the early-exit part
    • I'll run it by my team since @Qard is out

If I can merge the dc-polyfill PR today then I can update this Fastify PR and AFAICT it'll be ready to land.

@tlhunter
Copy link
Contributor Author

tlhunter commented May 9, 2024

@mcollina @jsumners, The backported functionality has landed in dc-polyfill as of v0.1.5 and this PR has been updated to make use of the new helper method. I believe this PR is good to go!

Here's an updated benchmark:

This PR

┌─────────┬──────┬──────┬───────┬───────┬─────────┬─────────┬───────┐
│ Stat    │ 2.5% │ 50%  │ 97.5% │ 99%   │ Avg     │ Stdev   │ Max   │
├─────────┼──────┼──────┼───────┼───────┼─────────┼─────────┼───────┤
│ Latency │ 4 ms │ 8 ms │ 12 ms │ 12 ms │ 7.62 ms │ 2.63 ms │ 56 ms │
└─────────┴──────┴──────┴───────┴───────┴─────────┴─────────┴───────┘
┌───────────┬─────────┬─────────┬─────────┬─────────┬────────────┬──────────┬─────────┐
│ Stat      │ 1%      │ 2.5%    │ 50%     │ 97.5%   │ Avg        │ Stdev    │ Min     │
├───────────┼─────────┼─────────┼─────────┼─────────┼────────────┼──────────┼─────────┤
│ Req/Sec   │ 110,207 │ 110,207 │ 123,711 │ 129,087 │ 123,140.27 │ 3,707.13 │ 110,198 │
├───────────┼─────────┼─────────┼─────────┼─────────┼────────────┼──────────┼─────────┤
│ Bytes/Sec │ 20.7 MB │ 20.7 MB │ 23.3 MB │ 24.3 MB │ 23.2 MB    │ 696 kB   │ 20.7 MB │
└───────────┴─────────┴─────────┴─────────┴─────────┴────────────┴──────────┴─────────┘

Main

┌─────────┬──────┬──────┬───────┬───────┬─────────┬─────────┬───────┐
│ Stat    │ 2.5% │ 50%  │ 97.5% │ 99%   │ Avg     │ Stdev   │ Max   │
├─────────┼──────┼──────┼───────┼───────┼─────────┼─────────┼───────┤
│ Latency │ 4 ms │ 9 ms │ 12 ms │ 13 ms │ 7.68 ms │ 2.73 ms │ 61 ms │
└─────────┴──────┴──────┴───────┴───────┴─────────┴─────────┴───────┘
┌───────────┬─────────┬─────────┬─────────┬─────────┬────────────┬─────────┬─────────┐
│ Stat      │ 1%      │ 2.5%    │ 50%     │ 97.5%   │ Avg        │ Stdev   │ Min     │
├───────────┼─────────┼─────────┼─────────┼─────────┼────────────┼─────────┼─────────┤
│ Req/Sec   │ 111,039 │ 111,039 │ 123,071 │ 127,935 │ 122,178.14 │ 3,587.8 │ 111,036 │
├───────────┼─────────┼─────────┼─────────┼─────────┼────────────┼─────────┼─────────┤
│ Bytes/Sec │ 20.9 MB │ 20.9 MB │ 23.1 MB │ 24.1 MB │ 23 MB      │ 674 kB  │ 20.9 MB │
└───────────┴─────────┴─────────┴─────────┴─────────┴────────────┴─────────┴─────────┘

@tlhunter tlhunter force-pushed the tlhunter/diagnostics-channel branch from 646b799 to c5d68d4 Compare May 9, 2024 17:39
@tlhunter
Copy link
Contributor Author

@mcollina do you think this will be able to land in the v5 release?

@mcollina
Copy link
Member

can you rebase and make CI green? I've restarted it

@mcollina
Copy link
Member

This can land I think

Copy link
Member

@jsumners jsumners left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@tlhunter tlhunter force-pushed the tlhunter/diagnostics-channel branch from c5d68d4 to a21666d Compare May 16, 2024 14:47
@github-actions github-actions bot added the documentation Improvements or additions to documentation label May 16, 2024
@jsumners
Copy link
Member

https://github.com/fastify/fastify/actions/runs/9114429260/job/25058583913?pr=5252#step:5:17959

There is a test timing out.

@tlhunter
Copy link
Contributor Author

tlhunter commented May 16, 2024

I'm able to reproduce this locally but only if I run via c8:

PASS: ./node_modules/.bin/tap
FAIL: ./node_modules/.bin/c8 ./node_modules/.bin/tap 
PASS: node ./test/wrapThenable.test.js
HANG: ./node_modules/.bin/c8 node ./test/wrapThenable.test.js
PASS: ./node_modules/.bin/c8 node ./test/listen.3.test.js # random other test

Notably the individual tests pass but something about c8 hangs indefinitely while using 100% CPU. It appears to be an occurance of this c8 issue. I'm still looking into it.

If the wrapThenable.test.js file only contains these two lines it will still hang:

require('tap')
require('../lib/reply')

However this only happens in this branch not in main.

After bisecting Node.js versions I discovered that this hang appears in Node.js v18.19.0 but isn't present in v18.18.2.

This might be related to tracing channel getting backported to Node.js v18.19.0. I did notice a small issue in dc-polyfill where we would patch TracingChannel in v8.19.0+ when we didn't need to. However fixing this bug didn't fix c8. Plus c8 doesn't seem to use it anyway.

Could this be related to nodejs/node#49344 (NODE_V8_COVERAGE + FinalizationRegistry)? Doesn't seem like it since adding the following to the c8 entry file or to the test file doesn't help:

globalThis.FinalizationRegistry = function(){};
globalThis.FinalizationRegistry.prototype.register = function(){};

@jsumners
Copy link
Member

Is it only happening on 18? Does it pass on 20+?

@tlhunter
Copy link
Contributor Author

Looks like it also happens on the oldest and newest 20, and not the last 16.

@jsumners
Copy link
Member

Drat! If it were only on 18, we had an easy solution.

@tlhunter tlhunter force-pushed the tlhunter/diagnostics-channel branch from e38ab6c to 33e79ee Compare May 22, 2024 02:04
@mcollina
Copy link
Member

Hopefully the problem with the v18 FinalizationRegistry timeouts will be fixed with the upcoming release, the fix has finally been backported.

@tlhunter tlhunter force-pushed the tlhunter/diagnostics-channel branch from 33e79ee to 0c30876 Compare May 30, 2024 19:56
@tlhunter
Copy link
Contributor Author

I rebased and kicked off the CI jobs again. It appears that fewer are failing:

2024-05-30_170408

I did some more testing locally:

v20.11.1 HANG
v20.12.0 OK 2024-03-26

v18.19.1 HANG
v18.20.0 HANG
v18.20.1 HANG
v18.20.2 HANG
v18.20.3 OK 2024-05-21

Currently it looks like CI is running the final failing tests against v18.20.2. Hopefully it'll disappear from the cache soon and the PR will turn green?

2024-05-30_171202

@tlhunter
Copy link
Contributor Author

tlhunter commented May 31, 2024

@jsumners @mcollina either someone cleared the cache or I had checked an older job... Looks like everything is green now!

Copy link
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mcollina mcollina merged commit 7b11fc3 into fastify:main May 31, 2024
28 checks passed
@mcollina mcollina added this to the v5.0.0 milestone May 31, 2024
@mcollina mcollina added the v5.x Issue or pr related to Fastify v5 label May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation v5.x Issue or pr related to Fastify v5
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants