ISUPPORT UTF8ONLY is not backwards-compatible. #456

keaston · 2021-06-02T01:38:24Z

One of the guiding principles of IRCv3 appears to be backwards-compatibility - from the FAQ:

We intend for all the specs we put out to be backwards-compatible. In other words, if an old client connects to a server that supports IRCv3 extensions, that old client should work without an issue.

This is not usefully the case for the current design of the ISUPPORT UTF8ONLY specification, since clients that do not support the specification will happily send non-UTF8 and be disconnected for violating the protocol.

To be backwards-compatible, this should be opt-in with a CAP exchange. Once a client has ACK'd UTF8ONLY, it is reasonable to expect it not to send anything that violates the UTF8ONLY specification.

SadieCat · 2021-06-02T01:45:20Z

The UTF8ONLY token only exists to let clients detect that the server is UTF-8 only. It is backwards compatible with the existing situation where servers that require UTF-8 silently break with clients which are not configured to use UTF-8.

The spec does not specify any required method for handling clients that send non-UTF-8. It's entirely legal under the spec for implementations to transcode any non-UTF-8 to UTF-8 if they want.

keaston · 2021-06-02T02:42:45Z

The UTF8ONLY token only exists to let clients detect that the server is UTF-8 only. It is backwards compatible with the existing situation where servers that require UTF-8 silently break with clients which are not configured to use UTF-8.

Such servers aren't really following the spirit of the backwards-compatibility principle, so it seems harmful to endorse that approach in IRCv3. The way it appears now it looks like a desired and encouraged part of the specification - ideally it would at least say that servers SHOULD not drop the client for sending non-UTF8, though they may ignore individual protocol messages.

DanielOaks · 2021-06-02T03:18:07Z

since clients that do not support the specification will happily send non-UTF8 and be disconnected for violating the protocol.

Ideally such servers would always handle these cases without disconnecting the client. However, given the amount of discussion that'd likely result from trying to specify one specific way of handling these cases, I thought it'd be best to just let the servers handle it in whatever way they find appropriate.

To be backwards-compatible, this should be opt-in with a CAP exchange. Once a client has ACK'd UTF8ONLY, it is reasonable to expect it not to send anything that violates the UTF8ONLY specification.

Unfortunately we can't make this opt-in with a CAP, since servers that only accept UTF-8 traffic already exist and they need to transcode, reject, or in some other way handle non-UTF-8 traffic from clients in line with the definition written in the spec anyway.

ideally it would at least say that servers SHOULD not drop the client for sending non-UTF8, though they may ignore individual protocol messages

Definitely makes sense to discourage disconnecting the client outright. I'll play with the language there and try to PR some alternative language that encourages that only as a last resort. Thanks for the note, much appreciated!

slingamn · 2021-06-02T04:10:54Z

Such servers aren't really following the spirit of the backwards-compatibility principle, so it seems harmful to endorse that approach in IRCv3.

It's a tricky issue, yeah. I think a compatibility break is inherent in the intent of the specification --- if a server implements the spec, it's never really going to interoperate acceptably with clients that use non-UTF8 encodings (even if you can robustly transcode input, the server will only emit UTF8, likely violating client expectations that the output encoding will agree with the input encoding).

I agree with the suggestion that disconnecting the client altogether is unnecessarily aggressive and should probably be deprecated. (From the comment history on #432, it sounds like we were exploring it as the best way to get the end user's attention.)

vanosg · 2022-07-02T23:21:57Z

I'll bump this issue a year later- I agree, the concept of disconnecting a client over UTF8 seems heavy-handed and appears to be an option suggested in the UTF8ONLY spec. I would love to see this language be removed.

DanielOaks · 2022-07-03T03:44:41Z

I think this change gives a more accurate explanation of why this spec exists, and also removes the disconnection language entirely. Please let me know watcha think: https://gist.github.com/DanielOaks/02a60498e4be4ecb7d6be387eecb642a/revisions#diff-014869833613b58c7e37f5208548f4e64d8d0deb465a47d1db21da761158f143=

vanosg · 2022-07-03T04:47:39Z

I think the changes improve the document, and appreciate the removal of the language referencing disconnection as a server option.

slingamn · 2022-07-03T05:01:43Z

I'm OK with removing the disconnection language, but I don't like the other changes.

Only allowing this encoding breaks compatibility with the IRC protocol as written

Is this true? I've always thought of UTF8ONLY as being an example of a server's ability to impose a content moderation policy. In this case, non-UTF8 "payloads" (final parameters to PRIVMSG, NOTICE, USER, TOPIC, etc.) are being disallowed.

DanielOaks · 2022-07-03T05:12:48Z

Only allowing this encoding breaks compatibility with the IRC protocol as written

Is this true? I've always thought of UTF8ONLY as being an example of a server's ability to impose a content moderation policy. In this case, non-UTF8 "payloads" (final parameters to PRIVMSG, NOTICE, USER, TOPIC, etc.) are being disallowed.

Depends on your view of the protocol I guess. Some do see disallowing that as a protocol break, some responses to non-UTF-8 content (e.g. disconnecting the client) would prolly classify as a protocol break, and some don't see it as a protocol break.

I guess in my view of that sentence, I'm kind of conflating the 'decode everything as UTF-8' approach that some software does as not following the 'traditional' treat-everything-as-octets-and-bytes direction, but I guess the token/stdreplies code themselves doesn't necessarily mean that 🤷

slingamn · 2022-07-03T07:44:54Z

some don't see it as a protocol break.

Put me in this camp :-)

I found a better way to phrase my objection: the current spec language implies that non-UTF8 is legacy and UTF8 is preferred. I like this implication and I want to keep it.

slingamn mentioned this issue Jul 3, 2022

UTF8ONLY: remove suggestion to disconnect the client #502

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ISUPPORT UTF8ONLY is not backwards-compatible. #456

ISUPPORT UTF8ONLY is not backwards-compatible. #456

keaston commented Jun 2, 2021 •

edited

SadieCat commented Jun 2, 2021 •

edited

keaston commented Jun 2, 2021

DanielOaks commented Jun 2, 2021 •

edited

slingamn commented Jun 2, 2021

vanosg commented Jul 2, 2022

DanielOaks commented Jul 3, 2022

vanosg commented Jul 3, 2022

slingamn commented Jul 3, 2022

DanielOaks commented Jul 3, 2022

slingamn commented Jul 3, 2022

ISUPPORT UTF8ONLY is not backwards-compatible. #456

ISUPPORT UTF8ONLY is not backwards-compatible. #456

Comments

keaston commented Jun 2, 2021 • edited

SadieCat commented Jun 2, 2021 • edited

keaston commented Jun 2, 2021

DanielOaks commented Jun 2, 2021 • edited

slingamn commented Jun 2, 2021

vanosg commented Jul 2, 2022

DanielOaks commented Jul 3, 2022

vanosg commented Jul 3, 2022

slingamn commented Jul 3, 2022

DanielOaks commented Jul 3, 2022

slingamn commented Jul 3, 2022

keaston commented Jun 2, 2021 •

edited

SadieCat commented Jun 2, 2021 •

edited

DanielOaks commented Jun 2, 2021 •

edited