Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISUPPORT UTF8ONLY is not backwards-compatible. #456

Open
keaston opened this issue Jun 2, 2021 · 10 comments
Open

ISUPPORT UTF8ONLY is not backwards-compatible. #456

keaston opened this issue Jun 2, 2021 · 10 comments

Comments

@keaston
Copy link

keaston commented Jun 2, 2021

One of the guiding principles of IRCv3 appears to be backwards-compatibility - from the FAQ:

We intend for all the specs we put out to be backwards-compatible. In other words, if an old client connects to a server that supports IRCv3 extensions, that old client should work without an issue.

This is not usefully the case for the current design of the ISUPPORT UTF8ONLY specification, since clients that do not support the specification will happily send non-UTF8 and be disconnected for violating the protocol.

To be backwards-compatible, this should be opt-in with a CAP exchange. Once a client has ACK'd UTF8ONLY, it is reasonable to expect it not to send anything that violates the UTF8ONLY specification.

@SadieCat
Copy link
Contributor

SadieCat commented Jun 2, 2021

The UTF8ONLY token only exists to let clients detect that the server is UTF-8 only. It is backwards compatible with the existing situation where servers that require UTF-8 silently break with clients which are not configured to use UTF-8.

The spec does not specify any required method for handling clients that send non-UTF-8. It's entirely legal under the spec for implementations to transcode any non-UTF-8 to UTF-8 if they want.

@keaston
Copy link
Author

keaston commented Jun 2, 2021

The UTF8ONLY token only exists to let clients detect that the server is UTF-8 only. It is backwards compatible with the existing situation where servers that require UTF-8 silently break with clients which are not configured to use UTF-8.

Such servers aren't really following the spirit of the backwards-compatibility principle, so it seems harmful to endorse that approach in IRCv3. The way it appears now it looks like a desired and encouraged part of the specification - ideally it would at least say that servers SHOULD not drop the client for sending non-UTF8, though they may ignore individual protocol messages.

@DanielOaks
Copy link
Member

DanielOaks commented Jun 2, 2021

since clients that do not support the specification will happily send non-UTF8 and be disconnected for violating the protocol.

Ideally such servers would always handle these cases without disconnecting the client. However, given the amount of discussion that'd likely result from trying to specify one specific way of handling these cases, I thought it'd be best to just let the servers handle it in whatever way they find appropriate.

To be backwards-compatible, this should be opt-in with a CAP exchange. Once a client has ACK'd UTF8ONLY, it is reasonable to expect it not to send anything that violates the UTF8ONLY specification.

Unfortunately we can't make this opt-in with a CAP, since servers that only accept UTF-8 traffic already exist and they need to transcode, reject, or in some other way handle non-UTF-8 traffic from clients in line with the definition written in the spec anyway.

ideally it would at least say that servers SHOULD not drop the client for sending non-UTF8, though they may ignore individual protocol messages

Definitely makes sense to discourage disconnecting the client outright. I'll play with the language there and try to PR some alternative language that encourages that only as a last resort. Thanks for the note, much appreciated!

@slingamn
Copy link
Contributor

slingamn commented Jun 2, 2021

Such servers aren't really following the spirit of the backwards-compatibility principle, so it seems harmful to endorse that approach in IRCv3.

It's a tricky issue, yeah. I think a compatibility break is inherent in the intent of the specification --- if a server implements the spec, it's never really going to interoperate acceptably with clients that use non-UTF8 encodings (even if you can robustly transcode input, the server will only emit UTF8, likely violating client expectations that the output encoding will agree with the input encoding).

I agree with the suggestion that disconnecting the client altogether is unnecessarily aggressive and should probably be deprecated. (From the comment history on #432, it sounds like we were exploring it as the best way to get the end user's attention.)

@vanosg
Copy link
Contributor

vanosg commented Jul 2, 2022

I'll bump this issue a year later- I agree, the concept of disconnecting a client over UTF8 seems heavy-handed and appears to be an option suggested in the UTF8ONLY spec. I would love to see this language be removed.

@DanielOaks
Copy link
Member

I think this change gives a more accurate explanation of why this spec exists, and also removes the disconnection language entirely. Please let me know watcha think: https://gist.github.com/DanielOaks/02a60498e4be4ecb7d6be387eecb642a/revisions#diff-014869833613b58c7e37f5208548f4e64d8d0deb465a47d1db21da761158f143=

@vanosg
Copy link
Contributor

vanosg commented Jul 3, 2022

I think the changes improve the document, and appreciate the removal of the language referencing disconnection as a server option.

@slingamn
Copy link
Contributor

slingamn commented Jul 3, 2022

I'm OK with removing the disconnection language, but I don't like the other changes.

Only allowing this encoding breaks compatibility with the IRC protocol as written

Is this true? I've always thought of UTF8ONLY as being an example of a server's ability to impose a content moderation policy. In this case, non-UTF8 "payloads" (final parameters to PRIVMSG, NOTICE, USER, TOPIC, etc.) are being disallowed.

@DanielOaks
Copy link
Member

Only allowing this encoding breaks compatibility with the IRC protocol as written

Is this true? I've always thought of UTF8ONLY as being an example of a server's ability to impose a content moderation policy. In this case, non-UTF8 "payloads" (final parameters to PRIVMSG, NOTICE, USER, TOPIC, etc.) are being disallowed.

Depends on your view of the protocol I guess. Some do see disallowing that as a protocol break, some responses to non-UTF-8 content (e.g. disconnecting the client) would prolly classify as a protocol break, and some don't see it as a protocol break.

I guess in my view of that sentence, I'm kind of conflating the 'decode everything as UTF-8' approach that some software does as not following the 'traditional' treat-everything-as-octets-and-bytes direction, but I guess the token/stdreplies code themselves doesn't necessarily mean that 🤷

@slingamn
Copy link
Contributor

slingamn commented Jul 3, 2022

some don't see it as a protocol break.

Put me in this camp :-)

I found a better way to phrase my objection: the current spec language implies that non-UTF8 is legacy and UTF8 is preferred. I like this implication and I want to keep it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants