Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add *consistent* wildcard support #270

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

HansBrende
Copy link

@HansBrende HansBrende commented Feb 1, 2022

I realize that there is already a PR to add wildcard support. However, that PR is going to add some pretty big inconsistencies between named and anonymous wildcard groups, in a very confusing way. See: whatwg/urlpattern#163

The response over there so far has been "too bad, so sad, who cares."

So I'm taking the initiative to create this PR to at least demonstrate that there is a way forward with wildcard support that actually makes sense.

Why do I care? Because once URLPattern comes out of experimental status, we will be stuck with the choices we make today, FOREVER. If the path-to-regexp library itself chooses the right way forward, then maybe URLPattern will follow suit, since it seems to care a lot about compatibility with this library.

How did I go about making the two behaviors compatible with each other? Simply by not disallowing empty path segments in named wildcard groups, and not requiring a trailing slash in anonymous wildcard groups.

Hopefully my efforts here at least give some people food for thought.

P.S. the pre-commit hook that added 600 lines to this PR is really annoying.

P.P.S. Fixes #214. Also fixes or re-fixes the previously closed issues #228, #212, #196, #194 (for the case where it makes sense), #103, #87, and #37.

Detailed breakdown:

Issue Pattern Expectation This PR WICG PR v6.2 v2 - v5 v0.1.7
#214 /files/:path*\.:ext* /files/my/photo.jpg/gif 💀
#228 /:everything* /
#228 abc/:everything* abc
#212 #/* #/ 💀
#196 * /any 💀
#194 /foo/:bar* /foo/test1//test2
#103 /entity/:id/* /entity/foo 💀
#103 /entity/:id/* /entity/foo/ 💀
#87 /test/* /test 💀
#37 /some-path/* /some-path/anything 💀

(✅ = match; ❌ = no match; 💀 = parse error; all tests run in strict mode)

…ards to match paths with empty path segments
@blakeembrey
Copy link
Member

Let me fix the codebase, it's my bad for merging a dependabot PR (which I also had to disable due to sheer volume of noise) which bumped prettier without the concurrent re-formatting. I'll fix that now and you can rebase the changes.

Aside from that, I'm less sure about the actual behavior of /files/:path*\.:ext* and /:everything* passing. Especially the first will be a regression for some users who might be using this in a non-path positions. Same with the ones allowing //, the implication there is that we're allowing empty segments. Does that mean /:test should work with just /?

@blakeembrey
Copy link
Member

blakeembrey commented Feb 11, 2022

The other thing to keep in mind with * support is backward compatibility with existing scripts. The behavior of /test/* is actually incompatible with old path-to-regexp and incompatible with people migrating from other related syntaxes, such as older specs in web that might just treat * as a simple wildcard. Making it behave as a segment makes it harder for customers to properly migrate. An example in hostname search would be http://*.example.com which clearly does subdomain matching today, but in a world where it acts like a segment the subdomain becomes optional (currently path-to-regexp is only prefix segments, but the example still works if you do foo.*.example.com instead).

Overall the most unfortunate decision in consistent behavior has been the inclusion of magic prefixes which makes all this behavior harder to follow. If it was explicit the confusion would not exist, e.g. foo{.*}.example.com would be an optional domain segment while foo.*.example.com is explicit. The same concept for paths too, e.g. foo/* vs foo{/*}.

@blakeembrey
Copy link
Member

Also forgot to mention in my initial review but the summary table you created is really awesome regardless of the actual output of the PR 💯

@HansBrende
Copy link
Author

HansBrende commented Feb 11, 2022

@blakeembrey thanks for the responses and follow-up questions.

Does that mean /:test should work with just /?

No, I would say definitely not. In the absence of a wildcard character, a variable should always have a non-empty value, IMO. The reason it makes sense to me to allow empty path segments when a wildcard is specified is that in this case, the intention is clear: devs want to match for anything. So any valid path should match... regardless of whether or not it contains empty path segments. (This is particularly useful, fyi, in routing libraries. See, e.g., react-router or wouter for examples of how they are using the same syntax. It is quite unexpected behavior to specify a 404 for anything that doesn't match the previous routes, only to have a blank page instead of your 404 page show up because the actual url typed in by the user happened to contain an empty path segment.)

Especially the first [/files/:path*.:ext*] will be a regression for some users

But in this case my PR is consistent with v2 thru v5, right? So wouldn't that be fixing a regression?

...in a world where it acts like a segment the subdomain becomes optional (currently path-to-regexp is only prefix segments, but the example still works if you do foo.*.example.com instead)

True, however... the behavior of the WICG PR will do a similar thing by allowing foo..example.com to match, which isn't even a valid hostname. If users really want correct behavior in this case they should probably be doing foo.(.+).example.com, or just specify that the only prefix in the config should be "/" and not both "." and "/". Alternatively, I could also simply amend this PR to only do this for forward slashes and not dots, since that's really the use case anyone cares about, and paths & hostnames obviously have different semantics. I agree that abc.* makes slightly less sense to have an optional trailing dot than abc/* with the optional trailing forward slash.

EDIT: and honestly I'm not sure the subdomain issue is much of a problem in the first place, because even though foo.*.example.com "matches" foo.example.com with this PR, the regexp exec result[1] will be undefined rather than empty string. So if that's really important to the developer to have it not match a url missing the subdomain portion (which... I would argue... is probably less likely than the alternative use-case), they could instead check if the result of the match is not undefined; i.e., they could simply switch the logic from pattern.exec(path) != null to pattern.exec(path)?.[1] != null.

EDIT #2: OR simply do foo.{*}.example.com (which basically aligns with existing syntax/behavior anyways).

Thoughts?

@blakeembrey
Copy link
Member

The reason it makes sense to me to allow empty path segments when a wildcard is specified is that in this case, the intention is clear: devs want to match for anything.

Isn’t that why you’d want the asterisk or (.*) group to make it clear? I don’t completely disagree, but it does make for some changes in other places. E.g. does path matching that makes arrays today leave in the empty holes or remove them? Does the optional delimiter also allow empty?

@blakeembrey
Copy link
Member

Also something worth keeping in mind is whether we actually care about all these edge cases. In theory if you’re allowing repeated delimiters in one place shouldn’t we just allow it everywhere? But that makes for a more complex regex. And I’ve been of the opinion that the simple solution there is to fix it by normalizing paths at the router level before path matching. Arguably even the trailing slash should be at the router level to normalize it first.

@HansBrende
Copy link
Author

HansBrende commented Feb 11, 2022

@blakeembrey

Isn’t that why you’d want the asterisk or (.*) group to make it clear?

This could be done, however, there are a couple reasons why this would pose a major burden:

(1) Most front-end routing libraries (e.g. wouter, react-router) only expose a subset of the path-to-regexp syntax, and that subset does not include custom regexps at all! 99.999% of the routing architecture use-case can be accomplished without custom regexps, so it's a bit awkward to require one solely for the most common use-case in routing.
(2) Subrouting/404-page handling is such a common use-case that I'd guess probably 50% of patterns being matched for in the routing use-case want the equivalent of a trailing wildcard. The cost of making one specify a regexp for this use-case every. single. time. is going to add up.
(3) The wildcard syntax already exists... so devs WILL use it and prefer it over a manual regexp. And they will do so expecting it to mean one thing ("give me everything else!"). And in 99.999% of cases, it will mean that thing, so they won't change it. Except for the one rare edge case where the user mistypes an extra slash and the entire application breaks. In short: the wildcard syntax is SO handy that it would be a tremendous shame for it to fail on 1 odd-ball edge case in literally THE most common use-case for it.

does path matching that makes arrays today leave in the empty holes or remove them?

This is the part I don't really care about too much since I'm doing my own splitting logic when necessary... and only using the path-to-regexp part of the library (usually splitting is not necessary at all since the remaining path represented by the wildcard is either delegated to a subrouter, or sent to a 404 page). I do believe that the question of "how regexp match results are split" should be an entirely separate & independent question from the standardization of the actual generated regexp. That being said, it might make sense to leave in the holes here. Or take them out, either way. My gut tells me to leave them in, that way the original path can be reconstructed by joining the match with the "/" delimiter, rather than being lossy. Since the dev did not specify an overriding pattern, we're simply making the default pattern be (.*) in this case to unify behavior with the unnamed wildcard. Which allows empty string. So empty strings should be valid in the array. (I would think.)

Does the optional delimiter also allow empty?

No. The use-case for the optional quantifier is pretty much 100% across the board either an optional path segment or an optional affix of a single path segment. Therefore, the optional quantifier should not imply a different default pattern than that of a mandatory path segment variable. The use-cases where a dev would want to match a variable or empty string in the former case are basically non-existent, since nobody writes their routes that way (for a very good reason). And if they did, they could simply use 2 patterns: 1 for the non-empty case (e.g. foo/:var/bar), and one for the empty case (e.g. foo//bar). (Alternatively: foo/:var([^/]*)/bar if they didn't want 2 patterns). That way it's explicit that they are expecting a very weird use-case to be handled. With the wildcard, however, the main use-case is that you really don't care about the internal structure of the match at all, if you haven't specified a pattern. Which means that the default pattern for wildcards should be "anything".

And I’ve been of the opinion that the simple solution there is to fix it by normalizing paths at the router level before path matching.

That's not a robust solution though, because sometimes empty path segments actually mean something. (E.g. https://en.wikipedia.org/wiki///). If you normalize prior to putting the URL through the router, then you are arbitrarily deleting information from the URL that might indeed be used by subrouters. (If only to simply display the correct error message to the user, e.g. "404 path suffix foo////bar was not found in the directory!") This poses a particular problem for generic routing libraries, who cannot make the assumption that the user's routes can be normalized without changing the semantics.

@HansBrende
Copy link
Author

@blakeembrey P.S. -- if you get a chance and haven't already, please do check out the original issue I filed in URLPattern: whatwg/urlpattern#163.

In that issue, I go a lot more in-depth into all the pros & cons that I see here, as well as listing out tables of possible different behaviors. (The behavior I've implemented in this PR is equivalent to "Proposed Behavior 1" in that issue.)

blakeembrey added a commit that referenced this pull request May 28, 2024
@blakeembrey
Copy link
Member

I have added the tests from this PR description to the library in cb27d37 to be released in v7. Unfortunately I haven't supported * standalone behavior with an optional / prefix in the match due to difficulty maintaining compatibility with the match and compile parts of the library. I'm probably missing something obvious but I can't get the existing tests passing while also support standalone * and suspect I need to think more about it, it's definitely been giving me a headache.

The approach I took to get as close as possible was to allow a set of characters to repeated indefinitely. In this case just / by default. This maintains existing behavior and enables some failing cases above.

@HansBrende
Copy link
Author

@blakeembrey Cool stuff! I made sure all the tests passed whenever I wrote this PR (if I recall correctly). Which test(s) was failing that you wanted to succeed with the standalone *?

@blakeembrey
Copy link
Member

Which test(s) was failing that you wanted to succeed with the standalone *?

It's mostly the decoding of repeat parameters, I had to make it a regex match to get each sub-section due to the split now being regex-like, and .* results in empty parameter parts. It's not impossible to fix, but not ideal either.

I'm more in the camp of treating * as a separate behavior (such as .*) after playing around with it for the past few days. It's hard to make these two things reconcile and feels like a losing battle.

On the other hand, I'm trying to recall conversations with @wanderview on this and IIRC the * was for backward compatibility with existing patterns in the web platform. But I'm unsure if the existing platforms expect .* or something else, since I'm kind of guessing it was for patterns like *.example.com which matches glob behavior more than it does .* behavior.

Finally, on that note, I'm also wondering out loud whether trying to make these two things reconcile is a waste and instead I should just support the basics expected of glob matching such as *, **, etc. A lot to unpack here, I don't expect any answers, but certainly appreciate perspective 😄

@HansBrende
Copy link
Author

HansBrende commented May 28, 2024

@blakeembrey gotcha! Am I correct in stating then, that the current behavior you've implemented should nearly* match my "Proposed Behavior 2"?

If that's the case, obviously that works for me since I proposed it as the minimal necessary change to ensure basic consistency.

My perspective would be, it makes perfect sense to go with "Proposed Behavior 2" if you are planning on adding glob-matching-consistent support for *. Otherwise, "Proposed Behavior 1" (implemented in this PR) seems more ideal.

(In either direction you go, I believe .* is supported only inside parentheses, so there is no possible overlap in semantics with that particular syntax.)

* Proposed Behavior 2 with one tweak /foo /foo/ /foo/bar/ /foo//bar
/foo/:rest*
/foo/*

@blakeembrey
Copy link
Member

that the current behavior you've implemented should nearly match my whatwg/urlpattern#163?

So currently all I've done is add an option to make / become /+ which resolves most of the concerns around (.*) (since that's similar to [^/]+(/+[^/]+)*. The one gotcha is that standalone *.

So I believe it currently matches "Proposed Behavior 1" as a result of that.

I do want to make sure encoding/decoding works with whatever I do, so having * behave like [^/]*(/+[^/]*)* is pretty close to :path* behavior, but not the same. But I'd be in favor of making that change and just having * behave differently after this exercise.

But if they are going to behave differently, I am wondering if it should just match glob semantics that people already know, allowing for it to be used as *.example.com or **.example.com, and /*/file.png or /**/file.png.

@HansBrende
Copy link
Author

HansBrende commented May 29, 2024

@blakeembrey ahhh, after running through all my tests with the new behavior, I see what you are saying.

First of all, amazing work!!! Except for that one test case (* and "/any"), it seems that all the weirdness is fixed with the new behavior! sigh of relief

Second thing: it's worth noting that, for that one failing test case, the results in the current behavior are consistent between named and anonymous wildcard groups.

I.e.:
* "/any": NO MATCH
:rest* "/any": NO MATCH

However, I think it would be easy enough to fix this in both cases (should you want to) simply by prefixing that component of the regex by something like the following:

(?:^\/+)?

i.e., throwing away any preceding delimiters if they occur at the beginning of the string (in which case they are not handled by the previous path segment's regexp simply because there is no previous path segment regexp to strip them away). Thoughts?

EDIT: Or, on further inspection:

I took a look at the generated output for */*, which is as follows (after replacing (?:[^\/]+?)(?:\/+(?:[^\/]+?))* with "RELPATH_REGEX":

^(?:(RELPATH_REGEX))?(?:\/+(RELPATH_REGEX))?(?:\/+)?$

This is very similar to the regex for * alone:

^(?:(RELPATH_REGEX))?(?:\/+)?$

So, it seems it would suffice to simply change the first capturing group in the list to be just like the others, except with \/* instead of \/+, i.e:

^(?:\/*(RELPATH_REGEX))?(?:\/+)?$,

or, even just preceding the whole expression with an optional slash, similar to how the expression is suffixed with an optional slash, i.e:

^(?:\/+)?(?:(RELPATH_REGEX))?(?:\/+)?$

which can be simplified, by the way, to:

^(?:\/*)(?:(RELPATH_REGEX))?(?:\/*)$

@blakeembrey
Copy link
Member

throwing away any preceding delimiters if they occur at the beginning of the string (in which case they are not handled by the previous path segment's regexp simply because there is no previous path segment regexp to strip them away).

I considered this but it's a little odd to me. No strong opinions, but it does tend to feel like (.*) is simpler and still split on the / anyway when decoding 🤷 Although it's consistent, I suspect people might be surprised when using * to find it's not just (.*). Which sucks to say, but I'm leaning toward URLPattern having made a reasonable decision on this one. Trying not to be biased by the fact they've already made the change, but it is tricky for existing user expectations. Especially because I'd be adding this back for Express.js v5 which has * mapping to (.*) already.

To you, is there any difference at the point of the above between .* and /+?[^/]+...? I'd split it similarly and the behavior should add up the same way.

@HansBrende
Copy link
Author

HansBrende commented May 29, 2024

@blakeembrey I edited my above comment to add a few more suggestions.

I nearly agree with you: * should match /any, on that we are agreed. And it doesn't matter to me whether it matches (.*) or /+?(RELPATH). The former I think is even preferable. The only question in my head is:

Shouldn't :rest* also match /any and give consistent results with *?

My strong opinion is that if * matches, then :rest* matches, and vice versa. Any reason for this not to be the case?

My weaker opinion is that they should both give consistent matching results. But if in this one case, :rest* strips off the leading slash and * does not, I probably wouldn't mind.

(Note: I think this choice would also apply to all paths of the form */rest/of/path and :rest*/rest/of/path).

@blakeembrey
Copy link
Member

Any reason for this not to be the case?

No good reason. It just feels kind of janky for that to behave differently to /:path? and /:path+. Shouldn't they all allow that optional / at the beginning? So far no one has made a real issue about :path* not matching / at the beginning, so I believe it's been acceptable and understandable, while I do believe the * by itself will cause a bunch of gotchas.

@blakeembrey
Copy link
Member

I think I'm seeing * in those cases more as a modifier to make repeatable patterns in matching, vs trying to make a named wildcard. I'm seeing * by itself as just a wildcard. And I believe the /+ fix makes each more reasonable now and fixes the caveats I've seen mentioned.

@HansBrende
Copy link
Author

HansBrende commented May 29, 2024

No good reason. It just feels kind of janky for that to behave differently to /:path? and /:path+. Shouldn't they all allow that optional / at the beginning?

@blakeembrey Those two, unlike :path*, are prefixed with /. To compare apples to apples, we'd need to consider :path? and :path+.

  1. Should * match /any? Both agree yes
  2. Should :rest* match /any? (I would argue yes to bring consistency with * to 100%... just seems a shame to stop at 99.9% consistent!)
  3. Should :rest+ match /any? (I would argue yes for consistency with :rest*)
  4. Should :item and :item? match /any? (I would again say yes, given that HTTP URL paths are always guaranteed to begin with a forward slash, so the only possible semantics the user could have intended by putting :item/... at the beginning of the path is if it was meant to match the first path component, regardless if the root-level forward slash were still present or stripped off). This would be a convenient way to match the first path component, regardless if it were relative or absolute.

Having said all that, I do not have a strong opinion on this point after all, since I realized that this distinction will never come up for me anyways in real life... given that all URL paths always start with a forward slash (and therefore I can always pre-normalize path patterns to include a preceding forward-slash if they are missing one prior to generating the regexp).

Making the corresponding root-level forward slash optional in the path pattern would be convenient since I could then omit all the first forward slashes from patterns and they'd still match the right paths, and it would boost that consistency between * and :rest* to 100% which would be neat.

(Actually, probably the biggest tangible bonus I can think of to doing that would be to simplify the documentation: you could just say that * is exactly the same as :named_wild* without using the word "except" 😄 .)

But I think either way would work as both solve all of the actual, tangible use-cases here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multiple parameters suffixed with an asterisk does not match - works with v5/v3
3 participants