Redirect management overhaul #89

smashwilson · 2016-01-21T14:21:48Z

We're about to inherit some thousands of redirects, so the "giant hand-edited JSON file" approach to redirect configuration is quickly going to become untenable. Let's investigate some alternatives to make it a bit more manageable.

One thing to keep in mind: we already have an open issue to draft a new control repository layout (rackerlabs/nexus-control#271). It would be wise to design this in a way that's compatible with our thoughts there, so that we don't have to accept three kinds of redirect configuration!

/cc @j12y, @steveortiz

chesshacker · 2016-01-21T14:34:46Z

Possible feature: pass "aliases" to content in the front-matter yaml of each piece of content. When the content service loads it, it would see the aliases in the envelope data, and it could setup the redirects then... the only issue I see with this, is if we remove an alias from the front-matter yaml, how would the content service know to remove the redirect? Thoughts?

smashwilson · 2016-01-21T15:08:14Z

Possible feature: pass "aliases" to content in the front-matter yaml of each piece of content. When the content service loads it, it would see the aliases in the envelope data, and it could setup the redirects then... the only issue I see with this, is if we remove an alias from the front-matter yaml, how would the content service know to remove the redirect? Thoughts?

I do like the idea of managing redirects there, but there are lots of other difficulties with the implementation:

The information flow between the content service and the presenter is strictly one-way. We cheat a little to sync the control repository SHA updates - the presenter actually queries the content service for updates on each request and takes action if the hash has changed. We'd need to do something similar to sync the state of the current redirect map.
How do we handle cases where the source files don't map one-to-one to output pages? This is the case for almost all of our content repositories right now (that use the Sphinx single-page builder).
What about files that specify redirect aliases for content that already exists? What if a piece of content specifies an alias, and then later content is uploaded that maps to that path? Keep in mind that the content service knows nothing of the mapping, and that the mapping can change without notice to the content service.

Essentially, that would be a different, much harder thing to implement than what I'm talking about here. It might still be worth doing -- we'd talked about trying to infer redirects somehow, and I think this is more promising than that -- but it escalates the work required to the "months" range. We would also still need a way to configure redirects in the control repository, because otherwise we'd have no way to redirect off-cluster. So I'd rather improve the presenter config story first and get you something that helps you ship within a few weeks.

smashwilson · 2016-01-21T15:11:22Z

Can you open an issue on deconst/deconst-docs for that? I can try to brainstorm how to best fit that in there.

smashwilson · 2016-01-21T15:30:51Z

As for improvements on the redirect configuration format, some quick ideas:

Multiple files. This is the easiest one; we have a filename pattern we accept and just accumulate the rules from each file that matches. It'd let us introduce some semblance of organization.
Nesting. Would it be helpful to be able to nest rules somehow? Match on one pattern and only apply subrules if the match is successful. I think this would add some expressiveness to the rulesets and might let us have fewer rules. It would be harder for maintainers to grok, though.
Use something other than JSON? JSON makes regular expressions awkward because of the escaping that you need to do. On the other hand, it's incredibly convenient to parse and it's got an easily understandable syntax. I'm not sure what I'd propose in its place, though.

ktbartholomew · 2016-01-21T16:45:38Z

👍 for multiple files. It doesn't add a significant amount of complexity, but makes it easier to logically separate sets of redirects that have different purposes. If we're going to set that precedent, we should also allow multiple files for route and content configuration, too. Something like this:

example.com/
    content.d/
    rewrites.d/
    routes.d/

I believe "nesting" is already possible in a way, by setting "rewrite": true for a given redirect. This should allow you to alter the path, then process other rewrites using the altered path. This is what it's supposed to do, anyways.

I was going to weigh in on the JSON thing, but there's active discussion going on in Slack, so i'm gonna bite my tongue for a bit.

kenperkins · 2016-01-21T16:49:09Z

👍 on multiple files. I believe this would make our strategy a little easier to manage, while preserving the fundamental way we're doing redirects now.

smashwilson · 2016-01-22T15:32:08Z

@steveortiz: Any feedback on this? My thoughts right now:

Early next week, I can look into supporting a rewrites.d directory. That should alleviate the immediate pain of having a single, monolithic JSON blob sitting in our control repo, by giving us several JSON blobs in our control repo. This is fairly low effort and shouldn't derail me from the hosted-build work I'm trying to focus on too much.
Someone can write a script to convert the existing redirects from whatever format they're in now to JSON rules, and add it to the control repo.
Once hosted builds and the pull request builder are shipped, we can revisit and find a more maintainable, cleaner solution to the redirect management problem.

ktbartholomew · 2016-01-22T15:33:15Z

dibs on rewrites.d work.

smashwilson · 2016-01-22T15:33:29Z

👍 ✨

chesshacker · 2016-01-26T14:46:31Z

Sorry for dropping off the discussion... still heads down getting ready for our release tomorrow. But I did catch up on the previous conversation.

To answer @kenperkins's question about whether the current setup is adequate. Well, the rewrites config file in nexus-control would become unwieldy, but... yes, it would work in a pinch, if you don't mind my adding 5,000 lines to it. Longer-term, splitting that out helps some, but it is still disconnected from the content and the authors do not have an easy way to add to it.

Say for instance, that the product name changes, so all the articles with that in the product name need to change... ideally, you should also go back and update all other redirects that pointed to this article as well, so you don't end up with a long chain of redirects. What a mess!

If instead, in the content file, you had a list of "aliases"... you'd just add one old name to the aliases and rename the file. The preparer would see the new alias and handle it (discussed below). This is a better experience from the content creator's perspective... and with as many articles as we have in the KC, adding redirects is a frequent occurrence.

The problems @smashwilson mentioned. If the content is no longer in nexus... possible solution: rather you could add the keyword "redirect" to the front-matter... and the content store would not create an entry for content, but an entry for the redirect to whatever location was specified in the front-matter.

If there is a conflict, both an alias and content with the same path, the content should win. You could implement this by checking for an alias or redirect only if the content is not found. The case where you have the same alias specified by two different pieces of content is trickier. I guess you could create and possibly replace the alias whenever the request comes in, but only remove the alias if it includes the expected location.

I'm not saying the presenter shouldn't have the ability to redirect from the nexus-control. I think that is helpful. However, I think the authors should have easy access to redirects, so we should put something in the front-matter that gives them this capability.

ktbartholomew · 2016-01-26T15:15:31Z

The idea of aliases in the content files is problematic, because the presenter fetches content files based on the URL currently being requested. So if a visitor requests a.html, the presenter will fetch the content file for a.html. If, however, that file has been renamed to b.html and subsequently c.html, with aliases being added to its metadata, the presenter will still try to load a.html because the URL the visitor is using hasn't changed. It would have no way of knowing about the aliases of c.html without actually loading that file.

I think this points towards the need for a single source of truth regarding redirects. While we have this in our current static files, maybe there's a need for a "Redirect service" of some sort that is pinged by the content service when it receives content containing redirects, and is consulted by the presenter when routing requests.

smashwilson · 2016-01-26T16:26:13Z

The idea of aliases in the content files is problematic, because the presenter fetches content files based on the URL currently being requested. So if a visitor requests a.html, the presenter will fetch the content file for a.html. If, however, that file has been renamed to b.html and subsequently c.html, with aliases being added to its metadata, the presenter will still try to load a.html because the URL the visitor is using hasn't changed. It would have no way of knowing about the aliases of c.html without actually loading that file.

I've been brainstorming ways we could make this work and I think it would be doable if we store redirect data as a separate map of URL patterns 👉 content ID rather than as actual metadata attached to each envelope. Then we can change the way requests are served to something like:

Presenter queries its content map to discover the content ID mapped to the presented URL. If there is a mapped content ID, hit the content service to retrieve its content, or find out that it isn't there. (Say /a.html is mapped to some-content-id/a, but there's no content there.)
Presenter queries ${REDIRECT_SOURCE_OF_TRUTH} with the presented URL to discover if this URL matches any redirect patterns. If so: generate the target content ID and do a reverse lookup on the content map to find its URL. (Say the latest redirect for /a.html is some-content-id/c, which is mapped as /c.html.)
If there is a redirect but no mapping or no content, issue the resolved redirect. (Here we'd issue a 302 for /c.html.)
If there is a mapping and content, perform the render as we do anyway.

Rewrites are slightly more complicated but similar: I think we can just use the transformed presented URL, redo the content ID derivation from the content map, do both content lookups in parallel, and use the envelope from the transformed content ID if the original presented URL is unmapped or has no content.

It is more traffic per request but I'm willing to bet we'd still be bound by Cloud Files throughput anyway.

I think this points towards the need for a single source of truth regarding redirects. While we have this in our current static files, maybe there's a need for a "Redirect service" of some sort that is pinged by the content service when it receives content containing redirects, and is consulted by the presenter when routing requests.

Hah. @steveortiz actually originally suggested a dedicated service for redirects and I resisted the idea. Collapsing the mapping and template services into the presenter simplified everything so much, I've been avoiding introducing more services since then as a result. We'd also have to duplicate the API key validation logic, etc.

Honestly I was just thinking of adding a new set of /redirect API endpoints to the content service. I don't think it's too much of an abuse of the content service domain of responsibility because it's kind of "content-adjacent" information, much like the full-text search indexing we already manage there.

chesshacker · 2016-01-26T16:26:24Z

@ktbartholomew -- I didn't explain my idea well. What I am suggesting is that we create entries in cloud files for each piece of content AND each alias/redirect. For your example, where it was called a.html and then was changed to b.html with an alias of a.html. The only time it would get a body from the content service is when the presenter requests b.html. For requests to a.html... it would find a very short response saying redirect to b.html in the envelope.
To do this, there would have to be something in the content service at b.html. Either the preparer or the content service could manage this... when someone creates or updates content with aliases in the envelope data... it actually creates the content in one place and a redirect for each alias.
To answer the question of what to do if there is content and a redirect, I suggested previously we modify the name of redirects slightly... say, for this example, the entry is actually b.html_redirect, so the presenter would try to get b.html and if that failed, it would try for b.html__redirect... I'm not sure this is the right plan... I think it might be better to hit the content service once, and either get back a piece of content or a redirect. But, it's one possible solution for what to do when there's a conflict with content and aliases.
The content repositories are the ultimate source of truth for the content... I'd like if they were also the source of truth for redirects, and I think the closer we can keep the redirects to the content, the better.
That said, if it's impractical to do this... I'm always open to other suggestions.

chesshacker · 2016-01-26T16:30:47Z

@smashwilson -- I like your ideas on how to store redirects... either in a separate service in the content store with a different endpoint (a much better idea than my hypothetical of tacking on __redirect). What do you think about which service should create them?

smashwilson · 2016-01-26T16:40:56Z

Yeah, I try to avoid using magic names, because you know it's only a matter of time until someone tries to make an envelope that actually has a content ID ending in _redirect 😉 The _faq thing is only there until I can work on deconst/deconst-docs#133 and overhaul the way content gets submitted. Fun fact: there's also no way to delete content right now without manually deleting things from Cloud Files.

What do you think about which service should create them?

The nice part about making it a first-class concept in the API is that you can manage them with any client that has a valid API key. In practice, I'd expect redirects created from page metadata (Jekyll frontmatter or Sphinx per-page metadata) to be submitted by the content preparer, and redirects created from nexus-control files to be submitted by the increasingly-inaccurately-named asset preparer.

chesshacker · 2016-01-26T17:02:40Z

and redirects created from nexus-control files to be submitted by the increasingly-inaccurately-named asset preparer.

LOL! Sounds like a good plan to me.

ktbartholomew mentioned this issue Jan 22, 2016

modify JSON parsing function to read, parse, and merge many files #90

Merged

smashwilson changed the title ~~Improve redirect configuration syntax~~ Redirect management overhaul Jan 26, 2016

smashwilson added the epic label May 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redirect management overhaul #89

Redirect management overhaul #89

smashwilson commented Jan 21, 2016

chesshacker commented Jan 21, 2016

smashwilson commented Jan 21, 2016

smashwilson commented Jan 21, 2016

smashwilson commented Jan 21, 2016

ktbartholomew commented Jan 21, 2016

kenperkins commented Jan 21, 2016

smashwilson commented Jan 22, 2016

ktbartholomew commented Jan 22, 2016

smashwilson commented Jan 22, 2016

chesshacker commented Jan 26, 2016

ktbartholomew commented Jan 26, 2016

smashwilson commented Jan 26, 2016

chesshacker commented Jan 26, 2016

chesshacker commented Jan 26, 2016

smashwilson commented Jan 26, 2016

chesshacker commented Jan 26, 2016

Redirect management overhaul #89

Redirect management overhaul #89

Comments

smashwilson commented Jan 21, 2016

chesshacker commented Jan 21, 2016

smashwilson commented Jan 21, 2016

smashwilson commented Jan 21, 2016

smashwilson commented Jan 21, 2016

ktbartholomew commented Jan 21, 2016

kenperkins commented Jan 21, 2016

smashwilson commented Jan 22, 2016

ktbartholomew commented Jan 22, 2016

smashwilson commented Jan 22, 2016

chesshacker commented Jan 26, 2016

ktbartholomew commented Jan 26, 2016

smashwilson commented Jan 26, 2016

chesshacker commented Jan 26, 2016

chesshacker commented Jan 26, 2016

smashwilson commented Jan 26, 2016

chesshacker commented Jan 26, 2016