Latest allowed_domains behavior breaks middlewares that rewrite urls #6366

rvandam · 2024-05-15T23:26:16Z

Description

The change to behavior of Spider.allowed_domains in 2.11.2 broke several of our crawls because it does not play well with downloader middlewares that replace the original request in order to modify the url.

This is because modifying the url (such as for proxies that only have apis without a transparent proxy mode) requires using request.replace but that generates a "new" request which scrapy then reschedules which replays all middlewares including the new offsite middleware, causing the proxied request to get filtered (unless the proxy domain is also included in the allowed_domains).

The documented workaround of using dont_filter is a non-starter since that won't fix published middlewares that you don't own and it has other side effects (e.g. disabling the duplicate filter) which may be undesired.

This is a breaking change on a minor version which really shouldn't happen.

Steps to Reproduce

Set allowed_domains to the desired domain, e.g. books.toscrape.com
Use any middleware that rewrites the url to a different domain, e.g. https://github.com/aleroot/scrapy-scraperapi-proxy/blob/master/scrapy_scraperapi/scraperapi.py
Run your spider on any reasonable url of your domain, .e.g. https://books.toscrape.com/catalogue/category/books/mystery_3/index.html

Expected behavior:
Behavior prior to 2.11.2, the desired domain would still be crawled.

Actual behavior:
Nothing gets crawled

Reproduces how often:
100% in the given scenario

Additional context

I can think of the following workarounds but all of them have major drawbacks:

stop using allowed_domains (but now you don't get its intended benefits)
disable offsite middleware (ditto)
include any domains from your middlewares into allowed_domains (this breaks encapsulation and makes it harder to switch proxies and might make this a very large list for large rotating proxies)
modify the middleware to always force include dont_filter=True (but what if you're using a 3rd party middleware like the one above so you can't modify it and also now you lose dupe filtering)
convince every proxy provider to offer a transparent proxy mode rather than an api (unlikely)
convince every proxy provider to write their own custom download handler middleware like zyte api has to avoid this problem (good luck)

Suggested fixes

Perhaps the new behavior could be off by default and toggled with a new setting and then the old behavior could go through a deprecation cycle in future major versions. At least then the breaking change would be expected.

It might also help to separate out a dont_offsite_filter flag so that a well designed middleware could intentionally avoid this dilemma without also triggering the rest of the dont_filter behavior.

The text was updated successfully, but these errors were encountered:

wRAR · 2024-05-16T06:24:01Z

As a quick workaround you can disable the new downloader middleware and enable the old spider middleware.

Gallaecio · 2024-05-16T07:56:06Z

It might also help to separate out a dont_offsite_filter flag so that a well designed middleware could intentionally avoid this dilemma without also triggering the rest of the dont_filter behavior.

This sounds good to me, allowing to set "allow_offsite": True in Request.meta to let a specific request bypass the middleware.

This is a breaking change on a minor version which really shouldn't happen.

For the record, this is a breaking change in a patch version, which is something we are willing to do for security issues like this one. We have done it in the past, and we will probably do it again in the future. Python does the same.

That said, had we realized this could be an issue, we would have provided a workaround like the one you suggest as part of the fix. We did not think of scenarios like yours, even though it sounds very fair. We’ll try to address this in 2.11.3.

wRAR · 2024-05-16T08:03:19Z

Related: #6151

wRAR · 2024-05-16T09:40:17Z

This still will require updating middlewares that reschedule requests. I think we need to add a couple of suggestions about this problem to the docs, probably to the 2.11.2 release notes.

Gallaecio linked a pull request May 16, 2024 that will close this issue

new allow_offsite parameter in OffsiteMiddleware #6151

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latest allowed_domains behavior breaks middlewares that rewrite urls #6366

Latest allowed_domains behavior breaks middlewares that rewrite urls #6366

rvandam commented May 15, 2024

wRAR commented May 16, 2024

Gallaecio commented May 16, 2024

wRAR commented May 16, 2024

wRAR commented May 16, 2024

Latest allowed_domains behavior breaks middlewares that rewrite urls #6366

Latest allowed_domains behavior breaks middlewares that rewrite urls #6366

Comments

rvandam commented May 15, 2024

Description

Steps to Reproduce

Additional context

Suggested fixes

wRAR commented May 16, 2024

Gallaecio commented May 16, 2024

wRAR commented May 16, 2024

wRAR commented May 16, 2024