Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latest allowed_domains behavior breaks middlewares that rewrite urls #6366

Open
rvandam opened this issue May 15, 2024 · 4 comments · May be fixed by #6151
Open

Latest allowed_domains behavior breaks middlewares that rewrite urls #6366

rvandam opened this issue May 15, 2024 · 4 comments · May be fixed by #6151

Comments

@rvandam
Copy link

rvandam commented May 15, 2024

Description

The change to behavior of Spider.allowed_domains in 2.11.2 broke several of our crawls because it does not play well with downloader middlewares that replace the original request in order to modify the url.

This is because modifying the url (such as for proxies that only have apis without a transparent proxy mode) requires using request.replace but that generates a "new" request which scrapy then reschedules which replays all middlewares including the new offsite middleware, causing the proxied request to get filtered (unless the proxy domain is also included in the allowed_domains).

The documented workaround of using dont_filter is a non-starter since that won't fix published middlewares that you don't own and it has other side effects (e.g. disabling the duplicate filter) which may be undesired.

This is a breaking change on a minor version which really shouldn't happen.

Steps to Reproduce

  1. Set allowed_domains to the desired domain, e.g. books.toscrape.com
  2. Use any middleware that rewrites the url to a different domain, e.g. https://github.com/aleroot/scrapy-scraperapi-proxy/blob/master/scrapy_scraperapi/scraperapi.py
  3. Run your spider on any reasonable url of your domain, .e.g. https://books.toscrape.com/catalogue/category/books/mystery_3/index.html

Expected behavior:
Behavior prior to 2.11.2, the desired domain would still be crawled.

Actual behavior:
Nothing gets crawled

Reproduces how often:
100% in the given scenario

Additional context

I can think of the following workarounds but all of them have major drawbacks:

  • stop using allowed_domains (but now you don't get its intended benefits)
  • disable offsite middleware (ditto)
  • include any domains from your middlewares into allowed_domains (this breaks encapsulation and makes it harder to switch proxies and might make this a very large list for large rotating proxies)
  • modify the middleware to always force include dont_filter=True (but what if you're using a 3rd party middleware like the one above so you can't modify it and also now you lose dupe filtering)
  • convince every proxy provider to offer a transparent proxy mode rather than an api (unlikely)
  • convince every proxy provider to write their own custom download handler middleware like zyte api has to avoid this problem (good luck)

Suggested fixes

Perhaps the new behavior could be off by default and toggled with a new setting and then the old behavior could go through a deprecation cycle in future major versions. At least then the breaking change would be expected.

It might also help to separate out a dont_offsite_filter flag so that a well designed middleware could intentionally avoid this dilemma without also triggering the rest of the dont_filter behavior.

@wRAR
Copy link
Member

wRAR commented May 16, 2024

As a quick workaround you can disable the new downloader middleware and enable the old spider middleware.

@Gallaecio
Copy link
Member

It might also help to separate out a dont_offsite_filter flag so that a well designed middleware could intentionally avoid this dilemma without also triggering the rest of the dont_filter behavior.

This sounds good to me, allowing to set "allow_offsite": True in Request.meta to let a specific request bypass the middleware.

This is a breaking change on a minor version which really shouldn't happen.

For the record, this is a breaking change in a patch version, which is something we are willing to do for security issues like this one. We have done it in the past, and we will probably do it again in the future. Python does the same.

That said, had we realized this could be an issue, we would have provided a workaround like the one you suggest as part of the fix. We did not think of scenarios like yours, even though it sounds very fair. We’ll try to address this in 2.11.3.

@wRAR
Copy link
Member

wRAR commented May 16, 2024

Related: #6151

@Gallaecio Gallaecio linked a pull request May 16, 2024 that will close this issue
@wRAR
Copy link
Member

wRAR commented May 16, 2024

This still will require updating middlewares that reschedule requests. I think we need to add a couple of suggestions about this problem to the docs, probably to the 2.11.2 release notes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants