Skip to content

_Url to inherit from str #187

@BurnzZ

Description

@BurnzZ

There was a previous discussion about this before in one of the PRs.

I'm re-opening this for tracking since this part of w3lib.util.to_unicode breaks: https://github.com/scrapy/w3lib/blob/master/w3lib/util.py#L46-L49

In particular, doing something like:

from scrapy.linkextractors import LinkExtractor

link_extractor = LinkExtractor()
link_extractor.extract_links(response) 

where response is a web_poet.page_inputs.http.HttpResponse instance and not scrapy.http.Response.

The full stacktrace would be:

File "/usr/local/lib/python3.10/site-packages/scrapy/linkextractors/[lxmlhtml.py](http://lxmlhtml.py/)", line 239, in extract_links
    base_url = get_base_url(response)
  File "/usr/local/lib/python3.10/site-packages/scrapy/utils/[response.py](http://response.py/)", line 27, in get_base_url
    _baseurl_cache[response] = html.get_base_url(
  File "/usr/local/lib/python3.10/site-packages/w3lib/[html.py](http://html.py/)", line 323, in get_base_url
    return safe_url_string(baseurl)
  File "/usr/local/lib/python3.10/site-packages/w3lib/[url.py](http://url.py/)", line 141, in safe_url_string
    decoded = to_unicode(url, encoding=encoding, errors="percentencode")
  File "/usr/local/lib/python3.10/site-packages/w3lib/[util.py](http://util.py/)", line 47, in to_unicode
    raise TypeError(
TypeError: to_unicode must receive bytes or str, got ResponseUrl

Other alternatives could be adjusting Scrapy code instead to cast str(response.url) for every use.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions