Skip to content

feat: add ImpitHttpClient http-client client using the impit library #1151

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Jul 15, 2025

Conversation

Mantisus
Copy link
Collaborator

Description

  • add ImpitHttpClient http-client client using the impit library

Issues

Testing

Added tests for ImpitHttpClient. ImpitHttpClient is enabled for all tests using http-client

@Mantisus
Copy link
Collaborator Author

For now, I suggest adding impit as an additional dependency, as it still needs some tweaking before it's ready to replace httpx.

Awaiting a decision - apify/impit#123

@Mantisus Mantisus requested a review from Copilot April 14, 2025 15:03
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (1)

pyproject.toml:63

  • The version requirement for impit in the main dependencies (>=0.1.0) differs from the one in the adaptive-crawler section (>=0.2.0), which may lead to dependency conflicts. Consider aligning them to a consistent version.
"impit>=0.1.0",

@Mantisus Mantisus requested a review from Copilot April 14, 2025 15:29
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

@Mantisus Mantisus self-assigned this Apr 15, 2025
@Mantisus
Copy link
Collaborator Author

Mantisus commented Jul 7, 2025

Python binding Impit has all the basic functionality to integrate into Crawlee.

The _get_client method is implemented based on ImpitHttpClient. However, this looks inefficient, especially when working without a proxy, but using a SessionPool of size greater than 1, because the client will be created anew for each request. I think we should improve this on the impit side. @barjin, maybe you'll have some ideas.

Replacing httpx with impit as the main client, I propose to do in a separate PR

@Mantisus Mantisus marked this pull request as ready for review July 7, 2025 22:52
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you are following established pattern about adding new test file for the new client, but maybe now is the time to refactor the tests and have only one test file for all clients and parametrize all tests with client.

For example:

@pytest.mark.parametrize("http_client", [
    CurlImpersonateHttpClient(http_version=CurlHttpVersion.V1_1),
    ImpitHttpClient(),
    HttpxHttpClient(http2=False)])
async def test_http_1(http_client: HttpClient, server_url: URL) -> None:
    response = await http_client.send_request(str(server_url))
    assert response.http_version == 'HTTP/1.1'

Maybe we would need 3xclient factories instead and parametrize by that, but regardless of that I think it would be great to reduce code duplication and ensure that we have exactly the same tests for all clients and that they can all work in exactly the same way.

But this is just a suggestion. Maybe it can be done in separate PR as well to not mix new implementation with pure refactoring.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great idea. But yes, I think it should be done in a separate PR.

Also, we could take the same approach for beautifulsoup and parsel crawlers. Since the tests for them are also completely duplicated.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@Pijukatel Pijukatel Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will wait for this PR to be merged first, and then we can refactor: #1299

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Mantisus
Copy link
Collaborator Author

Mantisus commented Jul 8, 2025

Not merging this PR until we resolve the test issue

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test issue

[gw1] node down: Not properly terminated
[gw1] [ 98%] FAILED tests/unit/test_service_locator.py::test_storage_client_conflict 
replacing crashed worker gw1
tests/unit/crawlers/_parsel/test_parsel_crawler.py::test_enqueue_links_selector[curl] 

@Mantisus Mantisus requested a review from vdusek July 15, 2025 08:41
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Pijukatel Pijukatel merged commit 0d0d268 into apify:master Jul 15, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants