Skip to content

Conversation

Pijukatel
Copy link
Collaborator

Description

Add guide about creating and using WARC files with Crawlee

Issues

@Pijukatel Pijukatel added documentation Improvements or additions to documentation. t-tooling Issues with this label are in the ownership of the tooling team. labels Jun 26, 2025
@github-actions github-actions bot added this to the 117th sprint - Tooling team milestone Jun 26, 2025
@Pijukatel Pijukatel requested review from barjin, vdusek and Mantisus June 26, 2025 14:03
@Pijukatel Pijukatel marked this pull request as ready for review June 26, 2025 14:04
Copy link
Collaborator

@Mantisus Mantisus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great reserch and guide!

Copy link
Contributor

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you!

I'd advise running the guide through Grammarly before merging (all my suggestions are from there too, don't worry 😅. I did not run it on every paragraph, though, so only merging my suggestions doesn't cover it all).

Pijukatel and others added 2 commits June 30, 2025 09:05
Co-authored-by: Jindřich Bär <jindrichbar@gmail.com>
@Pijukatel Pijukatel requested a review from barjin July 1, 2025 12:26
Copy link
Contributor

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you! 🚀

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

- Make the crawler use the proxy server.
- Deal with the [pywb Certificate Authority](https://pywb.readthedocs.io/en/latest/manual/configuring.html#https-proxy-and-pywb-certificate-authority).

For example, in `PlaywrightCrawler`, this is the simplest setup, which takes the shortcut and ignores the CA-related errors:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe PlaywrightCrawler could be API link here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

# it is needed to scroll the page to load all content.
# It slows down the crawling, but ensures that all content is loaded.
await context.infinite_scroll()
await context.enqueue_links()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe specify explicitly that we want to enqueue links only from the same domain?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

async def request_handler(context: ParselCrawlingContext) -> None:
context.log.info(f'Archiving {context.request.url} ...')
archive_response(context=context, writer=writer)
await context.enqueue_links()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe specify explicitly that we want to enqueue links only from the same domain?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

# it is needed to scroll the page to load all content.
# It slows down the crawling, but ensures that all content is loaded.
await context.infinite_scroll()
await context.enqueue_links()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe specify explicitly that we want to enqueue links only from the same domain?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added


Previous commands start the wayback server that allows crawler requests to be served from the archived pages in the `example-collection` instead of sending requests to the real website. This is again [proxy mode of the wayback server](https://pywb.readthedocs.io/en/latest/manual/usage.html#http-s-proxy-mode-access), but without recording capability. Now you need to [configure your crawler](#configure-the-crawler) to use this proxy server, which was already described above. Once everything is finished, you can just run your crawler, and it will crawl the offline archived version of the website from your WARC file.

You can also manually browse the archived pages in the wayback server by going to the locally hosted server and entering the collection and URL of the archived page, for example: `http://localhost:8080/example-collection/https:/crawlee.dev/`. The wayback server will serve the page from the WARC file if it exists, or it will return a 404 error if it does not. For more detail about the server please refer to the [pywb documentation](https://pywb.readthedocs.io/en/latest/manual/usage.html#getting-started).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a conclusion at the end with a summary and a call to action to join Discord and check out our Github. See https://crawlee.dev/python/docs/guides/storages#conclusion for example.

The final sentence could be the same:

If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments...

Pijukatel and others added 3 commits July 2, 2025 16:07
@Pijukatel Pijukatel requested a review from vdusek July 2, 2025 14:29
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Pijukatel Pijukatel merged commit abe0f52 into master Jul 3, 2025
19 checks passed
@Pijukatel Pijukatel deleted the warc-files branch July 3, 2025 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add capability to Crawlers to archive pages to WARC files
4 participants