docs: Add guide about creating and using WARC files with `Crawlee` #1273

Pijukatel · 2025-06-26T13:53:30Z

Description

Add guide about creating and using WARC files with Crawlee

Issues

Closes: Add capability to Crawlers to archive pages to WARC files #1262

Mantisus

Great reserch and guide!

barjin

Looks great, thank you!

I'd advise running the guide through Grammarly before merging (all my suggestions are from there too, don't worry 😅. I did not run it on every paragraph, though, so only merging my suggestions doesn't cover it all).

docs/guides/code_examples/creating_web_archive/manual_archiving_playwright_crawler.py

docs/guides/creating_web_archive.mdx

Co-authored-by: Jindřich Bär <jindrichbar@gmail.com>

docs/guides/code_examples/creating_web_archive/manual_archiving_playwright_crawler.py

barjin

Looks great, thank you! 🚀

vdusek

Nice!

docs/guides/creating_web_archive.mdx

pyproject.toml

vdusek · 2025-07-02T08:48:00Z

docs/guides/creating_web_archive.mdx

+ - Make the crawler use the proxy server.
+ - Deal with the [pywb Certificate Authority](https://pywb.readthedocs.io/en/latest/manual/configuring.html#https-proxy-and-pywb-certificate-authority).
+
+For example, in `PlaywrightCrawler`, this is the simplest setup, which takes the shortcut and ignores the CA-related errors:


Maybe PlaywrightCrawler could be API link here.

vdusek · 2025-07-02T08:51:17Z

docs/guides/code_examples/creating_web_archive/simple_pw_through_proxy_pywb_server.py

+        # it is needed to scroll the page to load all content.
+        # It slows down the crawling, but ensures that all content is loaded.
+        await context.infinite_scroll()
+        await context.enqueue_links()


Maybe specify explicitly that we want to enqueue links only from the same domain?

vdusek · 2025-07-02T08:52:39Z

docs/guides/code_examples/creating_web_archive/manual_archiving_parsel_crawler.py

+        async def request_handler(context: ParselCrawlingContext) -> None:
+            context.log.info(f'Archiving {context.request.url} ...')
+            archive_response(context=context, writer=writer)
+            await context.enqueue_links()


Maybe specify explicitly that we want to enqueue links only from the same domain?

vdusek · 2025-07-02T08:52:46Z

docs/guides/code_examples/creating_web_archive/manual_archiving_playwright_crawler.py

+            # it is needed to scroll the page to load all content.
+            # It slows down the crawling, but ensures that all content is loaded.
+            await context.infinite_scroll()
+            await context.enqueue_links()


Maybe specify explicitly that we want to enqueue links only from the same domain?

vdusek · 2025-07-02T08:55:31Z

docs/guides/creating_web_archive.mdx

+
+Previous commands start the wayback server that allows crawler requests to be served from the archived pages in the `example-collection` instead of sending requests to the real website. This is again [proxy mode of the wayback server](https://pywb.readthedocs.io/en/latest/manual/usage.html#http-s-proxy-mode-access), but without recording capability. Now you need to [configure your crawler](#configure-the-crawler) to use this proxy server, which was already described above. Once everything is finished, you can just run your crawler, and it will crawl the offline archived version of the website from your WARC file.
+
+You can also manually browse the archived pages in the wayback server by going to the locally hosted server and entering the collection and URL of the archived page, for example: `http://localhost:8080/example-collection/https:/crawlee.dev/`. The wayback server will serve the page from the WARC file if it exists, or it will return a 404 error if it does not. For more detail about the server please refer to the [pywb documentation](https://pywb.readthedocs.io/en/latest/manual/usage.html#getting-started).


Please add a conclusion at the end with a summary and a call to action to join Discord and check out our Github. See https://crawlee.dev/python/docs/guides/storages#conclusion for example.

The final sentence could be the same:

If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!

vdusek

A few comments...

Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>

vdusek

LGTM

Add guide about creating and using WARC files in Crawlee

4be0e9e

Pijukatel added documentation Improvements or additions to documentation. t-tooling Issues with this label are in the ownership of the tooling team. labels Jun 26, 2025

github-actions bot assigned Pijukatel Jun 26, 2025

github-actions bot added this to the 117th sprint - Tooling team milestone Jun 26, 2025

Pijukatel mentioned this pull request Jun 26, 2025

Add capability to Crawlers to archive pages to WARC files #1262

Closed

Pijukatel requested review from barjin, vdusek and Mantisus June 26, 2025 14:03

Pijukatel marked this pull request as ready for review June 26, 2025 14:04

Mantisus approved these changes Jun 27, 2025

View reviewed changes

barjin reviewed Jun 30, 2025

View reviewed changes

Pijukatel and others added 2 commits June 30, 2025 09:05

Apply suggestions from code review

1af83ea

Co-authored-by: Jindřich Bär <jindrichbar@gmail.com>

Polish the guide

02064a4

barjin reviewed Jun 30, 2025

View reviewed changes

docs/guides/code_examples/creating_web_archive/manual_archiving_playwright_crawler.py Show resolved Hide resolved

Use Request together with on requestfinished

952079a

Pijukatel requested a review from barjin July 1, 2025 12:26

barjin approved these changes Jul 1, 2025

View reviewed changes

vdusek reviewed Jul 2, 2025

View reviewed changes

vdusek requested changes Jul 2, 2025

View reviewed changes

Pijukatel and others added 3 commits July 2, 2025 16:07

Apply suggestions from code review

a01fdac

Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>

Review comments

06a892d

Update for new lint rules

4a642f7

Pijukatel requested a review from vdusek July 2, 2025 14:29

vdusek approved these changes Jul 3, 2025

View reviewed changes

Pijukatel merged commit abe0f52 into master Jul 3, 2025
19 checks passed

Pijukatel deleted the warc-files branch July 3, 2025 14:49


		Previous commands start the wayback server that allows crawler requests to be served from the archived pages in the `example-collection` instead of sending requests to the real website. This is again [proxy mode of the wayback server](https://pywb.readthedocs.io/en/latest/manual/usage.html#http-s-proxy-mode-access), but without recording capability. Now you need to [configure your crawler](#configure-the-crawler) to use this proxy server, which was already described above. Once everything is finished, you can just run your crawler, and it will crawl the offline archived version of the website from your WARC file.

		You can also manually browse the archived pages in the wayback server by going to the locally hosted server and entering the collection and URL of the archived page, for example: `http://localhost:8080/example-collection/https:/crawlee.dev/`. The wayback server will serve the page from the WARC file if it exists, or it will return a 404 error if it does not. For more detail about the server please refer to the [pywb documentation](https://pywb.readthedocs.io/en/latest/manual/usage.html#getting-started).

docs: Add guide about creating and using WARC files with Crawlee #1273

docs: Add guide about creating and using WARC files with Crawlee #1273

Uh oh!

Conversation

Pijukatel commented Jun 26, 2025

Description

Issues

Uh oh!

Mantisus left a comment

Choose a reason for hiding this comment

Uh oh!

barjin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

barjin left a comment

Choose a reason for hiding this comment

Uh oh!

vdusek left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

docs: Add guide about creating and using WARC files with `Crawlee` #1273

docs: Add guide about creating and using WARC files with `Crawlee` #1273

vdusek left a comment •

edited

Loading