-
Notifications
You must be signed in to change notification settings - Fork 432
docs: Add guide about creating and using WARC files with Crawlee
#1273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great reserch and guide!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thank you!
I'd advise running the guide through Grammarly before merging (all my suggestions are from there too, don't worry 😅. I did not run it on every paragraph, though, so only merging my suggestions doesn't cover it all).
docs/guides/code_examples/creating_web_archive/manual_archiving_playwright_crawler.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Jindřich Bär <jindrichbar@gmail.com>
docs/guides/code_examples/creating_web_archive/manual_archiving_playwright_crawler.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thank you! 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
docs/guides/creating_web_archive.mdx
Outdated
- Make the crawler use the proxy server. | ||
- Deal with the [pywb Certificate Authority](https://pywb.readthedocs.io/en/latest/manual/configuring.html#https-proxy-and-pywb-certificate-authority). | ||
|
||
For example, in `PlaywrightCrawler`, this is the simplest setup, which takes the shortcut and ignores the CA-related errors: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe PlaywrightCrawler
could be API link here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
# it is needed to scroll the page to load all content. | ||
# It slows down the crawling, but ensures that all content is loaded. | ||
await context.infinite_scroll() | ||
await context.enqueue_links() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe specify explicitly that we want to enqueue links only from the same domain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
async def request_handler(context: ParselCrawlingContext) -> None: | ||
context.log.info(f'Archiving {context.request.url} ...') | ||
archive_response(context=context, writer=writer) | ||
await context.enqueue_links() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe specify explicitly that we want to enqueue links only from the same domain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
# it is needed to scroll the page to load all content. | ||
# It slows down the crawling, but ensures that all content is loaded. | ||
await context.infinite_scroll() | ||
await context.enqueue_links() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe specify explicitly that we want to enqueue links only from the same domain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
|
||
Previous commands start the wayback server that allows crawler requests to be served from the archived pages in the `example-collection` instead of sending requests to the real website. This is again [proxy mode of the wayback server](https://pywb.readthedocs.io/en/latest/manual/usage.html#http-s-proxy-mode-access), but without recording capability. Now you need to [configure your crawler](#configure-the-crawler) to use this proxy server, which was already described above. Once everything is finished, you can just run your crawler, and it will crawl the offline archived version of the website from your WARC file. | ||
|
||
You can also manually browse the archived pages in the wayback server by going to the locally hosted server and entering the collection and URL of the archived page, for example: `http://localhost:8080/example-collection/https:/crawlee.dev/`. The wayback server will serve the page from the WARC file if it exists, or it will return a 404 error if it does not. For more detail about the server please refer to the [pywb documentation](https://pywb.readthedocs.io/en/latest/manual/usage.html#getting-started). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a conclusion at the end with a summary and a call to action to join Discord and check out our Github. See https://crawlee.dev/python/docs/guides/storages#conclusion for example.
The final sentence could be the same:
If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments...
Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
Add guide about creating and using WARC files with
Crawlee
Issues