You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: add utility for load and parse Sitemap and SitemapRequestLoader (#1169)
### Description
- Add `SitemapRequestLoader` for comfortable working with `Sitemap` and
easy integration into the framework
- Add utility for working with `Sitemap`, loads, and stream parsing
### Issues
- Closes: #1161
### Testing
- Add tests for `SitemapRequestLoader`
- Add new endpoints for the unicorn server for sitemaps tests
@@ -23,9 +24,10 @@ The [`request_loaders`](https://github.com/apify/crawlee-python/tree/master/src/
23
24
- <ApiLinkto="class/RequestManager">`RequestManager`</ApiLink>: Extends `RequestLoader` with write capabilities.
24
25
- <ApiLinkto="class/RequestManagerTandem">`RequestManagerTandem`</ApiLink>: Combines a read-only `RequestLoader` with a writable `RequestManager`.
25
26
26
-
And one specific request loader:
27
+
And specific request loaders:
27
28
28
29
- <ApiLinkto="class/RequestList">`RequestList`</ApiLink>: A lightweight implementation of request loader for managing a static list of URLs.
30
+
- <ApiLinkto="class/SitemapRequestLoader">`SitemapRequestLoader`</ApiLink>: A request loader that reads URLs from XML sitemaps with filtering capabilities.
29
31
30
32
Below is a class diagram that illustrates the relationships between these components and the <ApiLinkto="class/RequestQueue">`RequestQueue`</ApiLink>:
@@ -112,6 +120,14 @@ Here is a basic example of working with the <ApiLink to="class/RequestList">`Req
112
120
{RlBasicExample}
113
121
</RunnableCodeBlock>
114
122
123
+
## Sitemap request loader
124
+
125
+
The <ApiLinkto="class/SitemapRequestLoader">`SitemapRequestLoader`</ApiLink> is a specialized request loader that reads URLs from XML sitemaps. It's particularly useful when you want to crawl a website systematically by following its sitemap structure. The loader supports filtering URLs using glob patterns and regular expressions, allowing you to include or exclude specific types of URLs. The <ApiLinkto="class/SitemapRequestLoader">`SitemapRequestLoader`</ApiLink> provides streaming processing of sitemaps, which ensures efficient memory usage without loading the entire sitemap into memory.
The <ApiLinkto="class/RequestManager">`RequestManager`</ApiLink> extends `RequestLoader` with write capabilities. In addition to reading requests, a request manager can add or reclaim them. This is important for dynamic crawling projects, where new URLs may emerge during the crawl process. Or when certain requests may failed and need to be retried. For more details refer to the <ApiLinkto="class/RequestManager">`RequestManager`</ApiLink> API reference.
@@ -139,4 +155,4 @@ This sections describes the combination of the <ApiLink to="class/RequestList">`
139
155
140
156
## Conclusion
141
157
142
-
This guide explained the `request_loaders` sub-package, which extends the functionality of the `RequestQueue` with additional tools for managing URLs. You learned about the `RequestLoader`, `RequestManager`, and `RequestManagerTandem` classes, as well as the `RequestList`class. You also saw examples of how to work with these classes in practice. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/crawlee-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
158
+
This guide explained the `request_loaders` sub-package, which extends the functionality of the `RequestQueue` with additional tools for managing URLs. You learned about the `RequestLoader`, `RequestManager`, and `RequestManagerTandem` classes, as well as the `RequestList`and `SitemapRequestLoader` classes. You also saw examples of how to work with these classes in practice. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/crawlee-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
0 commit comments