Skip to content

Commit 66599f8

Browse files
authored
feat: add utility for load and parse Sitemap and SitemapRequestLoader (#1169)
### Description - Add `SitemapRequestLoader` for comfortable working with `Sitemap` and easy integration into the framework - Add utility for working with `Sitemap`, loads, and stream parsing ### Issues - Closes: #1161 ### Testing - Add tests for `SitemapRequestLoader` - Add new endpoints for the unicorn server for sitemaps tests
1 parent cf604c2 commit 66599f8

File tree

10 files changed

+1139
-69
lines changed

10 files changed

+1139
-69
lines changed
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
import asyncio
2+
import re
3+
4+
from crawlee.http_clients import HttpxHttpClient
5+
from crawlee.request_loaders import SitemapRequestLoader
6+
7+
8+
async def main() -> None:
9+
# Create an HTTP client for fetching sitemaps
10+
async with HttpxHttpClient() as http_client:
11+
# Create a sitemap request loader with URL filtering
12+
sitemap_loader = SitemapRequestLoader(
13+
sitemap_urls=['https://crawlee.dev/sitemap.xml'],
14+
http_client=http_client,
15+
# Exclude all URLs that do not contain 'blog'
16+
exclude=[re.compile(r'^((?!blog).)*$')],
17+
max_buffer_size=500, # Buffer up to 500 URLs in memory
18+
)
19+
20+
while request := await sitemap_loader.fetch_next_request():
21+
# Do something with it...
22+
23+
# And mark it as handled.
24+
await sitemap_loader.mark_request_as_handled(request)
25+
26+
27+
if __name__ == '__main__':
28+
asyncio.run(main())

docs/guides/request_loaders.mdx

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ import TabItem from '@theme/TabItem';
1010
import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';
1111

1212
import RlBasicExample from '!!raw-loader!roa-loader!./code_examples/request_loaders/rl_basic_example.py';
13+
import SitemapExample from '!!raw-loader!roa-loader!./code_examples/request_loaders/sitemap_example.py';
1314
import TandemExample from '!!raw-loader!roa-loader!./code_examples/request_loaders/tandem_example.py';
1415
import ExplicitTandemExample from '!!raw-loader!roa-loader!./code_examples/request_loaders/tandem_example_explicit.py';
1516

@@ -23,9 +24,10 @@ The [`request_loaders`](https://github.com/apify/crawlee-python/tree/master/src/
2324
- <ApiLink to="class/RequestManager">`RequestManager`</ApiLink>: Extends `RequestLoader` with write capabilities.
2425
- <ApiLink to="class/RequestManagerTandem">`RequestManagerTandem`</ApiLink>: Combines a read-only `RequestLoader` with a writable `RequestManager`.
2526

26-
And one specific request loader:
27+
And specific request loaders:
2728

2829
- <ApiLink to="class/RequestList">`RequestList`</ApiLink>: A lightweight implementation of request loader for managing a static list of URLs.
30+
- <ApiLink to="class/SitemapRequestLoader">`SitemapRequestLoader`</ApiLink>: A request loader that reads URLs from XML sitemaps with filtering capabilities.
2931

3032
Below is a class diagram that illustrates the relationships between these components and the <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>:
3133

@@ -83,6 +85,11 @@ class RequestList {
8385
_methods_()
8486
}
8587
88+
class SitemapRequestLoader {
89+
_attributes_
90+
_methods_()
91+
}
92+
8693
class RequestManagerTandem {
8794
_attributes_
8895
_methods_()
@@ -97,6 +104,7 @@ RequestManager <|-- RequestQueue
97104
98105
RequestLoader <|-- RequestManager
99106
RequestLoader <|-- RequestList
107+
RequestLoader <|-- SitemapRequestLoader
100108
RequestManager <|-- RequestManagerTandem
101109
```
102110

@@ -112,6 +120,14 @@ Here is a basic example of working with the <ApiLink to="class/RequestList">`Req
112120
{RlBasicExample}
113121
</RunnableCodeBlock>
114122

123+
## Sitemap request loader
124+
125+
The <ApiLink to="class/SitemapRequestLoader">`SitemapRequestLoader`</ApiLink> is a specialized request loader that reads URLs from XML sitemaps. It's particularly useful when you want to crawl a website systematically by following its sitemap structure. The loader supports filtering URLs using glob patterns and regular expressions, allowing you to include or exclude specific types of URLs. The <ApiLink to="class/SitemapRequestLoader">`SitemapRequestLoader`</ApiLink> provides streaming processing of sitemaps, which ensures efficient memory usage without loading the entire sitemap into memory.
126+
127+
<RunnableCodeBlock className="language-python" language="python">
128+
{SitemapExample}
129+
</RunnableCodeBlock>
130+
115131
## Request manager
116132

117133
The <ApiLink to="class/RequestManager">`RequestManager`</ApiLink> extends `RequestLoader` with write capabilities. In addition to reading requests, a request manager can add or reclaim them. This is important for dynamic crawling projects, where new URLs may emerge during the crawl process. Or when certain requests may failed and need to be retried. For more details refer to the <ApiLink to="class/RequestManager">`RequestManager`</ApiLink> API reference.
@@ -139,4 +155,4 @@ This sections describes the combination of the <ApiLink to="class/RequestList">`
139155

140156
## Conclusion
141157

142-
This guide explained the `request_loaders` sub-package, which extends the functionality of the `RequestQueue` with additional tools for managing URLs. You learned about the `RequestLoader`, `RequestManager`, and `RequestManagerTandem` classes, as well as the `RequestList` class. You also saw examples of how to work with these classes in practice. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/crawlee-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
158+
This guide explained the `request_loaders` sub-package, which extends the functionality of the `RequestQueue` with additional tools for managing URLs. You learned about the `RequestLoader`, `RequestManager`, and `RequestManagerTandem` classes, as well as the `RequestList` and `SitemapRequestLoader` classes. You also saw examples of how to work with these classes in practice. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/crawlee-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!

src/crawlee/_utils/robots.py

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
from protego import Protego
66
from yarl import URL
77

8+
from crawlee._utils.sitemap import Sitemap
89
from crawlee._utils.web import is_status_code_client_error
910

1011
if TYPE_CHECKING:
@@ -15,9 +16,13 @@
1516

1617

1718
class RobotsTxtFile:
18-
def __init__(self, url: str, robots: Protego) -> None:
19+
def __init__(
20+
self, url: str, robots: Protego, http_client: HttpClient | None = None, proxy_info: ProxyInfo | None = None
21+
) -> None:
1922
self._robots = robots
2023
self._original_url = URL(url).origin()
24+
self._http_client = http_client
25+
self._proxy_info = proxy_info
2126

2227
@classmethod
2328
async def from_content(cls, url: str, content: str) -> Self:
@@ -56,7 +61,7 @@ async def load(cls, url: str, http_client: HttpClient, proxy_info: ProxyInfo | N
5661

5762
robots = Protego.parse(body.decode('utf-8'))
5863

59-
return cls(url, robots)
64+
return cls(url, robots, http_client=http_client, proxy_info=proxy_info)
6065

6166
def is_allowed(self, url: str, user_agent: str = '*') -> bool:
6267
"""Check if the given URL is allowed for the given user agent.
@@ -83,3 +88,16 @@ def get_crawl_delay(self, user_agent: str = '*') -> int | None:
8388
"""
8489
crawl_delay = self._robots.crawl_delay(user_agent)
8590
return int(crawl_delay) if crawl_delay is not None else None
91+
92+
async def parse_sitemaps(self) -> Sitemap:
93+
"""Parse the sitemaps from the robots.txt file and return a `Sitemap` instance."""
94+
sitemaps = self.get_sitemaps()
95+
if not self._http_client:
96+
raise ValueError('HTTP client is required to parse sitemaps.')
97+
98+
return await Sitemap.load(sitemaps, self._http_client, self._proxy_info)
99+
100+
async def parse_urls_from_sitemaps(self) -> list[str]:
101+
"""Parse the sitemaps in the robots.txt file and return a list URLs."""
102+
sitemap = await self.parse_sitemaps()
103+
return sitemap.urls

0 commit comments

Comments
 (0)