Skip to content

Distinguishing prefixes from objects in iterdir() results #193

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
oir opened this issue Apr 19, 2025 · 8 comments
Open

Distinguishing prefixes from objects in iterdir() results #193

oir opened this issue Apr 19, 2025 · 8 comments

Comments

@oir
Copy link

oir commented Apr 19, 2025

Hi! Apologies if I missed this in the docs. In the readme, it looks like the surface form string should have a trailing slash for "folders" (i.e. common prefixes):

>>> bucket_path = S3Path('/pypi-proxy/')
>>> [path for path in bucket_path.iterdir() if path.is_dir()]
[S3Path('/pypi-proxy/requests/'),
 S3Path('/pypi-proxy/boto3/'),
 S3Path('/pypi-proxy/botocore/')]

However when I run something similar, I don't observe the trailing slashes, so it looks something like:

>>> bucket_path = S3Path('/pypi-proxy/')
>>> [path for path in bucket_path.iterdir() if path.is_dir()]
[S3Path('/pypi-proxy/requests'),
 S3Path('/pypi-proxy/boto3'),
 S3Path('/pypi-proxy/botocore')]

This means after any iterdir(), I have to make an extra call to is_file() or is_dir() to determine which ones among the returned are "files" and which ones are "folders", which become expensive. Is this expected? Or am I missing a configuration?

I am using s3path 0.6.1 with a custom s3-compatible object storage and custom configuration with register_configuration_parameter() if it makes a difference.

@liormizr
Copy link
Owner

Hi @oir
Great question!
We are sticking to pathlib API, so this behavior is coming from there

>>> from pathlib import Path
>>> Path('s3path/')
PosixPath('s3path')

It's a good point that we need to fix the doc's

About optimization,
I can think for a way to "cache" is it's a full key or just a prefix (folder)
I also have an open issue (#186) to implement the walk method, I think that with this method you will be able to get only the folders in a optimized way (need to think about it...)

@oir
Copy link
Author

oir commented Apr 27, 2025

Sounds great, I think I would benefit from either of the suggested solutions. Thanks for the heads up.

@liormizr
Copy link
Owner

liormizr commented May 7, 2025

Fixed and deployed in version 0.6.2

@oir
Copy link
Author

oir commented May 7, 2025

Will take a look asap! 🙏

@cooperlees
Copy link

I tried updating to s3path 0.6.2 and noticed my testing mkdir() now fails - Is this now due to this caching my first .is_dir()? Is there any way I can disable it for the test? For now I've just removed the first assert in is_dir() on line 30 and its passes again.

https://github.com/pypa/bandersnatch/blob/main/src/bandersnatch/tests/plugins/test_storage_plugin_s3.py#L28

Failure: https://github.com/pypa/bandersnatch/actions/runs/14916487772/job/41903180127?pr=1944

Summary:

    def test_path_mkdir(s3_mock: S3Path) -> None:
        new_folder = s3.S3Path(f"/{s3_mock.bucket}/test_folder")
        assert not new_folder.is_dir()
        new_folder.mkdir()
>       assert new_folder.is_dir()
E       AssertionError: assert False
E        +  where False = is_dir()
E        +    where is_dir = S3Path('/test-bucket/test_folder').is_dir

src/bandersnatch/tests/plugins/test_storage_plugin_s3.py:32: AssertionError

I've commented here cause I saw in the code you plan to make this more robust - Is this including maybe a "no_cahce=True" param to do the expensive operation?

@cooperlees
Copy link

Ahhh - Creating a new S3Path object does the trick ...

diff --git a/src/bandersnatch/tests/plugins/test_storage_plugin_s3.py b/src/bandersnatch/tests/plugins/test_storage_plugin_s3.py
index cfa7d69..ab3b006 100644
--- a/src/bandersnatch/tests/plugins/test_storage_plugin_s3.py
+++ b/src/bandersnatch/tests/plugins/test_storage_plugin_s3.py
@@ -27,8 +27,10 @@ def test_update_safe(s3_mock: S3Path) -> None:

 def test_path_mkdir(s3_mock: S3Path) -> None:
     new_folder = s3.S3Path(f"/{s3_mock.bucket}/test_folder")
+    assert not new_folder.is_dir()
     new_folder.mkdir()
-    assert new_folder.is_dir()
+    new_folder2 = s3.S3Path(f"/{s3_mock.bucket}/test_folder")
+    assert new_folder2.is_dir()

If this is how you envision how people should use this should maybe it should be documented.

@liormizr
Copy link
Owner

@cooperlees Tx for the debugging
This is a big issue.
I’ll create a new version that removes the cache or resets it whenever a local “modification” happens.
Although it’s risky—another process might make a change that we won’t detect because of the caching system.
Now I remember why I wasn’t a fan of adding cache to the library 😊

@cooperlees
Copy link

"There are only two hard things in Computer Science: cache invalidation and naming things." - Phil Karlton (Netscape)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants