-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
BUG: Fix Series.str.contains with compiled regex on Arrow string dtype (#61942) #61946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
0b16375
to
838b1c5
Compare
Hi @mroeschke I'd appreciate it if you could take a look and share your feedback. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fix should go into _str_contains
of ArrowExtensionArray
Thankyou for the feedback! |
Additionally, if this is something that is not implemented by pyarrow, we should not raise a NotImplementedError, but fall back on the python object implementation (you can see a similar pattern in some other str methods, like |
@jorisvandenbossche Thank you for the feedback! I will update the PR accordingly. Would you mind letting me know the reason behind the one failing check (pre-commit.ci)? |
ruff is failing, which is used for auto formatting. I would recommend to install the pre-commit locally to avoid having this fail on CI: https://pandas.pydata.org/docs/dev/development/contributing_codebase.html#pre-commit |
hi @jorisvandenbossche Thankyou! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a whatsnew note in v2.3.2
and tests for this?
pandas/core/arrays/string_arrow.py
Outdated
@@ -344,6 +344,9 @@ def _str_contains( | |||
na=lib.no_default, | |||
regex: bool = True, | |||
): | |||
if isinstance(pat, re.Pattern) and regex: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you combine this case with the one below?
pandas-dev#61942) and add whatsnew note
pandas-dev#61942) and add whatsnew note
pandas-dev#61942) and add whatsnew note
pandas-dev#61942) and add whatsnew note
pandas-dev#61942) and add whatsnew note
pandas-dev#61942) and add whatsnew note
pandas-dev#61942) and add whatsnew note
pandas-dev#61942) and add whatsnew note
doc/source/whatsnew/v2.3.2
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file already exists but as v2.3.2.rst
. So you can move the item to that file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file exists on main, see https://github.com/pandas-dev/pandas/blob/main/doc/source/whatsnew/v2.3.2.rst. So if you don't have it locally, that means you have to fetch the latest upstream repo and merge in your branch. See https://pandas.pydata.org/docs/development/contributing.html#updating-your-pull-request
pandas/tests/strings/test_strings.py
Outdated
@pytest.mark.parametrize("dtype", ["string[pyarrow]"]) | ||
def test_str_contains_compiled_regex_arrow_dtype(dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pytest.mark.parametrize("dtype", ["string[pyarrow]"]) | |
def test_str_contains_compiled_regex_arrow_dtype(dtype): | |
def test_str_contains_compiled_regex_arrow_dtype(any_string_dtype): |
By using this fixture, we test it with all different string-like dtypes and ensure it behaves consistently (you might just have to define the expected boolean dtype depending on the exact dtype, you can see how that is done in the other .str.contains
tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thankyou for suggesting .
I have applied changes and updated PR
if any_string_dtype == "string[pyarrow]": | ||
pytest.importorskip("pyarrow") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if any_string_dtype == "string[pyarrow]": | |
pytest.importorskip("pyarrow") |
That should already happen via the fixture, I think?
"str": bool, | ||
}.get(any_string_dtype, object) | ||
expected = Series([False, True, True], dtype=expected_dtype) | ||
assert str(result.dtype) == str(expected.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert str(result.dtype) == str(expected.dtype) |
This check will also be done by the assert_series_equal
called on the next line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now , I have applied changes as per your suggestion .
Can you try to run the test you added locally? Then you can make sure to get it working correctly. Right now it is still failing according to CI |
Sure, I will try to run tests locally and update this PR . |
closes #61942
This PR fixes an issue in
Series.str.contains()
where passing a compiled regex object failed when the underlying string data is backed by PyArrow.Please, provide feedback if my approach is not correct , I would love to improve and contribute in this.