Skip to content

BUG: Fix Series.str.contains with compiled regex on Arrow string dtype (#61942) #61946

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

Aniketsy
Copy link

@Aniketsy Aniketsy commented Jul 25, 2025

closes #61942

This PR fixes an issue in Series.str.contains() where passing a compiled regex object failed when the underlying string data is backed by PyArrow.

Please, provide feedback if my approach is not correct , I would love to improve and contribute in this.

@Aniketsy Aniketsy force-pushed the fix-arrow-contains-regex-v2 branch from 0b16375 to 838b1c5 Compare July 25, 2025 13:15
@Aniketsy
Copy link
Author

Hi @mroeschke
I've worked on the issue
BUG: Fix Series.str.contains with compiled regex on Arrow string dtype ([#61942])
and have opened a pull request for it.

I'd appreciate it if you could take a look and share your feedback.
Please let me know if anything needs to be improved or clarified.

Thanks!

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix should go into _str_contains of ArrowExtensionArray

@Aniketsy
Copy link
Author

Thankyou for the feedback!
I will update that.

@jorisvandenbossche
Copy link
Member

Additionally, if this is something that is not implemented by pyarrow, we should not raise a NotImplementedError, but fall back on the python object implementation (you can see a similar pattern in some other str methods, like ArrowStringArray._str_replace)

@jorisvandenbossche jorisvandenbossche added Bug Strings String extension data type and string data Arrow pyarrow functionality labels Jul 26, 2025
@jorisvandenbossche jorisvandenbossche added this to the 2.3.2 milestone Jul 26, 2025
@Aniketsy
Copy link
Author

@jorisvandenbossche Thank you for the feedback! I will update the PR accordingly.

Would you mind letting me know the reason behind the one failing check (pre-commit.ci)?
Thanks again!

@jorisvandenbossche
Copy link
Member

Would you mind letting me know the reason behind the one failing check (pre-commit.ci)?

ruff is failing, which is used for auto formatting. I would recommend to install the pre-commit locally to avoid having this fail on CI: https://pandas.pydata.org/docs/dev/development/contributing_codebase.html#pre-commit

@Aniketsy
Copy link
Author

hi @jorisvandenbossche
Please review this PR, and if area needs changes please suggest.
Also I want to know if i would need to write unit test for this .

Thankyou!

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a whatsnew note in v2.3.2 and tests for this?

@@ -344,6 +344,9 @@ def _str_contains(
na=lib.no_default,
regex: bool = True,
):
if isinstance(pat, re.Pattern) and regex:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you combine this case with the one below?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file already exists but as v2.3.2.rst. So you can move the item to that file

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

github

I’ve deleted the file I previously created. However, I couldn't find v2.3.2.rst in the repository.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file exists on main, see https://github.com/pandas-dev/pandas/blob/main/doc/source/whatsnew/v2.3.2.rst. So if you don't have it locally, that means you have to fetch the latest upstream repo and merge in your branch. See https://pandas.pydata.org/docs/development/contributing.html#updating-your-pull-request

Comment on lines 180 to 181
@pytest.mark.parametrize("dtype", ["string[pyarrow]"])
def test_str_contains_compiled_regex_arrow_dtype(dtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@pytest.mark.parametrize("dtype", ["string[pyarrow]"])
def test_str_contains_compiled_regex_arrow_dtype(dtype):
def test_str_contains_compiled_regex_arrow_dtype(any_string_dtype):

By using this fixture, we test it with all different string-like dtypes and ensure it behaves consistently (you might just have to define the expected boolean dtype depending on the exact dtype, you can see how that is done in the other .str.contains tests)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thankyou for suggesting .
I have applied changes and updated PR

Comment on lines 286 to 287
if any_string_dtype == "string[pyarrow]":
pytest.importorskip("pyarrow")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if any_string_dtype == "string[pyarrow]":
pytest.importorskip("pyarrow")

That should already happen via the fixture, I think?

"str": bool,
}.get(any_string_dtype, object)
expected = Series([False, True, True], dtype=expected_dtype)
assert str(result.dtype) == str(expected.dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert str(result.dtype) == str(expected.dtype)

This check will also be done by the assert_series_equal called on the next line

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now , I have applied changes as per your suggestion .

@jorisvandenbossche
Copy link
Member

Can you try to run the test you added locally? Then you can make sure to get it working correctly. Right now it is still failing according to CI

@Aniketsy
Copy link
Author

Sure, I will try to run tests locally and update this PR .
Thankyou !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Bug Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Using Series.str.contains() with a compiled regex and arrow strings fails
3 participants