This repository was archived by the owner on Apr 18, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 36
add parameter to coerce non-numeric values to NaN during validation #4
Open
diegoquintanav
wants to merge
1
commit into
multimeric:master
Choose a base branch
from
diegoquintanav:coercenumeric
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Contributor
Author
|
@TMiguelT what do you think about this? |
Owner
|
This seems reasonable. Could you add a test that currently breaks? ie a column containing non-numeric data? |
Contributor
Author
|
Sure I will during the week, if that's okay with you |
Contributor
Author
|
this is really old, but I ran into this again: Out[85]: df["my_column"].unique()
Out[85]:
array(['nan', '2008', '2016', '2015', '2014', '2013', '2012', '2010',
'2011', '2009', '2017'], dtype=object)Say we have a simple dictionary dictionary = ps.Schema(
[
ps.Column('my_column', [ps.validations.InRangeValidation(1900, 3000)]),
])In [86]: errors = dictionary.validate(df, columns=["my_column"])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
pandas/_libs/src/inference.pyx in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string "nan"
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-86-2a1c78e8916b> in <module>()
----> 1 errors = dictionary.validate(df, columns=["my_column"])
~/Code/lap-etl-project/venv/lib/python3.7/site-packages/pandas_schema/schema.py in validate(self, df, columns)
83 # Iterate over each pair of schema columns and data frame series and run validations
84 for series, column in column_pairs:
---> 85 errors += column.validate(series)
86
87 return sorted(errors, key=lambda e: e.row)
~/Code/lap-etl-project/venv/lib/python3.7/site-packages/pandas_schema/column.py in validate(self, series)
25 :return: An iterable of ValidationError instances generated by the validation
26 """
---> 27 return [error for validation in self.validations for error in validation.get_errors(series, self)]
~/Code/lap-etl-project/venv/lib/python3.7/site-packages/pandas_schema/column.py in <listcomp>(.0)
25 :return: An iterable of ValidationError instances generated by the validation
26 """
---> 27 return [error for validation in self.validations for error in validation.get_errors(series, self)]
~/Code/lap-etl-project/venv/lib/python3.7/site-packages/pandas_schema/validation.py in get_errors(self, series, column)
82 # Calculate which columns are valid using the child class's validate function, skipping empty entries if the
83 # column specifies to do so
---> 84 simple_validation = ~self.validate(series)
85 if column.allow_empty:
86 # Failing results are those that are not empty, and fail the validation
~/Code/lap-etl-project/venv/lib/python3.7/site-packages/pandas_schema/validation.py in validate(self, series)
205
206 def validate(self, series: pd.Series) -> pd.Series:
--> 207 series = pd.to_numeric(series)
208 return (series >= self.min) & (series < self.max)
209
~/Code/lap-etl-project/venv/lib/python3.7/site-packages/pandas/core/tools/numeric.py in to_numeric(arg, errors, downcast)
131 coerce_numeric = False if errors in ('ignore', 'raise') else True
132 values = lib.maybe_convert_numeric(values, set(),
--> 133 coerce_numeric=coerce_numeric)
134
135 except Exception:
pandas/_libs/src/inference.pyx in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string "nan" at position 0I believe line 207 is the responsible, as it raises an error if it can't convert to numeric values. This is true for NaN values. |
Owner
|
Please just make a test case out of your example and I'll be happy to accept the PR |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Maybe this parameter should be exposed, but I set it to coerce by default. All non numeric values are converted into np.NaN elements. Without this setting validation raises an error if a string is found in a column of ints or floats.
Please let me know what you think