-
Notifications
You must be signed in to change notification settings - Fork 4
Anticipate different year values between Athena and Socrata assets #894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
query_dict = { | ||
year: f"{query} WHERE year = {year}" | ||
for year in years | ||
if not pd.isna(year) | ||
} | ||
|
||
if any(pd.isna(years)): | ||
query_dict.update({"nan": f"{query} WHERE year IS NULL"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making sure we don't exclude data on socrata that has null year values.
input_data[":deleted"] = None | ||
input_data.loc[input_data["_merge"] == "right_only", ":deleted"] = True | ||
input_data = input_data.drop(columns=["_merge"]) | ||
if len(socrata_rows) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we get new years in athena there won't be any comparable socrata data to gather and this was causing an issue for the permits asset.
.split(",") | ||
) | ||
# row id won't show up here since it's hidden on the open data portal assets | ||
asset_columns += ["row_id"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer needed since we exposed row_id
on the open data assets.
Addressing an issue where we get new years for data in Athena which don't already exist in Socrata, or years are deleted from Athena and the corresponding rows need to be removed from Socrata. Before, this would error out trying to find a
row_id
column in an empty dataframe if the year value wasn't present on Socrata, or would leave rows undeleted on Socrata if they had a year value that had been removed from Athena.This hasn't caused us an issue yet since we've almost exclusively been using the script to update the present year (2025) of data for each asset, and when I tested the script updating all years in the past enough time hadn't passed between runs for there to be differences in year values.
This run of the workflow successfully handled
year = 205
which was causing the error here (205 had been added to athena but wasn't present in socrata), but ended up failing when athena got nuked.Here's a run I initialized after solving the data is on athena but not on socrata issue. There was still a one row difference between the athena and socrata assets due to a year value that no longer existed in athena. The run found the row on socrata and removed it.