Skip to content

Rewrite builders to extract relevant info from netCDF files, rather than regex matching on filenames #378

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
charles-turner-1 opened this issue Apr 1, 2025 · 16 comments · May be fixed by #432
Assignees
Labels
enhancement New feature or request

Comments

@charles-turner-1
Copy link
Collaborator

Is your feature request related to a problem? Please describe.

The current regex matching approach used to generate file ID's & then separate the experiment into datasets needs to be updated, as we have several datasets for which this approach fails currently in the catalog & we're liable to create more. Additional datasets such as those on #349 will eventually lead to the regex approach becoming a frankensteins monster of string processing as we introduce more layers of workarounds.

Describe the feature you'd like

When we open up the datasets to build the datastore here, we should pull out the extra info necessary to disambiguate the experiment outputs/file structure into datasets.

@dougiesquire
Copy link
Collaborator

Related: #67

@marc-white
Copy link
Collaborator

I've been looking at this today. There's a couple of easy wins we can make in progressing this, and then one major sticking point.

Wins

  • I've already removed timestamp_filename from the _NCFileInfo object. AFAIK, it seems to not be used for anything anymore, so was a dead letter.
  • I think I can remove the filename_frequency from the file parser fairly easily, and simply leave the frequency as the STATIC_FREQUENCY (fx) default if the frequency can't be calculated from the file data. I can't remember where, but I think we've discussed this as being a reasonable solution in the past, as it ultimately represents a data issue.

Sticking point

... is the realm information. At the moment, this is being parsed from the filename, or in the case of, say, ROMSBuilder, it's being hard-coded to a single value. However, these approaches won't work under this ticket:

  • Obviously, we're trying to move away from using filenames for anything data-related;
  • Several models have multiple possible realms, so we can't just hard-code a solution.

I've been looking through just the test data contained in the repository. Some models (e.g., ACCESS-OM2) seem to contain some useful data in the X.attrs dict (where X is the xarray.Dataset return of xarray.open_dataset); we could try parsing that. However, some of the other test data (e.g., ACCESS-CM2) seems to contain no explicitly useful information to compute the realm from.

The only solution I've been able to think of so far is to hold a mapping of variables vs. realm for each model, and use the variable(s) present in each file to divine which realm the file belongs to. However, this has some potential landmines:

  • How many variables are unique to a given realm in each model?
  • Can we guarantee one of these unique 'marker' variables will be available in every file?
  • I can imagine this mapping would get very large, very quickly.

Happy to hear if anyone else has any thoughts on how to address the realm conundrum.

@marc-white marc-white self-assigned this Apr 3, 2025
@charles-turner-1
Copy link
Collaborator Author

charles-turner-1 commented Apr 3, 2025

I wonder if using coordinate variables, rather than data variables, might let us enumerate the space of realms more straightforwardly. I'd be very surprised if, eg. (and I'm making these examples up off the top of my head) an atmospheric model used a depth coordinate, or vice versa for an ocean model using altitude/hPa etc.

It seems to me that coordinate variables would be more likely to be consistent than data variables?

I'm just spitballing here though - so I could be talking nonsense.

@dougiesquire
Copy link
Collaborator

I'm not sure I'd worry too much about realm at the moment. I think for the most part, the parsing of those can be relied on (or at least, other approaches I can think of off the top of my head aren't any better). It's getting rid of file_id--and the regex madness it entails--in the definition of dataset keys that I think is highest priority.

@marc-white
Copy link
Collaborator

@dougiesquire don't we need to split the data by realm in order to reliably build the datastores? I think this is what we are doing in effect at the moment by using filename-based file_id keys (which typically contain the realm information by design/happy accident).

@charles-turner-1
Copy link
Collaborator Author

charles-turner-1 commented Apr 3, 2025

I think file_id is built from the filename, as is the realm - but we can just get the realm from the filename without having to do any of the other parsing to create a file_id - and create the file_id or similar from information within the file instead?

IIRC, the thinking was that if we could extract a combination of frequency and dimensions (x,y,z), where we encode missing dimensions as being size zero, we should be able to get back to something that uniquely identifies the same dataset's as before without having to redact strings.

@dougiesquire
Copy link
Collaborator

dougiesquire commented Apr 3, 2025

That's right @charles-turner-1. My vote would be we get rid of file_id altogether and start by trying to replace the current dataset keys, which are file_id.frequency, with realm.frequency.nx.ny.nz, where realm is determined in the same way it currently is (at least for now).

I'm not sure how well this will work across all the different experiments in the catalog and there may be performance implications for some experiments, but we have to start somewhere.

@dougiesquire
Copy link
Collaborator

dougiesquire commented Apr 3, 2025

Note, the dataset keys are defined by groupby_attrs in each Builder, e.g. here

@marc-white
Copy link
Collaborator

@dougiesquire do you have any insight into what the best way to get the coordinates in the new dataset keys?

For example, I'm looking at a file from the request for a new reader (#168 ), and there's a ton of variables in there, all with different shapes:

ds = netCDF4.Dataset("/g/data/zv30/non-cmip/ACCESS-CM3/cm3-run-29-01-2025-exp-runoff-smoothing-rmax-500-efold-1000/archive/2100/ice/access-cm3.cice.h.2100-01.nc")

for k in ds.variables.keys():
    print(f"{k}: {ds[k].dimensions}")

time: ('time',)
time_bounds: ('time', 'nbnd')
TLON: ('nj', 'ni')
TLAT: ('nj', 'ni')
ULON: ('nj', 'ni')
ULAT: ('nj', 'ni')
NLON: ('nj', 'ni')
NLAT: ('nj', 'ni')
ELON: ('nj', 'ni')
ELAT: ('nj', 'ni')
NCAT: ('nc',)
VGRDi: ('nkice',)
VGRDs: ('nksnow',)
VGRDb: ('nkbio',)
VGRDa: ('nkaer',)
tmask: ('nj', 'ni')
umask: ('nj', 'ni')
nmask: ('nj', 'ni')
emask: ('nj', 'ni')
tarea: ('nj', 'ni')
uarea: ('nj', 'ni')
narea: ('nj', 'ni')
earea: ('nj', 'ni')
dxt: ('nj', 'ni')
dyt: ('nj', 'ni')
dxu: ('nj', 'ni')
dyu: ('nj', 'ni')
dxn: ('nj', 'ni')
dyn: ('nj', 'ni')
dxe: ('nj', 'ni')
dye: ('nj', 'ni')
HTN: ('nj', 'ni')
HTE: ('nj', 'ni')
ANGLE: ('nj', 'ni')
ANGLET: ('nj', 'ni')
lont_bounds: ('nj', 'ni', 'nvertices')
latt_bounds: ('nj', 'ni', 'nvertices')
lonu_bounds: ('nj', 'ni', 'nvertices')
latu_bounds: ('nj', 'ni', 'nvertices')
lonn_bounds: ('nj', 'ni', 'nvertices')
latn_bounds: ('nj', 'ni', 'nvertices')
lone_bounds: ('nj', 'ni', 'nvertices')
late_bounds: ('nj', 'ni', 'nvertices')
hi: ('time', 'nj', 'ni')
hs: ('time', 'nj', 'ni')
snowfrac: ('time', 'nj', 'ni')
Tsfc: ('time', 'nj', 'ni')
aice: ('time', 'nj', 'ni')
uvel: ('time', 'nj', 'ni')
vvel: ('time', 'nj', 'ni')
icespd: ('time', 'nj', 'ni')
icedir: ('time', 'nj', 'ni')
uatm: ('time', 'nj', 'ni')
vatm: ('time', 'nj', 'ni')
atmspd: ('time', 'nj', 'ni')
atmdir: ('time', 'nj', 'ni')
fswup: ('time', 'nj', 'ni')
fswdn: ('time', 'nj', 'ni')
flwdn: ('time', 'nj', 'ni')
snow: ('time', 'nj', 'ni')
snow_ai: ('time', 'nj', 'ni')
rain: ('time', 'nj', 'ni')
rain_ai: ('time', 'nj', 'ni')
sst: ('time', 'nj', 'ni')
ocnspd: ('time', 'nj', 'ni')
ocndir: ('time', 'nj', 'ni')
frzmlt: ('time', 'nj', 'ni')
scale_factor: ('time', 'nj', 'ni')
fswint_ai: ('time', 'nj', 'ni')
fswabs: ('time', 'nj', 'ni')
fswabs_ai: ('time', 'nj', 'ni')
albsni: ('time', 'nj', 'ni')
alvdr_ai: ('time', 'nj', 'ni')
alidr_ai: ('time', 'nj', 'ni')
alvdf_ai: ('time', 'nj', 'ni')
alidf_ai: ('time', 'nj', 'ni')
flat: ('time', 'nj', 'ni')
flat_ai: ('time', 'nj', 'ni')
fsens: ('time', 'nj', 'ni')
fsens_ai: ('time', 'nj', 'ni')
flwup: ('time', 'nj', 'ni')
flwup_ai: ('time', 'nj', 'ni')
evap: ('time', 'nj', 'ni')
evap_ai: ('time', 'nj', 'ni')
congel: ('time', 'nj', 'ni')
frazil: ('time', 'nj', 'ni')
snoice: ('time', 'nj', 'ni')
meltt: ('time', 'nj', 'ni')
melts: ('time', 'nj', 'ni')
meltb: ('time', 'nj', 'ni')
meltl: ('time', 'nj', 'ni')
fresh: ('time', 'nj', 'ni')
fresh_ai: ('time', 'nj', 'ni')
fsalt: ('time', 'nj', 'ni')
fsalt_ai: ('time', 'nj', 'ni')
fbot: ('time', 'nj', 'ni')
fhocn: ('time', 'nj', 'ni')
fhocn_ai: ('time', 'nj', 'ni')
fswthru: ('time', 'nj', 'ni')
fswthru_ai: ('time', 'nj', 'ni')
strairx: ('time', 'nj', 'ni')
strairy: ('time', 'nj', 'ni')
strtltx: ('time', 'nj', 'ni')
strtlty: ('time', 'nj', 'ni')
strcorx: ('time', 'nj', 'ni')
strcory: ('time', 'nj', 'ni')
strocnx: ('time', 'nj', 'ni')
strocny: ('time', 'nj', 'ni')
strintx: ('time', 'nj', 'ni')
strinty: ('time', 'nj', 'ni')
taubx: ('time', 'nj', 'ni')
tauby: ('time', 'nj', 'ni')
strength: ('time', 'nj', 'ni')
divu: ('time', 'nj', 'ni')
shear: ('time', 'nj', 'ni')
sig1: ('time', 'nj', 'ni')
sig2: ('time', 'nj', 'ni')
sigP: ('time', 'nj', 'ni')
dvidtt: ('time', 'nj', 'ni')
dvidtd: ('time', 'nj', 'ni')
daidtt: ('time', 'nj', 'ni')
daidtd: ('time', 'nj', 'ni')
ice_present: ('time', 'nj', 'ni')
fsurf_ai: ('time', 'nj', 'ni')
fcondtop_ai: ('time', 'nj', 'ni')
fmeltt_ai: ('time', 'nj', 'ni')
opening: ('time', 'nj', 'ni')
apond: ('time', 'nj', 'ni')
apond_ai: ('time', 'nj', 'ni')
hpond: ('time', 'nj', 'ni')
hpond_ai: ('time', 'nj', 'ni')
ipond: ('time', 'nj', 'ni')
ipond_ai: ('time', 'nj', 'ni')
apeff: ('time', 'nj', 'ni')
apeff_ai: ('time', 'nj', 'ni')
aicen: ('time', 'nc', 'nj', 'ni')
vicen: ('time', 'nc', 'nj', 'ni')
vsnon: ('time', 'nc', 'nj', 'ni')
fsurfn_ai: ('time', 'nc', 'nj', 'ni')
fcondtopn_ai: ('time', 'nc', 'nj', 'ni')
fmelttn_ai: ('time', 'nc', 'nj', 'ni')
flatn_ai: ('time', 'nc', 'nj', 'ni')
fsensn_ai: ('time', 'nc', 'nj', 'ni')
apondn: ('time', 'nc', 'nj', 'ni')
hpondn: ('time', 'nc', 'nj', 'ni')
apeffn: ('time', 'nc', 'nj', 'ni')

How can we discern which coordinates are the ones we want to build a key from? I've tried things like eliminating the keys that are also in ds.dimensions, but that doesn't get me very far.

@charles-turner-1
Copy link
Collaborator Author

My brain is very wooly today, but I think what you're looking for is either the indexes the dataset contains, or something very closely related to them.

For example, for the last two

hpondn: ('time', 'nc', 'nj', 'ni')
apeffn: ('time', 'nc', 'nj', 'ni')

hpondn and apeffn are both indexing other 'time', 'nc', 'nj', and 'ni', which I think means that they are going to be in the same dataset/use the same key (I think using the same key is equivalent to being in the same dataset unless there are subtleties I'm forgetting).

I think this should also neatly handle any potential 2d coordinate complexities.

Disclaimer: This is mostly based off reading about how netcdf files internally store information during airport layovers over the past couple weeks, so I could be mixing up terms.

@dougiesquire
Copy link
Collaborator

Yeah, I think this will be an iterative process unfortunately to find what works best.

A starting point could be to define a column entry (say, shape) to use in the keys that is based off the dimensions and size of a variable, excluding some (Builder-specific) dims. E.g. here we could exclude time, nbnd and nvertices, so that for hpondn, for example, shape = "nc:10.nj:20.ni:30" (I don't actually know the sizes of nc etc). Then the dataset keys could be realm.frequency.shape. A better approach would be if we could identify and encode the grid somehow, as discussed in #112.

Coordinate variables will require special consideration as they should just be included in the same dataset (i.e. have the same key) as the variable they are relevant to, where possible.

It's hard to know how well something like my suggestion will work without just trying it. There are lots of edge cases that are hard to imagine. I'd suggest trying with one of the large OM2 experiments.

@marc-white
Copy link
Collaborator

The part that is confusing me at the moment is what happens in the case where we have multiple data variables inside the same file? Do we just take the highest-order coordinates as the 'matching' shape (in the example I gave, the 4D coordinates, which goes back to 3D once time is excluded), and then assume anything of lower dimensionality will be able to be worked out by the Builder?

@dougiesquire
Copy link
Collaborator

It think we'll need to define the shape per variable. That will mean the variables in a given file could be split into different Intake-ESM datasets (as in your example case). This will decrease performance in cases where people want multiple variable of different shape that exist in the same file, but I think it could be a worthwhile trade for robustness.

@marc-white
Copy link
Collaborator

@dougiesquire is that strictly necessary? The build process has been managing fine so far by grouping the files by (redacted) file name, and then letting the Builder figure out how to tie the various variables together in time.

I'm starting to wonder if what we should be doing is confirming that all the files we want to group together contain the same variables, and those variables all have the same dimensions. I think this is what we were effectively assuming with the filename grouping, although it will end up making a ridiculously long shape string for matching...

@marc-white
Copy link
Collaborator

Discussed this with @charles-turner-1 at our meeting today - going to try and use the indices recorded in the file to define the file shape, rather than work through the various coordinate combinations of each variable. Might need to open the file in xarray rather than netCDF4 to make this happen, but will see.

@marc-white
Copy link
Collaborator

Image

Well, that didn't go exactly how I'd hoped...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Backlog
Development

Successfully merging a pull request may close this issue.

3 participants