Rewrite builders to extract relevant info from netCDF files, rather than regex matching on filenames #378

charles-turner-1 · 2025-04-01T23:18:07Z

Is your feature request related to a problem? Please describe.

The current regex matching approach used to generate file ID's & then separate the experiment into datasets needs to be updated, as we have several datasets for which this approach fails currently in the catalog & we're liable to create more. Additional datasets such as those on #349 will eventually lead to the regex approach becoming a frankensteins monster of string processing as we introduce more layers of workarounds.

Describe the feature you'd like

When we open up the datasets to build the datastore here, we should pull out the extra info necessary to disambiguate the experiment outputs/file structure into datasets.

dougiesquire · 2025-04-01T23:43:30Z

Related: #67

marc-white · 2025-04-03T05:21:05Z

I've been looking at this today. There's a couple of easy wins we can make in progressing this, and then one major sticking point.

Wins

I've already removed timestamp_filename from the _NCFileInfo object. AFAIK, it seems to not be used for anything anymore, so was a dead letter.
I think I can remove the filename_frequency from the file parser fairly easily, and simply leave the frequency as the STATIC_FREQUENCY (fx) default if the frequency can't be calculated from the file data. I can't remember where, but I think we've discussed this as being a reasonable solution in the past, as it ultimately represents a data issue.

Sticking point

... is the realm information. At the moment, this is being parsed from the filename, or in the case of, say, ROMSBuilder, it's being hard-coded to a single value. However, these approaches won't work under this ticket:

Obviously, we're trying to move away from using filenames for anything data-related;
Several models have multiple possible realms, so we can't just hard-code a solution.

I've been looking through just the test data contained in the repository. Some models (e.g., ACCESS-OM2) seem to contain some useful data in the X.attrs dict (where X is the xarray.Dataset return of xarray.open_dataset); we could try parsing that. However, some of the other test data (e.g., ACCESS-CM2) seems to contain no explicitly useful information to compute the realm from.

The only solution I've been able to think of so far is to hold a mapping of variables vs. realm for each model, and use the variable(s) present in each file to divine which realm the file belongs to. However, this has some potential landmines:

How many variables are unique to a given realm in each model?
Can we guarantee one of these unique 'marker' variables will be available in every file?
I can imagine this mapping would get very large, very quickly.

Happy to hear if anyone else has any thoughts on how to address the realm conundrum.

charles-turner-1 · 2025-04-03T05:44:06Z

I wonder if using coordinate variables, rather than data variables, might let us enumerate the space of realms more straightforwardly. I'd be very surprised if, eg. (and I'm making these examples up off the top of my head) an atmospheric model used a depth coordinate, or vice versa for an ocean model using altitude/hPa etc.

It seems to me that coordinate variables would be more likely to be consistent than data variables?

I'm just spitballing here though - so I could be talking nonsense.

dougiesquire · 2025-04-03T05:55:10Z

I'm not sure I'd worry too much about realm at the moment. I think for the most part, the parsing of those can be relied on (or at least, other approaches I can think of off the top of my head aren't any better). It's getting rid of file_id--and the regex madness it entails--in the definition of dataset keys that I think is highest priority.

marc-white · 2025-04-03T06:12:56Z

@dougiesquire don't we need to split the data by realm in order to reliably build the datastores? I think this is what we are doing in effect at the moment by using filename-based file_id keys (which typically contain the realm information by design/happy accident).

charles-turner-1 · 2025-04-03T06:57:20Z

I think file_id is built from the filename, as is the realm - but we can just get the realm from the filename without having to do any of the other parsing to create a file_id - and create the file_id or similar from information within the file instead?

IIRC, the thinking was that if we could extract a combination of frequency and dimensions (x,y,z), where we encode missing dimensions as being size zero, we should be able to get back to something that uniquely identifies the same dataset's as before without having to redact strings.

dougiesquire · 2025-04-03T08:34:04Z

That's right @charles-turner-1. My vote would be we get rid of file_id altogether and start by trying to replace the current dataset keys, which are file_id.frequency, with realm.frequency.nx.ny.nz, where realm is determined in the same way it currently is (at least for now).

I'm not sure how well this will work across all the different experiments in the catalog and there may be performance implications for some experiments, but we have to start somewhere.

dougiesquire · 2025-04-03T08:39:08Z

Note, the dataset keys are defined by groupby_attrs in each Builder, e.g. here

marc-white · 2025-04-28T07:07:50Z

@dougiesquire do you have any insight into what the best way to get the coordinates in the new dataset keys?

For example, I'm looking at a file from the request for a new reader (#168 ), and there's a ton of variables in there, all with different shapes:

ds = netCDF4.Dataset("/g/data/zv30/non-cmip/ACCESS-CM3/cm3-run-29-01-2025-exp-runoff-smoothing-rmax-500-efold-1000/archive/2100/ice/access-cm3.cice.h.2100-01.nc")

for k in ds.variables.keys():
    print(f"{k}: {ds[k].dimensions}")

time: ('time',)
time_bounds: ('time', 'nbnd')
TLON: ('nj', 'ni')
TLAT: ('nj', 'ni')
ULON: ('nj', 'ni')
ULAT: ('nj', 'ni')
NLON: ('nj', 'ni')
NLAT: ('nj', 'ni')
ELON: ('nj', 'ni')
ELAT: ('nj', 'ni')
NCAT: ('nc',)
VGRDi: ('nkice',)
VGRDs: ('nksnow',)
VGRDb: ('nkbio',)
VGRDa: ('nkaer',)
tmask: ('nj', 'ni')
umask: ('nj', 'ni')
nmask: ('nj', 'ni')
emask: ('nj', 'ni')
tarea: ('nj', 'ni')
uarea: ('nj', 'ni')
narea: ('nj', 'ni')
earea: ('nj', 'ni')
dxt: ('nj', 'ni')
dyt: ('nj', 'ni')
dxu: ('nj', 'ni')
dyu: ('nj', 'ni')
dxn: ('nj', 'ni')
dyn: ('nj', 'ni')
dxe: ('nj', 'ni')
dye: ('nj', 'ni')
HTN: ('nj', 'ni')
HTE: ('nj', 'ni')
ANGLE: ('nj', 'ni')
ANGLET: ('nj', 'ni')
lont_bounds: ('nj', 'ni', 'nvertices')
latt_bounds: ('nj', 'ni', 'nvertices')
lonu_bounds: ('nj', 'ni', 'nvertices')
latu_bounds: ('nj', 'ni', 'nvertices')
lonn_bounds: ('nj', 'ni', 'nvertices')
latn_bounds: ('nj', 'ni', 'nvertices')
lone_bounds: ('nj', 'ni', 'nvertices')
late_bounds: ('nj', 'ni', 'nvertices')
hi: ('time', 'nj', 'ni')
hs: ('time', 'nj', 'ni')
snowfrac: ('time', 'nj', 'ni')
Tsfc: ('time', 'nj', 'ni')
aice: ('time', 'nj', 'ni')
uvel: ('time', 'nj', 'ni')
vvel: ('time', 'nj', 'ni')
icespd: ('time', 'nj', 'ni')
icedir: ('time', 'nj', 'ni')
uatm: ('time', 'nj', 'ni')
vatm: ('time', 'nj', 'ni')
atmspd: ('time', 'nj', 'ni')
atmdir: ('time', 'nj', 'ni')
fswup: ('time', 'nj', 'ni')
fswdn: ('time', 'nj', 'ni')
flwdn: ('time', 'nj', 'ni')
snow: ('time', 'nj', 'ni')
snow_ai: ('time', 'nj', 'ni')
rain: ('time', 'nj', 'ni')
rain_ai: ('time', 'nj', 'ni')
sst: ('time', 'nj', 'ni')
ocnspd: ('time', 'nj', 'ni')
ocndir: ('time', 'nj', 'ni')
frzmlt: ('time', 'nj', 'ni')
scale_factor: ('time', 'nj', 'ni')
fswint_ai: ('time', 'nj', 'ni')
fswabs: ('time', 'nj', 'ni')
fswabs_ai: ('time', 'nj', 'ni')
albsni: ('time', 'nj', 'ni')
alvdr_ai: ('time', 'nj', 'ni')
alidr_ai: ('time', 'nj', 'ni')
alvdf_ai: ('time', 'nj', 'ni')
alidf_ai: ('time', 'nj', 'ni')
flat: ('time', 'nj', 'ni')
flat_ai: ('time', 'nj', 'ni')
fsens: ('time', 'nj', 'ni')
fsens_ai: ('time', 'nj', 'ni')
flwup: ('time', 'nj', 'ni')
flwup_ai: ('time', 'nj', 'ni')
evap: ('time', 'nj', 'ni')
evap_ai: ('time', 'nj', 'ni')
congel: ('time', 'nj', 'ni')
frazil: ('time', 'nj', 'ni')
snoice: ('time', 'nj', 'ni')
meltt: ('time', 'nj', 'ni')
melts: ('time', 'nj', 'ni')
meltb: ('time', 'nj', 'ni')
meltl: ('time', 'nj', 'ni')
fresh: ('time', 'nj', 'ni')
fresh_ai: ('time', 'nj', 'ni')
fsalt: ('time', 'nj', 'ni')
fsalt_ai: ('time', 'nj', 'ni')
fbot: ('time', 'nj', 'ni')
fhocn: ('time', 'nj', 'ni')
fhocn_ai: ('time', 'nj', 'ni')
fswthru: ('time', 'nj', 'ni')
fswthru_ai: ('time', 'nj', 'ni')
strairx: ('time', 'nj', 'ni')
strairy: ('time', 'nj', 'ni')
strtltx: ('time', 'nj', 'ni')
strtlty: ('time', 'nj', 'ni')
strcorx: ('time', 'nj', 'ni')
strcory: ('time', 'nj', 'ni')
strocnx: ('time', 'nj', 'ni')
strocny: ('time', 'nj', 'ni')
strintx: ('time', 'nj', 'ni')
strinty: ('time', 'nj', 'ni')
taubx: ('time', 'nj', 'ni')
tauby: ('time', 'nj', 'ni')
strength: ('time', 'nj', 'ni')
divu: ('time', 'nj', 'ni')
shear: ('time', 'nj', 'ni')
sig1: ('time', 'nj', 'ni')
sig2: ('time', 'nj', 'ni')
sigP: ('time', 'nj', 'ni')
dvidtt: ('time', 'nj', 'ni')
dvidtd: ('time', 'nj', 'ni')
daidtt: ('time', 'nj', 'ni')
daidtd: ('time', 'nj', 'ni')
ice_present: ('time', 'nj', 'ni')
fsurf_ai: ('time', 'nj', 'ni')
fcondtop_ai: ('time', 'nj', 'ni')
fmeltt_ai: ('time', 'nj', 'ni')
opening: ('time', 'nj', 'ni')
apond: ('time', 'nj', 'ni')
apond_ai: ('time', 'nj', 'ni')
hpond: ('time', 'nj', 'ni')
hpond_ai: ('time', 'nj', 'ni')
ipond: ('time', 'nj', 'ni')
ipond_ai: ('time', 'nj', 'ni')
apeff: ('time', 'nj', 'ni')
apeff_ai: ('time', 'nj', 'ni')
aicen: ('time', 'nc', 'nj', 'ni')
vicen: ('time', 'nc', 'nj', 'ni')
vsnon: ('time', 'nc', 'nj', 'ni')
fsurfn_ai: ('time', 'nc', 'nj', 'ni')
fcondtopn_ai: ('time', 'nc', 'nj', 'ni')
fmelttn_ai: ('time', 'nc', 'nj', 'ni')
flatn_ai: ('time', 'nc', 'nj', 'ni')
fsensn_ai: ('time', 'nc', 'nj', 'ni')
apondn: ('time', 'nc', 'nj', 'ni')
hpondn: ('time', 'nc', 'nj', 'ni')
apeffn: ('time', 'nc', 'nj', 'ni')

How can we discern which coordinates are the ones we want to build a key from? I've tried things like eliminating the keys that are also in ds.dimensions, but that doesn't get me very far.

charles-turner-1 · 2025-04-28T07:44:16Z

My brain is very wooly today, but I think what you're looking for is either the indexes the dataset contains, or something very closely related to them.

For example, for the last two

hpondn: ('time', 'nc', 'nj', 'ni')
apeffn: ('time', 'nc', 'nj', 'ni')

hpondn and apeffn are both indexing other 'time', 'nc', 'nj', and 'ni', which I think means that they are going to be in the same dataset/use the same key (I think using the same key is equivalent to being in the same dataset unless there are subtleties I'm forgetting).

I think this should also neatly handle any potential 2d coordinate complexities.

Disclaimer: This is mostly based off reading about how netcdf files internally store information during airport layovers over the past couple weeks, so I could be mixing up terms.

dougiesquire · 2025-04-28T09:35:27Z

Yeah, I think this will be an iterative process unfortunately to find what works best.

A starting point could be to define a column entry (say, shape) to use in the keys that is based off the dimensions and size of a variable, excluding some (Builder-specific) dims. E.g. here we could exclude time, nbnd and nvertices, so that for hpondn, for example, shape = "nc:10.nj:20.ni:30" (I don't actually know the sizes of nc etc). Then the dataset keys could be realm.frequency.shape. A better approach would be if we could identify and encode the grid somehow, as discussed in #112.

Coordinate variables will require special consideration as they should just be included in the same dataset (i.e. have the same key) as the variable they are relevant to, where possible.

It's hard to know how well something like my suggestion will work without just trying it. There are lots of edge cases that are hard to imagine. I'd suggest trying with one of the large OM2 experiments.

marc-white · 2025-04-28T23:19:11Z

The part that is confusing me at the moment is what happens in the case where we have multiple data variables inside the same file? Do we just take the highest-order coordinates as the 'matching' shape (in the example I gave, the 4D coordinates, which goes back to 3D once time is excluded), and then assume anything of lower dimensionality will be able to be worked out by the Builder?

dougiesquire · 2025-04-28T23:28:55Z

It think we'll need to define the shape per variable. That will mean the variables in a given file could be split into different Intake-ESM datasets (as in your example case). This will decrease performance in cases where people want multiple variable of different shape that exist in the same file, but I think it could be a worthwhile trade for robustness.

marc-white · 2025-04-28T23:54:46Z

@dougiesquire is that strictly necessary? The build process has been managing fine so far by grouping the files by (redacted) file name, and then letting the Builder figure out how to tie the various variables together in time.

I'm starting to wonder if what we should be doing is confirming that all the files we want to group together contain the same variables, and those variables all have the same dimensions. I think this is what we were effectively assuming with the filename grouping, although it will end up making a ridiculously long shape string for matching...

marc-white · 2025-05-01T01:38:15Z

Discussed this with @charles-turner-1 at our meeting today - going to try and use the indices recorded in the file to define the file shape, rather than work through the various coordinate combinations of each variable. Might need to open the file in xarray rather than netCDF4 to make this happen, but will see.

marc-white · 2025-05-01T23:47:38Z

Well, that didn't go exactly how I'd hoped...

charles-turner-1 added the enhancement New feature or request label Apr 1, 2025

github-project-automation bot added this to Model Evaluation & Diagnostics Apr 1, 2025

charles-turner-1 self-assigned this Apr 1, 2025

github-project-automation bot moved this to Backlog in Model Evaluation & Diagnostics Apr 1, 2025

marc-white self-assigned this Apr 3, 2025

charles-turner-1 mentioned this issue May 21, 2025

New build check updates #407

Merged

3 tasks

dougiesquire mentioned this issue May 28, 2025

Encoding grid information #112

Open

marc-white linked a pull request Jun 17, 2025 that will close this issue

Switch to file_id based off file coordinates, not file name #432

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rewrite builders to extract relevant info from netCDF files, rather than regex matching on filenames #378

Rewrite builders to extract relevant info from netCDF files, rather than regex matching on filenames #378

charles-turner-1 commented Apr 1, 2025

dougiesquire commented Apr 1, 2025

Uh oh!

marc-white commented Apr 3, 2025

Uh oh!

charles-turner-1 commented Apr 3, 2025 •

edited

Loading

Uh oh!

dougiesquire commented Apr 3, 2025

Uh oh!

marc-white commented Apr 3, 2025

Uh oh!

charles-turner-1 commented Apr 3, 2025 •

edited

Loading

Uh oh!

dougiesquire commented Apr 3, 2025 •

edited

Loading

Uh oh!

dougiesquire commented Apr 3, 2025 •

edited

Loading

Uh oh!

marc-white commented Apr 28, 2025

Uh oh!

charles-turner-1 commented Apr 28, 2025

Uh oh!

dougiesquire commented Apr 28, 2025

Uh oh!

marc-white commented Apr 28, 2025

Uh oh!

dougiesquire commented Apr 28, 2025

Uh oh!

marc-white commented Apr 28, 2025

Uh oh!

marc-white commented May 1, 2025

Uh oh!

marc-white commented May 1, 2025

Uh oh!

Rewrite builders to extract relevant info from netCDF files, rather than regex matching on filenames #378

Rewrite builders to extract relevant info from netCDF files, rather than regex matching on filenames #378

Comments

charles-turner-1 commented Apr 1, 2025

Is your feature request related to a problem? Please describe.

Describe the feature you'd like

dougiesquire commented Apr 1, 2025

Uh oh!

marc-white commented Apr 3, 2025

Wins

Sticking point

Uh oh!

charles-turner-1 commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dougiesquire commented Apr 3, 2025

Uh oh!

marc-white commented Apr 3, 2025

Uh oh!

charles-turner-1 commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dougiesquire commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dougiesquire commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marc-white commented Apr 28, 2025

Uh oh!

charles-turner-1 commented Apr 28, 2025

Uh oh!

dougiesquire commented Apr 28, 2025

Uh oh!

marc-white commented Apr 28, 2025

Uh oh!

dougiesquire commented Apr 28, 2025

Uh oh!

marc-white commented Apr 28, 2025

Uh oh!

marc-white commented May 1, 2025

Uh oh!

marc-white commented May 1, 2025

Uh oh!

charles-turner-1 commented Apr 3, 2025 •

edited

Loading

charles-turner-1 commented Apr 3, 2025 •

edited

Loading

dougiesquire commented Apr 3, 2025 •

edited

Loading

dougiesquire commented Apr 3, 2025 •

edited

Loading