-
Notifications
You must be signed in to change notification settings - Fork 1
Rewrite builders to extract relevant info from netCDF files, rather than regex matching on filenames #378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Related: #67 |
I've been looking at this today. There's a couple of easy wins we can make in progressing this, and then one major sticking point. Wins
Sticking point... is the realm information. At the moment, this is being parsed from the filename, or in the case of, say,
I've been looking through just the test data contained in the repository. Some models (e.g., The only solution I've been able to think of so far is to hold a mapping of variables vs. realm for each model, and use the variable(s) present in each file to divine which realm the file belongs to. However, this has some potential landmines:
Happy to hear if anyone else has any thoughts on how to address the realm conundrum. |
I wonder if using coordinate variables, rather than data variables, might let us enumerate the space of realms more straightforwardly. I'd be very surprised if, eg. (and I'm making these examples up off the top of my head) an atmospheric model used a It seems to me that coordinate variables would be more likely to be consistent than data variables? I'm just spitballing here though - so I could be talking nonsense. |
I'm not sure I'd worry too much about realm at the moment. I think for the most part, the parsing of those can be relied on (or at least, other approaches I can think of off the top of my head aren't any better). It's getting rid of |
@dougiesquire don't we need to split the data by realm in order to reliably build the datastores? I think this is what we are doing in effect at the moment by using filename-based |
I think IIRC, the thinking was that if we could extract a combination of frequency and dimensions (x,y,z), where we encode missing dimensions as being size zero, we should be able to get back to something that uniquely identifies the same dataset's as before without having to redact strings. |
That's right @charles-turner-1. My vote would be we get rid of I'm not sure how well this will work across all the different experiments in the catalog and there may be performance implications for some experiments, but we have to start somewhere. |
Note, the dataset keys are defined by |
@dougiesquire do you have any insight into what the best way to get the coordinates in the new dataset keys? For example, I'm looking at a file from the request for a new reader (#168 ), and there's a ton of variables in there, all with different shapes:
How can we discern which coordinates are the ones we want to build a key from? I've tried things like eliminating the keys that are also in |
My brain is very wooly today, but I think what you're looking for is either the indexes the dataset contains, or something very closely related to them. For example, for the last two
I think this should also neatly handle any potential 2d coordinate complexities. Disclaimer: This is mostly based off reading about how netcdf files internally store information during airport layovers over the past couple weeks, so I could be mixing up terms. |
Yeah, I think this will be an iterative process unfortunately to find what works best. A starting point could be to define a column entry (say, Coordinate variables will require special consideration as they should just be included in the same dataset (i.e. have the same key) as the variable they are relevant to, where possible. It's hard to know how well something like my suggestion will work without just trying it. There are lots of edge cases that are hard to imagine. I'd suggest trying with one of the large OM2 experiments. |
The part that is confusing me at the moment is what happens in the case where we have multiple data variables inside the same file? Do we just take the highest-order coordinates as the 'matching' shape (in the example I gave, the 4D coordinates, which goes back to 3D once |
It think we'll need to define the |
@dougiesquire is that strictly necessary? The build process has been managing fine so far by grouping the files by (redacted) file name, and then letting the Builder figure out how to tie the various variables together in time. I'm starting to wonder if what we should be doing is confirming that all the files we want to group together contain the same variables, and those variables all have the same dimensions. I think this is what we were effectively assuming with the filename grouping, although it will end up making a ridiculously long |
Discussed this with @charles-turner-1 at our meeting today - going to try and use the indices recorded in the file to define the file shape, rather than work through the various coordinate combinations of each variable. Might need to open the file in xarray rather than netCDF4 to make this happen, but will see. |
Is your feature request related to a problem? Please describe.
The current regex matching approach used to generate file ID's & then separate the experiment into datasets needs to be updated, as we have several datasets for which this approach fails currently in the catalog & we're liable to create more. Additional datasets such as those on #349 will eventually lead to the regex approach becoming a frankensteins monster of string processing as we introduce more layers of workarounds.
Describe the feature you'd like
When we open up the datasets to build the datastore here, we should pull out the extra info necessary to disambiguate the experiment outputs/file structure into datasets.
The text was updated successfully, but these errors were encountered: