Skip to content

Breaking: Materialize DimArray or DimStack From a Table #739

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 64 commits into from
Aug 10, 2025

Conversation

JoshuaBillson
Copy link
Contributor

Description:

resolves #335

This PR aims to let users construct either a DimStack or DimArray from a table with one or more coordinate columns.

Unlike the existing contructor, rows may be out of order or even missing altogether.

Performance:

The algorithm is O(n), requiring two forward passes for each dimension to determine the correct order of rows.

1000x1000: 0.005 Seconds

2000x2000: 0.025 Seconds

4000x4000: 0.108 Seconds

8000x8000: 0.376 Seconds

Example:

julia> r = DimArray(rand(UInt8, 1000, 1000), (X, Y));

julia> t = r |> DataFrame |> Random.shuffle!;

julia> restored = DimArray(t, dims(r));

julia> all(r .== restored)
true

Next Steps:

  1. Finalize method signatures.
  2. Decide what to do with missing rows. Currently, users may choose a missing value to fill in (missing by default).
  3. Add support for a :geometry column.
  4. Write test cases.
  5. Update docs.

@JoshuaBillson
Copy link
Contributor Author

There's still a few questions that need resolving:

  1. What do we do with missing rows? This can be handled explicitly in Rasters.jl, but DimArrays have no concept of missing values. We could let users choose a missing value to write, or we could encode it explicitly with missing. We could also disallow the existence of missing rows, but I don't think this is an ideal solution.
  2. How should we handle a :geometry column? I know that several packages in the Julia ecosystem are using this convention, but I don't have much experience with them. Can we assume :geometry will contain tuples of coordinates, or could they also be some sort of geometry like Meshes.Point?

@asinghvi17
Copy link
Collaborator

asinghvi17 commented Jun 20, 2024

I think it's best to test that the geometry column's elements have GI.PointTrait - then you can always extract e.g. GI.x(point) and the same for y, z, and m. You can also interrogate the dimension via GI.is3d, GI.ismeasured.

Going forward to get the geometry column it's probably best to have first(GI.geometrycolumns(table)) - the fallback implementation will give you (:geometry,), but to handle tables with other geometry columns which may be indicated by e.g. metadata, it's better to use this.

@rafaqz
Copy link
Owner

rafaqz commented Jun 20, 2024

DimensionalData.jl does not depend on GeoInterface

@JoshuaBillson
Copy link
Contributor Author

DimensionalData.jl does not depend on GeoInterface

Being able to interact with geometries in a generic fashion would help us interop with other packages. GeoInterface's only dependency is Extents, which we already depend on. After importing DimensionalData, GeoInterface loads in 0.008 seconds.

I also see that Rasters already depends on GeoInterface. We could simply ignore the :geometry column in DimensionalData, then implement the additional functionality in Rasters. However, I think the problem is general enough to be implemented here.

@rafaqz
Copy link
Owner

rafaqz commented Jun 20, 2024

It's not about deps or timing, it's about clean feature scope.

The promise here is that Rasters (and YAX) have all the geo deps and features, so non-geo people don't have to worry about them. Some of the biggest contributors here use DD for unrelated fields.

I would ignore point columns here entirely and instead write the code so it's easy for Rasters to handle them. Taking the underscores off a few functions and documenting them as an real interface will help that.

@JoshuaBillson
Copy link
Contributor Author

I've updated the docstrings for both the DimArray and DimStack constructors. By default, materializers will now use the Contains selector for irregular and non-numeric coordinates, with the option to specify an alternative when desired. We're exporting the restore_array and coords_to_index methods to be used by downstream packages like Rasters.jl. I've also written several test cases to show that we can handle tables with out of order and missing rows. If everything looks good, I think we can go ahead and merge this PR.

Copy link
Owner

@rafaqz rafaqz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good, but a few minor changes are needed for it to be correct. Constructed selectors should be allowed so At can have atol. We can just construct selector types at the outer level (they're just filled with nothing.

Then there is a bit more dispatch needed to make the fast paths correct for At/Near/Contains with their standard behaviour.

JoshuaBillson and others added 8 commits August 7, 2024 23:55
Co-authored-by: Rafael Schouten <rafaelschouten@gmail.com>
Co-authored-by: Rafael Schouten <rafaelschouten@gmail.com>
Co-authored-by: Rafael Schouten <rafaelschouten@gmail.com>
Co-authored-by: Rafael Schouten <rafaelschouten@gmail.com>
Co-authored-by: Rafael Schouten <rafaelschouten@gmail.com>
Copy link

codecov bot commented Jun 28, 2025

Codecov Report

❌ Patch coverage is 84.76821% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.18%. Comparing base (886a2b2) to head (55f8017).
⚠️ Report is 1 commits behind head on breaking.

Files with missing lines Patch % Lines
src/table_ops.jl 81.44% 18 Missing ⚠️
src/stack/stack.jl 84.61% 4 Missing ⚠️
src/tree/tree.jl 75.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           breaking     #739      +/-   ##
============================================
- Coverage     84.62%   84.18%   -0.45%     
============================================
  Files            53       54       +1     
  Lines          5307     5456     +149     
============================================
+ Hits           4491     4593     +102     
- Misses          816      863      +47     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tiemvanderdeure
Copy link
Collaborator

tiemvanderdeure commented Jun 28, 2025

Okay docs fails because of this:

 Error: 5 docstrings not included in the manual:

    DimensionalData.restore_array :: Tuple{AbstractVector, AbstractVector, Tuple, Any}
    DimensionalData.coords_to_indices :: Tuple{Any, Tuple}
    DimensionalData.guess_dims :: Tuple{Any}
    DimensionalData.data_col_names :: Tuple{Any, Tuple}
    DimensionalData.get_column :: Tuple{Any, Type{<:Dimension}}

But anyway I think all of these functions should be internal. Should I just remove the docstrings - or make them into comments or something?

(Never seen this before so I'm not 100% about what this is testing)

@rafaqz
Copy link
Owner

rafaqz commented Jun 29, 2025

Yeah we force docstrings to be in the docs. I think these should just be comments not docstrings

@tiemvanderdeure
Copy link
Collaborator

tiemvanderdeure commented Jun 30, 2025

All right let's see if this passes now. I think before merging we should add some documentation, though.

Another thing is that this will only work with exactly DimArray now, not other abstract dimarrays or abstractdimstacks. We probably want this functionality with rasters and maybe yaxarrays as well, without creating dispatch mayhem.

I've tried to solve this now by adding a type argument to dimarray_from_table, so you could call dimarray_from_table(Raster, table, dims; kw...) and the only lines you would have to copy are these:

https://github.com/JoshuaBillson/DimensionalData.jl/blob/aec86a9600d6ceb3af0aaa7dadc0f051c638a87d/src/array/array.jl#L519-L544

@rafaqz
Copy link
Owner

rafaqz commented Jul 1, 2025

Yes perfect that's why I removed the underscores, those a_from_b methods can become dev api

@rafaqz
Copy link
Owner

rafaqz commented Aug 10, 2025

@tiemvanderdeure this looks good to merge ?

@tiemvanderdeure
Copy link
Collaborator

Yeah I think it is! I don't have more, if you've reviewed it and found nothing then do merge :)

@rafaqz rafaqz merged commit cc4928e into rafaqz:breaking Aug 10, 2025
7 of 8 checks passed
@tiemvanderdeure
Copy link
Collaborator

I think this got merged into main and released. Was that on purpose? It is very slightly breaking.

@rafaqz
Copy link
Owner

rafaqz commented Aug 12, 2025

It got merged into breaking! which I do want to merge now but it will get a breaking bump

@tiemvanderdeure
Copy link
Collaborator

Ah I got confused because it is listed in the tag https://github.com/rafaqz/DimensionalData.jl/releases/tag/v0.29.21

Didn't know that it also listed PRs into other branches than main

@rafaqz
Copy link
Owner

rafaqz commented Aug 13, 2025

Oh I didn't know that either.. thats actually bad

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a Tables.materializer method
4 participants