Skip to content

CI: add rpm build workflow #1244

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 13 commits into
base: master
Choose a base branch
from
Draft

CI: add rpm build workflow #1244

wants to merge 13 commits into from

Conversation

junghans
Copy link
Contributor

For #1241

@aprokop
Copy link
Contributor

aprokop commented Apr 24, 2025

@junghans Does not seem like it works unless I'm missing something.

@junghans
Copy link
Contributor Author

2025-04-24T23:54:55.8680985Z 10/12 Test #10: ArborX_Test_SpecializedTraversals ........***Failed    0.01 sec
2025-04-24T23:54:55.8681171Z Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
2025-04-24T23:54:55.8681411Z   In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
2025-04-24T23:54:55.8681531Z   For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
2025-04-24T23:54:55.8681613Z   For unit testing set OMP_PROC_BIND=false
2025-04-24T23:54:55.8681673Z Running 10 test cases...
2025-04-24T23:54:55.8683253Z /builddir/build/BUILD/ArborX-2.0-build/ArborX-2.0/test/tstNeighborList.cpp(177): �[1;31;49merror: in "find_neighbor_list_compare_filtered_tree_traversal<Kokkos__Device<Kokkos__OpenMP_ Kokkos__HostSpace>>": check Test::buildHalfNeighborListAndExpandToFull(exec_space, points, radius) == Test::compute_reference<MemorySpace>(exec_space, points, radius) has failed
2025-04-24T23:54:55.8683433Z   - mismatch at position 0: [( 2 7 8 24 38 41 46 53 60 63 64 91 ) == ( 2 7 8 24 38 41 53 60 63 64 91 )] is false
2025-04-24T23:54:55.8683550Z   - mismatch at position 3: [( 8 46 53 88 ) == ( 8 53 )] is false
2025-04-24T23:54:55.8683718Z   - mismatch at position 6: [( 14 20 35 36 42 48 50 68 84 94 96 ) == ( 14 20 35 36 42 48 50 84 94 96 )] is false
2025-04-24T23:54:55.8683940Z   - mismatch at position 8: [( 0 2 3 7 24 40 41 46 53 63 64 66 88 91 ) == ( 0 2 3 7 24 40 41 53 63 64 66 91 )] is false
2025-04-24T23:54:55.8684084Z   - mismatch at position 14: [( 6 35 36 37 78 80 98 ) == ( 6 35 36 37 78 98 )] is false
2025-04-24T23:54:55.8684238Z   - mismatch at position 17: [( 5 25 26 27 38 41 51 52 60 ) == ( 5 25 26 27 38 41 51 60 )] is false
2025-04-24T23:54:55.8684373Z   - mismatch at position 22: [( 31 33 55 62 67 74 ) == ( 33 55 67 )] is false
2025-04-24T23:54:55.8684528Z   - mismatch at position 25: [( 5 17 26 27 38 41 51 52 60 ) == ( 5 17 26 27 38 41 51 60 )] is false
2025-04-24T23:54:55.8684769Z   - mismatch at position 26: [( 5 17 25 27 38 41 51 52 60 73 ) == ( 5 17 25 27 38 41 51 60 73 )] is false
2025-04-24T23:54:55.8684904Z   - mismatch at position 27: [( 5 17 25 26 51 52 ) == ( 5 17 25 26 51 )] is false
2025-04-24T23:54:55.8685049Z   - mismatch at position 31: [( 22 32 33 55 61 62 67 74 ) == ( 33 55 61 67 )] is false
2025-04-24T23:54:55.8685191Z   - mismatch at position 32: [( 31 33 46 55 61 67 74 83 90 ) == ( 33 55 61 67 )] is false
2025-04-24T23:54:55.8685300Z   - mismatch at position 34: [( 61 83 90 ) == ( 61 )] is false
2025-04-24T23:54:55.8685458Z   - mismatch at position 36: [( 6 14 35 48 50 68 80 89 98 ) == ( 6 14 35 48 50 89 98 )] is false
2025-04-24T23:54:55.8685611Z   - mismatch at position 40: [( 8 24 47 53 66 88 91 94 97 ) == ( 8 24 47 53 66 91 94 97 )] is false
2025-04-24T23:54:55.8685780Z   - mismatch at position 41: [( 0 8 17 24 25 26 52 60 64 66 73 91 ) == ( 0 8 17 24 25 26 60 64 66 73 91 )] is false
2025-04-24T23:54:55.8686091Z   - mismatch at position 42: [( 2 6 7 16 35 48 53 63 72 78 80 89 94 ) == ( 2 6 7 16 35 48 53 63 72 78 89 94 )] is false
2025-04-24T23:54:55.8686226Z   - mismatch at position 46: [( 0 3 8 32 53 63 89 ) == ( 53 63 89 )] is false
2025-04-24T23:54:55.8686446Z   - mismatch at position 48: [( 6 35 36 42 50 68 80 89 ) == ( 6 35 36 42 50 89 )] is false
2025-04-24T23:54:55.8686579Z   - mismatch at position 50: [( 6 36 48 68 80 ) == ( 6 36 48 )] is false
2025-04-24T23:54:55.8686702Z   - mismatch at position 52: [( 17 25 26 27 41 60 ) == ( 60 )] is false
2025-04-24T23:54:55.8686918Z   - mismatch at position 53: [( 0 2 3 7 8 35 40 42 46 63 88 89 91 94 ) == ( 0 2 3 7 8 35 40 42 46 63 89 91 94 )] is false
2025-04-24T23:54:55.8687080Z   - mismatch at position 55: [( 22 31 32 33 61 67 74 78 99 ) == ( 22 31 32 33 61 67 78 99 )] is false
2025-04-24T23:54:55.8687239Z   - mismatch at position 61: [( 31 32 33 34 55 62 67 74 83 90 99 ) == ( 31 32 33 34 55 67 99 )] is false
2025-04-24T23:54:55.8687390Z   - mismatch at position 62: [( 22 31 61 67 74 ) == ( 67 )] is false
2025-04-24T23:54:55.8687539Z   - mismatch at position 64: [( 0 8 24 41 66 73 75 91 ) == ( 0 8 24 41 66 73 91 )] is false
2025-04-24T23:54:55.8687689Z   - mismatch at position 66: [( 8 24 40 41 64 75 91 97 ) == ( 8 24 40 41 64 91 97 )] is false
2025-04-24T23:54:55.8687837Z   - mismatch at position 67: [( 22 31 32 33 55 61 62 74 ) == ( 22 31 32 33 55 61 62 )] is false
2025-04-24T23:54:55.8687967Z   - mismatch at position 68: [( 6 36 48 50 80 89 98 ) == ( 89 98 )] is false
2025-04-24T23:54:55.8688104Z   - mismatch at position 74: [( 22 31 32 55 61 62 67 83 90 ) == ( )] is false
2025-04-24T23:54:55.8688198Z   - mismatch at position 75: [( 64 66 ) == ( )] is false
2025-04-24T23:54:55.8688332Z   - mismatch at position 80: [( 14 36 42 48 50 68 89 98 ) == ( 89 98 )] is false
2025-04-24T23:54:55.8688451Z   - mismatch at position 83: [( 32 34 61 74 90 ) == ( )] is false
2025-04-24T23:54:55.8688568Z   - mismatch at position 88: [( 3 8 40 53 91 ) == ( 91 )] is false
2025-04-24T23:54:55.8688832Z   - mismatch at position 90: [( 32 34 61 74 83 ) == ( )] is false�[0;39;49m
2025-04-24T23:54:55.8689034Z �[1;31;49m*** 1 failure is detected in the test module "Master Test Suite"

@aprokop
Copy link
Contributor

aprokop commented Apr 25, 2025

@junghans Is it possible to run the CI based on the current branch? So that if I push here, it runs the change and not 2.0. Also, is there a way to speed it up, it seems to take 2+ hours?

@junghans
Copy link
Contributor Author

@junghans Is it possible to run the CI based on the current branch? So that if I push here, it runs the change and not 2.0. Also, is there a way to speed it up, it seems to take 2+ hours?

@aprokop it makes a tarball out of the current checkout, it is just the tarball is always named ArborX-2.0.tar.gz.
I think we could make it faster by not building the mpi versions as it only happens in the serial build.

But maybe the easiest would to trying to just build it with same flags and CMake options:

2025-04-24T23:54:54.8665106Z + CFLAGS='-O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -mbranch-protection=standard -fasynchronous-unwind-tables -fstack-clash-protection -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer '
2025-04-24T23:54:54.8667973Z + export CFLAGS
2025-04-24T23:54:54.8669993Z + CXXFLAGS='-O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -mbranch-protection=standard -fasynchronous-unwind-tables -fstack-clash-protection -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer '
2025-04-24T23:54:54.8671711Z + export CXXFLAGS
2025-04-24T23:54:54.8683068Z + LDFLAGS='-Wl,-z,relro -Wl,--as-needed  -Wl,-z,pack-relative-relocs -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -specs=/usr/lib/rpm/redhat/redhat-hardened-ld-errors -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -Wl,--build-id=sha1 -specs=/usr/lib/rpm/redhat/redhat-package-notes '
2025-04-24T23:54:54.8684300Z + export LDFLAGS
2025-04-24T23:54:54.8684489Z + LT_SYS_LIBRARY_PATH=/usr/lib64:
2025-04-24T23:54:54.8684706Z + export LT_SYS_LIBRARY_PATH
2025-04-24T23:54:54.8684880Z + CC=gcc
2025-04-24T23:54:54.8685017Z + export CC
2025-04-24T23:54:54.8685151Z + CXX=g++
2025-04-24T23:54:54.8685280Z + export CXX
2025-04-24T23:54:54.8688017Z + /usr/bin/cmake -S . -B aarch64-redhat-linux-gnu-serial -DCMAKE_C_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_CXX_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_Fortran_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON -DCMAKE_INSTALL_DO_STRIP:BOOL=OFF -DCMAKE_INSTALL_PREFIX:PATH=/usr -DCMAKE_INSTALL_FULL_SBINDIR:PATH=/usr/bin -DCMAKE_INSTALL_SBINDIR:PATH=bin -DINCLUDE_INSTALL_DIR:PATH=/usr/include -DLIB_INSTALL_DIR:PATH=/usr/lib64 -DSYSCONF_INSTALL_DIR:PATH=/etc -DSHARE_INSTALL_PREFIX:PATH=/usr/share -DLIB_SUFFIX=64 -DBUILD_SHARED_LIBS:BOOL=ON -DARBORX_ENABLE_TESTS=ON -DARBORX_ENABLE_EXAMPLES=OFF -DARBORX_ENABLE_BENCHMARKS=OFF -DARBORX_ENABLE_MPI=OFF -DCMAKE_INSTALL_DATADIR=/usr/share -DCMAKE_INSTALL_INCLUDEDIR=/usr/include

@aprokop
Copy link
Contributor

aprokop commented Apr 25, 2025

So it seems that the following change makes it pass:

--- a/src/spatial/detail/ArborX_ExpandHalfToFull.hpp
+++ b/src/spatial/detail/ArborX_ExpandHalfToFull.hpp
@@ -50,19 +50,13 @@ void expandHalfToFull(ExecutionSpace const &space, Offsets &offsets,
                                  "ArborX::Experimental::HalfToFull::counts");
   Kokkos::parallel_for(
       "ArborX::Experimental::HalfToFull::rewrite",
-      Kokkos::TeamPolicy(space, n, Kokkos::AUTO, 1),
-      KOKKOS_LAMBDA(
-          typename Kokkos::TeamPolicy<ExecutionSpace>::member_type const
-              &member) {
-        auto const i = member.league_rank();
-        auto const first = offsets_orig(i);
-        auto const last = offsets_orig(i + 1);
-        Kokkos::parallel_for(
-            Kokkos::TeamVectorRange(member, last - first), [&](int j) {
-              int const k = indices_orig(first + j);
-              indices(Kokkos::atomic_fetch_inc(&counts(i))) = k;
-              indices(Kokkos::atomic_fetch_inc(&counts(k))) = i;
-            });
+      Kokkos::RangePolicy(space, 0, n), KOKKOS_LAMBDA(int i) {
+        for (int j = offsets_orig(i); j < offsets_orig(i + 1); ++j)
+        {
+          int const k = indices_orig(j);
+          indices(Kokkos::atomic_fetch_inc(&counts(i))) = k;
+          indices(Kokkos::atomic_fetch_inc(&counts(k))) = i;
+        }
       });
   Kokkos::Profiling::popRegion();
 }

I don't understand why. Both codes seem valid to me. It seems to only affect aarch64. Mac uses aarch64 but native Mac's toolchain does not support OpenMP, so I never ran it, and it passes in Serial.

@junghans
Copy link
Contributor Author

I am not sure either, but maybe @dalg24 knows....

@junghans
Copy link
Contributor Author

Either way, I patched that in and rebuild: https://koji.fedoraproject.org/koji/taskinfo?taskID=131982908

@aprokop
Copy link
Contributor

aprokop commented Apr 26, 2025

Hmm, the latest patch failed in a different place:


/builddir/build/BUILD/ArborX-2.0-build/ArborX-2.0/test/tstDBSCAN.cpp(185):
error: in "DBSCAN/dbscan<Kokkos__Device<Kokkos__OpenMP_ Kokkos__HostSpace>>":
check verifyDBSCAN( space, hidden_points, r - (Coordinate)0.1, 2, dbscan(space, hidden_points, r - (Coordinate)0.1, 2, params)) has failed

So, it seems, the failures are intermittent. I really need to be able to run things in a loop to properly debug this.

@junghans
Copy link
Contributor Author

The CI2 is failing in 48min.

@@ -58,7 +58,7 @@ void expandHalfToFull(ExecutionSpace const &space, Offsets &offsets,
auto const first = offsets_orig(i);
auto const last = offsets_orig(i + 1);
Kokkos::parallel_for(
Kokkos::TeamVectorRange(member, last - first), [&](int j) {
Kokkos::TeamThreadRange(member, last - first), [&](int j) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TeamVectorRange looks more correct here (so that it would also work with vector_length > 1).

@junghans
Copy link
Contributor Author

@aprokop let me know if you have a patch set I should test on Fedora again.

@junghans
Copy link
Contributor Author

@aprokop any update on this, anything I can help with.

@aprokop
Copy link
Contributor

aprokop commented May 13, 2025

@junghans I think I'm essentially stuck here. None of it makes any sense to me. I will try to add more printouts and see if I can track it some more. I wonder if it is some optimizations again, and the issue would disappear with "-O0".

@aprokop
Copy link
Contributor

aprokop commented May 14, 2025

Either the failure is intermittent (which it could be), or it is similar to #1186.

@junghans
Copy link
Contributor Author

@dalg24 any ideas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants