CLOUDP-338084 - removing and refactoring agent matrix from pipeline.py and atomic_pipeline.py #346

nammn · 2025-08-14T09:32:17Z

Summary

Why we do this

since we don't do an agent matrix release anymore, there is no need to release all the agents we see on release.json. Instead we should only release the agent if PCT adds a new agent. That happens during OM and CM bumps, the new detection script should handle this and release the images

What changes

adding a detection script that detects agent changes between local vs origin/master for release.json and uses that as a base to do the release
streamline evergreen.yml and remove matrix builds/releases
streamline agent builds on pipeline.py

Evergreen configuration cleanup and simplification:

Removed obsolete tasks and buildvariants related to agent image releases, such as release_agent_operator_release, release_agents_on_ecr_conditional, and init_release_agents_on_ecr, to streamline the release process.

Pipeline logic refactoring:

Refactored the build_agent_default_case function in pipeline.py to use the new detect_ops_manager_changes function for determining which agent versions to build, and eliminated the separate build_agent_on_agent_bump logic.
Simplified agent in pipeline.py to match atomic_pipeline.py
Updated the image builder function mapping so that both "agent" and "agent-pct" use the unified build_agent_default_case function.

Proof of Work

no agent needed to be released - patch: Link

[2025/08/14 10:56:13.642] === Detecting OM Mapping Changes (Local vs Base) ===
[2025/08/14 10:56:13.643] INFO     2025-08-14 08:56:13,642 [atomic_pipeline]  No changes detected, skipping agent build
[2025/08/14 10:56:13.725] Finished command 'subprocess.exec' in function 'pipeline' (step 3.5 of 3) in 4.209554597s.

manual changing release.json (the changeset, its a manual rsynced patch), leading to an agent release/build. Note; the task is not passing because I am releasing a non existant image -Link

[2025/08/14 17:42:50.558] INFO     2025-08-14 15:42:50,558 [atomic_pipeline]  ======= Agent versions to build [('13.30.0.9590-1', '100.12.2')] =======
[2025/08/14 17:42:50.558] INFO     2025-08-14 15:42:50,558 [atomic_pipeline]  ======= Building Agent ('13.30.0.9590-

cm bump worked with this changeset link + the related patch
- this caused in init_test_run agent build to release the agent to ecr as release.json has changed

[2025/08/14 13:59:35.068] === Detecting OM Mapping Changes (Local vs Base) ===

[2025/08/14 13:59:35.068] INFO     2025-08-14 11:59:35,067 [atomic_pipeline]  Building Agent versions: [('13.38.0.9654-1', '100.12.2')]

[2025/08/14 13:59:35.068] INFO     2025-08-14 11:59:35,068 [atomic_pipeline]  Running with factor of None

[2025/08/14 13:59:35.068] INFO     2025-08-14 11:59:35,068 [atomic_pipeline]  ======= Agent versions to build [('13.38.0.9654-1', '100.12.2')] =======

[2025/08/14 13:59:35.068] INFO     2025-08-14 11:59:35,068 [atomic_pipeline]  ======= Building Agent ('13.38.0.9654-1', '100.12.2') (0/1)

we have a dedicated variant that can also release all agents: link

Example Cases

A new OM/CM bump workflow

publish_om/cm and release_agent variants are getting triggered
detection script detects a change in release.json
release the new agent

Checklist

Have you linked a jira ticket and/or is the ticket in the title?
Have you checked whether your jira ticket required DOCSP changes?
Have you added changelog file?
- use skip-changelog label if not needed
- refer to Changelog files and Release Notes section in CONTRIBUTING.md for more details

github-actions · 2025-08-14T09:33:00Z

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.3.0 Release Notes

Bug Fixes

This change fixes the current complex and difficult-to-maintain architecture for stateful set containers, which relies on an "agent matrix" to map operator and agent versions which led to a sheer amount of images.
We solve this by shifting to a 3-container setup. This new design eliminates the need for the operator-version/agent-version matrix by adding one additional container containing all required binaries. This architecture maps to what we already do with the mongodb-database container.
Fixed an issue where the readiness probe reported the node as ready even when its authentication mechanism was not in sync with the other nodes, potentially causing premature restarts.

Other Changes

Optional permissions for PersistentVolumeClaim moved to a separate role. When managing the operator with Helm it is possible to disable permissions for PersistentVolumeClaim resources by setting operator.enablePVCResize value to false (true by default). When enabled, previously these permissions were part of the primary operator role. With this change, permissions have a separate role.
subresourceEnabled Helm value was removed. This setting used to be true by default and made it possible to exclude subresource permissions from the operator role by specifying false as the value. We are removing this configuration option, making the operator roles always have subresource permissions. This setting was introduced as a temporary solution for this OpenShift issue. The issue has since been resolved and the setting is no longer needed.

nammn · 2025-08-14T12:01:37Z

.evergreen.yml

      - name: build_readiness_probe_image
        variant: init_test_run
      - name: build_upgrade_hook_image
        variant: init_test_run
      - name: build_mco_test_image
        variant: init_test_run
+      - name: build_agent_images_ubi


we still use all run this on every patch, the script just checks whether its required and potentially skips it then if there are no changes.

Why still run it?
On CM and OM bump prs we still need the agent in ecr first. This ensures we build it to ecr first

nammn · 2025-08-14T13:44:59Z

scripts/release/pipeline_main.py

@@ -251,6 +252,11 @@ def main():
        type=int,
        help="Number of agent builds to run in parallel, defaults to number of cores",
    )
+    parser.add_argument(


defaults to false

.evergreen-functions.yml

…ubernetes into remove-static-pipeline

nammn · 2025-08-14T13:59:31Z

.evergreen-functions.yml

@@ -540,7 +540,7 @@ functions:
        shell: bash
        <<: *e2e_include_expansions_in_env
        working_dir: src/github.com/mongodb/mongodb-kubernetes
-        binary: scripts/dev/run_python.sh scripts/release/pipeline_main.py --parallel ${image_name}
+        binary: scripts/dev/run_python.sh scripts/release/pipeline_main.py --parallel ${image_name} ${all_agents}


all_agents expansion is empty, but in the manual release agents on ecr variant it will be set to --all-agents

MaciejKaras · 2025-08-14T14:33:04Z

scripts/test_detect_ops_manager_changes.py

this should be in scripts/release/tests otherwise it is not run in CI. @mircea-cosbuc knows more about it

maybe we should move scripts/release/tests to scripts/tests/release?

so this test would go to scripts/tests

it is though 😕
https://spruce.mongodb.com/task/mongodb_kubernetes_unit_tests_unit_tests_python_patch_f0df6f42547f5602c5c4ab120d1d7e3d0db05797_689deba0420e560007c79dab_25_08_14_13_58_58/logs?execution=0

MaciejKaras · 2025-08-14T14:38:14Z

pipeline.py

-    # We only need [latest agents (for each OM major version and for CM) x patch ID] for patches
-    else:
-        agent_versions_to_build = gather_latest_agent_versions(release_json, build_configuration.agent_to_build)
+def build_agent(build_configuration: BuildConfiguration):


are we still using the legacy pipeline anywhere for agents?

maybe - right now i want things to be consistent

I think this still needs answering. Since you're changing the image build for the agent there should be a decision on whether moving to atomic pipeline is happening. The title and the description of the PR imply that we're getting rid of pipeline.py .

yes i was unclear. What I meant is - we should move to atomic_pipeline to make sure we are not changing 2 files at the same time (for the release as well as for patches).

Due to this reasons I've made pipeline.py and atomic_pipeline.py agent handling the same to not have that edge-case that we overlook something and still call pipeline.py

we plan to migrate to atomic for releases: #344

but that PR does not touch agent images at all.

If we are in fact moving away from agent image building in legacy pipeline code, we should not make changes to it or remove redundant code completely

the problem now is that we have duplicated code in pipeline and atomic.

one is for releasing and one for patches but for agent releases both are the same duplicated code. I think its better to just use atomic with scenarion release 937e953. If you strongly disagree i can move back to pipeline and have them duplicated

mircea-cosbuc · 2025-08-14T15:40:36Z

pipeline.py

-    # We only need [latest agents (for each OM major version and for CM) x patch ID] for patches
-    else:
-        agent_versions_to_build = gather_latest_agent_versions(release_json, build_configuration.agent_to_build)
+def build_agent(build_configuration: BuildConfiguration):


I think this still needs answering. Since you're changing the image build for the agent there should be a decision on whether moving to atomic pipeline is happening. The title and the description of the PR imply that we're getting rid of pipeline.py .

scripts/release/atomic_pipeline.py

nammn · 2025-08-14T16:17:18Z

.evergreen-functions.yml

@@ -517,30 +517,36 @@ functions:
          # docker buildx needs the moby/buildkit image when setting up a builder so we pull it from our mirror
          docker buildx create --driver=docker-container --driver-opt=image=268558157000.dkr.ecr.eu-west-1.amazonaws.com/docker-hub-mirrors/moby/buildkit:buildx-stable-1 --use
          docker buildx inspect --bootstrap
-    - command: ec2.assume_role


this is cherry-picked from https://github.com/mongodb/mongodb-kubernetes/pull/344/files#diff-ad8722e626fc7bc08be6765b8268550446b1fb934c1a7eb6a5766d6446f92ad1 and i need to use this for the agent. We can merge either and I can fix the merge conflict - but this ensures we will get the correct merge either way

nammn · 2025-08-14T16:31:59Z

.evergreen-functions.yml

-        binary: scripts/dev/run_python.sh scripts/release/pipeline_main.py --parallel ${image_name}
+        env:
+          git_tag: ${triggered_by_git_tag}
+        binary: scripts/dev/run_python.sh scripts/release/pipeline_main.py ${image_name} --build-scenario release ${git_tag|--version ${git_tag}}


my change here is to make version optional, but this is hacky and we should make it properly for om and the agent in a next step https://jira.mongodb.org/browse/CLOUDP-338152

anandsyncs · 2025-08-14T22:31:12Z

scripts/release/detect_ops_manager_changes.py

+
+    current_release = load_current_release_json()
+    if not current_release:
+        print("ERROR: Could not load current local release.json")


Why don't we use a logger in this file?

anandsyncs · 2025-08-14T22:32:04Z

scripts/release/detect_ops_manager_changes.py

+            ["git", "show", f"{commit}:{file_path}"], capture_output=True, text=True, check=True, timeout=30
+        )
+        return result.stdout
+    except (subprocess.CalledProcessError, subprocess.TimeoutExpired):


Would it make sense to properly log these exceptions?

anandsyncs · 2025-08-14T22:34:07Z

LGTM!
No blocking issues from my side.
Only some comments about logging.

removing pipeline and atomic matrix

d95b296

nammn added the skip-changelog Use this label in Pull Request to not require new changelog entry file label Aug 14, 2025

nammn changed the title ~~removing pipeline and atomic matrix~~ CLOUDP-338084 - removing pipeline and atomic matrix Aug 14, 2025

removing pipeline and atomic matrix

3357e31

nammn marked this pull request as ready for review August 14, 2025 09:36

nammn requested a review from a team as a code owner August 14, 2025 09:36

nammn requested review from anandsyncs and lucian-tosa August 14, 2025 09:36

nammn added 4 commits August 14, 2025 12:03

fix bc

a2bb281

fix python tests

832739c

push agent

c581be0

push agent

115b072

nammn commented Aug 14, 2025

View reviewed changes

nammn requested review from mircea-cosbuc, viveksinghggits and Julien-Ben August 14, 2025 12:42

nammn added 2 commits August 14, 2025 15:43

add a way to push all agents

7d78b09

add a way to push all agents

18e2687

nammn commented Aug 14, 2025

View reviewed changes

.evergreen-functions.yml Outdated Show resolved Hide resolved

nammn added 3 commits August 14, 2025 15:45

Merge branch 'master' into remove-static-pipeline

a08ab76

push agent

3477571

Merge branch 'remove-static-pipeline' of github.com:mongodb/mongodb-k…

8f71280

…ubernetes into remove-static-pipeline

nammn commented Aug 14, 2025

View reviewed changes

nammn requested a review from MaciejKaras August 14, 2025 14:00

MaciejKaras reviewed Aug 14, 2025

View reviewed changes

nammn added 2 commits August 14, 2025 17:02

change release folder

931a774

linter

5ad2a77

mircea-cosbuc reviewed Aug 14, 2025

View reviewed changes

nammn changed the title ~~CLOUDP-338084 - removing pipeline and atomic matrix~~ CLOUDP-338084 - removing and refactoring agent matrix from pipeline.py and atomic_pipeline.py Aug 14, 2025

linter

3ad71b7

mircea-cosbuc approved these changes Aug 14, 2025

View reviewed changes

add release for agent

c3dd864

nammn commented Aug 14, 2025

View reviewed changes

add release for agent

937e953

nammn commented Aug 14, 2025

View reviewed changes

anandsyncs reviewed Aug 14, 2025

View reviewed changes

anandsyncs approved these changes Aug 14, 2025

View reviewed changes

CLOUDP-338084 - removing and refactoring agent matrix from pipeline.py and atomic_pipeline.py #346

Are you sure you want to change the base?

CLOUDP-338084 - removing and refactoring agent matrix from pipeline.py and atomic_pipeline.py #346

Uh oh!

Conversation

nammn commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why we do this

What changes

Proof of Work

Example Cases

Checklist

Uh oh!

github-actions bot commented Aug 14, 2025

MCK 1.3.0 Release Notes

Bug Fixes

Other Changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nammn Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anandsyncs Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anandsyncs commented Aug 14, 2025

Uh oh!

Uh oh!

nammn commented Aug 14, 2025 •

edited

Loading

nammn Aug 14, 2025 •

edited

Loading

anandsyncs Aug 14, 2025 •

edited

Loading