Skip to content

CLOUDP-338084 - removing and refactoring agent matrix from pipeline.py and atomic_pipeline.py #346

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

nammn
Copy link
Collaborator

@nammn nammn commented Aug 14, 2025

Summary

Why we do this

  • since we don't do an agent matrix release anymore, there is no need to release all the agents we see on release.json. Instead we should only release the agent if PCT adds a new agent. That happens during OM and CM bumps, the new detection script should handle this and release the images

What changes

  • adding a detection script that detects agent changes between local vs origin/master for release.json and uses that as a base to do the release
  • streamline evergreen.yml and remove matrix builds/releases
  • streamline agent builds on pipeline.py

Evergreen configuration cleanup and simplification:

  • Removed obsolete tasks and buildvariants related to agent image releases, such as release_agent_operator_release, release_agents_on_ecr_conditional, and init_release_agents_on_ecr, to streamline the release process.

Pipeline logic refactoring:

  • Refactored the build_agent_default_case function in pipeline.py to use the new detect_ops_manager_changes function for determining which agent versions to build, and eliminated the separate build_agent_on_agent_bump logic.
  • Simplified agent in pipeline.py to match atomic_pipeline.py
  • Updated the image builder function mapping so that both "agent" and "agent-pct" use the unified build_agent_default_case function.

Proof of Work

  • no agent needed to be released - patch: Link
[2025/08/14 10:56:13.642] === Detecting OM Mapping Changes (Local vs Base) ===
[2025/08/14 10:56:13.643] INFO     2025-08-14 08:56:13,642 [atomic_pipeline]  No changes detected, skipping agent build
[2025/08/14 10:56:13.725] Finished command 'subprocess.exec' in function 'pipeline' (step 3.5 of 3) in 4.209554597s.
[2025/08/14 17:42:50.558] INFO     2025-08-14 15:42:50,558 [atomic_pipeline]  ======= Agent versions to build [('13.30.0.9590-1', '100.12.2')] =======
[2025/08/14 17:42:50.558] INFO     2025-08-14 15:42:50,558 [atomic_pipeline]  ======= Building Agent ('13.30.0.9590-
  • cm bump worked with this changeset link + the related patch
    • this caused in init_test_run agent build to release the agent to ecr as release.json has changed
[2025/08/14 13:59:35.068] === Detecting OM Mapping Changes (Local vs Base) ===

[2025/08/14 13:59:35.068] INFO     2025-08-14 11:59:35,067 [atomic_pipeline]  Building Agent versions: [('13.38.0.9654-1', '100.12.2')]

[2025/08/14 13:59:35.068] INFO     2025-08-14 11:59:35,068 [atomic_pipeline]  Running with factor of None

[2025/08/14 13:59:35.068] INFO     2025-08-14 11:59:35,068 [atomic_pipeline]  ======= Agent versions to build [('13.38.0.9654-1', '100.12.2')] =======

[2025/08/14 13:59:35.068] INFO     2025-08-14 11:59:35,068 [atomic_pipeline]  ======= Building Agent ('13.38.0.9654-1', '100.12.2') (0/1)
  • we have a dedicated variant that can also release all agents: link

Example Cases

A new OM/CM bump workflow

  • publish_om/cm and release_agent variants are getting triggered
  • detection script detects a change in release.json
  • release the new agent

Checklist

  • Have you linked a jira ticket and/or is the ticket in the title?
  • Have you checked whether your jira ticket required DOCSP changes?
  • Have you added changelog file?

Copy link

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.3.0 Release Notes

Bug Fixes

  • This change fixes the current complex and difficult-to-maintain architecture for stateful set containers, which relies on an "agent matrix" to map operator and agent versions which led to a sheer amount of images.
  • We solve this by shifting to a 3-container setup. This new design eliminates the need for the operator-version/agent-version matrix by adding one additional container containing all required binaries. This architecture maps to what we already do with the mongodb-database container.
  • Fixed an issue where the readiness probe reported the node as ready even when its authentication mechanism was not in sync with the other nodes, potentially causing premature restarts.

Other Changes

  • Optional permissions for PersistentVolumeClaim moved to a separate role. When managing the operator with Helm it is possible to disable permissions for PersistentVolumeClaim resources by setting operator.enablePVCResize value to false (true by default). When enabled, previously these permissions were part of the primary operator role. With this change, permissions have a separate role.
  • subresourceEnabled Helm value was removed. This setting used to be true by default and made it possible to exclude subresource permissions from the operator role by specifying false as the value. We are removing this configuration option, making the operator roles always have subresource permissions. This setting was introduced as a temporary solution for this OpenShift issue. The issue has since been resolved and the setting is no longer needed.

@nammn nammn added the skip-changelog Use this label in Pull Request to not require new changelog entry file label Aug 14, 2025
@nammn nammn changed the title removing pipeline and atomic matrix CLOUDP-338084 - removing pipeline and atomic matrix Aug 14, 2025
@nammn nammn marked this pull request as ready for review August 14, 2025 09:36
@nammn nammn requested a review from a team as a code owner August 14, 2025 09:36
- name: build_readiness_probe_image
variant: init_test_run
- name: build_upgrade_hook_image
variant: init_test_run
- name: build_mco_test_image
variant: init_test_run
- name: build_agent_images_ubi
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still use all run this on every patch, the script just checks whether its required and potentially skips it then if there are no changes.

Why still run it?
On CM and OM bump prs we still need the agent in ecr first. This ensures we build it to ecr first

@@ -251,6 +252,11 @@ def main():
type=int,
help="Number of agent builds to run in parallel, defaults to number of cores",
)
parser.add_argument(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defaults to false

@@ -540,7 +540,7 @@ functions:
shell: bash
<<: *e2e_include_expansions_in_env
working_dir: src/github.com/mongodb/mongodb-kubernetes
binary: scripts/dev/run_python.sh scripts/release/pipeline_main.py --parallel ${image_name}
binary: scripts/dev/run_python.sh scripts/release/pipeline_main.py --parallel ${image_name} ${all_agents}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all_agents expansion is empty, but in the manual release agents on ecr variant it will be set to --all-agents

@nammn nammn requested a review from MaciejKaras August 14, 2025 14:00
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be in scripts/release/tests otherwise it is not run in CI. @mircea-cosbuc knows more about it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should move scripts/release/tests to scripts/tests/release?

so this test would go to scripts/tests

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# We only need [latest agents (for each OM major version and for CM) x patch ID] for patches
else:
agent_versions_to_build = gather_latest_agent_versions(release_json, build_configuration.agent_to_build)
def build_agent(build_configuration: BuildConfiguration):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we still using the legacy pipeline anywhere for agents?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe - right now i want things to be consistent

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this still needs answering. Since you're changing the image build for the agent there should be a decision on whether moving to atomic pipeline is happening. The title and the description of the PR imply that we're getting rid of pipeline.py .

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes i was unclear. What I meant is - we should move to atomic_pipeline to make sure we are not changing 2 files at the same time (for the release as well as for patches).

Due to this reasons I've made pipeline.py and atomic_pipeline.py agent handling the same to not have that edge-case that we overlook something and still call pipeline.py

we plan to migrate to atomic for releases: #344

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but that PR does not touch agent images at all.

If we are in fact moving away from agent image building in legacy pipeline code, we should not make changes to it or remove redundant code completely

Copy link
Collaborator Author

@nammn nammn Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem now is that we have duplicated code in pipeline and atomic.

one is for releasing and one for patches but for agent releases both are the same duplicated code. I think its better to just use atomic with scenarion release 937e953. If you strongly disagree i can move back to pipeline and have them duplicated

# We only need [latest agents (for each OM major version and for CM) x patch ID] for patches
else:
agent_versions_to_build = gather_latest_agent_versions(release_json, build_configuration.agent_to_build)
def build_agent(build_configuration: BuildConfiguration):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this still needs answering. Since you're changing the image build for the agent there should be a decision on whether moving to atomic pipeline is happening. The title and the description of the PR imply that we're getting rid of pipeline.py .

@nammn nammn changed the title CLOUDP-338084 - removing pipeline and atomic matrix CLOUDP-338084 - removing and refactoring agent matrix from pipeline.py and atomic_pipeline.py Aug 14, 2025
@@ -517,30 +517,36 @@ functions:
# docker buildx needs the moby/buildkit image when setting up a builder so we pull it from our mirror
docker buildx create --driver=docker-container --driver-opt=image=268558157000.dkr.ecr.eu-west-1.amazonaws.com/docker-hub-mirrors/moby/buildkit:buildx-stable-1 --use
docker buildx inspect --bootstrap
- command: ec2.assume_role
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is cherry-picked from https://github.com/mongodb/mongodb-kubernetes/pull/344/files#diff-ad8722e626fc7bc08be6765b8268550446b1fb934c1a7eb6a5766d6446f92ad1 and i need to use this for the agent. We can merge either and I can fix the merge conflict - but this ensures we will get the correct merge either way

binary: scripts/dev/run_python.sh scripts/release/pipeline_main.py --parallel ${image_name}
env:
git_tag: ${triggered_by_git_tag}
binary: scripts/dev/run_python.sh scripts/release/pipeline_main.py ${image_name} --build-scenario release ${git_tag|--version ${git_tag}}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my change here is to make version optional, but this is hacky and we should make it properly for om and the agent in a next step https://jira.mongodb.org/browse/CLOUDP-338152


current_release = load_current_release_json()
if not current_release:
print("ERROR: Could not load current local release.json")
Copy link
Contributor

@anandsyncs anandsyncs Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we use a logger in this file?

["git", "show", f"{commit}:{file_path}"], capture_output=True, text=True, check=True, timeout=30
)
return result.stdout
except (subprocess.CalledProcessError, subprocess.TimeoutExpired):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to properly log these exceptions?

@anandsyncs
Copy link
Contributor

LGTM!
No blocking issues from my side.
Only some comments about logging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
skip-changelog Use this label in Pull Request to not require new changelog entry file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants