Skip to content

CLOUDP-337356 - static support #333

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 33 commits into
base: multi-arch-pipeline-combined
Choose a base branch
from

Conversation

nammn
Copy link
Collaborator

@nammn nammn commented Aug 11, 2025

Summary

Proof of Work

Checklist

  • Have you linked a jira ticket and/or is the ticket in the title?
  • Have you checked whether your jira ticket required DOCSP changes?
  • Have you added changelog file?

nammn and others added 30 commits August 11, 2025 17:19
# Summary
**Add Not-Ready Handling for Ongoing Auth Transitions**:
This patch refines our readiness logic to correctly reflect the state of
authentication transitions. Previously, we treated
LastGoalVersionAchieved == GoalVersion as a signal that the cluster was
"Running", but this assumption breaks down when auth transitions are
still in progress.
This happened because we returned "ready" during a wait step
(WaitAuthCanUpdate) — and [we generally return ready for all wait
steps](https://github.com/mongodb/mongodb-kubernetes/blob/f0050b8942545701e8cb9e42d54d14f0cb58ee6a/mongodb-community-operator/cmd/readiness/main.go#L139),
regardless of whether auth is fully transitioned. Example status:
```
{
  "step": "WaitAuthUpdate",
  "stepDoc": "Wait to update Auth",
  "isWaitStep": true,
  "started": "2025-08-07T14:59:40.213178437Z",
  "attempts": 512,
  "latestAttempt": "2025-08-07T15:09:20.966699961Z",
  "completed": null,
  "result": "wait"
}
```


**Why implemented in the operator and not readinessProbe**:
I didn't fix the readinessProbe but rather  the operator
* if the readinessProbe blocks new nodes are not coming up
* we want new nodes coming up 
* but we also want to block new configurations being applied, which the
automation_status check in the
  operator does

**The core idea:**
* Configuration applied ≠ transition fully complete.

**What happened in our tests**:
* we update auth via CR x509 -> scram
* `node-0` completed its auth transition (now uses scram, instead of
x509)
* `Config server` hasn't finished its auth transition yet
* We hit a race condition where clusters were marked as "Running" too
early and thus continued the rolling restart of `nod e-0`
* `node-0` restarted with the old X509 config (see below comment from
the agent code)
* The X509 process couldn’t access the SCRAM automation user
* Leads to Error: "process...doesn't have the automation user"

- in the mms-automation there is also a comment; that indicates thats
they are handling the edge-case if an auth transition was not
successful, they start the process with old auth to "finish" it. But
this is exactly what causes our race condition
```
	// If a process went down unexpectedly in the middle of an auth transition,
	// we want to restart it with the old auth args.
	// Otherwise, it could be upgraded to the new auth state too soon,
	// and not be able to communicate with other shard members.
```

tl;dr: first `node-0` moved to new auth, `config` not yet, `node-0`
restarted and during the restart `config` transitioned to the new auth
while `node-0` is again running old auth

## Proof of Work

- auth change tests are passing multiple times in a row:
[Link](http://spruce.mongodb.com/version/6894b98218a2e90007437e99/tasks?sorts=STATUS%3AASC%3BBASE_STATUS%3ADESC)
- the most flaky auth tests +
[Link2](https://spruce.mongodb.com/task/mongodb_kubernetes_e2e_static_mdb_kind_ubi_cloudqa_e2e_sharded_cluster_x509_to_scram_transition_patch_b29fb4ace63eec7102f8f034fd6c553b5d75c1a1_6894c0785c119f0007a58f3c_25_08_07_15_04_26/logs?execution=0)
- from the patch


## Checklist

- [ ] Have you linked a jira ticket and/or is the ticket in the title?
- [x] Have you checked whether your jira ticket required DOCSP changes?
- [x] Have you added changelog file?
    - use `skip-changelog` label if not needed
- refer to [Changelog files and Release
Notes](https://github.com/mongodb/mongodb-kubernetes/blob/master/CONTRIBUTING.md#changelog-files-and-release-notes)
section in CONTRIBUTING.md for more details
…341)

# Summary

This patch adds a separate task in the `init_test_run` variant for
building the operator image with the race-checker enabled.

This patch also moved the tests under `scripts/release` according to
[standard pytest best
practices](https://doc.pytest.org/en/latest/explanation/goodpractices.html#tests-as-part-of-application-code)
as currently our python test targets were not capturing all the tests.

## Proof of Work

[EVG Job](https://spruce.mongodb.com/version/689b8a097e5e41000778b82e)

## Checklist

- [x] Have you linked a jira ticket and/or is the ticket in the title?
- [x] Have you checked whether your jira ticket required DOCSP changes?
- [x] Have you added changelog file?
    - use `skip-changelog` label if not needed
- refer to [Changelog files and Release
Notes](https://github.com/mongodb/mongodb-kubernetes/blob/master/CONTRIBUTING.md#changelog-files-and-release-notes)
section in CONTRIBUTING.md for more details
# Summary

- when refactoring and removing daily builds we also removed parts of
teardown by accident

## Proof of Work
- periodic getting triggered:
https://spruce.mongodb.com/version/689dc0de0e8148000791e683/tasks?sorts=STATUS%3AASC%3BBASE_STATUS%3ADESC

## Checklist

- [ ] Have you linked a jira ticket and/or is the ticket in the title?
- [ ] Have you checked whether your jira ticket required DOCSP changes?
- [ ] Have you added changelog file?
    - use `skip-changelog` label if not needed
- refer to [Changelog files and Release
Notes](https://github.com/mongodb/mongodb-kubernetes/blob/master/CONTRIBUTING.md#changelog-files-and-release-notes)
section in CONTRIBUTING.md for more details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants