-
-
Notifications
You must be signed in to change notification settings - Fork 348
fix: eip race condition when updating ASG #1314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hey @blackillzone! 👋 Thank you for your contribution to the project. Please refer to the contribution rules for a quick overview of the process. Make sure that this PR clearly explains:
With submitting this PR you confirm that you hold the rights of the code added and agree that it will published under this LICENSE. The following ChatOps commands are supported:
Simply add a comment with the command in the first line. If you need to pass more information, separate it with a blank line from the command. This message was generated automatically. You are welcome to improve it. |
Zero downtime should be the top priority and this is definitely the easiest solution, but adds $3.60 per IP and machine to the monthly bill. What about removing the EIP from Terraform and order it during startup and release it during shutdown? But to be safe you always have to check for dangling EIPs (due to failures) and release them too |
Seems possible, so the workflow would be this one, if I'm correct:
This way, each agent manage it's own EIP lifecycle, and we don't have race condition issues, and no more EIP deployed for nothing. For the dangling EIP, I don't know where we should manage this, to do it properly. Maybe in the lambda function, as it's already used to cleanup what's left from a runner agent shutdown ? |
I've updated the PR, and tested all this with At startup phase:
At terminate phase:
Lambda fallback:
As I was testing in autoscaler mode, I've also include a fix for this: #1316 It's ready for review, let me know if you think we can enhance anything in the current workflow for the EIP. Note: I've not worked on backward compatibility with the old behavior, but the transition should be easy |
Hello @kayman-mk , did you had time to review this PR ? I'm running on it since a month with EIP active, and it's working for my use case at least, but I'll gladly have an external feedback on it. Also, let me know if it's okay, I can work on the rebase 👍 Thank you 🙏 |
Description
Add a second EIP, when we decide to use EIP for the manager. This way, when we update the ASG, a free EIP is available to pick by the
aws-ec2-assign-elastic-ip
tool.This fix the rolling update, and allow for a smoother transition (instead of trying to work around the removing of the EIP association), and breaking the availability of the runner.
The caveats is that we require two EIP for this use case, but I guess it's okay, as long the user is aware of it.
Migrations required
No
Verification
Running a deployment with a runner instance using an EIP, then applying a configuration change for this (like changing the type, or an AMI update).
The second runner is starting correctly, and get an EIP associated.
Alternative
To avoid using two EIP ressources, we may add some more logic in the
template/eip.tftpl
in order to check the EIP availability, and do a switchover if it is already associated to an existing running instance. The issue is that the runner would still be unavailable in the meantime, on Gitlab side, because it will still running the rest of the cloud init script, and won't be able to keep the communication.I'm open to the discussion about this, but having two EIP is more reliable I think.