Zero-downtime when restarting zitadel pods is not possible as of now
Zitadel Helm chart version: 8.1.0
Zitadel image version: v2.55.0
Kubernetes: v1.29.5
DB: PostgreSQL 15.6
Ingress: nginx v1.10.1
PROBLEM: When the zitadel's pod is being restarted or stopped - kubernetes sends SIGTERM signal to stop the main container, and after that - zitadel immediately "gracefully" exits. This happens almost instantly, therefore the endpoints controller is updating the endpoints at the same time and ingress controller (before detecting the change in enpoints) could for some small amount of time still send the traffic to non-existing controller. (as the container immediately exited without any delay). And as a side effect - you will see a few "bad gateway" errors while trying to reach the zitadel endpoints at that time. So no complete ZERO-downtime is possible
HOW TO REPRODUCE: start k6 performance tests (VUs >15) and at the same time restart whole zitadel deployment (it will be done one by one pod) - you will see a few "bad gateway" errors due to the process described above.
POSSIBLE SOLUTION: Zitadel theoretically should get SIGTERM signal, and after the signal was received - to delay a bit its exiting, before the endpoints are updated. The delay could be configurable for e.g.
What do you think of that issue?
16 Replies
Hm that is an intersting topic.
In what scenario do you see this happen?
To my understanding if you deploy an new version an new replica set gets created where traffic is shifted toward before the old one is terminated.
What strategy are you unsing here?
We do use the rolling update strategy, and it is set on the zitadel deployment:
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
And the issue does still happen in this case. I was suspecting the immediate exiting of the zitadel pod as the main reason. The corresponding "endpoint" resource modification (removal of exiting container IP) and the container shutdown happen at the same time by separate control loops, therefore this misalignment can sometimes happen.
Most of our other applications does wait a bit after SIGTERM is sent, doing some cleaning work, and thefore the IP in the endpoints resource is guaranteed to be removed BEFORE the actual container exiting, this is what i mean.
The scenarios when it happen is :
1. update of the zitadel pod to newer version
2. just the restart of the deployment (without update).
3. just scaling down the replicas on the zitadel deployment
Ok, this helps me understand, how long would we need to wait on sigterm?
CC @Elio have you seen something like this already?
I would say, ideally, the zitadel should get SIGTERM, wait until all current executed http connections are completed, and only then shutdown, if that really helps solve the issue.
In addition to that - we could also make the delay excplicitly configurable (instead of waiting until all connections are completed, just wait the exact amount of seconds) as the ENV variable 🤔
But i would still - try to reproduce the issue on your end (execute the performance tests while restarting or scaling the zitadel deployment), if it is also an issue for you - try my suggested options and see if it resolves it 🤔 As i am only guessing there.
Hehe we will surely look into a reproduction.
I asked because I have not seen problem around this with our customers, so something must be different that you see that behavior
We wouldn't see/catch this issue also if we didn't run performance tests (with VUs >15) and try to update the application at the same time (while checking the performance test logs at that particular time) 🙂 So maybe that is the reason, not sure 🤔 Not sure how rare/common are the disaster recovery scenarios tests these days 😅
Would you mind testing if a workaround with a preStop hook?
https://stackoverflow.com/questions/40545581/do-kubernetes-pods-still-receive-requests-after-receiving-sigterm
If that resolves your case, we could add a configurable wait time for the sigterm trap in Zitadel
Stack Overflow
Do Kubernetes pods still receive requests after receiving SIGTERM?
I want to implement graceful shutdown in a Kubernetes Pod. I know I need to listen for SIGTERM, which indicates the start of the shutdown procedure. But what exactly do I do when I receive it?
At ...
@FFO as per your/stackoverflow suggestion: we have re-packed original zitadel container image to include the "sleep" binary (zitadel is based from "scratch" image and didn't contain anything except zitadel itself), modified original Helm chart to include preStop hook, set the "sleep" delay to 5s - and it completely eliminated the issue with "bad gateway" error.
Pod Lifecycle Controller (responsible for managing pod creation, termination and etc) and Endpoint Controller (which updates services endpoints of the healthy IPs, which ingress uses) - are two separated control loops. Since these two controllers operate independently, there is a slight lag between them is possible, which creates the original problem. So i guess this issue will be also reproducable in your testing lab.
So to conclude - implementing the configurable exiting delay from the application perspective would be really valuable for others also 😊 As i am pretty much sure, others are also facing that issue, just they didn't know about it.
If that resolves your case, we could add a configurable wait time for the sigterm trap in Zitadel@FFO @Elio Will you implement the fix (configurable wait time) as originally planned?
I think we should but ATM our capacity is a little tight (it will improve since we hired multiple new engineers :D)
Would you mind create a improvement request on Github?
@FFO The issue we were describing is about zero-downtime feature of zitadel, which doesn't work correctly, as we have figured out. Likely a large number of zitadel users are experiencing the same, so i would say (imo) it is quite critical and everyone would be interested in fixing this 😊
In the meantime, what is the correct way of creating an improvement in Github? Should we just create an "Issue" in the https://github.com/zitadel/zitadel/issues ?
Or do you mean something else? 😊
You best create this one here https://github.com/zitadel/zitadel/issues/new?template=improvement.yaml
This helps us track the implementation 😄
Hello again @FFO . I can see, that this bug/deature was removed from product management in https://github.com/zitadel/zitadel-charts/issues/282 Could i know the reason for that? Have you decided - that is not critical?
GitHub
Zero Downtime upgrades of Zitadel looks broken as of now - "502 bad...
Preflight Checklist I could not find a solution in the existing issues, docs, nor discussions I have joined the ZITADEL chat Describe your problem PROBLEM: When zitadel's pod is being restarted...
no this has nothing to do with the criticality, it just falls into the operations team, and they handle the issues on another board
sorry for the confusion
Oh, got it, thanks for clarifications ☺