RAleksejukR
ZITADEL2y ago
22 replies
RAleksejuk

Zero-downtime when restarting zitadel pods is not possible as of now

Zitadel Helm chart version: 8.1.0
Zitadel image version: v2.55.0
Kubernetes: v1.29.5
DB: PostgreSQL 15.6
Ingress: nginx v1.10.1

PROBLEM: When the zitadel's pod is being restarted or stopped - kubernetes sends SIGTERM signal to stop the main container, and after that - zitadel immediately "gracefully" exits. This happens almost instantly, therefore the endpoints controller is updating the endpoints at the same time and ingress controller (before detecting the change in enpoints) could for some small amount of time still send the traffic to non-existing controller. (as the container immediately exited without any delay). And as a side effect - you will see a few "bad gateway" errors while trying to reach the zitadel endpoints at that time. So no complete ZERO-downtime is possible
HOW TO REPRODUCE: start k6 performance tests (VUs >15) and at the same time restart whole zitadel deployment (it will be done one by one pod) - you will see a few "bad gateway" errors due to the process described above.
POSSIBLE SOLUTION: Zitadel theoretically should get SIGTERM signal, and after the signal was received - to delay a bit its exiting, before the endpoints are updated. The delay could be configurable for e.g.

What do you think of that issue?
Was this page helpful?