K8s deployment DB issues
Today we noticed some error logs in our k8s deployment of Zitadel, related to DB access and can't figure out what's wrong. We have two separate k8s deployments: one with in-cluster deployed postgres and another with a managed postgres solution and both deployments produce the same errors. This is a sample of the errors:
We tried a python script to create many pg connections, but it didn't produce any connection issues, also postgres is not utilized like at all. Our application stack, that's running on the same k8s cluster as Zitadel uses postgres events directly (pg_notify) to track Zitadel event store, but I guess this should not affect anything. Local Docker stack doesn't have issues.
Any suggestions where should we look further?
11 Replies
Just to summarize what we have tried/checked so far:
- zalando postgres operator logs doesn't show anything suspicious at the time error occurs
- Max connection count at the db server level is not reached, not even close
- The zitadel pod cpu/memory utilization has plenty of room
- The testing script k8s job which was supposed to test db connectivity and dns resolution was left for the whole night and haven't failed even once (50 parallel executions every second)
- The same issue is present on totally different environments, with different network stack, different datacenter and k8s provider, kubernetes versions
- The issue is pretty much random and doesn't show consistently (for e.g. after zitadel restart, and there were no issues for 20 minutes or so)
- Issue is present on different zitadel versions
We increased
Projections.TransactionDuration
from 500ms to 1000ms in Zitadel config, as of now - no i/o or dns resolution errors, we'll continue to monitor the situation...
After some time the same issues reappearing 😦 any tips here, what could be wrong?
Any measures we should take based on those errors, or should we just ignore them for now (as Zitadel seems to be unaffected by this issue)?Hey just saw this thread, give me some time to check
I have so far not seen the error you have here.
This really looks like a DNS issue 😅 but that would be weird.
Can you share more details about your infra and what versions (zitadel/db) you operate? The zitadel config would also help.
Here are some tech details:
we tried a script specifically to test a possible flaw in ingress/dns in python, but could not produce such errors... we tried to load test more than this comment suggests...
@FFO hey, any tips how to find a culprit of those errors?
Unknown User•12mo ago
Message Not Public
Sign In & Join Server To View
Can you share your config/infra setup?
You still see the DNS error or is it just the connection error now?
@FFO hey, currently we are useing Zitadel v2.55.0 we are getting these errors:
Generic deadline exceeded timeout (this one is the most often one):
Generic failure to connect to the db cluster (rarely):
Failure to dns resolve db server via local k8s DNS resolver (172.16.0.3:53) (also rarely):
Seems this problem somehow went away, we are rarely getting those errors anymore. We have reported this to our k8s provider, probably they did some work on the infratructure and changed things, but we are not sure... We'll keep monitoring the situation an I'll open a new post in case of problems.
OK, happy to assist again!
Since I have not seen this on a broad scale it might be really an infra related issue.
Unknown User•12mo ago
Message Not Public
Sign In & Join Server To View
Maybe @adlerhurst can lend a hand here.
What version of zitadel are we talking here, with what DB?
Unknown User•12mo ago
Message Not Public
Sign In & Join Server To View