TomasP•13mo ago

K8s deployment DB issues

Today we noticed some error logs in our k8s deployment of Zitadel, related to DB access and can't figure out what's wrong. We have two separate k8s deployments: one with in-cluster deployed postgres and another with a managed postgres solution and both deployments produce the same errors. This is a sample of the errors:

time="2024-09-05T04:48:51Z" 
    level=info 
    msg="process events failed" 
    caller="/home/runner/work/zitadel/zitadel/internal/eventstore/handler/v2/handler.go:413" 
    error="context deadline exceeded" 
    projection=projections.project_grant_members4

time="2024-09-05T04:48:53Z" 
    level=info 
    msg="process events failed" 
    caller="/home/runner/work/zitadel/zitadel/internal/eventstore/handler/v2/handler.go:413" 
    error="failed to connect to `host=auth-domain.postgres-system.svc.cluster.local user=auth_user database=auth_zitadel_prod`: 
        hostname resolving error (lookup auth-domain.postgres-system.svc.cluster.local on 172.16.0.3:53: dial udp 172.16.0.3:53: i/o timeout)" 
    projection=projections.secret_generators2

time="2024-09-05T04:48:53Z" 
    level=info 
    msg="process events failed" 
    caller="/home/runner/work/zitadel/zitadel/internal/eventstore/handler/v2/handler.go:413" 
    error="failed to connect to `host=auth-domain.postgres-system.svc.cluster.local user=auth_user database=auth_zitadel_prod`: 
        hostname resolving error (lookup auth-domain.postgres-system.svc.cluster.local on 172.16.0.3:53: dial udp 172.16.0.3:53: i/o timeout)" 
    projection=projections.project_grant_members4

time="2024-09-05T04:48:51Z" 
    level=info 
    msg="process events failed" 
    caller="/home/runner/work/zitadel/zitadel/internal/eventstore/handler/v2/handler.go:413" 
    error="context deadline exceeded" 
    projection=projections.project_grant_members4

time="2024-09-05T04:48:53Z" 
    level=info 
    msg="process events failed" 
    caller="/home/runner/work/zitadel/zitadel/internal/eventstore/handler/v2/handler.go:413" 
    error="failed to connect to `host=auth-domain.postgres-system.svc.cluster.local user=auth_user database=auth_zitadel_prod`: 
        hostname resolving error (lookup auth-domain.postgres-system.svc.cluster.local on 172.16.0.3:53: dial udp 172.16.0.3:53: i/o timeout)" 
    projection=projections.secret_generators2

time="2024-09-05T04:48:53Z" 
    level=info 
    msg="process events failed" 
    caller="/home/runner/work/zitadel/zitadel/internal/eventstore/handler/v2/handler.go:413" 
    error="failed to connect to `host=auth-domain.postgres-system.svc.cluster.local user=auth_user database=auth_zitadel_prod`: 
        hostname resolving error (lookup auth-domain.postgres-system.svc.cluster.local on 172.16.0.3:53: dial udp 172.16.0.3:53: i/o timeout)" 
    projection=projections.project_grant_members4

We tried a python script to create many pg connections, but it didn't produce any connection issues, also postgres is not utilized like at all. Our application stack, that's running on the same k8s cluster as Zitadel uses postgres events directly (pg_notify) to track Zitadel event store, but I guess this should not affect anything. Local Docker stack doesn't have issues. Any suggestions where should we look further?

11 Replies

TomasPOP•13mo ago

Just to summarize what we have tried/checked so far: - zalando postgres operator logs doesn't show anything suspicious at the time error occurs - Max connection count at the db server level is not reached, not even close - The zitadel pod cpu/memory utilization has plenty of room - The testing script k8s job which was supposed to test db connectivity and dns resolution was left for the whole night and haven't failed even once (50 parallel executions every second) - The same issue is present on totally different environments, with different network stack, different datacenter and k8s provider, kubernetes versions - The issue is pretty much random and doesn't show consistently (for e.g. after zitadel restart, and there were no issues for 20 minutes or so) - Issue is present on different zitadel versions We increased Projections.TransactionDuration from 500ms to 1000ms in Zitadel config, as of now - no i/o or dns resolution errors, we'll continue to monitor the situation... After some time the same issues reappearing 😦 any tips here, what could be wrong? Any measures we should take based on those errors, or should we just ignore them for now (as Zitadel seems to be unaffected by this issue)?

FFO•13mo ago

Hey just saw this thread, give me some time to check I have so far not seen the error you have here. This really looks like a DNS issue 😅 but that would be weird. Can you share more details about your infra and what versions (zitadel/db) you operate? The zitadel config would also help.

TomasPOP•13mo ago

Here are some tech details:

message.txt

TomasPOP•13mo ago

# ZITADEL manages three database connection pools.
  # The *ConnRatio settings define the ratio of how many connections from
  # MaxOpenConns and MaxIdleConns are used to push events and spool projections.
  # Remaining connection are used for queries (search).
  # Values may not be negative and the sum of the ratios must always be less than 1.
  # For example this defaults define 15 MaxOpenConns overall.
  # - 15*0.2=3 connections are allocated to the event pusher;
  # - 15*0.135=2 connections are allocated to the projection spooler;
  # - 15-(3+2)=10 connections are remaining for queries;

# ZITADEL manages three database connection pools.
  # The *ConnRatio settings define the ratio of how many connections from
  # MaxOpenConns and MaxIdleConns are used to push events and spool projections.
  # Remaining connection are used for queries (search).
  # Values may not be negative and the sum of the ratios must always be less than 1.
  # For example this defaults define 15 MaxOpenConns overall.
  # - 15*0.2=3 connections are allocated to the event pusher;
  # - 15*0.135=2 connections are allocated to the projection spooler;
  # - 15-(3+2)=10 connections are remaining for queries;

we tried a script specifically to test a possible flaw in ingress/dns in python, but could not produce such errors... we tried to load test more than this comment suggests... @FFO hey, any tips how to find a culprit of those errors?

Unknown User•12mo ago

Message Not Public

FFO•12mo ago

Can you share your config/infra setup? You still see the DNS error or is it just the connection error now?

TomasPOP•12mo ago

@FFO hey, currently we are useing Zitadel v2.55.0 we are getting these errors: Generic deadline exceeded timeout (this one is the most often one):

time="2024-09-13T06:25:24Z" level=info msg="process events failed" caller="/home/runner/work/zitadel/zitadel/internal/eventstore/handler/v2/handler.go:413" error="timeout: context deadline exceeded" projection=projections.custom_texts2

time="2024-09-13T06:25:24Z" level=info msg="process events failed" caller="/home/runner/work/zitadel/zitadel/internal/eventstore/handler/v2/handler.go:413" error="timeout: context deadline exceeded" projection=projections.custom_texts2

Generic failure to connect to the db cluster (rarely):

time="2024-09-13T09:13:17Z" level=info msg="process events failed" caller="/home/runner/work/zitadel/zitadel/internal/eventstore/handler/v2/handler.go:413" error="failed to connect to `host=se-prod-postgresql-common-domain.postgres-system.svc.cluster.local user=common_domain_user database=common_domain_zitadel_prod`: dial error (timeout: context deadline exceeded)" projection=projections.custom_texts2

time="2024-09-13T09:13:17Z" level=info msg="process events failed" caller="/home/runner/work/zitadel/zitadel/internal/eventstore/handler/v2/handler.go:413" error="failed to connect to `host=se-prod-postgresql-common-domain.postgres-system.svc.cluster.local user=common_domain_user database=common_domain_zitadel_prod`: dial error (timeout: context deadline exceeded)" projection=projections.custom_texts2

Failure to dns resolve db server via local k8s DNS resolver (172.16.0.3:53) (also rarely):

time="2024-09-12T17:26:51Z" level=info msg="process events failed" caller="/home/runner/work/zitadel/zitadel/internal/eventstore/handler/v2/handler.go:413" error="failed to connect to `host=se-prod-postgresql-common-domain.postgres-system.svc.cluster.local user=common_domain_user database=common_domain_zitadel_prod`: hostname resolving error (lookup se-prod-postgresql-common-domain.postgres-system.svc.cluster.local on 172.16.0.3:53: dial udp 172.16.0.3:53: i/o timeout)" projection=projections.custom_texts2

time="2024-09-12T17:26:51Z" level=info msg="process events failed" caller="/home/runner/work/zitadel/zitadel/internal/eventstore/handler/v2/handler.go:413" error="failed to connect to `host=se-prod-postgresql-common-domain.postgres-system.svc.cluster.local user=common_domain_user database=common_domain_zitadel_prod`: hostname resolving error (lookup se-prod-postgresql-common-domain.postgres-system.svc.cluster.local on 172.16.0.3:53: dial udp 172.16.0.3:53: i/o timeout)" projection=projections.custom_texts2

Seems this problem somehow went away, we are rarely getting those errors anymore. We have reported this to our k8s provider, probably they did some work on the infratructure and changed things, but we are not sure... We'll keep monitoring the situation an I'll open a new post in case of problems.

FFO•12mo ago

OK, happy to assist again! Since I have not seen this on a broad scale it might be really an infra related issue.

Unknown User•12mo ago

Message Not Public

FFO•12mo ago

Maybe @adlerhurst can lend a hand here. What version of zitadel are we talking here, with what DB?

Unknown User•12mo ago

Message Not Public

K8s deployment DB issues

Did you find this page helpful?