craigzour•6mo ago

Notifier errors since upgraded to 3.0.4

Hello! I am looking for help to understand and debug an issue I have with my Zitadel service. I recently upgraded my self-hosted Zitadel instance from 2.63.4 to 3.0.4 and since then I am getting recurring errors related to some Notifier resource. Every 30 minutes I see the two following errors logs:

level=ERROR msg="Notifier: Error from notification wait" err="unexpected EOF"
level=ERROR msg="Notifier: Error running listener (will attempt reconnect after backoff)" attempt=180 err="unexpected EOF" sleep_duration=8m30.106954557s

level=ERROR msg="Notifier: Error from notification wait" err="unexpected EOF"
level=ERROR msg="Notifier: Error running listener (will attempt reconnect after backoff)" attempt=180 err="unexpected EOF" sleep_duration=8m30.106954557s

(small note: it feels like the retry backoff delay is not working properly since the error happens every 30 minutes) I tried multiple small configuration adjustments but none of them resolved it. Unfortunately, I also could not find anything on the internet that would point me in the right direction when it comes to fixing it. Thank you in advance 🙂

13 Replies

craigzourOP•6mo ago

(Bump)

Rajat Singh•6mo ago

hey @craigzour thanks for the bump, looking into it

Rajat•6mo ago

hey @craigzour The unexpected EOF error indicates that the Notifier component is experiencing an abrupt termination of its connection to the notification service or message broker. This could be due to multiple reasons, such as network interruptions, misconfigurations. I will take it to my team and ket them have a look also, what exactly you tried when you say "I tried multiple small configuration adjustments but none of them resolved it"

craigzourOP•6mo ago

Hello @Rajat . Thank you for spending some time looking into this. I tried to make some small Zitadel configuration adjustments such as: - Tweaking the database connection options (MaxOpenConns and MaxConnLifetime) - Settings up the Notifications options with either LegacyEnabled set to true or false (because I thought it was related to the Notifier word included in the error message) Also just to make sure we are on the same page, we have not touched anything else outside of that Zitadel upgrade.

Rajat•6mo ago

thanks for the update @craigzour I will send this to my team internally and will get back to you.

adlerhurst•6mo ago

Hi there 30 minutes is the default max connection lifetime to the database. Can you increase or decrease that value so that we figure out if thats the reason. Ah sorry i skipped that message, changing the conn max lifetime didn’t change the interval of the log?

craigzourOP•6mo ago

Hello! This is correct. Even though we changed MaxConnLifetime: "30m" to MaxConnLifetime: "1h" the Notifier errors cadence stayed the same (every 30 mins or so) Here is our current config file in case it helps debugging (we reverted that maxconnlifetime change since we still had the issue)

# Default config is merged with the overrides in this file.
# https://zitadel.com/docs/self-hosting/manage/configure#runtime-configuration-file

Log:
  Level: "info"

Metrics:
  Type: none
Tracing:
  Type: none
Profiler:
  Type: none
Telemetry:
  Enabled: false

Port: 8080
ExternalPort: 443
ExternalSecure: false
TLS:
  Enabled: false

Database:
  postgres:
    Port: 5432
    User:
      SSL:
        Mode: "require"
    Admin:
      SSL:
        Mode: "require"
    MaxOpenConns: 20
    MaxIdleConns: 10
    MaxConnLifetime: "30m"
    MaxConnIdleTime: "5m"

DefaultInstance:
  LoginPolicy:
    AllowRegister: false
    ForceMFA: true
    HidePasswordReset: true
  OIDCSettings:
    AccessTokenLifetime: "0.5h"
    IdTokenLifetime: "0.5h"
    RefreshTokenIdleExpiration: "720h"
    RefreshTokenExpiration: "2160h"

OIDC:
  DefaultAccessTokenLifetime: "0.5h"
  DefaultIdTokenLifetime: "0.5h"
  DefaultRefreshTokenIdleExpiration: "720h"
  DefaultRefreshTokenExpiration: "2160h"

Notifications:
  # Notifications can be processed by either a sequential mode (legacy) or a new parallel mode.
  # The parallel mode is currently only recommended for Postgres databases.
  # If legacy mode is enabled, the worker config below is ignored.
  LegacyEnabled: false
  # The amount of workers processing the notification request events.
  # If set to 0, no notification request events will be handled. This can be useful when running in
  # multi binary / pod setup and allowing only certain executables to process the events.
  Workers: 1
  # The maximum duration a job can do it's work before it is considered as failed.
  TransactionDuration: 10s
  # Automatically cancel the notification after the amount of failed attempts
  MaxAttempts: 3
  # Automatically cancel the notification if it cannot be handled within a specific time
  MaxTtl: 5m

# Default config is merged with the overrides in this file.
# https://zitadel.com/docs/self-hosting/manage/configure#runtime-configuration-file

Log:
  Level: "info"

Metrics:
  Type: none
Tracing:
  Type: none
Profiler:
  Type: none
Telemetry:
  Enabled: false

Port: 8080
ExternalPort: 443
ExternalSecure: false
TLS:
  Enabled: false

Database:
  postgres:
    Port: 5432
    User:
      SSL:
        Mode: "require"
    Admin:
      SSL:
        Mode: "require"
    MaxOpenConns: 20
    MaxIdleConns: 10
    MaxConnLifetime: "30m"
    MaxConnIdleTime: "5m"

DefaultInstance:
  LoginPolicy:
    AllowRegister: false
    ForceMFA: true
    HidePasswordReset: true
  OIDCSettings:
    AccessTokenLifetime: "0.5h"
    IdTokenLifetime: "0.5h"
    RefreshTokenIdleExpiration: "720h"
    RefreshTokenExpiration: "2160h"

OIDC:
  DefaultAccessTokenLifetime: "0.5h"
  DefaultIdTokenLifetime: "0.5h"
  DefaultRefreshTokenIdleExpiration: "720h"
  DefaultRefreshTokenExpiration: "2160h"

Notifications:
  # Notifications can be processed by either a sequential mode (legacy) or a new parallel mode.
  # The parallel mode is currently only recommended for Postgres databases.
  # If legacy mode is enabled, the worker config below is ignored.
  LegacyEnabled: false
  # The amount of workers processing the notification request events.
  # If set to 0, no notification request events will be handled. This can be useful when running in
  # multi binary / pod setup and allowing only certain executables to process the events.
  Workers: 1
  # The maximum duration a job can do it's work before it is considered as failed.
  TransactionDuration: 10s
  # Automatically cancel the notification after the amount of failed attempts
  MaxAttempts: 3
  # Automatically cancel the notification if it cannot be handled within a specific time
  MaxTtl: 5m

adlerhurst•5mo ago

Thanks for the info we try to reproduce

craigzourOP•5mo ago

Thanks you for the update 🙂

adlerhurst•5mo ago

hi there I just created this issue for tracking: https://github.com/zitadel/zitadel/issues/10092

GitHub

Error running listener · Issue #10092 · zitadel/zitadel

From this discord thread: https://discord.com/channels/927474939156643850/1374467931672543262 They see the following error logs every 30 minutes: level=ERROR msg="Notifier: Error from notifica...

craigzourOP•3mo ago

Hello! Perfect! Thank you 🙂 Hello! Wanted to share some new information about this as I spent some time investigating on it recently. I tried to find what code was throwing that recurring error and found out it was not in Zitadel directly but in a package named Riverqueue which has been integrated in Zitadel 6 months ago. It was not part of the Zitadel version we were using before (2.63.4) the migration to 3.0.4. After investigating and testing various things I discovered that our AWS RDS Proxy has an Idle timeout set to 30 minutes which is exactly the frequency at which we get those errors. I tried to increase that value to 8 hours (maximum allowed) and noticed that we were only getting one error every 8 hours. In the AWS RDS Proxy logs I also found out that we were getting that log just before getting the Notifier error

2025-08-22T12:47:27.276Z [WARN] [proxyEndpoint=default] [clientConnection=369107045] The client session was pinned to the database connection [dbConnection=3076737155] for the remainder of the session. The proxy can't reuse this connection until the session ends. Reason: SQL changed session settings that the proxy doesn't track. Consider moving session configuration to the proxy's initialization query. Digest: "set search_path to $1; set application_name to $2".

2025-08-22T12:47:27.276Z [WARN] [proxyEndpoint=default] [clientConnection=369107045] The client session was pinned to the database connection [dbConnection=3076737155] for the remainder of the session. The proxy can't reuse this connection until the session ends. Reason: SQL changed session settings that the proxy doesn't track. Consider moving session configuration to the proxy's initialization query. Digest: "set search_path to $1; set application_name to $2".

So all that to say that I think the issue is between Riverqueue and our AWS RDS Proxy. I believe Riverqueue needs an open connection with the database that never ends but our Proxy terminates it because it does not see any activity. I don't know much about Riverqueue but I am guessing that removing the AWS RDS Proxy would probably solve my problem. Since you folks probably know more about it, please let me know if you have a solution in terms of configuration that would allow me to keep that Proxy 🙂 Maybe there is a way for Riverqueue to keep that connection alive or go around that AWS RDS pinning thing. FYI: I did remove the AWS RDS Proxy and have not seen that issue come back.

adlerhurst•2mo ago

oh wow these are great news. Thanks for digging so deep into the problem 🙏

adlerhurst•2mo ago

FYI, this PR will fix the issue because Zitadel does not change the application_name anymore: https://github.com/zitadel/zitadel/pull/10560

GitHub

fix(projections): overhaul the event projection system by adlerhurs...

This PR overhauls our event projection system to make it more robust and prevent skipped events under high load. The core change replaces our custom, transaction-based locking with standard Postgre...

Notifier errors since upgraded to 3.0.4

Did you find this page helpful?