Monitoring Zitadel http latencies via internal prometheus metrics
In order to monitor and alert on Zitadel app/pods http latencies, we have configured 95 percentile stats on
http_server_duration_milliseconds_bucket
internal Prometheus metric (which Zitadel itself exposes):
But the problem is - from time to time the alert is being triggered (and the latency is closed to 10 seconds) without any reasoning. If we look into the same metric, but from the K8s Ingress perspective - we don't see such high latencies (the ingress which forwards the http traffic to the zitadel pods).
What could be the cause of the high latencies of just http_server_duration_milliseconds_bucket metric while not so much high for the k8s ingress? Some exact known scenarios for e.g.
What would be the best way to investigate and diagnose the cause of it? As of now we do have the metrics, the url endpoint, but no other details (for e.g. source IP, username used and etc).0 Replies