Memory leak
Could some give me some pointers as to why ZITADEL container might leak memory? I'm running 2.51.0 with custom OIDC login flow via
v2beta/sessions
Some additional info:
- each container is running co-located with CockroachDB on an instance with 32GB RAM, zitadel container is restricted to 16GB
- sessions have no TTL
- auth tokens expire after 12h
- total RPS is around 4-5
- 11kk events in eventstore.events2
- around 500 active users
46 Replies
ZITADEL is actively allocating something on the heap that is not being collected by GC

Hm let me check our data
happy to provide additional information if need be
Haha, with a big enough timeframe we can see this as well, but our containers do not run that long that it turns into an issue usually.
@livio do you have a hunch or maybe @adlerhurst

Yeah, it's gradual, but leads to a crash eventually
Unknown User•10mo ago
Message Not Public
Sign In & Join Server To View
That's awesome, thank you.
We've been experiencing performance problems for a quite some time now and I thought high memory utilization could be a part of the problem. Would be great to chalk it off
Can you share what perf. problems you see?
Usually memory is not the thing that we have/had issues with 😄
The most critical problem for us now is latency in session creation and patching. We are running a small service between UI and ZITADEL and this is how ingress latency looks like.

The weirdest thing about it is that we have UAT, where we have the same setup, same resources, same data but percentiles look much better.

Uf that is a lot of latency....
Granted, UAT doesn't receive as much traffic this is why I was thinking about resource utilization
In our cloud we see a p99 of less then 350ms usually
Right, we used to see latency like that too, but then it slowly grew 😦
I'm in the middle of enabling tracing, will share more details
OK, i think some query might have a bad time in CRDB
I would start lookig at the most expensive queries there
oh..

Actually, it looks like a query for an Events page in the console, I think it's irrelevant to sessions?
Hm that query is not executed often, right?
thankfully 🙂
Other than that I don't see anything suspicious

hm that is weird, do you mind sharing the query view sorted by cost?
This one?

does your zitadel go OOM from time to time or is it just not returning the memory?
what happens if you sort that by transaction time or cpu time?
does your zitadel go OOM from time to time or is it just not returning the memory?It goes OOM
thanks
statements sorted by cpu or tx time look more or less the same. please disregard the first two statements, there is another ongoing investigation on why the project grant has been dropped :/

Maybe I can narrow it down for you somehow?
I wonder about the latency of these here. Esp. the 400ms

Are you DB nodes close together?
Supposed to be the same region, yes
Ok so we can assume <2ms I would guess
Let me check, just to be on a safer side
it's pretty good

oh well, yes 😄
So I've finally got to traces
It seems that majority of our latency is coming from
TriggerOrgProjection
For example, when issuing a token via oauth/v2/token TriggerOrgProjection is called twice, which adds more than 2 seconds (!!!) of latency.
I noticed that in the 2.53.x
OrgByID
has some alternative way of fetching org without triggering a re-projection, but I can't upgrade just now to test it, because there were other changes that are breaking our setup.
Is there a way to improve the performance of re-projection without upgrading to 2.53 ?
Also, what are the drawbacks of enabling ImprovedPerformanceTypeOrgByID
?well yeah we made multiple optimizations on that end so an updated should help at one point, out of curiosity what did we break?
https://github.com/zitadel/zitadel/issues/8207
we kinda relied on on user being able to set up it's own MFA and used session tokens for that. Now we need to create a machine user with instance-wide Org Owner role and adjust our auth service.
GitHub
[Bug]: can't start the registration of TOTP generator · Issue #8207...
Preflight Checklist I could not find a solution in the documentation, the existing issues or discussions I have joined the ZITADEL chat Environment Self-hosted Version 2.54.4 Database CockroachDB D...
No big deal though, if the performance optimization is working it's going to be worth it
there is another ongoing investigation on why the project grant has been dropped :/Found another bug by the way, in console this time 🙂. Will file it today. any update on a memory leak?
Unknown User•10mo ago
Message Not Public
Sign In & Join Server To View
Mine is similar:
get auth request -> get user for checking state-> check existing sessions -> create session -> bind request
@Erik you should try enabling
ImprovedPerformanceTypeOrgByID
, it helped us tremendously
@livio fwiw, I think running pprof on the test environment (or even locally) might be faster/easier.
Earlier I posted a graph with go_memstats_heap_objects metric plotted over time and it's clear that there is a lot of allocations happening. With pprof it's trivial to identify the offending function.
hehe, you are right, but we wanted to introduce continuos profiling anyways in our cloud service to have statistical data available 😄
(GCP profiler is free to use, which is weird but nice)
Hey, I was wondering whether you got around that problem, because it didn't go anywhere. We had to throw it more resources so it doesn't go down as often.
