sp132
sp132•10mo ago

Memory leak

Could some give me some pointers as to why ZITADEL container might leak memory? I'm running 2.51.0 with custom OIDC login flow via v2beta/sessions Some additional info: - each container is running co-located with CockroachDB on an instance with 32GB RAM, zitadel container is restricted to 16GB - sessions have no TTL - auth tokens expire after 12h - total RPS is around 4-5 - 11kk events in eventstore.events2 - around 500 active users
No description
46 Replies
sp132
sp132OP•10mo ago
ZITADEL is actively allocating something on the heap that is not being collected by GC
No description
FFO
FFO•10mo ago
Hm let me check our data
sp132
sp132OP•10mo ago
happy to provide additional information if need be
FFO
FFO•10mo ago
Haha, with a big enough timeframe we can see this as well, but our containers do not run that long that it turns into an issue usually. @livio do you have a hunch or maybe @adlerhurst
No description
sp132
sp132OP•10mo ago
Yeah, it's gradual, but leads to a crash eventually
Unknown User
Unknown User•10mo ago
Message Not Public
Sign In & Join Server To View
sp132
sp132OP•10mo ago
That's awesome, thank you. We've been experiencing performance problems for a quite some time now and I thought high memory utilization could be a part of the problem. Would be great to chalk it off
FFO
FFO•10mo ago
Can you share what perf. problems you see? Usually memory is not the thing that we have/had issues with 😄
sp132
sp132OP•10mo ago
The most critical problem for us now is latency in session creation and patching. We are running a small service between UI and ZITADEL and this is how ingress latency looks like.
No description
sp132
sp132OP•10mo ago
The weirdest thing about it is that we have UAT, where we have the same setup, same resources, same data but percentiles look much better.
No description
FFO
FFO•10mo ago
Uf that is a lot of latency....
sp132
sp132OP•10mo ago
Granted, UAT doesn't receive as much traffic this is why I was thinking about resource utilization
FFO
FFO•10mo ago
In our cloud we see a p99 of less then 350ms usually
sp132
sp132OP•10mo ago
Right, we used to see latency like that too, but then it slowly grew 😦 I'm in the middle of enabling tracing, will share more details
FFO
FFO•10mo ago
OK, i think some query might have a bad time in CRDB I would start lookig at the most expensive queries there
sp132
sp132OP•10mo ago
oh..
No description
sp132
sp132OP•10mo ago
Actually, it looks like a query for an Events page in the console, I think it's irrelevant to sessions?
FFO
FFO•10mo ago
Hm that query is not executed often, right?
sp132
sp132OP•10mo ago
thankfully 🙂
sp132
sp132OP•10mo ago
Other than that I don't see anything suspicious
No description
FFO
FFO•10mo ago
hm that is weird, do you mind sharing the query view sorted by cost?
sp132
sp132OP•10mo ago
This one?
No description
FFO
FFO•10mo ago
does your zitadel go OOM from time to time or is it just not returning the memory? what happens if you sort that by transaction time or cpu time?
sp132
sp132OP•10mo ago
does your zitadel go OOM from time to time or is it just not returning the memory?
It goes OOM
FFO
FFO•10mo ago
thanks
sp132
sp132OP•10mo ago
statements sorted by cpu or tx time look more or less the same. please disregard the first two statements, there is another ongoing investigation on why the project grant has been dropped :/
No description
sp132
sp132OP•10mo ago
Maybe I can narrow it down for you somehow?
FFO
FFO•10mo ago
I wonder about the latency of these here. Esp. the 400ms
No description
FFO
FFO•10mo ago
Are you DB nodes close together?
sp132
sp132OP•10mo ago
Supposed to be the same region, yes
FFO
FFO•10mo ago
Ok so we can assume <2ms I would guess
sp132
sp132OP•10mo ago
Let me check, just to be on a safer side
sp132
sp132OP•10mo ago
it's pretty good
No description
FFO
FFO•10mo ago
oh well, yes 😄
sp132
sp132OP•10mo ago
So I've finally got to traces
sp132
sp132OP•10mo ago
It seems that majority of our latency is coming from TriggerOrgProjection For example, when issuing a token via oauth/v2/token TriggerOrgProjection is called twice, which adds more than 2 seconds (!!!) of latency.
No description
sp132
sp132OP•10mo ago
I noticed that in the 2.53.x OrgByID has some alternative way of fetching org without triggering a re-projection, but I can't upgrade just now to test it, because there were other changes that are breaking our setup. Is there a way to improve the performance of re-projection without upgrading to 2.53 ? Also, what are the drawbacks of enabling ImprovedPerformanceTypeOrgByID ?
FFO
FFO•10mo ago
well yeah we made multiple optimizations on that end so an updated should help at one point, out of curiosity what did we break?
sp132
sp132OP•10mo ago
https://github.com/zitadel/zitadel/issues/8207 we kinda relied on on user being able to set up it's own MFA and used session tokens for that. Now we need to create a machine user with instance-wide Org Owner role and adjust our auth service.
GitHub
[Bug]: can't start the registration of TOTP generator · Issue #8207...
Preflight Checklist I could not find a solution in the documentation, the existing issues or discussions I have joined the ZITADEL chat Environment Self-hosted Version 2.54.4 Database CockroachDB D...
sp132
sp132OP•10mo ago
No big deal though, if the performance optimization is working it's going to be worth it
there is another ongoing investigation on why the project grant has been dropped :/
Found another bug by the way, in console this time 🙂. Will file it today. any update on a memory leak?
Unknown User
Unknown User•10mo ago
Message Not Public
Sign In & Join Server To View
sp132
sp132OP•10mo ago
Mine is similar: get auth request -> get user for checking state-> check existing sessions -> create session -> bind request @Erik you should try enabling ImprovedPerformanceTypeOrgByID, it helped us tremendously
sp132
sp132OP•10mo ago
No description
sp132
sp132OP•10mo ago
@livio fwiw, I think running pprof on the test environment (or even locally) might be faster/easier. Earlier I posted a graph with go_memstats_heap_objects metric plotted over time and it's clear that there is a lot of allocations happening. With pprof it's trivial to identify the offending function.
FFO
FFO•10mo ago
hehe, you are right, but we wanted to introduce continuos profiling anyways in our cloud service to have statistical data available 😄 (GCP profiler is free to use, which is weird but nice)
sp132
sp132OP•6mo ago
Hey, I was wondering whether you got around that problem, because it didn't go anywhere. We had to throw it more resources so it doesn't go down as often.
No description

Did you find this page helpful?