sp132•10mo ago

Memory leak

Could some give me some pointers as to why ZITADEL container might leak memory? I'm running 2.51.0 with custom OIDC login flow via v2beta/sessions Some additional info: - each container is running co-located with CockroachDB on an instance with 32GB RAM, zitadel container is restricted to 16GB - sessions have no TTL - auth tokens expire after 12h - total RPS is around 4-5 - 11kk events in eventstore.events2 - around 500 active users

46 Replies

sp132OP•10mo ago

ZITADEL is actively allocating something on the heap that is not being collected by GC

FFO•10mo ago

Hm let me check our data

sp132OP•10mo ago

happy to provide additional information if need be

FFO•10mo ago

Haha, with a big enough timeframe we can see this as well, but our containers do not run that long that it turns into an issue usually. @livio do you have a hunch or maybe @adlerhurst

sp132OP•10mo ago

Yeah, it's gradual, but leads to a crash eventually

Unknown User•10mo ago

Message Not Public

sp132OP•10mo ago

That's awesome, thank you. We've been experiencing performance problems for a quite some time now and I thought high memory utilization could be a part of the problem. Would be great to chalk it off

FFO•10mo ago

Can you share what perf. problems you see? Usually memory is not the thing that we have/had issues with 😄

sp132OP•10mo ago

The most critical problem for us now is latency in session creation and patching. We are running a small service between UI and ZITADEL and this is how ingress latency looks like.

sp132OP•10mo ago

The weirdest thing about it is that we have UAT, where we have the same setup, same resources, same data but percentiles look much better.

FFO•10mo ago

Uf that is a lot of latency....

sp132OP•10mo ago

Granted, UAT doesn't receive as much traffic this is why I was thinking about resource utilization

FFO•10mo ago

In our cloud we see a p99 of less then 350ms usually

sp132OP•10mo ago

Right, we used to see latency like that too, but then it slowly grew 😦 I'm in the middle of enabling tracing, will share more details

FFO•10mo ago

OK, i think some query might have a bad time in CRDB I would start lookig at the most expensive queries there

sp132OP•10mo ago

oh..

sp132OP•10mo ago

Actually, it looks like a query for an Events page in the console, I think it's irrelevant to sessions?

FFO•10mo ago

Hm that query is not executed often, right?

sp132OP•10mo ago

thankfully 🙂

sp132OP•10mo ago

Other than that I don't see anything suspicious

FFO•10mo ago

hm that is weird, do you mind sharing the query view sorted by cost?

sp132OP•10mo ago

This one?

FFO•10mo ago

does your zitadel go OOM from time to time or is it just not returning the memory? what happens if you sort that by transaction time or cpu time?

sp132OP•10mo ago

does your zitadel go OOM from time to time or is it just not returning the memory?

It goes OOM

FFO•10mo ago

thanks

sp132OP•10mo ago

statements sorted by cpu or tx time look more or less the same. please disregard the first two statements, there is another ongoing investigation on why the project grant has been dropped :/

sp132OP•10mo ago

Maybe I can narrow it down for you somehow?

FFO•10mo ago

I wonder about the latency of these here. Esp. the 400ms

FFO•10mo ago

Are you DB nodes close together?

sp132OP•10mo ago

Supposed to be the same region, yes

FFO•10mo ago

Ok so we can assume <2ms I would guess

sp132OP•10mo ago

Let me check, just to be on a safer side

sp132OP•10mo ago

it's pretty good

FFO•10mo ago

oh well, yes 😄

sp132OP•10mo ago

So I've finally got to traces

sp132OP•10mo ago

It seems that majority of our latency is coming from TriggerOrgProjection For example, when issuing a token via oauth/v2/token TriggerOrgProjection is called twice, which adds more than 2 seconds (!!!) of latency.

sp132OP•10mo ago

I noticed that in the 2.53.x OrgByID has some alternative way of fetching org without triggering a re-projection, but I can't upgrade just now to test it, because there were other changes that are breaking our setup. Is there a way to improve the performance of re-projection without upgrading to 2.53 ? Also, what are the drawbacks of enabling ImprovedPerformanceTypeOrgByID ?

FFO•10mo ago

well yeah we made multiple optimizations on that end so an updated should help at one point, out of curiosity what did we break?

sp132OP•10mo ago

https://github.com/zitadel/zitadel/issues/8207 we kinda relied on on user being able to set up it's own MFA and used session tokens for that. Now we need to create a machine user with instance-wide Org Owner role and adjust our auth service.

GitHub

[Bug]: can't start the registration of TOTP generator · Issue #8207...

Preflight Checklist I could not find a solution in the documentation, the existing issues or discussions I have joined the ZITADEL chat Environment Self-hosted Version 2.54.4 Database CockroachDB D...

sp132OP•10mo ago

No big deal though, if the performance optimization is working it's going to be worth it

there is another ongoing investigation on why the project grant has been dropped :/

Found another bug by the way, in console this time 🙂. Will file it today. any update on a memory leak?

Unknown User•10mo ago

Message Not Public

sp132OP•10mo ago

Mine is similar: get auth request -> get user for checking state-> check existing sessions -> create session -> bind request @Erik you should try enabling ImprovedPerformanceTypeOrgByID, it helped us tremendously

sp132OP•10mo ago

@livio fwiw, I think running pprof on the test environment (or even locally) might be faster/easier. Earlier I posted a graph with go_memstats_heap_objects metric plotted over time and it's clear that there is a lot of allocations happening. With pprof it's trivial to identify the offending function.

FFO•10mo ago

hehe, you are right, but we wanted to introduce continuos profiling anyways in our cloud service to have statistical data available 😄 (GCP profiler is free to use, which is weird but nice)

sp132OP•6mo ago

Hey, I was wondering whether you got around that problem, because it didn't go anywhere. We had to throw it more resources so it doesn't go down as often.

Memory leak

Did you find this page helpful?