Slow migration from v2.66.18 to v3.3.5 (1.5M users)
Hi! We are using Zitadel in production on Kubernetes, installed via Helm Chart. We are currently on version v2.66.18 but are planning to upgrade to v3.3.5 because we are experiencing database performance issues (slow response on projection.users13 queries) that are currently "solved" with a large AWS RDS Postgres database
(db.m6g.8xlarge
). We also have our custom frontend for login.
We have an organization with 1.5M users (and growing). Yes, it's really big.
One of the problems we faced with the upgrade is the slow migration time (approximately 24 hours), and we realized that it's caused by a slow query (~12s). Every 12 seconds, the migration adds only a few users to projections.users14. We are conducting pre-migration tests with a dump of the production database in another environment, and the RDS instance type we are using there is db.m6g.4xlarge
. We don't see high CPU or IO usage in the database during the migration (nor in the Setup Job Pod). CPU usage is less than 10% and IOPS are quiet, but in AWS Performance Insights we see this slow query:
SELECT created_at, event_type, "sequence", "position", payload, creator, "owner", instance_id, aggregate_type, aggregate_id, revision FROM eventstore.events2 WHERE instance_id = $1 AND aggregate_type = ANY($2) AND event_type = ANY($3) AND "position" >= $4 AND "position" < (SELECT COALESCE(EXTRACT(EPOCH FROM min(xact_start)), EXTRACT(EPOCH FROM now())) FROM pg_stat_activity WHERE datname = current_database() AND application_name = ANY($5) AND state <> 'idle') ORDER BY "position", in_tx_order LIMIT $6 OFFSET $7
Our eventstore.events2 table is really large:
=> select count(*) from eventstore.events2 ;
count
----------
96100444
How can we upgrade faster? Can we add any index to this table? Does Projections.BulkLimit help with the limit of that query to process more users at once?
Thanks!13 Replies
These are our settings for the DB:
Database:
Postgres:
Host: <REDACTED>
Port: 5432
Database: zitadel_migration_test
MaxOpenConns: 100
MaxIdleConns: 100
MaxConnLifetime: 30m
MaxConnIdleTime: 5m
Hi @darioef
Thats indeed an interesting case.
I assume that there are many users with many audit events (
user.token.v2.added
, user.machine.secret.check.succeeded
, user.machine.secret.check.failed
) and if you are not interested in them we can remove them.
You can check the amount of events with the following query:
SELECT aggregate_id AS user_id, count(*) FROM eventstore.events2 WHERE instance_id = '<YOUR-INSTANCE-ID>' AND aggregate_type = 'user' AND event_type IN ('user.token.v2.added', 'user.machine.secret.check.succeeded', 'user.machine.secret.check.failed') GROUP BY aggregate_id;
I will provide a runbook for deleting them if we can delete themHi @adlerhurst , thanks for your reply. I executed the query and seems that there are too many users with that events.
If I count all these events it gives me:
So these events mean the 10% of the eventstore.events2 table.
Thanks for your help.
Okay as this is a dump of the database you could try to remove them and see if the performance is getting better.
Do you have any query insights of the database which shows which query causes the load?
Thanks, during the migration the only query that seems to load the db is the one I put in the first post.
I’ll try removing these records and run another migration.
Is there any way to tell the Setup Job to build the users14 table faster increasing the bulk of users it takes from events2?
I think you can set the bulk limit for each projection with this parameter https://github.com/zitadel/zitadel/blob/main/cmd/defaults.yaml#L382
https://github.com/zitadel/zitadel/blob/main/cmd/defaults.yaml#L390
Not sur ewhat the exat name is though 😄
Yes, I did some tests increasing Projections.BulkLimit to bigger numbers (1000, 10000, 100000, etc) but it didn’t help too much.
I dropped those records but the migration is still slow.
CPU and IOPS on database are low.



Hm that is weird, that latency is a little high.
Can you share the whole query with me?
Sorry for the delay, sure, this is the query:
Can you please share the result of the following query?
Yes! Sorry again for the delay:
@darioef -- I hate to bug you but its relivent. I see all your user IDs appear to be UUID4. How do you enforce this?
- UI registration appears to give a random zitadel ID and not a UUID.
- if I disalbe registration but allow google connections, users can still link external account and literally create an account which would break a UUID schema for users.
We currently have UUID in one system, and are migrating to Zitadel, but I'm woried about NEW users creating via the "auto-linking" feature when using Google or a different IdP. If they get an ID that isn't a UUID it would break our existing system.
Suggestions/feedback on how you have handled this? -- We are about to import 1.2 Million users from mysql into a fresh V4 setup and I'm not sure best practies and things like that for user IDs, but I know indexing a TEXT field as a primary key is a bad idea.