darioef
darioef4w ago

Slow migration from v2.66.18 to v3.3.5 (1.5M users)

Hi! We are using Zitadel in production on Kubernetes, installed via Helm Chart. We are currently on version v2.66.18 but are planning to upgrade to v3.3.5 because we are experiencing database performance issues (slow response on projection.users13 queries) that are currently "solved" with a large AWS RDS Postgres database (db.m6g.8xlarge). We also have our custom frontend for login. We have an organization with 1.5M users (and growing). Yes, it's really big. One of the problems we faced with the upgrade is the slow migration time (approximately 24 hours), and we realized that it's caused by a slow query (~12s). Every 12 seconds, the migration adds only a few users to projections.users14. We are conducting pre-migration tests with a dump of the production database in another environment, and the RDS instance type we are using there is db.m6g.4xlarge. We don't see high CPU or IO usage in the database during the migration (nor in the Setup Job Pod). CPU usage is less than 10% and IOPS are quiet, but in AWS Performance Insights we see this slow query: SELECT created_at, event_type, "sequence", "position", payload, creator, "owner", instance_id, aggregate_type, aggregate_id, revision FROM eventstore.events2 WHERE instance_id = $1 AND aggregate_type = ANY($2) AND event_type = ANY($3) AND "position" >= $4 AND "position" < (SELECT COALESCE(EXTRACT(EPOCH FROM min(xact_start)), EXTRACT(EPOCH FROM now())) FROM pg_stat_activity WHERE datname = current_database() AND application_name = ANY($5) AND state <> 'idle') ORDER BY "position", in_tx_order LIMIT $6 OFFSET $7 Our eventstore.events2 table is really large: => select count(*) from eventstore.events2 ; count ---------- 96100444 How can we upgrade faster? Can we add any index to this table? Does Projections.BulkLimit help with the limit of that query to process more users at once? Thanks!
13 Replies
darioef
darioefOP4w ago
These are our settings for the DB: Database: Postgres: Host: <REDACTED> Port: 5432 Database: zitadel_migration_test MaxOpenConns: 100 MaxIdleConns: 100 MaxConnLifetime: 30m MaxConnIdleTime: 5m
adlerhurst
adlerhurst4w ago
Hi @darioef Thats indeed an interesting case. I assume that there are many users with many audit events (user.token.v2.added, user.machine.secret.check.succeeded, user.machine.secret.check.failed) and if you are not interested in them we can remove them. You can check the amount of events with the following query: SELECT aggregate_id AS user_id, count(*) FROM eventstore.events2 WHERE instance_id = '<YOUR-INSTANCE-ID>' AND aggregate_type = 'user' AND event_type IN ('user.token.v2.added', 'user.machine.secret.check.succeeded', 'user.machine.secret.check.failed') GROUP BY aggregate_id; I will provide a runbook for deleting them if we can delete them
darioef
darioefOP4w ago
Hi @adlerhurst , thanks for your reply. I executed the query and seems that there are too many users with that events.
SELECT aggregate_id AS user_id, count(*)
FROM eventstore.events2
WHERE instance_id = '298583134000167036'
AND aggregate_type = 'user'
AND event_type IN ('user.token.v2.added', 'user.machine.secret.check.succeeded', 'user.machine.secret.check.failed')
GROUP BY aggregate_id
ORDER BY count(*) DESC;

=> SELECT aggregate_id AS user_id, count(*)
user_id | count
--------------------------------------+-------
303663173871918675 | 67432
7b92653e-5a53-46ed-868e-a594affd687e | 481
311035497096956262 | 348
573548ef-c7e0-4e59-854e-fe91514de544 | 310
fa56cb73-bcde-405c-9bf4-1235d643731c | 292
789b8fd7-6775-4fc8-adf5-6cf6ef34f89b | 262
2834c6c7-540f-44a3-8093-425a3c2566cf | 257
03357734-6a85-4394-8c03-7dd14e90855f | 245
32793218-9fbd-4672-bca7-fbaa512c9e1f | 237
306978675738326019 | 232
c39df126-313d-44b8-b1a0-dba604888567 | 222
7e3d25e7-7df4-4e43-a28e-c3afcae14b8f | 213
589e9ada-733f-4f6a-958d-343f71f716b2 | 210
788f20ca-7936-4b12-8aad-bb49db82b0af | 207
(output truncated)
SELECT aggregate_id AS user_id, count(*)
FROM eventstore.events2
WHERE instance_id = '298583134000167036'
AND aggregate_type = 'user'
AND event_type IN ('user.token.v2.added', 'user.machine.secret.check.succeeded', 'user.machine.secret.check.failed')
GROUP BY aggregate_id
ORDER BY count(*) DESC;

=> SELECT aggregate_id AS user_id, count(*)
user_id | count
--------------------------------------+-------
303663173871918675 | 67432
7b92653e-5a53-46ed-868e-a594affd687e | 481
311035497096956262 | 348
573548ef-c7e0-4e59-854e-fe91514de544 | 310
fa56cb73-bcde-405c-9bf4-1235d643731c | 292
789b8fd7-6775-4fc8-adf5-6cf6ef34f89b | 262
2834c6c7-540f-44a3-8093-425a3c2566cf | 257
03357734-6a85-4394-8c03-7dd14e90855f | 245
32793218-9fbd-4672-bca7-fbaa512c9e1f | 237
306978675738326019 | 232
c39df126-313d-44b8-b1a0-dba604888567 | 222
7e3d25e7-7df4-4e43-a28e-c3afcae14b8f | 213
589e9ada-733f-4f6a-958d-343f71f716b2 | 210
788f20ca-7936-4b12-8aad-bb49db82b0af | 207
(output truncated)
If I count all these events it gives me:
=> SELECT count(*)
FROM eventstore.events2
WHERE instance_id = '298583134000167036'
AND aggregate_type = 'user'
AND event_type IN ('user.token.v2.added', 'user.machine.secret.check.succeeded', 'user.machine.secret.check.failed');
count
---------
9159057
(1 row)
=> SELECT count(*)
FROM eventstore.events2
WHERE instance_id = '298583134000167036'
AND aggregate_type = 'user'
AND event_type IN ('user.token.v2.added', 'user.machine.secret.check.succeeded', 'user.machine.secret.check.failed');
count
---------
9159057
(1 row)
So these events mean the 10% of the eventstore.events2 table. Thanks for your help.
adlerhurst
adlerhurst4w ago
Okay as this is a dump of the database you could try to remove them and see if the performance is getting better. Do you have any query insights of the database which shows which query causes the load?
darioef
darioefOP4w ago
Thanks, during the migration the only query that seems to load the db is the one I put in the first post. I’ll try removing these records and run another migration. Is there any way to tell the Setup Job to build the users14 table faster increasing the bulk of users it takes from events2?
FFO
FFO4w ago
I think you can set the bulk limit for each projection with this parameter https://github.com/zitadel/zitadel/blob/main/cmd/defaults.yaml#L382 https://github.com/zitadel/zitadel/blob/main/cmd/defaults.yaml#L390 Not sur ewhat the exat name is though 😄
darioef
darioefOP4w ago
Yes, I did some tests increasing Projections.BulkLimit to bigger numbers (1000, 10000, 100000, etc) but it didn’t help too much.
darioef
darioefOP4w ago
I dropped those records but the migration is still slow. CPU and IOPS on database are low.
No description
No description
No description
FFO
FFO4w ago
Hm that is weird, that latency is a little high. Can you share the whole query with me?
darioef
darioefOP4w ago
Sorry for the delay, sure, this is the query:
SELECT created_at, event_type, "sequence", "position", payload, creator, "owner", instance_id, aggregate_type, aggregate_id, revision FROM eventstore.events2 WHERE instance_id = $1 AND aggregate_type = ANY($2) AND event_type = ANY($3) AND "position" >= $4 AND "position" < (SELECT COALESCE(EXTRACT(EPOCH FROM min(xact_start)), EXTRACT(EPOCH FROM now())) FROM pg_stat_activity WHERE datname = current_database() AND application_name = ANY($5) AND state <> 'idle') ORDER BY "position", in_tx_order LIMIT $6 OFFSET $7
SELECT created_at, event_type, "sequence", "position", payload, creator, "owner", instance_id, aggregate_type, aggregate_id, revision FROM eventstore.events2 WHERE instance_id = $1 AND aggregate_type = ANY($2) AND event_type = ANY($3) AND "position" >= $4 AND "position" < (SELECT COALESCE(EXTRACT(EPOCH FROM min(xact_start)), EXTRACT(EPOCH FROM now())) FROM pg_stat_activity WHERE datname = current_database() AND application_name = ANY($5) AND state <> 'idle') ORDER BY "position", in_tx_order LIMIT $6 OFFSET $7
adlerhurst
adlerhurst3w ago
Can you please share the result of the following query?
EXPLAIN (ANALYZE,VERBOSE,BUFFERS,COSTS,TIMING)
SELECT created_at, event_type, "sequence", "position", payload, creator, "owner", instance_id, aggregate_type, aggregate_id, revision
FROM eventstore.events2
WHERE instance_id = '298583134000167036'
AND aggregate_type = ANY(ARRAY['user', 'org', 'instance'])
AND event_type = ANY(ARRAY['user.added','user.human.added','user.selfregistered','user.human.selfregistered','user.human.initialization.code.added','user.initialization.code.added','user.human.initialization.check.succeeded','user.initialization.check.succeeded','user.locked','user.unlocked','user.deactivated','user.reactivated','user.removed','user.username.changed','user.domain.claimed','user.human.profile.changed','user.profile.changed','user.human.phone.changed','user.phone.changed','user.human.phone.removed','user.phone.removed','user.human.phone.verified','user.phone.verified','user.human.email.changed','user.email.changed','user.human.email.verified','user.email.verified','user.human.avatar.added','user.human.avatar.removed','user.machine.added'])
AND "position" >= (SELECT position FROM projections.current_states WHERE projection_name = 'projections.users14' and instance_id = '298583134000167036')
AND "position" < (SELECT COALESCE(EXTRACT(EPOCH FROM min(xact_start)), EXTRACT(EPOCH FROM now())) FROM pg_stat_activity WHERE datname = current_database() AND application_name = ANY(ARRAY['zitadel_es_pusher_298583134000167036']) AND state <> 'idle')
ORDER BY "position", in_tx_order
LIMIT 200
OFFSET (SELECT filter_offset FROM projections.current_states WHERE projection_name = 'projections.users14' and instance_id = '298583134000167036');
EXPLAIN (ANALYZE,VERBOSE,BUFFERS,COSTS,TIMING)
SELECT created_at, event_type, "sequence", "position", payload, creator, "owner", instance_id, aggregate_type, aggregate_id, revision
FROM eventstore.events2
WHERE instance_id = '298583134000167036'
AND aggregate_type = ANY(ARRAY['user', 'org', 'instance'])
AND event_type = ANY(ARRAY['user.added','user.human.added','user.selfregistered','user.human.selfregistered','user.human.initialization.code.added','user.initialization.code.added','user.human.initialization.check.succeeded','user.initialization.check.succeeded','user.locked','user.unlocked','user.deactivated','user.reactivated','user.removed','user.username.changed','user.domain.claimed','user.human.profile.changed','user.profile.changed','user.human.phone.changed','user.phone.changed','user.human.phone.removed','user.phone.removed','user.human.phone.verified','user.phone.verified','user.human.email.changed','user.email.changed','user.human.email.verified','user.email.verified','user.human.avatar.added','user.human.avatar.removed','user.machine.added'])
AND "position" >= (SELECT position FROM projections.current_states WHERE projection_name = 'projections.users14' and instance_id = '298583134000167036')
AND "position" < (SELECT COALESCE(EXTRACT(EPOCH FROM min(xact_start)), EXTRACT(EPOCH FROM now())) FROM pg_stat_activity WHERE datname = current_database() AND application_name = ANY(ARRAY['zitadel_es_pusher_298583134000167036']) AND state <> 'idle')
ORDER BY "position", in_tx_order
LIMIT 200
OFFSET (SELECT filter_offset FROM projections.current_states WHERE projection_name = 'projections.users14' and instance_id = '298583134000167036');
darioef
darioefOP3w ago
Yes! Sorry again for the delay:
Larzous
Larzous3w ago
@darioef -- I hate to bug you but its relivent. I see all your user IDs appear to be UUID4. How do you enforce this? - UI registration appears to give a random zitadel ID and not a UUID. - if I disalbe registration but allow google connections, users can still link external account and literally create an account which would break a UUID schema for users. We currently have UUID in one system, and are migrating to Zitadel, but I'm woried about NEW users creating via the "auto-linking" feature when using Google or a different IdP. If they get an ID that isn't a UUID it would break our existing system. Suggestions/feedback on how you have handled this? -- We are about to import 1.2 Million users from mysql into a fresh V4 setup and I'm not sure best practies and things like that for user IDs, but I know indexing a TEXT field as a primary key is a bad idea.

Did you find this page helpful?