Zero-downtime migration tooling; widen File.file_size to bigint (expand stage)#5986
Zero-downtime migration tooling; widen File.file_size to bigint (expand stage)#5986rtibbles wants to merge 3 commits into
Conversation
bjester
left a comment
There was a problem hiding this comment.
Overall, the code looks good and makes sense. I have lots of comments and a few questions for deliberation. After that, I'll look at testing it
| ) | ||
| parser.add_argument("--start-id", default=None, help="resume from this pk") | ||
| parser.add_argument( | ||
| "--verify", |
There was a problem hiding this comment.
nitpick: --verify doesn't verify backfilled columns, it only checks progress. IMO --check or something like that aligns better with what it does.
There was a problem hiding this comment.
--progress-check maybe to be completely explict?
|
|
||
| total = 0 | ||
| while lo is not None: | ||
| # hi = batch_size-th pk at/after lo (None on the final short batch). |
There was a problem hiding this comment.
hi = batch_size-th pk at/after lo -- this a level of terse LLM writing that makes this comment difficult to understand. IDK what it's saying. I understand the code more than the comment which is telling.
There was a problem hiding this comment.
Yes - explanatory code would be better than comments here.
| # Throttle between batches so WAL generation / replication lag and autovacuum | ||
| # can keep up on large tables. Pass --sleep 0 to disable. | ||
| if throttle: | ||
| time.sleep(throttle) |
There was a problem hiding this comment.
This may not provide a meaningful benefit, especially for large tables like the file table. I don't think there is any worry about the WAL, but autovacuum would be triggered. I just don't know that the sleep time would necessarily align with its completion (likely much much longer). If there is concern, I think a more explicit approach to managing it would be beneficial:
- turn it off while this running, and
- execute vacuum manually at the end
There was a problem hiding this comment.
Yes, I also argued with Claude quite a bit about this, and I relented because it was insistent. I think I mostly gave in out of fear because File is such a large table - but maybe we just assume we don't need the throttle at all, test it out on hotfixes and see whether that's a bad idea?
| models.Index( | ||
| fields=["checksum", "file_size_bigint"], | ||
| name=FILE_DISTINCT_BIGINT_INDEX_NAME, | ||
| ), |
There was a problem hiding this comment.
A constraint on file_size_bigint being not null may speed up this creation of this new index.
| ## Reclaiming the physical name (usually skip) | ||
|
|
||
| `db_column` makes the `file_size_bigint` physical name invisible to all ORM/app code; only raw SQL sees it. Matching the physical name to `file_size` requires a second expand/contract cycle — needed only for raw-SQL or external consumers. |
There was a problem hiding this comment.
As a raw SQL writer, IMO it's tech debt if it's skipped. I think this should be expected after we have propagated to all environments and generated backups-- we don't want to leave remnants of migrations gone by for no real reason.
There was a problem hiding this comment.
I did wonder if you would have an opinion about this - I think perfectly fine to do this, just noting that we'll essentially have to do the reverse operation (but without a field type change) to achieve this with zero downtime - but it's only one more release because we can drop the int column and add the new bigint column in the same release, then start backfilling.
There was a problem hiding this comment.
A column rename should be quick with postgres, at least, that's what I recall with these types of migrations. And yeah we don't need to rush to do it afterwards, but like I said, just not forget about it!
There was a problem hiding this comment.
Yeah, it's quick, but it's not absolutely zero downtime - it could, in theory lead to failed writes on non-upgraded pods. I feel OK accepting that small switchover drop, but I didn't want to advertise this as a "zero downtime" migration when there would be some switchover cost. But, going through the full cycle again to be truly zero downtime doesn't seem like a good use of effort!
|
|
||
| ## Guards (already configured) | ||
|
|
||
| - **Safe-DDL backend** — `DATABASES["default"]["ENGINE"]` in `settings.py` (prod adds Prometheus via `contentcuration.db.backends.zero_downtime_prometheus`). Lock/statement timeouts and `RAISE_FOR_UNSAFE` make rewriting or exclusively-locking DDL fail at migrate time. See the `ZERO_DOWNTIME_MIGRATIONS_*` settings. |
There was a problem hiding this comment.
See the
ZERO_DOWNTIME_MIGRATIONS_*settings.
Could link to docs?
| ## Guards (already configured) | ||
|
|
||
| - **Safe-DDL backend** — `DATABASES["default"]["ENGINE"]` in `settings.py` (prod adds Prometheus via `contentcuration.db.backends.zero_downtime_prometheus`). Lock/statement timeouts and `RAISE_FOR_UNSAFE` make rewriting or exclusively-locking DDL fail at migrate time. See the `ZERO_DOWNTIME_MIGRATIONS_*` settings. | ||
| - **`django-migration-linter`** (CI, `pythontest.yml`) — flags backward-incompatible schema (drops, renames, NOT NULL adds) the old pods would break on. The backend doesn't judge this axis. |
There was a problem hiding this comment.
The backend doesn't judge this axis.
wut
There was a problem hiding this comment.
I swear I read these docs multiple times - but this still got through. Will do another edit pass here.
| ] | ||
| ``` | ||
|
|
||
| Raw SQL because cutover detached both objects from Django state: the column lost its field, and `RemoveTrigger` was applied state-only to keep the physical trigger alive. `RemoveTrigger.database_forwards` and `RemoveField` look their target up in migration state, so with nothing there they have nothing to drop. Copy the trigger `pgid` from release 1's `AddTrigger` operation. |
| parser.add_argument("--model", required=True, help="app_label.ModelName") | ||
| parser.add_argument("--source-field", required=True) | ||
| parser.add_argument("--target-field", required=True) | ||
| parser.add_argument("--batch-size", type=int, default=1000) |
There was a problem hiding this comment.
Batch size could probably be pushed to 10k or a little more since this operates entirely within the DB
There was a problem hiding this comment.
And I think it also our default elsewhere?
| # studio#5974 cutover (next release, after backfill completes): | ||
| # - drop the @mirror_field decorator and the file_size_bigint field below | ||
| # - file_size = models.BigIntegerField(blank=True, null=True, db_column="file_size_bigint") | ||
| # - FILE_DISTINCT_BIGINT_INDEX_NAME -> fields=["checksum", "file_size"] |
There was a problem hiding this comment.
Not really apart of this step, but how confident are we in Django handling this particular action without significant flux?
There was a problem hiding this comment.
That's why we need to hand write the migrations with the "separate database state and operations" wrapper for this next step - otherwise it will almost certainly do something we definitely do not want.
- Safe-DDL backend: lock timeout, fail-fast, runtime unsafe-op guard - CI linting of new migrations on pull requests - Declarative dual-write trigger decorator (mirror_field) - Reusable batched-backfill command (idempotent, resumable, throttled) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LfZvkigk8hdsKdEif3hzBi
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LfZvkigk8hdsKdEif3hzBi
Expand stage of the zero-downtime int->bigint widening: - Add nullable file_size_bigint shadow column and its index (built CONCURRENTLY). - Mirror file_size into it via the change-guarded @mirror_field trigger. - Wire the online backfill as a commented deploy-migrate step. - Stage the cutover and contract steps as comments on the File model. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LfZvkigk8hdsKdEif3hzBi
4f79103 to
84f402d
Compare
Summary
Add tooling and playbook for zero downtime migrations for changing field types, e.g. int to bigint, or char to uuid - implement start of migration for file size on File object
Create reusable tooling for migration of file_size and other fields (like our CharField UUIDs to proper UUIDFields) - allow upload of files bigger than 2.1GB > int max value
Implemented with clear tooling to help ensure we implement with guardrails and clear repeatable steps.
References
Fixes #5973
Fixes #5974
Reviewer guidance
settings.py,production_settings.py): DB backend now routes throughdjango-pg-zero-downtime-migrations.0167: nullable shadow column + concurrent index + mirror trigger — all safe DDL (the backend forcesatomic=False, so the in-migrationCREATE INDEX CONCURRENTLYis fine).deploy-migrateruns the backfill this release. New writes dual-write via the trigger.AI usage
Used Claude Code to implement the tooling (safe DB backend wiring, the
mirror_fielddual-write trigger, thebackfill_columncommand) and draft the runbook. I reviewed the output across several local review rounds for correctness, tightened it to Studio's conventions, and ran the test suite.