Zero-downtime migration tooling; widen File.file_size to bigint (expand stage) by rtibbles · Pull Request #5986 · learningequality/studio

rtibbles · 2026-06-24T01:44:47Z

Summary

Add tooling and playbook for zero downtime migrations for changing field types, e.g. int to bigint, or char to uuid - implement start of migration for file size on File object
Create reusable tooling for migration of file_size and other fields (like our CharField UUIDs to proper UUIDFields) - allow upload of files bigger than 2.1GB > int max value
Implemented with clear tooling to help ensure we implement with guardrails and clear repeatable steps.

References

Fixes #5973
Fixes #5974

Reviewer guidance

DB engine swap (settings.py, production_settings.py): DB backend now routes through django-pg-zero-downtime-migrations.
Migration 0167: nullable shadow column + concurrent index + mirror trigger — all safe DDL (the backend forces atomic=False, so the in-migration CREATE INDEX CONCURRENTLY is fine).
deploy-migrate runs the backfill this release. New writes dual-write via the trigger.
New CI step lints PR migrations against the merge-base.

AI usage

Used Claude Code to implement the tooling (safe DB backend wiring, the mirror_field dual-write trigger, the backfill_column command) and draft the runbook. I reviewed the output across several local review rounds for correctness, tightened it to Studio's conventions, and ran the test suite.

bjester

Overall, the code looks good and makes sense. I have lots of comments and a few questions for deliberation. After that, I'll look at testing it

bjester · 2026-06-29T18:28:01Z

+        )
+        parser.add_argument("--start-id", default=None, help="resume from this pk")
+        parser.add_argument(
+            "--verify",


nitpick: --verify doesn't verify backfilled columns, it only checks progress. IMO --check or something like that aligns better with what it does.

--progress-check maybe to be completely explict?

Yeah that sounds good

bjester · 2026-06-29T18:36:11Z

+
+        total = 0
+        while lo is not None:
+            # hi = batch_size-th pk at/after lo (None on the final short batch).


hi = batch_size-th pk at/after lo -- this a level of terse LLM writing that makes this comment difficult to understand. IDK what it's saying. I understand the code more than the comment which is telling.

Yes - explanatory code would be better than comments here.

bjester · 2026-06-29T18:47:14Z

+            # Throttle between batches so WAL generation / replication lag and autovacuum
+            # can keep up on large tables. Pass --sleep 0 to disable.
+            if throttle:
+                time.sleep(throttle)


This may not provide a meaningful benefit, especially for large tables like the file table. I don't think there is any worry about the WAL, but autovacuum would be triggered. I just don't know that the sleep time would necessarily align with its completion (likely much much longer). If there is concern, I think a more explicit approach to managing it would be beneficial:

turn it off while this running, and

execute vacuum manually at the end

Yes, I also argued with Claude quite a bit about this, and I relented because it was insistent. I think I mostly gave in out of fear because File is such a large table - but maybe we just assume we don't need the throttle at all, test it out on hotfixes and see whether that's a bad idea?

bjester · 2026-06-29T18:54:36Z

+            models.Index(
+                fields=["checksum", "file_size_bigint"],
+                name=FILE_DISTINCT_BIGINT_INDEX_NAME,
+            ),


A constraint on file_size_bigint being not null may speed up this creation of this new index.

bjester · 2026-06-29T18:59:48Z

+## Reclaiming the physical name (usually skip)
+
+`db_column` makes the `file_size_bigint` physical name invisible to all ORM/app code; only raw SQL sees it. Matching the physical name to `file_size` requires a second expand/contract cycle — needed only for raw-SQL or external consumers.


As a raw SQL writer, IMO it's tech debt if it's skipped. I think this should be expected after we have propagated to all environments and generated backups-- we don't want to leave remnants of migrations gone by for no real reason.

I did wonder if you would have an opinion about this - I think perfectly fine to do this, just noting that we'll essentially have to do the reverse operation (but without a field type change) to achieve this with zero downtime - but it's only one more release because we can drop the int column and add the new bigint column in the same release, then start backfilling.

A column rename should be quick with postgres, at least, that's what I recall with these types of migrations. And yeah we don't need to rush to do it afterwards, but like I said, just not forget about it!

Yeah, it's quick, but it's not absolutely zero downtime - it could, in theory lead to failed writes on non-upgraded pods. I feel OK accepting that small switchover drop, but I didn't want to advertise this as a "zero downtime" migration when there would be some switchover cost. But, going through the full cycle again to be truly zero downtime doesn't seem like a good use of effort!

bjester · 2026-06-29T19:11:13Z

+
+## Guards (already configured)
+
+- **Safe-DDL backend** — `DATABASES["default"]["ENGINE"]` in `settings.py` (prod adds Prometheus via `contentcuration.db.backends.zero_downtime_prometheus`). Lock/statement timeouts and `RAISE_FOR_UNSAFE` make rewriting or exclusively-locking DDL fail at migrate time. See the `ZERO_DOWNTIME_MIGRATIONS_*` settings.


See the ZERO_DOWNTIME_MIGRATIONS_* settings.

Could link to docs?

bjester · 2026-06-29T19:11:42Z

+## Guards (already configured)
+
+- **Safe-DDL backend** — `DATABASES["default"]["ENGINE"]` in `settings.py` (prod adds Prometheus via `contentcuration.db.backends.zero_downtime_prometheus`). Lock/statement timeouts and `RAISE_FOR_UNSAFE` make rewriting or exclusively-locking DDL fail at migrate time. See the `ZERO_DOWNTIME_MIGRATIONS_*` settings.
+- **`django-migration-linter`** (CI, `pythontest.yml`) — flags backward-incompatible schema (drops, renames, NOT NULL adds) the old pods would break on. The backend doesn't judge this axis.


The backend doesn't judge this axis.

wut

I swear I read these docs multiple times - but this still got through. Will do another edit pass here.

bjester · 2026-06-29T19:15:06Z

+]
+```
+
+Raw SQL because cutover detached both objects from Django state: the column lost its field, and `RemoveTrigger` was applied state-only to keep the physical trigger alive. `RemoveTrigger.database_forwards` and `RemoveField` look their target up in migration state, so with nothing there they have nothing to drop. Copy the trigger `pgid` from release 1's `AddTrigger` operation.


"Raw SQL" -> "Raw SQL is used"

bjester · 2026-06-29T19:17:50Z

+        parser.add_argument("--model", required=True, help="app_label.ModelName")
+        parser.add_argument("--source-field", required=True)
+        parser.add_argument("--target-field", required=True)
+        parser.add_argument("--batch-size", type=int, default=1000)


Batch size could probably be pushed to 10k or a little more since this operates entirely within the DB

And I think it also our default elsewhere?

bjester · 2026-06-29T19:19:38Z

+# studio#5974 cutover (next release, after backfill completes):
+# - drop the @mirror_field decorator and the file_size_bigint field below
+# - file_size = models.BigIntegerField(blank=True, null=True, db_column="file_size_bigint")
+# - FILE_DISTINCT_BIGINT_INDEX_NAME -> fields=["checksum", "file_size"]


Not really apart of this step, but how confident are we in Django handling this particular action without significant flux?

That's why we need to hand write the migrations with the "separate database state and operations" wrapper for this next step - otherwise it will almost certainly do something we definitely do not want.

- Safe-DDL backend: lock timeout, fail-fast, runtime unsafe-op guard - CI linting of new migrations on pull requests - Declarative dual-write trigger decorator (mirror_field) - Reusable batched-backfill command (idempotent, resumable, throttled) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LfZvkigk8hdsKdEif3hzBi

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LfZvkigk8hdsKdEif3hzBi

Expand stage of the zero-downtime int->bigint widening: - Add nullable file_size_bigint shadow column and its index (built CONCURRENTLY). - Mirror file_size into it via the change-guarded @mirror_field trigger. - Wire the online backfill as a commented deploy-migrate step. - Stage the cutover and contract steps as comments on the File model. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LfZvkigk8hdsKdEif3hzBi

rtibbles marked this pull request as ready for review June 24, 2026 02:43

bjester self-assigned this Jun 26, 2026

bjester reviewed Jun 29, 2026

View reviewed changes

rtibbles and others added 3 commits June 30, 2026 14:35

docs(migrations): add expand/contract zero-downtime runbook

b631c25

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LfZvkigk8hdsKdEif3hzBi

rtibbles force-pushed the widen_file_size_bigint branch from 4f79103 to 84f402d Compare June 30, 2026 21:37

		## Reclaiming the physical name (usually skip)

		`db_column` makes the `file_size_bigint` physical name invisible to all ORM/app code; only raw SQL sees it. Matching the physical name to `file_size` requires a second expand/contract cycle — needed only for raw-SQL or external consumers.


		## Guards (already configured)

		- Safe-DDL backend — `DATABASES["default"]["ENGINE"]` in `settings.py` (prod adds Prometheus via `contentcuration.db.backends.zero_downtime_prometheus`). Lock/statement timeouts and `RAISE_FOR_UNSAFE` make rewriting or exclusively-locking DDL fail at migrate time. See the `ZERO_DOWNTIME_MIGRATIONS_*` settings.

Uh oh!

Uh oh!

Conversation

rtibbles commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

References

Reviewer guidance

AI usage

Uh oh!

bjester left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rtibbles Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rtibbles commented Jun 24, 2026 •

edited

Loading

rtibbles Jun 29, 2026 •

edited

Loading