Core memory leak fixes: Cython deallocation & callback hardening by bknight1 · Pull Request #181 · underworldcode/underworld3

bknight1 · 2026-05-11T06:59:27Z

This PR addresses several memory-related issues and stability concerns identified during OOM investigations on HPC clusters.

Key changes:

Cython Memory Management: Implemented __dealloc__ in the PtrContainer class to ensure that malloc'd function pointer arrays are properly freed when Python objects are garbage collected. Added re-allocation guards to prevent leaks during container reuse.
Array Callback Hardening: Updated mesh and variable array callbacks to verify array.owner exists before execution. This prevents AttributeError during complex object teardown (e.g., at application exit or during mesh replacement).
Recursion Guards: Integrated explicit recursion guards and mesh_update_lock checks in MeshVariable callbacks to prevent PETSc synchronization during sensitive coordinate deformations.

@bknight1

Swarm.advection() ended with self.migrate(delete_lost_points=True, max_its=1). max_its=1 only consults the closest domain centroid for each unclaimed particle, so any boundary particle whose owner happened to be the 2nd-or-3rd-closest rank failed points_in_domain on its sole try and was silently deleted. The override looks like leftover debugging — there is no reason for the end-of-advection migrate to truncate the kdtree retry. Drop it so the default max_its=10 is used, restoring the boundary-claim retry path. Adds a parallel regression test (tests/parallel/test_0765) adapted from @bknight1's reproducer in the bug report. With the fix in place: 4 ranks, no particles lost. Without it: 30/726 lost on order=1 advection, 66/726 on order=2 rotation. Reported and diagnosed by @bknight1. Underworld development team with AI support from Claude Code

Long parallel runs occasionally OOM on HPC even when each step looks small. memprobe gives a way to "light up" memory tracking on demand, sample at regular intervals, and pin which subsystem is growing. Components: - src/underworld3/utilities/memprobe.py — snapshot/diff/probe/instrument API. Snapshots capture process RSS (resource.getrusage), KDTree live + total-constructed counts, and (with full=True) per-class Python instance counts via gc.get_objects. Diffs filter out unchanged keys and sort py_classes by absolute change so dominant suspects surface first. - ckdtree.pyx — module-level live-instance counter, accurate via Cython's deterministic __cinit__/__dealloc__. Exposed as uw.kdtree.live_count() and uw.kdtree.total_constructed(). - Stokes.solve() and NavierStokes.solve() decorated with @memprobe.instrument(...). The decorator fast-returns when ENABLED is False (one bool check, sub-microsecond), so it's safe on hot paths. Activation: - UW_MEMPROBE=1 env var → enables instrumentation hooks at import time. - memprobe.enable() / disable() for runtime toggling. - with memprobe.probe("label"): … for ad-hoc blocks. Skipped: parsing PETSc.Log object tables from Python. The runtime flags -log_view and -malloc_dump give the same information more reliably; documented in the guide and exposed via memprobe.dump_petsc_leaks_at_finalize(). Adds tests/test_0780_memprobe.py (9 smoke tests) and docs/developer/guides/memory-diagnostics.md with debugging recipes. Does not fix issue #176 (Ben's OOM on HPC) — provides the instrumentation he needs to bisect it. Underworld development team with AI support from Claude Code

- _rss_mb(): switch from resource.ru_maxrss (peak/high-water RSS) to psutil.Process().memory_info().rss (current RSS), with /proc/self/statm Linux fallback and ru_maxrss kept only as a last resort. Current RSS is preferred so freed memory shows as a negative delta. (Copilot #1, #6) - diff(): apply 0.01 MiB threshold before including rss_mb in the delta. Smaller noise prints as "+0.00 MiB" and defeats the "no change" fast path. (Copilot #2) - dump_petsc_leaks_at_finalize(): drop leading "-" from PETSc option keys (codebase convention is no dash; "-malloc_dump" was a no-op), and align docstring with what the function actually sets. (Copilot #3) - Remove unused `os` import from test module. (Copilot #4) - Soften ckdtree.pyx comment claiming Cython "guarantees deterministic destruction" — that's CPython refcounting, which can lag for objects trapped in reference cycles until cyclic GC runs. (Copilot #5) - Test: assert KDTree count deltas relative to immediate before/after, never to absolute baselines. Earlier tests in the same pytest session can leave KDTree refs alive that the cyclic GC may collect at any time, shifting the absolute count. CI surfaced this with "assert 332 == 338" — six trees from earlier tests collected during our gc.collect(). Underworld development team with AI support from Claude Code

CI on PR #179 surfaced a flaky failure in test_bc_accepts_raw_numbers: any(k[0] == name for k in UWexpression._ephemeral_expr_names) RuntimeError: dictionary changed size during iteration The dict is mutated asynchronously by weakref finalizers running during cyclic GC, so iterating it directly races against those callbacks. The fix is to snapshot the keys with list(...) before iterating — at most a few hundred entries, negligible cost. Pre-existing bug — surfaces depending on test ordering, GC pressure, and timing. The memprobe PR's added tests happen to shift teardown state enough to trigger it consistently. Worth landing here so #179 isn't blocked. Underworld development team with AI support from Claude Code

Implement __dealloc__ to ensure malloc'd function pointer arrays are freed when the object is garbage collected. Added re-allocation guards in allocate() to prevent leaks when a container is reused. This resolves heap growth observed when solvers are repeatedly instantiated and destroyed during parameter sweeps or time-looping.

Updated mesh and variable update callbacks to verify 'array.owner' before processing. This prevents AttributeErrors during complex object teardowns (e.g. at exit or during mesh replacement). Integrated explicit recursion guards and mesh_update_lock checks in MeshVariable callbacks to ensure PETSc synchronization doesn't occur during sensitive coordinate deformations, preventing state corruption.

Add memprobe diagnostic module for memory-growth tracking (#176)

…grate Fix swarm particle loss across rank boundaries during advection (#175)

Copilot

Pull request overview

This PR targets stability and memory behavior in Underworld3’s PETSc/NumPy interoperability paths, addressing leaks from Cython-managed allocations and hardening array callbacks that can fire during teardown or mesh/variable rebuilds.

Changes:

Hardened SwarmVariable/MeshVariable/mesh coordinate array callbacks by resolving the owning object via array.owner and exiting early when the owner is gone.
Added recursion/mesh-update locking guards in MeshVariable callbacks to reduce unsafe PETSc synchronization during mesh deformation.
Added Cython-side cleanup for PtrContainer by introducing __dealloc__ and freeing/reallocating function-pointer arrays on reuse.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
`src/underworld3/swarm.py`	Uses `array.owner` in swarm variable array callbacks to avoid failures during teardown and to reference the correct owning variable.
`src/underworld3/discretisation/discretisation_mesh.py`	Hardens mesh coordinate callbacks by dereferencing the mesh via `array.owner` before deforming and versioning.
`src/underworld3/discretisation/discretisation_mesh_variables.py`	Updates MeshVariable callbacks to use `array.owner`, adds recursion/mesh-update lock guards, and adjusts canonical callback logic.
`src/underworld3/cython/petsc_types.pyx`	Adds `__cinit__`/`__dealloc__` and explicit freeing on `allocate()` reuse to prevent leaks of `malloc`’d function pointer arrays.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

        from underworld3.utilities import NDArray_With_Callback

        array_obj = NDArray_With_Callback(flat_petsc_data)

        # Single canonical callback for PETSc synchronization


            print(f"Mesh update callback - mesh deform")
-            coords = array.reshape(-1, array.owner.cdim)
-            self._deform_mesh(coords, verbose=True)
+            coords = array.reshape(-1, mesh.cdim)
+            mesh._deform_mesh(coords, verbose=True)

            # Increment mesh version to notify registered swarms of coordinate changes


lmoresi · 2026-05-11T07:08:48Z

Thanks @bknight1 — at a glance this looks orthogonal to the other in-flight work (#179 touched ckdtree.pyx + solvers.py; this PR is in petsc_types.pyx + discretisation_mesh*.py + a small swarm.py change). No conflicts expected.

To check before merge

PtrContainer.__dealloc__ re-allocation guard — confirm the guard handles the case where a PtrContainer is reused with a smaller array (free old before allocating new, no double-free).
array.owner existence check in callbacks — please add a one-line test or comment indicating which teardown path triggers the missing-owner case. That's the kind of guard that's invisible until something breaks, and a test or pointer-comment lets future-us understand why it's there.
mesh_update_lock — is this a new attribute introduced in this PR, or pre-existing? If new, a one-line docstring on it (where set, where cleared, what operations honour it) would help. Recursion guards are easy to leave half-wired.
Cross-validation with Add memprobe diagnostic module for memory-growth tracking (#176) #179: once merged, a UW_MEMPROBE=1 smoke run should show a clean drop in steady-state RSS after this PR vs. before — that's a nice fingerprint for the PtrContainer leak specifically.

Otherwise this looks ready. CI passing + the above confirmations and I'd say merge.

Underworld development team with AI support from Claude Code

lmoresi · 2026-05-11T23:51:26Z

@bknight1 — CI on this PR is failing, two things to address:

1. Rebase onto current `development` (picks up a race fix you don't have)

This PR's base is 848b8b9 (the #178 merge commit). Yesterday after that landed, three more PRs were merged:

Fix swarm particle loss across rank boundaries during advection (#175) #177 — Swarm.advection max_its=10 fix (your Consolidate and unify cached spatial indexing (KDTree) #182 also has this; will conflict).
Add memprobe diagnostic module for memory-growth tracking (#176) #179 — memprobe diagnostic module + KDTree counters.
A 1-line race fix 5f64de2 for UWexpression._ephemeral_expr_names iteration mid-finalize (the "dictionary changed size during iteration" race).

Your CI shows the race fix matters here:

FAILED tests/test_0810_bc_unit_conversion.py::test_bc_accepts_raw_numbers
  - RuntimeError: dictionary changed size during iteration

Rebase:

git fetch origin
git rebase origin/development
# resolve any conflicts (likely small — your changes are in different files)
git push --force-with-lease

2. Investigate the integral / swarm-statistics regression

CI also reports:

FAILED tests/test_0501_integrals.py::test_integrate_meshvar           - assert 0.6366197... < 0.001
FAILED tests/test_0501_integrals.py::test_integrate_swarmvar_deriv_O1 - assert 2.0 < 0.001
FAILED tests/test_0501_integrals.py::test_integrate_swarmvar_deriv_03 - assert 2.0 < 0.001
FAILED tests/test_0502_boundary_integrals.py::test_bd_integral_meshvar - Expected 0.6366..., got 0.0
FAILED tests/test_0852_swarm_integration_statistics.py::...           - assert 0.0 > 0
FAILED tests/test_0852_swarm_integration_statistics.py::...           - assert 273.0 < 0.0

These look like real regressions caused by this PR, not pre-existing flakes:

test_bd_integral_meshvar returning 0.0 instead of 0.6366... is severe — a boundary integral computing zero is the kind of thing you only see if data isn't reaching the integration context at all.
test_integrate_meshvar returning 0.6366 where the test expects < 0.001 (an error-norm assertion) is the same shape — the raw value leaking through where the residual should be.

My guess: the array.owner existence guard in your MeshVariable callbacks is silently early-returning when it shouldn't (e.g. during a teardown-and-rebuild of an integration work-variable, where owner is briefly absent but the data sync is still required). Or the mesh_update_lock is gating a path that integration relies on.

Worth running these four tests locally with UW_MEMPROBE=1 and a print/breakpoint in the new callback guard to see which path is being short-circuited.

After the rebase + fix, push and CI should pass — at which point we can merge.

Implement __dealloc__ to ensure malloc'd function pointer arrays are freed when the object is garbage collected. Added re-allocation guards in allocate() to prevent leaks when a container is reused. This resolves heap growth observed when solvers are repeatedly instantiated and destroyed during parameter sweeps or time-looping.

Updated mesh and variable update callbacks to verify 'array.owner' before processing. This prevents AttributeErrors during complex object teardowns (e.g. at exit or during mesh replacement). Integrated explicit recursion guards and mesh_update_lock checks in MeshVariable callbacks to ensure PETSc synchronization doesn't occur during sensitive coordinate deformations, preventing state corruption.

- Added safety comments to NDArray callbacks in MeshVariable and SwarmVariable explaining owner-existence checks for object teardown. - Documented _mesh_update_lock in Mesh and ensured _deform_mesh is properly wrapped to prevent PETSc sync conflicts during coordinate changes. - Verified test stability for integrals and swarm statistics.

Documented the ownership check guards in NDArray callbacks to explain that they handle teardown race conditions during Python garbage collection (e.g. at application exit or mesh rebuilds).

Use global_sum and global_size for arithmetic mean calculation to ensure rank-independent assertions when comparing with global integration results.

- Restore default max_its in Swarm.migrate to fix advection particle loss (issue #175) - Ensure canonical data array has owner for MeshVariable to enable PETSc sync - Confirmed with parallel advection and integral tests

bknight1 · 2026-05-12T02:06:16Z

I have addressed the CI failures and the feedback regarding the mesh variable synchronization regression.

Fixes and Verification:

Resolved MeshVariable Synchronization Regression:
- Root Cause: In the initial PR commit, MeshVariable._create_canonical_data_array was missing the owner=self argument in the NDArray_With_Callback constructor. This caused the PETSc synchronization callback to early-return because array.owner was None, preventing updates to the underlying PETSc data from the Python .data property.
- Impact: Integration and boundary integral tests were returning 0.0 or residuals that didn't match expectations because the solver was operating on uninitialized PETSc vectors.
- Fix: Added owner=self to ensure the callback has access to the variable for pack_raw_data_to_petsc calls.
- Verification: Verified that test_0501_integrals.py, test_0502_boundary_integrals.py, and test_0852_swarm_integration_statistics.py now pass.
Resolved Parallel Swarm Advection (Issue [BUG] - swarm particles being lost across processors resulting in 'gaps' #175):
- Root Cause: A regression in Swarm.advection (likely from a previous merge or manual override) set max_its=1 in the final migrate call. This prevented the KD-tree from retrying neighboring ranks for particles at process boundaries, leading to silent deletion.
- Fix: Restored the default max_its=10 to enable the retry path.
- Verification: Verified with tests/parallel/test_0765_swarm_advection_no_loss.py on 2 and 4 MPI ranks.
Documentation and Safety Hardening:
- Updated comments in MeshVariable and SwarmVariable callbacks to clarify that the owner is None check is a safety guard for Python garbage collection race conditions during object teardown (e.g., at exit or during mesh replacement).
- Documented _mesh_update_lock in Mesh to clarify its role in preventing PETSc synchronization conflicts during coordinate deformations.

Rebase:

I have rebased the branch onto the latest development to pick up recent race condition fixes and resolve conflicts with the overlapping advection fix.

All reported regressions are now resolved and verified locally.

lmoresi · 2026-05-14T11:05:30Z

Review pass — ready to merge

@bknight1 — the four pre-merge check items from 2026-05-11 are all addressed, and the regression root-cause + fix is well-explained. Going through the verification:

Check items resolved

PtrContainer.__dealloc__ reuse guard — petsc_types.pyx now has paired __cinit__/__dealloc__ plus an explicit free-before-reallocate in allocate(). Clean.
array.owner is None guard — purpose now documented inline as "Python GC race during object teardown (exit / mesh replacement)". The reason the integral tests failed in the first CI run is the exact teardown path that's now guarded — perfect demonstration of why the comment matters.
_mesh_update_lock docstring — added; explains it's held by mesh_update_callback during _deform_mesh() and checked by the variable-side callbacks via var.mesh._mesh_update_lock.acquire(blocking=False). Reentrant RLock is the right primitive.
UW_MEMPROBE=1 smoke + CI green — passing per checks.

Bug found, root cause documented, fix verified

The owner=self regression on _create_canonical_data_array is exactly the kind of failure mode the array.owner guard is meant to catch — when the callback is plumbed correctly the guard is invisible; the moment you forget to pass owner, the guard silently turns off all sync and integrals start returning zero. That's a useful failure mode (loud test failures rather than quiet data corruption later) and worth keeping a note about — perhaps a one-line assertion in _create_canonical_data_array that array_obj.owner is self post-construction, so any future caller that forgets fails immediately rather than via an integral regression three layers up.

Small leftover from Copilot's review

Copilot flagged unconditional print() calls in mesh_update_callback (no line pinned). Worth checking whether any print(...) statements in the callback path remain unguarded — these fire on every coordinate write and run on all MPI ranks. If any survived the diff, gate behind uw.pprint(0, ...) or a verbose flag before merging. Not blocking if they were already removed or gated.

Cross-validation with #182

Once both #181 and #182 land, the leak-free fingerprint on a tight solver loop should be:

Flat per-step RSS (this PR)
Zero kdtree: total_constructed delta after warm-up (Consolidate and unify cached spatial indexing (KDTree) #182)

Worth a quick UW_MEMPROBE=1 smoke on the Gadi-class reproducer from #176 to confirm before declaring that issue closed.

LGTM modulo the print() gating check.

Underworld development team with AI support from Claude Code

lmoresi

Approving per review pass — all four pre-merge check items addressed, regression root-cause clearly explained, CI green. Closes #175 and provides the Cython/callback foundation for #176.

lmoresi and others added 6 commits May 10, 2026 12:05

Copilot AI review requested due to automatic review settings May 11, 2026 06:59

bknight1 requested a review from lmoresi as a code owner May 11, 2026 06:59

Copilot started reviewing on behalf of bknight1 May 11, 2026 07:00 View session

lmoresi added 2 commits May 11, 2026 17:03

Merge pull request #179 from underworldcode/bugfix/memprobe-diagnostics

21b037c

Add memprobe diagnostic module for memory-growth tracking (#176)

Merge pull request #177 from underworldcode/bugfix/swarm-advection-mi…

f029a02

…grate Fix swarm particle loss across rank boundaries during advection (#175)

Copilot AI reviewed May 11, 2026

View reviewed changes

lmoresi mentioned this pull request May 11, 2026

Consolidate and unify cached spatial indexing (KDTree) #182

Merged

bknight1 added 8 commits May 12, 2026 08:08

merge: resolve conflicts with origin/fix/memory-leak-comprehensive

667f2de

docs: improve safety guard comments in variable/mesh callbacks

3aba913

Documented the ownership check guards in NDArray callbacks to explain that they handle teardown race conditions during Python garbage collection (e.g. at application exit or mesh rebuilds).

test: make swarm integration statistics test MPI-aware

9722c78

Use global_sum and global_size for arithmetic mean calculation to ensure rank-independent assertions when comparing with global integration results.

docs: silence mesh update callback prints and use rank-0 only pprint

20f4d8c

fix: resolve swarm particle loss and mesh variable sync issues

6f617f4

- Restore default max_its in Swarm.migrate to fix advection particle loss (issue #175) - Ensure canonical data array has owner for MeshVariable to enable PETSc sync - Confirmed with parallel advection and integral tests

lmoresi approved these changes May 14, 2026

View reviewed changes

lmoresi merged commit c914576 into development May 14, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core memory leak fixes: Cython deallocation & callback hardening#181

Core memory leak fixes: Cython deallocation & callback hardening#181
lmoresi merged 16 commits into
developmentfrom
fix/memory-leak-comprehensive

bknight1 commented May 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

lmoresi commented May 11, 2026

Uh oh!

lmoresi commented May 11, 2026

Uh oh!

bknight1 commented May 12, 2026

Uh oh!

lmoresi commented May 14, 2026

Uh oh!

lmoresi left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bknight1 commented May 11, 2026

Key changes:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

lmoresi commented May 11, 2026

To check before merge

Uh oh!

lmoresi commented May 11, 2026

1. Rebase onto current development (picks up a race fix you don't have)

2. Investigate the integral / swarm-statistics regression

Uh oh!

bknight1 commented May 12, 2026

Fixes and Verification:

Rebase:

Uh oh!

lmoresi commented May 14, 2026

Review pass — ready to merge

Check items resolved

Bug found, root cause documented, fix verified

Small leftover from Copilot's review

Cross-validation with #182

Uh oh!

lmoresi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. Rebase onto current `development` (picks up a race fix you don't have)