[FLINK-39924] Fix jemalloc narenas configuration by using actual container CPU allowance by leekeiabstraction · Pull Request #266 · apache/flink-docker

leekeiabstraction · 2026-06-13T11:07:24Z

What changes were proposed in this pull request?

Set jemalloc arena count using ncpus derived from the container's cgroup CPU quota with the same formula as jemalloc's default (4 × ncpus, or 1 when ncpus == 1)

Note jemalloc default is not container aware and uses host machine's CPU count: https://jemalloc.net/jemalloc.3.html#opt.narenas

Why are the changes needed?

Large number of arenas leads to infrequently used arenas, infrequently used arenas hold dirty pages for dirty_decay_ms before releasing memory to OS. We observed excessive memory fragmentation in production, using malloc_stats we identified the most extreme case of fragmentation at 3.91 GB (10.01 GB Resident - 6.1 GB Active) which was significant as the pod has a limit of 16 GB. This was caused by jemalloc arena count misconfigured to higher than expected default as it uses host CPU count.

Excessive memory fragmentation contributes to OOMKills
Excessive memory fragmentation also reduces OS page cache, which impacts performance of operations involving disk read and writes

Verifying this change

Reproduced on Docker Desktop running on mac book pro with 14 cores. With 6 TaskManagers per cluster configured with 2 GB process size, 1 CPU each and RocksDB state backend.

metric	Flink 2.2.1	Flink 2.2.1 image with fix	changes
peak anon resident set size (RSS)	1655.9 MB	1472.7 MB	−183.2 MB (-11.1 %)
avg anon RSS	1477.2 MB	1273.8 MB	−203.4 MB (-13.7 %)
lowest source throughput	197666 rec	202947 rec	+2.7 %
average source throughput	208187 rec	209098 rec	+4.4 %

See here for reproduction/verification step (4narena image was replaced to use patched image): https://github.com/leekeiabstraction/flink-docker/tree/reproduce-jemalloc-fragmentation/reproduce-jemalloc-fragmentation

Does this PR introduce any user-facing change?

None, default behaviour of manually override of narenas is preseved.

Apache Flink containers load jemalloc via LD_PRELOAD but don't configure narenas. jemalloc's default is 4 * ncpus, where ncpus is read from /proc/cpuinfo, the host CPU count, not the container's CPU limit. In CPU limited pods on large hosts this over-provisions arenas and causes RSS fragmentation, since each arena holds dirty pages for dirty_decay_ms before releasing them to the OS. Determine the effective CPU count from the cgroup CPU quota directly (cpu.max for v2, cpu.cfs_quota_us / cpu.cfs_period_us for v1), since nproc honors cpuset but not CPU quotas. Fall back to nproc when no quota is set. Skip the override entirely when the user has supplied narenas in MALLOC_CONF, and append narenas to any other user-supplied MALLOC_CONF value.

leekeiabstraction force-pushed the flink-39924 branch from f27f132 to dbdb96c Compare June 13, 2026 12:23

leekeiabstraction changed the title ~~[FLINK-39924] Size jemalloc narenas from container CPU allowance~~ [FLINK-39924] Fix jemalloc narenas configuration by using actual container CPU allowance Jun 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-39924] Fix jemalloc narenas configuration by using actual container CPU allowance#266

[FLINK-39924] Fix jemalloc narenas configuration by using actual container CPU allowance#266
leekeiabstraction wants to merge 1 commit into
apache:dev-masterfrom
leekeiabstraction:flink-39924

leekeiabstraction commented Jun 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leekeiabstraction commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Verifying this change

Does this PR introduce any user-facing change?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

leekeiabstraction commented Jun 13, 2026 •

edited

Loading