Skip to content

[FLINK-39924] Fix jemalloc narenas configuration by using actual container CPU allowance#266

Open
leekeiabstraction wants to merge 1 commit into
apache:dev-masterfrom
leekeiabstraction:flink-39924
Open

[FLINK-39924] Fix jemalloc narenas configuration by using actual container CPU allowance#266
leekeiabstraction wants to merge 1 commit into
apache:dev-masterfrom
leekeiabstraction:flink-39924

Conversation

@leekeiabstraction

@leekeiabstraction leekeiabstraction commented Jun 13, 2026

Copy link
Copy Markdown

What changes were proposed in this pull request?

Set jemalloc arena count using ncpus derived from the container's cgroup CPU quota with the same formula as jemalloc's default (4 × ncpus, or 1 when ncpus == 1)

Note jemalloc default is not container aware and uses host machine's CPU count: https://jemalloc.net/jemalloc.3.html#opt.narenas

Why are the changes needed?

Large number of arenas leads to infrequently used arenas, infrequently used arenas hold dirty pages for dirty_decay_ms before releasing memory to OS. We observed excessive memory fragmentation in production, using malloc_stats we identified the most extreme case of fragmentation at 3.91 GB (10.01 GB Resident - 6.1 GB Active) which was significant as the pod has a limit of 16 GB. This was caused by jemalloc arena count misconfigured to higher than expected default as it uses host CPU count.

  • Excessive memory fragmentation contributes to OOMKills
  • Excessive memory fragmentation also reduces OS page cache, which impacts performance of operations involving disk read and writes

Verifying this change

Reproduced on Docker Desktop running on mac book pro with 14 cores. With 6 TaskManagers per cluster configured with 2 GB process size, 1 CPU each and RocksDB state backend.

metric Flink 2.2.1 Flink 2.2.1 image with fix changes
peak anon resident set size (RSS) 1655.9 MB 1472.7 MB −183.2 MB (-11.1 %)
avg anon RSS 1477.2 MB 1273.8 MB −203.4 MB (-13.7 %)
lowest source throughput 197666 rec 202947 rec +2.7 %
average source throughput 208187 rec 209098 rec +4.4 %

See here for reproduction/verification step (4narena image was replaced to use patched image): https://github.com/leekeiabstraction/flink-docker/tree/reproduce-jemalloc-fragmentation/reproduce-jemalloc-fragmentation

Does this PR introduce any user-facing change?

None, default behaviour of manually override of narenas is preseved.

Apache Flink containers load jemalloc via LD_PRELOAD but don't configure
narenas. jemalloc's default is 4 * ncpus, where ncpus is read from
/proc/cpuinfo, the host CPU count, not the container's CPU limit. In
CPU limited pods on large hosts this over-provisions arenas and causes
RSS fragmentation, since each arena holds dirty pages for dirty_decay_ms
before releasing them to the OS.

Determine the effective CPU count from the cgroup CPU quota directly
(cpu.max for v2, cpu.cfs_quota_us / cpu.cfs_period_us for v1), since
nproc honors cpuset but not CPU quotas. Fall back to nproc when no quota
is set. Skip the override entirely when the user has supplied narenas
in MALLOC_CONF, and append narenas to any other user-supplied MALLOC_CONF
value.
@leekeiabstraction leekeiabstraction changed the title [FLINK-39924] Size jemalloc narenas from container CPU allowance [FLINK-39924] Fix jemalloc narenas configuration by using actual container CPU allowance Jun 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant