[FLINK-39924] Fix jemalloc narenas configuration by using actual container CPU allowance#266
Open
leekeiabstraction wants to merge 1 commit into
Open
[FLINK-39924] Fix jemalloc narenas configuration by using actual container CPU allowance#266leekeiabstraction wants to merge 1 commit into
leekeiabstraction wants to merge 1 commit into
Conversation
Apache Flink containers load jemalloc via LD_PRELOAD but don't configure narenas. jemalloc's default is 4 * ncpus, where ncpus is read from /proc/cpuinfo, the host CPU count, not the container's CPU limit. In CPU limited pods on large hosts this over-provisions arenas and causes RSS fragmentation, since each arena holds dirty pages for dirty_decay_ms before releasing them to the OS. Determine the effective CPU count from the cgroup CPU quota directly (cpu.max for v2, cpu.cfs_quota_us / cpu.cfs_period_us for v1), since nproc honors cpuset but not CPU quotas. Fall back to nproc when no quota is set. Skip the override entirely when the user has supplied narenas in MALLOC_CONF, and append narenas to any other user-supplied MALLOC_CONF value.
f27f132 to
dbdb96c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Set jemalloc arena count using
ncpusderived from the container's cgroup CPU quota with the same formula as jemalloc's default (4 × ncpus, or1whenncpus == 1)Note jemalloc default is not container aware and uses host machine's CPU count: https://jemalloc.net/jemalloc.3.html#opt.narenas
Why are the changes needed?
Large number of arenas leads to infrequently used arenas, infrequently used arenas hold dirty pages for dirty_decay_ms before releasing memory to OS. We observed excessive memory fragmentation in production, using malloc_stats we identified the most extreme case of fragmentation at 3.91 GB (10.01 GB Resident - 6.1 GB Active) which was significant as the pod has a limit of 16 GB. This was caused by jemalloc arena count misconfigured to higher than expected default as it uses host CPU count.
Verifying this change
Reproduced on Docker Desktop running on mac book pro with 14 cores. With 6 TaskManagers per cluster configured with 2 GB process size, 1 CPU each and RocksDB state backend.
See here for reproduction/verification step (4narena image was replaced to use patched image): https://github.com/leekeiabstraction/flink-docker/tree/reproduce-jemalloc-fragmentation/reproduce-jemalloc-fragmentation
Does this PR introduce any user-facing change?
None, default behaviour of manually override of narenas is preseved.