Perf: OpenMP cache blocking and SIMD for PW_Basis FFT transform copy routines#7439
Perf: OpenMP cache blocking and SIMD for PW_Basis FFT transform copy routines#7439MiniYuanBot wants to merge 3 commits into
Conversation
|
\label project_learning |
Qianruipku
left a comment
There was a problem hiding this comment.
LGTM. Could you include a performance comparison with the original implementation?
|
Here is the benchmark on 256³ grid, 20 repeats, GCC 13.3.0,
Block size rationale Note on absolute numbers Thanks for the suggestion! |
What's changed
This PR optimizes the memory-bound copy loops in
PW_Basis::real2recipandPW_Basis::recip2real(source/module_pw/pw_transform.cpp) using cache blocking and SIMD vectorization, while maintaining full numerical compatibility with the original implementation.Key Changes
Cache blocking (tiling)
Introduced a unified block size
pw_transform_cache_block = 1024and helperblock_end(). All long copy loops are rewritten in a two-level structure:This keeps the working set in L1/L2 cache and mitigates false sharing across OpenMP threads.
SIMD vectorization
Added
#pragma omp simdto the inner stride-1 loops (continuous copy, zeroing, and accumulation). This helps the compiler emit contiguous SIMD instructions (AVX2/AVX-512) forstd::complex<FPTYPE>and real-valued buffers.Alias analysis & pointer caching
Cached frequently accessed member variables (
nrxx,npw,nxyz,ig2isz) and FFT buffer pointers (auxr,auxg,rspace) as localconstvariables. This reduces repeatedthis->indirection and improves compiler aliasing assumptions.Finer-grained timers
Added sub-timers (
real2recip_copy_r,real2recip_copy_g,recip2real_copy_r,recip2real_copy_g) to isolate memory-copy overhead from FFT library time, aiding future profiling.Performance (256^3 grid, ecut=50, 20 repeats, WSL2 GCC 13.3.0)
Files Changed
source/module_pw/pw_transform.cpp— optimized copy loops and timers