Perf: parallelize count_pw_st with OpenMP collapse(2)#7438
Conversation
|
\label project_learning |
| int liy_local = 0, riy_local = 0; | ||
|
|
||
| #ifdef _OPENMP | ||
| #pragma omp parallel for collapse(2) \ |
There was a problem hiding this comment.
Did you compare the performance with collapse(1)? In this kind of loop nest, collapse(1) is often faster than collapse(2) when using the same level of parallelism.
There was a problem hiding this comment.
Besides, could you compare the single-thread performance with and without OpenMP? I think collapse(2) might still be much slower even with one thread.
There was a problem hiding this comment.
Here is the benchmark on -O3, OMP_PROC_BIND=close.
| Threads | collapse=1 (ms) | collapse=2 (ms) | No Pragma (ms) | Diff (2 vs 1) |
|---|---|---|---|---|
| 1 | 1605.65 | 1613.87 | 1606.42 | +0.5% |
| 4 | 404.674 | 406.729 | 1606.07 | +0.5% |
| 8 | 256.21 | 255.527 | 1606.35 | −0.3% |
| 12 | 232.543 | 233.994 | 1607.89 | +0.6% |
It seems like collapse(2) shows no measurable advantage over collapse(1).
So I've switched to collapse(1). Thanks for the suggestion!
What's changed?
source/source_basis/module_pw/pw_distributeg.cpp (
count_pw_st):parallel for collapse(2)to the(ix, iy)double loop for plane-wave stick enumerationreduction(+: npwtot_local, nstot_local)for accumulation of total plane-wave and stick countsreduction(min/max: ...)for boundary coordinate tracking (lix,rix,liy,riy)Performance impact (tested on Intel Core i7, GCC 13.3.0,
-O3 -fopenmp, grid=256×256×256, repeats=10):count_pw_stfunction, which is the hotspot in PW initialization for large grids.Near-linear scaling up to 4 threads (efficiency >95%)
8-thread efficiency drops to ~73% due to memory bandwidth saturation
12-thread marginal gain diminishes, consistent with SMT overhead on consumer-grade platforms
Behavior changes: None. The serial code path is preserved when
_OPENMPis undefined. All existingMODULE_PW_*unit tests (12/12) continue to pass.