Refactor/merge openmp by lijianing-sudo · Pull Request #7446 · deepmodeling/abacus-develop

lijianing-sudo · 2026-06-06T17:00:51Z

PR: OpenMP Parallel Optimization for ABACUS MD Module and ML Potential Interfaces (NEP/DPMD/LJ)

Reminder

Have you linked an issue with this pull request?
Have you added adequate unit tests and/or case tests for your pull request?
Have you noticed possible changes of behavior below or in the linked issue?
Have you explained the changes of codes in core modules of ESolver, HSolver, ElecState, Hamilt, Operator or Psi? (ignore if not applicable)

Linked Issue

Fix #...

Unit Tests and/or Case Tests for my changes

A unit test is added for each new feature or bug fix.

Existing Unit Tests Pass:

MODULE_MD_LJ_pot (6 tests)
MODULE_MD_func (7 tests)
MODULE_MD_fire
MODULE_MD_verlet
MODULE_MD_nhc
MODULE_MD_msst
MODULE_MD_lgv

Test Infrastructure:

Introduced shared MD test fixtures (source/source_md/test/md_test_fixture.h) to eliminate duplicated SetUp/TearDown across 6 test files.

Microbenchmark Verification:

Independent C++ microbenchmarks were written for each optimized kernel (see Test/openmp_nep_basic_benchmark.cpp and companion scripts).
2 million atoms, repeated 5 times, tested at 1/2/4/8/16 threads on Intel Xeon Platinum 8163.
All per-atom write loops produce bitwise-identical results (max_abs_diff = 0).
Reduction loops show floating-point differences at the 1e-10 to 1e-8 level due to summation order changes — expected and acceptable for MD trajectories.

What's changed?

This PR integrates OpenMP parallelization from three feature branches (refactor/md-factory, refactor/parallel-optimize, refactor/md-openmp-remainder) into the ABACUS MD module and ML potential interfaces. 22 parallel loops or worksharing regions are added across 12 source files (+3934/−342 lines total).

1. MD Base Loops (`source/source_md/`)

Function	File	Strategy
`MD_base::update_pos()`	`md_base.cpp`	`#pragma omp parallel for schedule(static)`
`MD_base::update_vel()`	`md_base.cpp`	`#pragma omp parallel for schedule(static)`
`kinetic_energy()`	`md_func.cpp`	`reduction(+:ke)`
`force_virial()` force copy	`md_func.cpp`	Parallel per-atom copy
`temp_vector()`	`md_func.cpp`	9 scalar reductions instead of shared-matrix accumulation
`rescale_vel()`	`md_func.cpp`	`schedule(static)`

All loops use if (natom >= 256) to skip parallel overhead for small systems.

2. NEP Interface (`source/source_esolver/esolver_nep.cpp/.h`)

Added atom_type_index / atom_local_index index caches for flat iat-based parallel loops.
Parallelized: coordinate buffer fill, per-atom energy reduction, force copy-back with unit conversion, and 9-component per-atom virial reduction.
NEP virial: reorganized from 9 separate full-array scans into a single per-atom scan with 9 scalar reductions — algorithmic + parallel gains combined (14.24× speedup at 8 threads).
nep.compute() external library call remains serial.

3. DPMD Interface (`source/source_esolver/esolver_dp.cpp/.h`)

Added iat → (it, ia) index caches.
Parallelized: coordinate buffer fill and model force copy-back with unit conversion.
Introduced persistent member buffers (dp_cell, dp_coord, dp_model_force, dp_model_virial) to avoid repeated allocations.
dp.compute() external library call and 3×3 virial copy-back remain serial.

4. Thermostat and Barostat (`source/source_md/`)

Class	Method	File
`Verlet`	`thermalize()` velocity rescaling	`verlet.cpp`
`MSST`	`rescale()` shock-direction velocity scaling	`msst.cpp`
`MSST`	`vel_sum()` velocity norm reduction	`msst.cpp`
`MSST`	`propagate_vel()` per-atom velocity propagation	`msst.cpp`
`NoseHoover`	`particle_thermo()` final velocity scaling	`nhchain.cpp`
`NoseHoover`	`vel_baro()` barostat velocity update	`nhchain.cpp`

Thermostat chain recurrence integration and cell dilation remain serial.

5. FIRE Algorithm (`source/source_md/fire.cpp`)

FIRE::check_fire() parallelized in three phases:

Three-scalar reduction for P, sumforce, normvel
Parallel velocity-force mixing
Parallel velocity zeroing (in P <= 0 branch)

Scalar state updates (alpha, negative_count, dt) remain serial.

6. LJ Interface (`source/source_esolver/esolver_lj.cpp/.h`)

Added global atom index cache.
Restructured nested type-iteration loops into flat iat-based loop.
schedule(dynamic, 32) to handle neighbor-count imbalance.
Thread-local potential and virial arrays with atomic (energy) and critical (virial) reduction at thread exit — no per-neighbor locks.

7. Code Quality Refactors

Extracted MD statistics helpers: calc_kinetic_state() / calc_stress_state() (md_func.h, md_statistics.h).
MD runner factory function: new/delete → std::unique_ptr (run_md.cpp).
Shared test fixture base classes to reduce duplication across 6 MD test files.

Performance Summary (Microbenchmark, 8 threads, 2M atoms, Xeon Platinum 8163)

Category	Kernel	Speedup	Efficiency
MD Base	`update_pos`	7.36×	92.0%
MD Base	`update_vel`	7.19×	89.9%
MD Base	`kinetic_energy`	6.98×	87.2%
MD Base	`temp_vector`	7.38×	92.2%
NEP	`coord_fill`	7.15×	89.4%
NEP	`energy_sum`	8.03×	100.4%
NEP	`force_fill`	6.99×	87.4%
NEP	`virial_sum`	14.24×	177.9%*
DPMD	`coord_fill`	5.50×	68.8%
DPMD	`force_copy`	7.19×	89.9%
Verlet	`thermalize`	7.80×	97.5%
MSST	`rescale`	7.28×	91.1%
MSST	`propagate_vel`	7.18×	89.7%
NHC	`particle_thermo`	7.21×	90.2%
FIRE	`check_fire` (mix)	7.64×	95.6%
LJ	`runner` core loop	6.96×	87.0%

*NEP virial 14.24× includes loop reorganization benefits beyond pure 8-thread scaling.

Known Limitations & Future Work

End-to-end tests: NEP and DPMD optimizations lack end-to-end tests with real external model libraries (__NEP, deepmd).
LJ parallel path: existing LJ unit tests use 4 atoms (< 256 threshold), covering only the serial path.
MPI + OpenMP hybrid: microbenchmarks are single-process; oversubscription risks under mixed MPI/OpenMP have not been characterized.
Thread threshold: nat >= 256 is an empirical uniform threshold; per-kernel tuning (64/128/256/512) is recommended.
LJ scheduling: schedule(dynamic, 32) vs static and optimal chunk size have not been systematically benchmarked across different neighbor distributions.
Microbenchmark results ≠ end-to-end wall-time: excluded overheads include MPI communication, neighbor-list construction, file I/O, and external model computation.

Any changes of core modules? (ignore if not applicable)

The MD ESolver interface layer (esolver_nep.cpp, esolver_dp.cpp, esolver_lj.cpp) is modified to add index caches and parallel worksharing constructs. No changes to the ESolver base class virtual function signatures. All external library calls (nep.compute(), dp.compute()) remain serial and their calling convention is unchanged.

…icle_thermo velocity scaling

… changes

…into md-factory

…T, NHC, FIRE, LJ) Cover 6 remaining hot-path per-atom loops that were not parallelized in the prior merge-openmp branch: - md_func.cpp: rescale_vel() — velocity rescaling factor apply - msst.cpp: vel_sum() — norm2 reduction, propagate_vel() — exp-based velocity propagation (highest compute density among uncovered loops) - nhchain.cpp: vel_baro() — NPT per-atom velocity scaling - fire.cpp: check_fire() — triple reduction + velocity mixing + zero - esolver_lj.cpp: runner() — N² neighbor pair computation with schedule(dynamic) for load balancing, per-thread virial accumulation All optimizations use schedule(static) with nat>=256 threshold (LJ uses dynamic,32 for neighbor-count imbalance). No data dependencies changed — all loops are per-atom independent. No conflict with prior merge-openmp branch.

The 'if' clause is only valid on '#pragma omp parallel', not on '#pragma omp for' when used inside an explicit parallel region. This caused a compile error: 'if' is not valid for '#pragma omp for'.

Audrey-777 and others added 13 commits May 30, 2026 21:04

refactor: prepare MD runner and ML buffers

7512416

refactor: extract MD statistics state helpers

1a608fd

refactor: introduce MD test fixtures

a099505

docs: record MD pre-parallel refactor

6b545c7

optimize: add OpenMP to MD base loops and NEP interface

ca3dcb1

docs: record OpenMP NEP and MD base changes

19117e0

docs: add MD OpenMP planning and benchmark results

5f60d47

optimize: add OpenMP to Verlet thermalize, MSST rescale, and NHC part…

e7bfb0c

…icle_thermo velocity scaling

optimize: add OpenMP to DPMD interface - coord fill & force copy back

79cf84a

docs: add optimization records for Verlet, MSST, NHC, and DPMD OpenMP…

6c8551f

… changes

merge: integrate refactor/parallel-optimize (DPMD+thermostat OpenMP) …

e999653

…into md-factory

fix: move 'if' clause from '#pragma omp for' to '#pragma omp parallel'

56538dc

The 'if' clause is only valid on '#pragma omp parallel', not on '#pragma omp for' when used inside an explicit parallel region. This caused a compile error: 'if' is not valid for '#pragma omp for'.

mohanchen added Refactor Refactor ABACUS codes MD & LAM MD and Larege Atomic Models project_learning and removed Refactor Refactor ABACUS codes labels Jun 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor/merge openmp#7446

Refactor/merge openmp#7446
lijianing-sudo wants to merge 13 commits into
deepmodeling:developfrom
Audrey-777:refactor/merge-openmp

lijianing-sudo commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

lijianing-sudo commented Jun 6, 2026

PR: OpenMP Parallel Optimization for ABACUS MD Module and ML Potential Interfaces (NEP/DPMD/LJ)

Reminder

Linked Issue

Unit Tests and/or Case Tests for my changes

What's changed?

1. MD Base Loops (source/source_md/)

2. NEP Interface (source/source_esolver/esolver_nep.cpp/.h)

3. DPMD Interface (source/source_esolver/esolver_dp.cpp/.h)

4. Thermostat and Barostat (source/source_md/)

5. FIRE Algorithm (source/source_md/fire.cpp)

6. LJ Interface (source/source_esolver/esolver_lj.cpp/.h)

7. Code Quality Refactors

Performance Summary (Microbenchmark, 8 threads, 2M atoms, Xeon Platinum 8163)

Known Limitations & Future Work

Any changes of core modules? (ignore if not applicable)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

1. MD Base Loops (`source/source_md/`)

2. NEP Interface (`source/source_esolver/esolver_nep.cpp/.h`)

3. DPMD Interface (`source/source_esolver/esolver_dp.cpp/.h`)

4. Thermostat and Barostat (`source/source_md/`)

5. FIRE Algorithm (`source/source_md/fire.cpp`)

6. LJ Interface (`source/source_esolver/esolver_lj.cpp/.h`)