Skip to content

Refactor/merge openmp#7446

Open
lijianing-sudo wants to merge 13 commits into
deepmodeling:developfrom
Audrey-777:refactor/merge-openmp
Open

Refactor/merge openmp#7446
lijianing-sudo wants to merge 13 commits into
deepmodeling:developfrom
Audrey-777:refactor/merge-openmp

Conversation

@lijianing-sudo
Copy link
Copy Markdown

PR: OpenMP Parallel Optimization for ABACUS MD Module and ML Potential Interfaces (NEP/DPMD/LJ)

Reminder

  • Have you linked an issue with this pull request?
  • Have you added adequate unit tests and/or case tests for your pull request?
  • Have you noticed possible changes of behavior below or in the linked issue?
  • Have you explained the changes of codes in core modules of ESolver, HSolver, ElecState, Hamilt, Operator or Psi? (ignore if not applicable)

Linked Issue

Fix #...

Unit Tests and/or Case Tests for my changes

  • A unit test is added for each new feature or bug fix.

Existing Unit Tests Pass:

  • MODULE_MD_LJ_pot (6 tests)
  • MODULE_MD_func (7 tests)
  • MODULE_MD_fire
  • MODULE_MD_verlet
  • MODULE_MD_nhc
  • MODULE_MD_msst
  • MODULE_MD_lgv

Test Infrastructure:

  • Introduced shared MD test fixtures (source/source_md/test/md_test_fixture.h) to eliminate duplicated SetUp/TearDown across 6 test files.

Microbenchmark Verification:

  • Independent C++ microbenchmarks were written for each optimized kernel (see Test/openmp_nep_basic_benchmark.cpp and companion scripts).
  • 2 million atoms, repeated 5 times, tested at 1/2/4/8/16 threads on Intel Xeon Platinum 8163.
  • All per-atom write loops produce bitwise-identical results (max_abs_diff = 0).
  • Reduction loops show floating-point differences at the 1e-10 to 1e-8 level due to summation order changes — expected and acceptable for MD trajectories.

What's changed?

This PR integrates OpenMP parallelization from three feature branches (refactor/md-factory, refactor/parallel-optimize, refactor/md-openmp-remainder) into the ABACUS MD module and ML potential interfaces. 22 parallel loops or worksharing regions are added across 12 source files (+3934/−342 lines total).

1. MD Base Loops (source/source_md/)

Function File Strategy
MD_base::update_pos() md_base.cpp #pragma omp parallel for schedule(static)
MD_base::update_vel() md_base.cpp #pragma omp parallel for schedule(static)
kinetic_energy() md_func.cpp reduction(+:ke)
force_virial() force copy md_func.cpp Parallel per-atom copy
temp_vector() md_func.cpp 9 scalar reductions instead of shared-matrix accumulation
rescale_vel() md_func.cpp schedule(static)

All loops use if (natom >= 256) to skip parallel overhead for small systems.

2. NEP Interface (source/source_esolver/esolver_nep.cpp/.h)

  • Added atom_type_index / atom_local_index index caches for flat iat-based parallel loops.
  • Parallelized: coordinate buffer fill, per-atom energy reduction, force copy-back with unit conversion, and 9-component per-atom virial reduction.
  • NEP virial: reorganized from 9 separate full-array scans into a single per-atom scan with 9 scalar reductions — algorithmic + parallel gains combined (14.24× speedup at 8 threads).
  • nep.compute() external library call remains serial.

3. DPMD Interface (source/source_esolver/esolver_dp.cpp/.h)

  • Added iat → (it, ia) index caches.
  • Parallelized: coordinate buffer fill and model force copy-back with unit conversion.
  • Introduced persistent member buffers (dp_cell, dp_coord, dp_model_force, dp_model_virial) to avoid repeated allocations.
  • dp.compute() external library call and 3×3 virial copy-back remain serial.

4. Thermostat and Barostat (source/source_md/)

Class Method File
Verlet thermalize() velocity rescaling verlet.cpp
MSST rescale() shock-direction velocity scaling msst.cpp
MSST vel_sum() velocity norm reduction msst.cpp
MSST propagate_vel() per-atom velocity propagation msst.cpp
NoseHoover particle_thermo() final velocity scaling nhchain.cpp
NoseHoover vel_baro() barostat velocity update nhchain.cpp

Thermostat chain recurrence integration and cell dilation remain serial.

5. FIRE Algorithm (source/source_md/fire.cpp)

FIRE::check_fire() parallelized in three phases:

  1. Three-scalar reduction for P, sumforce, normvel
  2. Parallel velocity-force mixing
  3. Parallel velocity zeroing (in P <= 0 branch)

Scalar state updates (alpha, negative_count, dt) remain serial.

6. LJ Interface (source/source_esolver/esolver_lj.cpp/.h)

  • Added global atom index cache.
  • Restructured nested type-iteration loops into flat iat-based loop.
  • schedule(dynamic, 32) to handle neighbor-count imbalance.
  • Thread-local potential and virial arrays with atomic (energy) and critical (virial) reduction at thread exit — no per-neighbor locks.

7. Code Quality Refactors

  • Extracted MD statistics helpers: calc_kinetic_state() / calc_stress_state() (md_func.h, md_statistics.h).
  • MD runner factory function: new/deletestd::unique_ptr (run_md.cpp).
  • Shared test fixture base classes to reduce duplication across 6 MD test files.

Performance Summary (Microbenchmark, 8 threads, 2M atoms, Xeon Platinum 8163)

Category Kernel Speedup Efficiency
MD Base update_pos 7.36× 92.0%
MD Base update_vel 7.19× 89.9%
MD Base kinetic_energy 6.98× 87.2%
MD Base temp_vector 7.38× 92.2%
NEP coord_fill 7.15× 89.4%
NEP energy_sum 8.03× 100.4%
NEP force_fill 6.99× 87.4%
NEP virial_sum 14.24× 177.9%*
DPMD coord_fill 5.50× 68.8%
DPMD force_copy 7.19× 89.9%
Verlet thermalize 7.80× 97.5%
MSST rescale 7.28× 91.1%
MSST propagate_vel 7.18× 89.7%
NHC particle_thermo 7.21× 90.2%
FIRE check_fire (mix) 7.64× 95.6%
LJ runner core loop 6.96× 87.0%

*NEP virial 14.24× includes loop reorganization benefits beyond pure 8-thread scaling.

Known Limitations & Future Work

  • End-to-end tests: NEP and DPMD optimizations lack end-to-end tests with real external model libraries (__NEP, deepmd).
  • LJ parallel path: existing LJ unit tests use 4 atoms (< 256 threshold), covering only the serial path.
  • MPI + OpenMP hybrid: microbenchmarks are single-process; oversubscription risks under mixed MPI/OpenMP have not been characterized.
  • Thread threshold: nat >= 256 is an empirical uniform threshold; per-kernel tuning (64/128/256/512) is recommended.
  • LJ scheduling: schedule(dynamic, 32) vs static and optimal chunk size have not been systematically benchmarked across different neighbor distributions.
  • Microbenchmark results ≠ end-to-end wall-time: excluded overheads include MPI communication, neighbor-list construction, file I/O, and external model computation.

Any changes of core modules? (ignore if not applicable)

The MD ESolver interface layer (esolver_nep.cpp, esolver_dp.cpp, esolver_lj.cpp) is modified to add index caches and parallel worksharing constructs. No changes to the ESolver base class virtual function signatures. All external library calls (nep.compute(), dp.compute()) remain serial and their calling convention is unchanged.

Audrey-777 and others added 13 commits May 30, 2026 21:04
…T, NHC, FIRE, LJ)

Cover 6 remaining hot-path per-atom loops that were not parallelized
in the prior merge-openmp branch:

- md_func.cpp: rescale_vel() — velocity rescaling factor apply
- msst.cpp: vel_sum() — norm2 reduction, propagate_vel() — exp-based
  velocity propagation (highest compute density among uncovered loops)
- nhchain.cpp: vel_baro() — NPT per-atom velocity scaling
- fire.cpp: check_fire() — triple reduction + velocity mixing + zero
- esolver_lj.cpp: runner() — N² neighbor pair computation with
  schedule(dynamic) for load balancing, per-thread virial accumulation

All optimizations use schedule(static) with nat>=256 threshold
(LJ uses dynamic,32 for neighbor-count imbalance).
No data dependencies changed — all loops are per-atom independent.
No conflict with prior merge-openmp branch.
The 'if' clause is only valid on '#pragma omp parallel', not on
'#pragma omp for' when used inside an explicit parallel region.
This caused a compile error: 'if' is not valid for '#pragma omp for'.
@mohanchen mohanchen added Refactor Refactor ABACUS codes MD & LAM MD and Larege Atomic Models project_learning and removed Refactor Refactor ABACUS codes labels Jun 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

MD & LAM MD and Larege Atomic Models project_learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants