Skip to content

feat(pt_expt): multi-rank LAMMPS support for GNN models (DPA3 / DPA2 / spin)#5430

Open
wanghan-iapcm wants to merge 29 commits intodeepmodeling:masterfrom
wanghan-iapcm:feat-pt-expt-gnn-mpi
Open

feat(pt_expt): multi-rank LAMMPS support for GNN models (DPA3 / DPA2 / spin)#5430
wanghan-iapcm wants to merge 29 commits intodeepmodeling:masterfrom
wanghan-iapcm:feat-pt-expt-gnn-mpi

Conversation

@wanghan-iapcm
Copy link
Copy Markdown
Collaborator

@wanghan-iapcm wanghan-iapcm commented May 1, 2026

Summary

Adds multi-rank LAMMPS support to the pt_expt (.pt2 / AOTInductor) backend for GNN models — DPA3 (repflows) and DPA2 (repformers), plus spin GNN — at parity with the existing pt (.pth / torch.jit) backend. Without this, multi-rank LAMMPS users with GNN .pt2 models fall back to single-rank-only, and the C++ side crashes on the first ghost exchange when given a non-use_loc_mapping GNN .pt2.

The mechanism mirrors the pt backend's per-layer ghost-atom MPI exchange: each repflow/repformer block exchanges g1 across ranks via border_op so each rank sees up-to-date ghost embeddings. To survive torch.export + AOTInductor packaging, border_op is wrapped as an opaque torch.library.custom_op (deepmd_export::border_op) with a separate border_op_backward C++ symbol for autograd.

Design

  • Phase 1 — dpmodel plumbing: thread comm_dict: dict | None = None through make_model, base_atomic_model, descriptor wrappers (dpa1/dpa2/dpa3/hybrid/se_*), and the repflows/repformers blocks. Lift the per-layer node_ebd_ext construction into a _exchange_ghosts method (default array-api impl ignores comm_dict).
  • Phase 2 — pt_expt opaque op + block overrides:
    • deepmd::border_op_backward C++ op (additive accumulation into local atom slots — symmetric exchange used by autograd backward).
    • deepmd_export::border_op Python custom_op wrapper with register_fake and register_autograd so the op is opaque to torch.export.
    • pt_expt/descriptor/repflows.py and repformers.py block subclasses with _exchange_ghosts overrides that call the opaque op (with the spin real/virtual split + concat_switch_virtual when comm_dict[\"has_spin\"] is set).
  • Phase 3 — dual-artifact AOTInductor export: when _has_message_passing(model) is true, compile two artifacts into the .pt2 ZIP:
    • forward_lower_no_comm.pt2 — current behavior (single-rank, mapping-based gather).
    • forward_lower_with_comm.pt2 — adds positional comm tensors (send_list, send_proc, recv_proc, send_num, recv_num, communicator, nlocal, nghost) to the trace input list, plus has_spin=tensor([1]) baked in for spin GNN.
    • Metadata: has_message_passing + has_comm_artifact flags so the C++ loader picks the right artifact.
  • Phase 4 — C++ dispatch: DeepPotPTExpt::compute and DeepSpinPTExpt::compute route to the with-comm artifact when lmp_list.nswap > 0. commPTExpt adds build_comm_tensors_positional and build_comm_tensors_positional_with_virtual_atoms (the latter remaps sendlists via fwd_map when NULL atoms drop out of the model's view).
  • Phase 5 — LAMMPS tests: end-to-end multi-rank tests (DPA3, DPA2, spin DPA3) covering the basic case + the structural edge cases (NULL-type atoms, empty subdomain, all-NULL rank, isolated NULL, nlist rebuild, N>2 decomposition).

Coverage matrix

DPA3 DPA2 spin DPA3
basic mpi-2
empty subdomain (rank 1 empty)
all-NULL rank
isolated NULL
NULL atoms straddling boundary
NULL across nlist rebuild
N>2 (2x2x1, 4x1x1, 2x2x2)
cached mapping_tensor (ago>0)

Tests compare mpi-N vs same-archive mpi-1 for force / force_mag / virial / energy (atol 1e-8); no hardcoded numerical references.

Plus a unit test (test_has_message_passing.py) pinning the _has_message_passing schema-drift contract.

Co-existence with #5407

This branch was rebased onto upstream master after PR #5407 (.pt2 perf) merged. The merge required:

  • forward_common_lower_exportable_with_comm (spin and non-spin variants) now applies _pad_nlist_for_export + need_sorted_nlist_for_lower=True workarounds matching perf(pt2): optimize .pt2 C++ inference path #5407's regular variant — keeps the with-comm trace's nnei axis dynamic.
  • CUDA realize_opcount_threshold=0 workaround applied around BOTH artifact compiles.
  • do_atomic_virial=True is used for all multi-rank fixtures to avoid AOTI compile-time changes from perf(pt2): optimize .pt2 C++ inference path #5407's default =False. (Multi-rank with do_atomic_virial=False is a known coverage gap — see Limitations.)

Known limitations

  • Hybrid + GNN + multi-rank: _has_message_passing doesn't recurse into hybrid children → no with-comm artifact produced for hybrid-of-GNN. Multi-rank LAMMPS with such a model would fall back to single-artifact and crash on first ghost exchange. Out of scope for this PR; if a user hits this, they get the same crash they had before.
  • do_atomic_virial=False under multi-rank: production default. All multi-rank tests use =True (matches perf(pt2): optimize .pt2 C++ inference path #5407 fixture conventions). Not exercised end-to-end yet.
  • CUDA: the TORCH_LIBRARY_IMPL(deepmd_export, CUDA, m) registration exists and the symbol is callable, but no GPU end-to-end test runs in this PR (CPU-only build environment locally).
  • DPA2 (repformers) edge cases: only basic mpi-2 is tested for DPA2. NULL / empty-subdomain / nlist-rebuild covered for DPA3 only — these paths are descriptor-agnostic by construction but not exercised end-to-end for DPA2.
  • Spin DPA3 edge cases: empty subdomain and NULL straddling boundary covered. all-NULL-rank, isolated-NULL, nlist-rebuild not.
  • Multiple spin-active types (use_spin=[True, True, ...]): only [True, False] tested.
  • Frame batching nb>1 with comm_dict: _exchange_ghosts uses .squeeze(0) / .unsqueeze(0) (mirrors pt). LAMMPS feeds nb=1 — fine in practice; breaks if reused outside LAMMPS.
  • Float16/bfloat16: comm_dict path was developed in float64 only.
  • Cross-backend (jax/paddle/numpy) comm_dict=None neutrality: dpmodel default _exchange_ghosts is the original code lifted into a method, behaviorally equivalent. Not separately re-tested via running consistent tests with comm_dict=None explicitly threaded.
  • Old .pt2 files lacking has_comm_artifact metadata: C++ defaults to single-artifact when key is missing. Not negative-tested.
  • AOTI compile failure cleanup: if the with-comm compile fails after the regular artifact is written, the resulting .pt2 is half-formed. RAII / SIGKILL leakage of TempFile in /tmp also pre-exists.

A full catalog of touched-but-untested paths is maintained at memory/gnn_mpi_untested_paths.md (local to the author).

Test plan

  • pytest source/tests/pt_expt/ — eager-parity, export round-trip, schema-drift unit tests
  • pytest source/tests/consistent/descriptor/test_dpa3.py source/tests/consistent/descriptor/test_dpa2.py — non-regression for the single-rank path
  • pytest source/lmp/tests/test_lammps_dpa3_pt2.py source/lmp/tests/test_lammps_dpa2_pt2.py source/lmp/tests/test_lammps_spin_dpa3_pt2.py — multi-rank end-to-end (requires mpirun + mpi4py)
  • ctest -R PtExpt — C++ tests (single-rank, since multi-rank is exercised via LAMMPS)
  • CI runs across CPU + CUDA matrix

Summary by CodeRabbit

  • New Features

    • Optional comms/MPI metadata support across inference paths enabling ghost-exchange; dual “with-comm” export/run artifact and tracing entrypoints for message-passing GNNs and spin models, plus improved export/tracing support for comm-enabled ops.
  • Bug Fixes

    • Fixed unsafe neighbor-list memory handling.
  • Tests

    • Added extensive MPI parity, export-with-comm, repflow/repformer parallel-path, spin integration tests and CLI MPI runner scripts.

Han Wang added 23 commits April 25, 2026 23:55
Lifts the per-layer node_ebd_ext gather inside DescrptBlockRepflows.call
and DescrptBlockRepformers.call into a new _exchange_ghosts(...) method
so subclasses can override it. Default behaviour is byte-identical to
before for non-parallel inference (comm_dict is None).

Threads an optional comm_dict kwarg through:
  - make_model.call_common_lower / forward_common_atomic
  - {base,dp,linear,pairtab}_atomic_model
  - dpa1/dpa2/dpa3/hybrid/se_* descriptors
  - repflows/repformers blocks

Non-GNN descriptors accept and ignore comm_dict (noqa-marked unused).
DPA2 routes around its pre-block gather when comm_dict is supplied so
the repformers' per-layer override drives ghost exchange instead.

This is the dpmodel-side groundwork for pt_expt multi-rank LAMMPS
support; default behaviour unchanged.
Refactors Border::backward into a free function take/return interface
(positional comm tensors + grad_g1, returns grad_in) and registers it as
``torch.ops.deepmd.border_op_backward``.  The autograd Function's
backward delegates to the new symbol so existing pt-backend behaviour is
unchanged; the new symbol is what pt_expt's opaque op wrapper
(``deepmd_export::border_op``) dispatches to from its
``register_autograd`` callback.

The standalone op is needed because the ``custom_op`` API requires the
backward to be expressible as a registered op (it cannot reference the
autograd Function directly), and AOTInductor must serialise the call
into the compiled .pt2.
…errides

Three changes that together let GNN models drive MPI ghost-atom
exchange through the pt_expt forward pass:

1. ``deepmd/pt_expt/utils/comm.py`` registers a NEW torch op
   ``deepmd_export::border_op`` via ``torch.library.custom_op``. The
   wrapper:
     - Forwards to the existing ``torch.ops.deepmd.border_op`` (clones
       the in-place output to satisfy custom_op aliasing rules).
     - Has a ``register_fake`` impl returning ``empty_like(g1)`` so
       ``torch.export`` / ``make_fx`` can trace through it.
     - Has a ``register_autograd`` callback that dispatches to
       ``torch.ops.deepmd.border_op_backward`` (the standalone op
       added in the previous commit).

   The existing ``deepmd::border_op`` is registered as
   ``CompositeImplicitAutograd`` and therefore tries to decompose
   into primitive aten ops during export — which fails because the
   C++ kernel calls ``data_ptr()`` on FakeTensors. The new opaque
   wrapper sidesteps this by being registered as an opaque op that
   ``torch.export`` records as a single black-box call.

2. ``deepmd/pt_expt/descriptor/{repflows,repformers}.py`` add pt_expt
   subclasses of ``DescrptBlockRepflows`` / ``DescrptBlockRepformers``
   that override ``_exchange_ghosts``. When ``comm_dict is None`` the
   override defers to the dpmodel default; otherwise it pads
   ``node_ebd`` to nall and calls the opaque wrapper. Includes the
   spin-aware ``has_spin`` path (split real/virtual + concat_switch
   _virtual) ported from pt's repflows.

3. ``forward_common_lower_exportable_with_comm`` is added on the
   pt_expt CM (and SpinModel) classes. Same as the existing
   ``forward_common_lower_exportable`` but accepts the 8 comm tensors
   as additional positional inputs and reconstructs ``comm_dict``
   inside the traced function (spin variant injects ``has_spin`` so
   the override takes the spin branch). This becomes the new traced
   entry point for the with-comm AOTI artifact (next commit).

Existing pt_expt descriptor wrappers (dpa1, dpa2, se_*) and the
``CM.forward_common_atomic`` override get an extra ``comm_dict`` kwarg
that is plumbed straight through to the dpmodel call — no behavioural
change for ``comm_dict is None``.

Phase 0 de-risk experiment (scratch/derisk_border_op.py) verified that
the opaque wrapper survives ``torch.export.export`` +
``aoti_compile_and_package`` + ``aoti_load_package`` round-trips for
both forward and backward.
Three small follow-ups uncovered by the spin export-with-comm test:

1. ``dpmodel/model/spin_model.py::call_common_lower`` was missing
   the ``comm_dict`` kwarg added by the Phase 1 plumbing. Added it
   and forward to ``backbone_model.call_common_lower`` so spin GNN
   models can drive parallel inference.

2. ``pt_expt/descriptor/repflows.py`` raises a clear ``RuntimeError``
   when ``use_loc_mapping=True`` is combined with a non-None
   ``comm_dict``. The local-mapping codepath skips per-layer ghost
   exchange entirely so combining it with ``comm_dict`` would
   silently drop the parallel behaviour.

3. ``pt_expt/utils/comm.py`` ``_check_underlying_ops_loaded`` is
   called on first wrapper invocation; surfaces a clearer error
   when libdeepmd_op_pt.so is unloaded ("rebuild the pt custom-op
   library") rather than the cryptic "torch.ops.deepmd has no
   attribute 'border_op'" from torch's dispatcher.
Adds a ``with_comm_dict: bool`` flag to ``_trace_and_export`` and
``_make_sample_inputs``/``_build_dynamic_shapes``.  When True, the
trace runs through ``forward_common_lower_exportable_with_comm``
(which threads 8 comm tensors as positional inputs and reconstructs
``comm_dict`` inside the traced function), and the resulting export
accepts comm tensors as additional positional inputs.

Constraints enforced for the with-comm trace:
  * ``nframes=1`` static (the pt-parity override uses
    squeeze(0)/unsqueeze(0) which only works for nb=1; LAMMPS always
    drives one frame anyway).  Avoids the regular-variants
    ``nframes=2`` collision-avoidance bumping (irrelevant when
    nframes is static — duck-sizing only unifies dynamic dims).
  * ``nswap`` static at the trace value.  ``nswap`` is fixed once at
    LAMMPS init (depends on the processor grid which doesnt change
    at runtime), so the dim doesnt need to be dynamic.

For GNN models, ``_deserialize_to_file_pt2`` now compiles BOTH the
regular and with-comm artifacts and packs the latter inside the .pt2
ZIP at ``extra/forward_lower_with_comm.pt2``.  Metadata gains:
  * ``has_message_passing`` (true if the descriptor has GNN block).
  * ``has_comm_artifact`` (true iff a with-comm artifact was packed).
Old .pt2 files lack these keys; the C++ loader (Phase 4) must default
to False when the field is missing.

The non-GNN path is unchanged: a single regular artifact + the
existing metadata layout, so existing .pt2 readers keep working.
Five new test files covering the GNN MPI plumbing:

* test_repflow_parallel.py / test_repformer_parallel.py
  Eager parity for DescrptBlockRepflows / DescrptBlockRepformers
  override.  Single-rank self-exchange via ctypes pointer-array
  sendlist; verifies override output equals dpmodel default for both
  with-mapping and none-mapping variants.  Includes a structural
  test for the spin branch and a guard test that
  use_loc_mapping=True + comm_dict raises RuntimeError.

* test_border_op_backward.py
  Direct unit tests for torch.ops.deepmd.border_op_backward (float32
  + float64) and the autograd path through deepmd_export::border_op.

* test_export_with_comm.py
  Phase 3 round-trip for the dual-artifact .pt2 layout: GNN models
  produce both regular and forward_lower_with_comm artifacts; both
  load via aoti_load_package; outputs match for self-exchange.
  Plus three coverage tests for previously-untested branches:
  zero-nghost clamp in _make_comm_sample_inputs, hybrid-with-GNN
  detection in _has_message_passing, .pte with-comm trace round-trip.

* test_spin_export_with_comm.py
  Spin model trace machinery (smoke test on se_e2_a) and end-to-end
  eager value parity for spin DPA3 models running through
  SpinModel.call_common_lower with comm_dict.
Without TORCH_LIBRARIES on the test binary, the
``__has_include(<torch/csrc/inductor/aoti_package/model_package_loader.h>)``
check in DeepPotPTExpt.h evaluates to false and the test files compile
with BUILD_PT_EXPT=0, causing every pt_expt test case to silently
GTEST_SKIP("PyTorch support is not enabled").  The bug was masked by
ctest reporting a green run with all skips counted as success.

Adding ``target_link_libraries(runUnitTests_cc "${TORCH_LIBRARIES}")``
under the existing ``ENABLE_PYTORCH`` branch makes the AOTI header
visible to the test compilation. After this fix, the 148 pt_expt
tests actually run instead of being silently skipped.
Phase 4 of the GNN MPI plumbing.  When a .pt2 archive carries a
nested forward_lower_with_comm.pt2 (added by Phase 3 for GNN models),
the C++ inference path now optionally extracts and loads it as a
second AOTInductor module.  Each compute() call dispatches between
the regular and with-comm artifacts based on lmp_list.nswap: LAMMPS
sets nswap=0 in single-rank mode and >0 in multi-rank, so single-rank
inference keeps using the regular artifact (mapping-tensor gather)
and multi-rank routes to the with-comm artifact (MPI ghost exchange).

Three additions:

1. commonPTExpt.h adds:
   - TempFile RAII handle for the extracted nested artifact (mkstemp,
     unlinked at destruction).
   - TempFile::from_zip_entry reads a ZIP entry from the outer .pt2
     and writes it to a temp file (atomic, 0600).
   - build_comm_tensors_positional packs the 8 comm tensors in
     canonical positional order (send_list, send_proc, recv_proc,
     send_num, recv_num, communicator, nlocal, nghost) for the
     with-comm AOTI module input vector.

2. DeepPotPTExpt:
   - Reads has_comm_artifact from metadata.json (defaults false for
     old .pt2 files lacking the field).
   - When true, extracts extra/forward_lower_with_comm.pt2 to a
     TempFile and loads it as with_comm_loader.
   - run_model_with_comm appends the 8 comm tensors to the base
     inputs and dispatches to with_comm_loader->run.
   - compute() chooses regular vs with-comm based on nswap.

3. DeepSpinPTExpt:
   - Same pattern; the Phase 3 export injects has_spin=1 into the
     traced graph comm_dict, so the C++ side passes the same 8 comm
     tensors as the non-spin case.  nlocal/nghost carry the real-atom
     counts (the spin override halves them internally to get the
     atom-doubled counts).

All 148 existing pt_expt C++ tests pass — the with-comm path is
gated behind nswap > 0 so single-rank tests dont exercise it (that
coverage is Phase 5 multi-rank LAMMPS test).
Phase 5 — final integration after Phases 1-4 land the dpmodel
plumbing, opaque op wrappers, two-mode AOTI export and C++ dispatch.
Three pieces had to fall into place to make multi-rank LAMMPS
actually run a GNN .pt2:

1. Move deepmd_export op schema declarations to C++.
   torch.library.custom_op only registers the op in the Python
   process, but a LAMMPS run loads the .pt2 in pure C++ (no Python
   interpreter). Add TORCH_LIBRARY_FRAGMENT(deepmd_export, m) +
   TORCH_LIBRARY_IMPL blocks under explicit CPU/CUDA dispatch
   keys in source/op/pt/comm.cc; the C++ impls clone the underlying
   deepmd::* op outputs to satisfy AOTI no-aliasing. Python comm.py
   now layers register_fake + register_autograd on top of the
   C++-defined ops instead of defining new ones.

2. Call deepmd::load_op_library at DeepPot/SpinPTExpt init so
   libdeepmd_op_pt.so loads before AOTIModelPackageLoader; the LAMMPS
   plugin doesnt pre-load it. Without this, a multi-rank GNN .pt2
   aborts at pair_style time with a missing-schema error.

3. Gate dual-artifact production on use_loc_mapping=False.
   _has_message_passing now walks into the GNN block to inspect
   use_loc_mapping; if True, only the regular artifact is produced
   (the override would raise on parallel mode anyway). gen_dpa3.py
   produces a second deeppot_dpa3_mpi.pt2 with use_loc_mapping=False
   so the new mpirun test has a real dual-artifact .pt2 to load.

Plus the multi-rank test itself:
- run_mpi_pair_deepmd_dpa3_pt2.py: subprocess driver. Uses
  PyLammps + processors 2 1 1 so nswap > 0 on every rank,
  forcing the C++ side to dispatch to the with-comm artifact.
  Forces are gathered via lammps.lmp.gather_atoms (rank-local
  atoms[i] doesnt see other ranks); pe via lammps.eval on rank 0.
- test_pair_deepmd_mpi_dpa3 in test_lammps_dpa3_pt2.py: invokes
  the driver under mpirun -n 2, asserts energy + per-atom forces
  match the single-rank reference within atol=1e-8.

Also: register_fake for the backward op too. Without it, make_fx
tracing autograd.grad inside forward_common_lower_exportable hits
the same FakeTensor data_ptr error we solved for forward in Phase 0.

All 31 pt_expt LAMMPS tests pass.
…gaps

Three fixes targeting the limitations from the previous Phase 5 commit:

1. NULL-type atoms (build_comm_dict_with_virtual_atoms equivalent).
   When ``select_real_atoms_coord`` filters atoms with atype < 0,
   the LAMMPS-supplied sendlist still indexes the original atom
   array. ``DeepPotPTExpt::compute`` (and Spin) now check
   ``has_null_atoms = (nall_real < nall)`` and route to the new
   ``build_comm_tensors_positional_with_virtual_atoms`` helper in
   commonPTExpt.h, which calls ``remap_comm_sendlist`` to translate
   indices through ``fwd_map`` (mirrors what
   ``commonPT.h::build_comm_dict_with_virtual_atoms`` does for the
   torch.jit pt-backend). Untested numerically (no test fixture
   produces NULL-type atoms in multi-rank); code path is structurally
   identical to the validated pt-backend equivalent.

2. nlist-rebuild test (test_pair_deepmd_mpi_dpa3_nlist_rebuild).
   Runs 50 MD steps under mpirun -n 2 with neigh_modify every=10,
   forcing >=5 neighbor-list rebuilds. Validates the with-comm
   dispatch path stays consistent across rebuilds (the comm tensors
   are reconstructed when ``ago == 0`` triggers). Asserts forces
   stay finite and bounded; no exact-value comparison since round-
   off accumulates over the trajectory and cross-rank ordering can
   shift the LSBs.

3. Spin multi-rank dispatch wiring (DeepSpinPTExpt::compute).
   Same has_null_atoms branch as DeepPotPTExpt. Code path
   structurally identical to the validated DeepPotPTExpt path; no
   spin-specific multi-rank test yet (would need a spin DPA3 .pt2
   with use_loc_mapping=False to exercise it end-to-end).

Note: virial check via LAMMPS compute pressure NULL virial caused
PyLammps multi-rank deadlock; deferred to a follow-up. Forces ARE
the autograd output of energy through the with-comm graph, so
force parity already validates the with-comm backward path.

All 26 pt_expt LAMMPS tests pass (including the new multi-rank
ones); 9 model_devi_pt2 tests confirm DeepPotModelDevi delegates
correctly through the dispatch.
- run_mpi_pair_deepmd_dpa3_pt2.py: gather atom ids alongside forces and
  sort by id explicitly. Output ordering is now robust to subdomain
  layout, empty subdomains, or future LAMMPS gather_atoms changes.
  Add atom_modify map yes so single-rank dispatch on the dual-artifact
  .pt2 (uses mapping) works; expose --processors so the runner can
  produce a same-archive single-rank reference.

- test_pair_deepmd_mpi_dpa3_nlist_rebuild: replace the finite/bounded
  sanity check with a value comparison against a single-rank reference
  of the same trajectory (mpirun -n 1, processors "1 1 1"). 25 MD steps
  cross two nlist rebuilds, atol=1e-6 forces / rel=1e-8 energy. This
  catches a wrong-but-finite force from a dispatch bug that the
  previous assertion would have missed.
…with virial

- common.cc: NeighborListData::copy_from_nlist used &ilist[0] /
  &jlist[ii][0] for the memcpy destination, which is OOB on an empty
  vector (libstdc++ debug-mode assertion) and undefined behaviour in
  general. Switch to .data() and skip the copy when the count is zero.
  Surfaced by the new empty-subdomain MPI test where rank 1 owns
  nloc=0 atoms; the same latent bug also applied to atoms with no
  neighbours.

- run_mpi_pair_deepmd_dpa3_pt2.py: also gather the per-atom virial
  via ``compute centroid/stress/atom NULL pair`` and
  ``lmp.gather("c_virial", 1, 9)``. Output rows are now (3 force) +
  (9 virial) per atom, id-sorted.

- test_lammps_dpa3_pt2.py:
  * test_pair_deepmd_mpi_dpa3 now asserts virial against expected_v
    (with the same column permutation as test_pair_deepmd_virial),
    closing the previous "virial multi-rank deferred" gap.
  * test_pair_deepmd_mpi_dpa3_nlist_rebuild now also compares virial
    between the multi-rank and single-rank reference runs.
  * New test_pair_deepmd_mpi_dpa3_empty_subdomain: 30 x 13 x 13 box
    with all atoms in x in [0.25, 12.83]; under processors "2 1 1"
    rank 1 owns zero local atoms. Compares forces + virial + energy
    against a same-archive single-rank reference.
Brings in PR deepmodeling#5407 (perf(pt2): optimize .pt2 C++ inference path) and
unrelated upstream changes. Conflict resolution:

- source/api_cc/src/common.cc: kept .data() form (strictly safer than
  upstream's &vec[0]) for the empty-subdomain / zero-neighbour guards
  added on this branch. Both branches added the same guard concept.

- source/api_cc/src/commonPTExpt.h: deleted the obsolete
  buildTypeSortedNlist (PR 5407 replaced it with createNlistTensor in
  commonPT.h, which the C++ side now uses). Kept this branch's TempFile
  RAII + build_comm_tensors_positional[_with_virtual_atoms] for the
  Phase 4 with-comm dispatch.

- source/api_cc/tests/CMakeLists.txt: kept the TORCH_LIBRARIES link
  with the explanatory comment from this branch (BUILD_PT_EXPT
  detection requires torch headers in the test binary).

- deepmd/pt_expt/utils/serialization.py: combined both signatures.
  _trace_and_export now takes both with_comm_dict (this branch) and
  do_atomic_virial (PR 5407). _build_dynamic_shapes accepts both
  with_comm_dict and model_nnei. Apply PR 5407's CUDA
  realize_opcount_threshold=0 workaround around BOTH artifact compiles.
  Switch the .pte path's _trace_and_export call to kwargs so the new
  with_comm_dict positional doesn't shift the bind.

- deepmd/pt_expt/model/{make_model,spin_model}.py: align
  forward_common_lower_exportable_with_comm with PR 5407's regular
  variant — apply _pad_nlist_for_export and need_sorted_nlist_for_lower=
  True so the with-comm trace keeps its nnei axis dynamic. Without these,
  the with-comm trace generated a "nnei <= sum(sel)" upper-bound guard
  that conflicted with PR 5407's "nnei >= sum(sel)" lower bound.

- source/tests/infer/gen_dpa3.py: regular pt2 keeps PR 5407's
  do_atomic_virial=True; the multi-rank pt2 also gets do_atomic_virial=
  True so the new MPI virial test has reference values to compare against.

Verified post-merge:
- All 33 LAMMPS PT2 tests pass (including 9 DPA3 tests with 3 MPI tests).
- All 3 ctest targets (lib + cc + c) pass.
- source/tests/pt_expt/ pytest suite passes.
- source/lmp/tests/test_lammps_dpa2_pt2.py (NEW): runs DPA2 .pt2 under
  mpirun -n 2 with the with-comm artifact and asserts pe + per-atom
  forces + per-atom virial match a same-archive single-rank reference.
  Closes the recorded gap "DPA2 multi-rank dispatch never exercised
  end-to-end" (gnn_mpi_untested_paths.md). The runner script
  (run_mpi_pair_deepmd_dpa3_pt2.py) is descriptor-agnostic so no new
  driver is needed.

- source/tests/infer/gen_dpa2.py: drop dead config_mpi block accidentally
  added during planning. DPA2's repformer has no use_loc_mapping knob
  (unlike DPA3), so the single deeppot_dpa2.pt2 already carries the
  dual-artifact layout — _has_message_passing returns True for any
  DPA2 model.

- source/tests/pt_expt/conftest.py: ``import deepmd.pt`` at conftest
  evaluation time so libdeepmd_op_pt.so is loaded and
  ``deepmd_export::{border_op, border_op_backward}`` are registered
  before any pt_expt test module imports ``deepmd.pt_expt.utils``
  (which transitively imports ``comm.py`` and its
  ``_check_underlying_ops_loaded()`` runtime check). Previously this
  worked only when the test was collected alongside earlier modules
  that happened to import deepmd.pt first; running the spin/export
  tests in isolation crashed at collection.

- source/tests/pt_expt/model/test_spin_export_with_comm.py: fix
  pre-existing test data bug — model has sel=[20,20,20] (sum=60) but
  the trace test was passing nlist with width 6, tripping the
  _format_nlist post-condition assertion. Now uses the correct
  sum(sel) width. Surfaced once the conftest fix above made the test
  reliably runnable in isolation.
…it test

Closes two gaps from the GNN-MPI untested-paths catalog:

- ``test_pair_deepmd_mpi_dpa3_decomposition`` (parametrized): runs DPA3
  .pt2 under three additional processor grids — ``4@2x2x1`` (2D),
  ``4@4x1x1`` (1D-deep chain), and ``8@2x2x2`` (3D). All three must
  match the gen_dpa3.py reference for energy / per-atom force /
  per-atom virial within atol=1e-8. The 2x2x2 split puts several
  subdomains empty, so this also exercises the
  ``copy_from_nlist`` empty-subdomain guard in a 3D layout.

- ``source/tests/pt_expt/utils/test_has_message_passing.py``: pins
  ``_has_message_passing`` against schema drift. The detection chain
  (``model.atomic_model.descriptor`` ->
  ``descriptor.has_message_passing()`` -> ``block.use_loc_mapping``)
  is brittle to attribute renames in the dpmodel descriptor layer; a
  silent regression would disable the with-comm artifact and break
  multi-rank LAMMPS for GNN users with no test failure to flag it.
  The test asserts the documented value for 5 baseline configs
  (se_e2_a, dpa1, dpa3 use_loc_mapping=True/False, dpa2) plus two
  stub-model defensive cases.

The runner helper ``_run_mpi_subprocess`` gains an optional
``processors`` arg so the new parametrized test can dictate the
LAMMPS ``processors`` grid; existing tests keep their previous
defaults (``2 1 1`` for nprocs=2, ``1 1 1`` for nprocs=1).
Closes the recorded gap "NULL-type atoms under mpirun" — until now
``build_comm_tensors_positional_with_virtual_atoms`` and the
``fwd_map``-based comm-tensor remap had never been exercised in a
multi-rank LAMMPS run despite being reachable any time a user runs
a model on a system with atom types outside its ``type_map``.

Fixture (``data_dpa3_pt2_null_type.lmp``): the canonical 6 real
atoms (types 1, 2) plus 2 LAMMPS type-3 atoms placed at (5.5, 6, 6)
and (7.5, 7, 7) — straddling the x=6.5 rank boundary under
``processors 2 1 1`` and within rcut (=6) of multiple real atoms.
The pair_coeff ``* * O H NULL`` maps LAMMPS type 3 to deepmd
atype=-1, so ``select_real_atoms_coord`` filters them and
``DeepPotPTExpt::compute`` takes the
``build_comm_tensors_positional_with_virtual_atoms`` branch.

The NULL atoms appear in cross-rank sendlists because both sit in
the boundary's rcut window, so the remap must:
  - drop the -1 fwd_map slots from each swap's sendlist;
  - decrement sendnum/recvnum by the number dropped;
  - translate surviving indices into real-atom space.

Test asserts:
  - forces on the 6 real atoms match the no-NULL baseline
    ``expected_f`` (atol 1e-8);
  - NULL atom forces are zero (atol 1e-12) — deepmd is the only
    pair_style and skips them;
  - total potential energy matches ``expected_e``;
  - per-atom virial on real atoms matches ``expected_v``.

Runner script (``run_mpi_pair_deepmd_dpa3_pt2.py``) gains two
optional flags: ``--pair-coeff`` (override the default ``"* *"``)
and ``--mass3`` (mass for a third LAMMPS atom type). Existing tests
keep their previous defaults unchanged.

The ``_run_mpi_subprocess`` helper gains a ``runner_args`` kwarg
to forward arbitrary flags to the runner; existing call sites are
unaffected.
…ist-rebuild)

Closes three more entries from the GNN-MPI untested-paths catalog,
all variations on the multi-rank NULL-type filter path:

- ``test_pair_deepmd_mpi_dpa3_null_isolated``: large box (30 x 13 x 13)
  puts a NULL atom at x=7.5, in rank 0's subdomain interior. With
  rcut=6 the boundary rcut-windows on rank 0 are x in [0, 6] (PBC of
  the right wall via x=30) and [9, 15] (rank 1's left wall); atoms
  in (6, 9) are local but never appear in any sendlist. Exercises
  ``has_null_atoms == True`` with a no-op remap (sendlists contain
  no NULL entries to drop) — complementary to
  ``test_pair_deepmd_mpi_dpa3_null_type`` which exercises the
  remap-with-NULLs case.

- ``test_pair_deepmd_mpi_dpa3_all_null_rank``: rank 1 owns ONLY NULL
  atoms (intersection of empty-subdomain and NULL-type paths). After
  ``select_real_atoms_coord`` rank 1 has nloc_real=0, so the
  ``copy_from_nlist`` empty-subdomain guard must fire AND the
  ``_with_virtual_atoms`` remap must handle a sendlist whose entire
  local section was NULL.

- ``test_pair_deepmd_mpi_dpa3_null_type_nlist_rebuild``: rebuilds the
  nlist 3 times in 3 MD steps using ``neigh_modify every 1``. NULL
  atoms drift across the boundary so sendlist composition changes
  per rebuild — validates that the remap re-runs correctly on every
  ago=0 trigger and stays consistent with the cached
  ``mapping_tensor`` / ``firstneigh_tensor`` in
  ``DeepPotPTExpt::compute``.

Also speeds up ``test_pair_deepmd_mpi_dpa3_nlist_rebuild`` (existing
non-NULL test) by switching from ``every 10`` + 25 steps to
``every 1`` + 3 steps — same 3 rebuilds, ~1/3 the wall time.

Runner script gains a ``--neigh-every`` flag (default 10). All
three new tests compare mpirun -n 2 against an mpirun -n 1
reference on the same fixture.
Two related changes that strengthen the NULL-type rebuild test and
trim the decomposition variant set:

- ``test_pair_deepmd_mpi_dpa3_null_type_nlist_rebuild`` now sets a
  high initial velocity (v_x = 2000 A/ps) on LAMMPS type-3 atoms via
  the runner's new ``--null-vx`` flag and a per-type ``velocity``
  command. With timestep 0.0005 ps each NULL atom moves 1.0 A per
  step — enough to physically cross the x=6.5 rank boundary in
  step 1 (NULL @ 5.5 -> 6.5 -> 7.5 -> 8.5). NULL atoms therefore
  migrate ranks across rebuilds, exercising the case where a NULL's
  fwd_map index moves between the local-section and ghost-section
  of per-rank sendlists.

  Real atoms keep v=0 so their dynamics are stable; the deepmd model
  never sees NULL atoms (filtered by ``select_real_atoms_coord``)
  so unphysical NULL velocity is harmless. mpi-2 vs mpi-1 reference
  match within atol=1e-6 / rel=1e-8.

- ``test_pair_deepmd_mpi_dpa3_decomposition``: drop the ``[4-2 2 1]``
  variant. Its 2D coverage is fully subsumed by ``[8-2 2 2]``
  (which is 3D, so 2D face exchange is a strict subset). The two
  remaining variants — ``[4-4 1 1]`` for 1D-deep sendlist chains
  and ``[8-2 2 2]`` for 3D borders — are complementary and not
  subsumable. Saves ~5.5s of suite wall time.

Runner script gains a ``--null-vx`` flag (no-op when not passed,
so existing tests are unaffected).
Strengthens ``test_pair_deepmd_mpi_dpa3_null_type_nlist_rebuild`` so
the rebuilds see non-trivial sendlist composition changes:

- NULL atoms now move in OPPOSITE directions via the new
  ``--null-vx-split`` flag. NULL id=7 (at x=5.5) gets v_x=-2000 A/ps
  -> drifts left (and via PBC into rank 1's far ghost region). NULL
  id=8 (at x=7.5) gets v_x=+2000 A/ps -> drifts right (deeper into
  rank 1's domain). The +/- split means each rebuild sees one NULL
  entering rank 0's sendlist while the other leaves it.

- Real atoms get thermal velocities at T=10000 K via the new
  ``--real-temp`` flag (LAMMPS ``velocity realgroup create T seed``).
  Each real atom gets a different random direction, so the sendlist
  composition is also perturbed by real-atom motion (small but
  detectable under ``every 1`` rebuilds).

NULL atoms still don't contribute to the deepmd model (filtered by
``select_real_atoms_coord``), so their unphysical velocity is
harmless. Real-atom thermal motion at T=10000 K corresponds to
RMS speed ~3-9 A/ps (mass-weighted) -> ~0.005-0.015 A motion per
step; small enough that NVE stays stable but enough to perturb
sendlists.

Both new flags are no-ops when not passed; existing tests are
unaffected.
Extends ``test_pair_deepmd_mpi_dpa3_empty_subdomain`` to run 5 MD
steps with ``neigh_modify every 100`` instead of a single
``lammps.run(0)``. This forces:

  step 0  -> ago=0 (full rebuild; mapping_tensor + firstneigh_tensor
             populated for the first time on the empty-subdomain rank)
  step 1  -> ago=1 (cache HIT — mapping_tensor and firstneigh_tensor
             reused)
  step 2  -> ago=2 (cache hit)
  step 3  -> ago=3 (cache hit)
  step 4  -> ago=4 (cache hit)
  step 5  -> ago=5 (cache hit)

Closes the catalog gap "Empty subdomain under PR 5407's
mapping_tensor cache". Previously only step 0 was tested, which is
always ago=0; the cache-hit branch in DeepPotPTExpt::compute on a
rank with nloc=0 was unexercised.

Compares mpi-2 vs mpi-1 trajectory with the same atol=1e-6 / rel=1e-8
tolerances as the other rebuild tests.
Adds the smallest reachable test that exercises the full spin GNN
multi-rank dispatch chain (Tier-1 #2 in gnn_mpi_untested_paths).

- gen_spin.py: also produce deeppot_dpa3_spin_mpi.pt2 (DPA3 +
  use_loc_mapping=False + use_spin=[True, False]) so a dual-artifact
  spin GNN .pt2 exists for testing.
- run_mpi_pair_deepmd_spin_dpa3_pt2.py: MPI runner that drives the
  spin pair_style and gathers force / force_mag / virial across ranks.
  fm goes via 'compute property/atom fmx fmy fmz' since the legacy
  extract/gather_atoms registry doesn't expose 'fm'.
- test_lammps_spin_dpa3_pt2.py: mpirun -n 2 vs same-archive mpirun -n 1
  reference for energy / force / force_mag / virial (atol 1e-8). A
  divergence is necessarily a problem in DeepSpinPTExpt multi-rank
  dispatch, the spin branch of _exchange_ghosts, the C++
  deepmd_export::border_op invocation, or the comm-tensor builder.
- _build_dynamic_shapes: bump nall_dim min from 1 to 4 when has_spin.
  Without this, torch.export raises CONSTRAINT_VIOLATION on the
  pre-doubling nall axis when tracing GNN spin with with_comm_dict
  (the suggested fix in the error matches min=4).

Eager parity (test_spin_dpa3_eager_parity) and trace-only validation
already existed; this PR closes the gap by adding AOTI compile +
LAMMPS load + real MPI exchange.

Known limitations:
- Single configuration tested (4 atoms, 2 ranks, type_map ["Ni", "O"],
  use_spin=[True, False]).  No NULL-type, empty-subdomain, nlist-rebuild
  variants for spin yet -- the non-spin DPA3 path covers those code
  branches and the spin override differs only in the real/virtual
  split, which the new test exercises.
- do_atomic_virial=True only (matches all current multi-rank tests;
  Tier-1 #3 still open).
- N=2 only; no decomposition/N>2 spin variant.
- CPU only.
Resolves the two spin-specific gaps left open by the previous commit:

- test_pair_deepmd_mpi_dpa3_spin_empty_subdomain: elongated 30 A box +
  processors '2 1 1' leaves rank 1 with nloc=0. Exercises the
  copy_from_nlist empty-rank guard for the spin path (the with-comm
  artifact still runs on rank 1 with nloc_real=0).

- test_pair_deepmd_mpi_dpa3_spin_null_type: 2 NULL (LAMMPS type-3,
  deepmd atype=-1) atoms straddling the x=6.5 rank boundary, within
  rcut of real atoms on both sides. Goes through DeepSpinPTExpt with
  nall_real < nall, triggering the has_null_atoms branch that calls
  build_comm_tensors_positional_with_virtual_atoms (fwd_map-based
  sendlist remap) for spin. Asserts NULL atoms get zero forces from
  the deepmd model and real-atom values match the mpi-1 reference.

Both compare mpi-2 vs same-archive mpi-1 (atol 1e-8) so any divergence
is necessarily in the multi-rank dispatch, not in tracing precision.

Runner generalised with --pair-coeff and --mass3 flags (mirrors the
non-spin DPA3 runner).
gen_spin.py rebuilds the yaml from the deterministic config + seed in
_build_dpa3_mpi_yaml when missing, and the multi-rank test compares
mpi-2 vs same-archive mpi-1 (no hardcoded numerical references depend
on these weights). Unlike deeppot_dpa_spin.yaml -- whose committed
weights are pinned because C++ tests hardcode reference values against
them -- nothing requires deeppot_dpa3_spin_mpi.yaml to be checkpointed
in git.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 695b7368a8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

# Register opaque deepmd_export::border_op wrapper (used by GNN MPI
# parallel inference; see comm.py module docstring).
# Register fake tensor implementations for custom tabulate ops
from deepmd.pt_expt.utils import comm # noqa: F401
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Defer comm registration until op library is loaded

Importing deepmd.pt_expt.utils now unconditionally imports comm, which immediately executes _check_underlying_ops_loaded() and raises if deepmd_export::border_op is not already registered. In practice this makes pt_expt import order-sensitive: any code path that imports pt_expt modules before deepmd.pt (including normal pt_expt entrypoints) can fail at import time even when not using multi-rank GNN comm ops. This is a regression from prior lazy/optional registration behavior and can break startup for single-rank workflows.

Useful? React with 👍 / 👎.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 1, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Threads an optional comm_dict (MPI communication metadata) through model, descriptor, and atomic APIs; factors ghost-exchange into exchange hooks; implements pt_expt overrides that call an opaque border-op for multi-rank exchanges; adds export of a dual-artifact .pt2 (with-comm); and integrates C++ runtime support and MPI LAMMPS tests.

Changes

API threading: comm_dict through model/atomic/descriptors

Layer / File(s) Summary
Public signatures
deepmd/dpmodel/model/make_model.py, deepmd/dpmodel/model/spin_model.py, deepmd/dpmodel/atomic_model/*, deepmd/pt_expt/model/*, deepmd/jax/...
Added optional `comm_dict: dict
Descriptor call signatures
deepmd/dpmodel/descriptor/{dpa1,dpa2,dpa3,hybrid,se_*.py}, deepmd/pt_expt/descriptor/{dpa1,dpa2,dpa3,hybrid,se_*.py}
Extended call(...) APIs to accept comm_dict and pass it into children where applicable.

Repflow / Repformer refactor (dpmodel)

Layer / File(s) Summary
Data/shape precompute
deepmd/dpmodel/descriptor/{repformers.py,repflows.py}
Compute nall/nloc, conditionally tile mapping only if present.
Core implementation
deepmd/dpmodel/descriptor/{repformers.py,repflows.py}
Introduce _exchange_ghosts helper; delegate per-layer ghost-extension to it and thread comm_dict and size params.
API surface
deepmd/dpmodel/descriptor/{repformers.py,repflows.py}
Expose comm_dict in descriptor call signatures.

pt_expt MPI implementations, utils and export

Layer / File(s) Summary
pt_expt overrides
deepmd/pt_expt/descriptor/{repformers.py,repflows.py}
Added pt_expt subclasses overriding _exchange_ghosts to call torch.ops.deepmd_export.border_op when comm_dict present; handle padding, virtual-atom splitting/concat for spin, and register dpmodel→pt_expt mapping.
Comm utils / tracing support
deepmd/pt_expt/utils/comm.py, deepmd/pt_expt/utils/__init__.py
New module registers fake border-op implementations for export/make_fx, wires autograd for the opaque op, and validates underlying op registration on import. Side-effect import added to package init.
Serialization/export
deepmd/pt_expt/utils/serialization.py
Detects message-passing, offers with_comm_dict tracing path, constructs comm-sample inputs, and emits a second AOTInductor artifact forward_lower_with_comm.pt2 with metadata has_comm_artifact.
Tracing entrypoints
deepmd/pt_expt/model/make_model.py, deepmd/pt_expt/model/spin_model.py
Add forward_common_lower_exportable_with_comm tracing entrypoints that accept comm tensors and assemble comm_dict for export-time tracing.

C++ runtime, opaque op registration, and helpers

Layer / File(s) Summary
Opaque op & backward refactor
source/op/pt/comm.cc
Refactor backward dispatch to border_op_backward and register deepmd_export::border_op / border_op_backward wrappers with explicit CPU/CUDA dispatch keys for AOT/export.
TempFile & comm builders
source/api_cc/src/commonPTExpt.h
Add TempFile helper and functions to build the 8 positional comm tensors (including virtual-atom remap variant).
PTExpt loader / run_with_comm
source/api_cc/include/{DeepPotPTExpt.h,DeepSpinPTExpt.h}, source/api_cc/src/{DeepPotPTExpt.cc,DeepSpinPTExpt.cc}
Add has_comm_artifact state, with_comm_loader, run_model_with_comm(...) to load/execute with-comm artifact; integrate conditional dispatch in compute() to call with-comm when multi-rank and artifact present.
Utilities / bugfix
source/api_cc/src/common.cc
Fix NeighborListData memcpy to use .data() instead of &vec[0] to avoid UB on empty vectors.

Export generators, examples and tests

Layer / File(s) Summary
Exporter yml / generators
source/tests/infer/{gen_dpa2.py,gen_dpa3.py,gen_spin.py}
Add generation of MPI-configured model artifacts / yaml and clarifying comments about dual-artifact behavior.
Tests, runners and infra
source/lmp/tests/*, source/tests/pt_expt/*, source/tests/pt_expt/model/*, source/tests/pt_expt/utils/*
Add MPI runner scripts, extensive end-to-end MPI parity tests (DPA2/DPA3/spin), pt_expt export-with-comm tests, repflow/repformer parity tests, border-op backward/autograd tests, message-passing detection tests; update CMake tests and conftest eager import to ensure op registration.
Build config
source/api_cc/tests/CMakeLists.txt, source/tests/pt_expt/conftest.py
Link ${TORCH_LIBRARIES} in tests when PyTorch enabled; eager import of deepmd.pt in conftest to register ops early.

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant Model as EnergyModel
    participant Descriptor
    participant GhostExch as RepFlow/Repformer
    participant BorderOp as border_op (C++)
    
    Client->>Model: call_common_lower(inputs, comm_dict)
    activate Model
    Model->>Descriptor: call(inputs, comm_dict)
    activate Descriptor
    Descriptor->>GhostExch: _exchange_ghosts(g1, mapping, comm_dict, nall, nloc)
    activate GhostExch
    alt comm_dict is None
        GhostExch->>GhostExch: g1_ext = gather(g1, mapping)
        GhostExch-->>Descriptor: g1_ext
    else comm_dict present
        GhostExch->>BorderOp: border_op(g1, send_list, send_proc, recv_proc, send_num, recv_num, communicator, nlocal, nghost)
        activate BorderOp
        BorderOp-->>GhostExch: g1_ext
        deactivate BorderOp
        GhostExch-->>Descriptor: g1_ext
    end
    deactivate GhostExch
    Descriptor-->>Model: descriptors
    deactivate Descriptor
    Model-->>Client: energy/forces/virial
    deactivate Model
Loading
sequenceDiagram
    actor Exporter
    participant Serializer
    participant HasMP as has_message_passing?
    participant Tracer as torch.export
    participant Archive as .pt2 ZIP
    
    Exporter->>Serializer: serialize_model()
    activate Serializer
    Serializer->>HasMP: _has_message_passing(model)
    HasMP-->>Serializer: boolean
    Serializer->>Tracer: _trace_and_export(with_comm_dict=False)
    Tracer-->>Archive: base/forward_lower.pt2
    alt has_message_passing == true
        Serializer->>Tracer: _trace_and_export(with_comm_dict=True)
        Tracer-->>Archive: extra/forward_lower_with_comm.pt2
        Serializer->>Archive: metadata.has_comm_artifact = true
    end
    Serializer-->>Exporter: .pt2
    deactivate Serializer
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested labels

new feature

Suggested reviewers

  • iProzd
  • anyangml
  • njzjz
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding multi-rank LAMMPS support for GNN models in the pt_expt backend, mentioning the key model types (DPA3, DPA2, spin).
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
deepmd/pt_expt/descriptor/dpa2.py (1)

228-249: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject comm_dict on the compressed path until ghost exchange is supported.

Lines 248-249 say the compressed path is local-only, but the method still accepts comm_dict and proceeds. In multi-rank inference that silently skips ghost exchange and can feed stale g1_ext into the repformers.

Suggested fix
     def call(
         self,
         coord_ext: torch.Tensor,
         atype_ext: torch.Tensor,
         nlist: torch.Tensor,
         mapping: torch.Tensor | None = None,
         fparam: torch.Tensor | None = None,
         comm_dict: dict | None = None,
     ) -> Any:
+        if self.compress and comm_dict is not None:
+            raise NotImplementedError(
+                "Compressed DPA2 descriptors do not support comm_dict yet."
+            )
         if not self.compress:
             return DescrptDPA2DP.call.__wrapped__(
                 self,
                 coord_ext,
                 atype_ext,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/pt_expt/descriptor/dpa2.py` around lines 228 - 249, In the call
method, when self.compress is True you must reject any provided comm_dict to
avoid silent multi-rank misuse; add a guard at the start of the compressed-path
branch in call (the method that currently returns self._call_compressed(...))
that checks if comm_dict is not None and raises a clear exception (e.g.,
ValueError) with a message like "Compressed path is local-only; comm_dict/ghost
exchange not supported" before calling _call_compressed; keep the rest of the
behavior unchanged and reference the call method and _call_compressed when
making this change.
source/op/pt/comm.cc (1)

174-193: ⚠️ Potential issue | 🔴 Critical

Return exactly one gradient per forward input.

The forward() method (line 34+) accepts 9 tensor inputs, but the backward() method (lines 190-193) returns 10 gradient tensors. torch::autograd::Function enforces this arity contract, so the backward pass will fail at runtime when autograd executes this operation.

Suggested fix
-    return {torch::Tensor(), torch::Tensor(), torch::Tensor(),
-            torch::Tensor(), torch::Tensor(), d_in,
-            torch::Tensor(), torch::Tensor(), torch::Tensor(),
-            torch::Tensor()};
+    return {torch::Tensor(), torch::Tensor(), torch::Tensor(),
+            torch::Tensor(), torch::Tensor(), d_in,
+            torch::Tensor(), torch::Tensor(), torch::Tensor()};
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@source/op/pt/comm.cc` around lines 174 - 193, The backward() currently
returns 10 gradient tensors but forward() accepts 9 inputs, breaking autograd's
arity contract; update backward() (the torch::autograd::Function::backward
implementation) to return exactly nine torch::Tensor() entries matching
forward's inputs, keeping d_in (the result of border_op_backward_dispatch)
positioned to correspond to the same forward input it gradients; use
ctx->get_saved_variables() and variable_list semantics to produce a
variable_list of length 9 (e.g., replace the current 10-entry return with a
9-entry list containing torch::Tensor() placeholders and d_in in the correct
slot).
deepmd/pt_expt/model/make_model.py (1)

273-323: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Propagate comm_dict through the Hessian recompute path.

forward_common_atomic() now uses comm_dict for the main atomic_model.forward_common_atomic(...) call, but the r_hessian branch still rebuilds the graph through _cal_hessian_ext(...), which re-invokes atomic_model.forward_common_atomic(...) without comm_dict. Any Hessian-enabled output on a message-passing model will therefore be differentiated through the wrong graph.

Possible fix
 def _cal_hessian_ext(
     model: Any,
     kk: str,
     vdef: OutputVariableDef,
     extended_coord: torch.Tensor,
     extended_atype: torch.Tensor,
     nlist: torch.Tensor,
     mapping: torch.Tensor | None,
     fparam: torch.Tensor | None,
     aparam: torch.Tensor | None,
+    comm_dict: dict | None = None,
     create_graph: bool = False,
 ) -> torch.Tensor:
 ...
             wrapper = _WrapperForwardEnergy(
                 model,
                 kk,
                 ci,
                 nall,
                 extended_atype[ii],
                 nlist[ii],
                 mapping[ii] if mapping is not None else None,
                 fparam[ii] if fparam is not None else None,
                 aparam[ii] if aparam is not None else None,
+                comm_dict,
             )
 class _WrapperForwardEnergy:
     def __init__(
         self,
         model: Any,
         kk: str,
         ci: int,
         nall: int,
         atype: torch.Tensor,
         nlist: torch.Tensor,
         mapping: torch.Tensor | None,
         fparam: torch.Tensor | None,
         aparam: torch.Tensor | None,
+        comm_dict: dict | None,
     ) -> None:
 ...
+        self.comm_dict = comm_dict

     def __call__(self, coord_flat: torch.Tensor) -> torch.Tensor:
 ...
         atomic_ret = self.model.atomic_model.forward_common_atomic(
             cc_3d,
             self.atype.unsqueeze(0),
             self.nlist.unsqueeze(0),
             mapping=self.mapping.unsqueeze(0) if self.mapping is not None else None,
             fparam=self.fparam.unsqueeze(0) if self.fparam is not None else None,
             aparam=self.aparam.unsqueeze(0) if self.aparam is not None else None,
+            comm_dict=self.comm_dict,
         )
                     model_ret[kk_hess] = _cal_hessian_ext(
                         self,
                         kk,
                         vdef,
                         extended_coord,
                         extended_atype,
                         nlist,
                         mapping,
                         fparam,
                         aparam,
+                        comm_dict,
                         create_graph=self.training,
                     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/pt_expt/model/make_model.py` around lines 273 - 323, The Hessian
recompute path in forward_common_atomic is not passing comm_dict into the
recomputation, so _cal_hessian_ext ends up differentiating a graph that lacks
message-passing state; update the call site inside the r_hessian loop to pass
the comm_dict argument through to _cal_hessian_ext (propagate the same comm_dict
that was passed into forward_common_atomic), ensuring _cal_hessian_ext and any
subsequent atomic_model.forward_common_atomic it invokes receive comm_dict as
well.
deepmd/pt_expt/utils/serialization.py (1)

842-913: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Write the final .pt2 atomically after both artifacts succeed.

model_file is populated with the regular AOTI package before the with-comm compile and before any of the extra/*.json entries are appended. If the second compile or the later ZIP write fails, callers are left with a top-level .pt2 that looks real on disk but is missing the metadata/model payloads needed by the loader. Staging both artifacts in a temp location and only replacing model_file once packaging is complete would avoid that corruption window.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/pt_expt/utils/serialization.py` around lines 842 - 913, The current
flow writes the regular AOTI package directly to model_file before building the
with-comm artifact and before adding extra/metadata.json, which can leave a
corrupted .pt2 if the second compile or ZIP append fails; change serialize logic
so aoti_compile_and_package writes the primary artifact into a temporary file
(use tempfile.TemporaryDirectory and a temp path for the first
aoti_compile_and_package call with exported), also produce the with-comm
artifact into another temp file (exported_wc -> wc_path) and collect
with_comm_bytes, then open a new temp ZIP (or copy the primary temp .pt2 into a
temp output ZIP), add extra/metadata.json, extra/model.json,
extra/model_def_script.json and optionally extra/forward_lower_with_comm.pt2,
and finally atomically replace the original model_file with
os.replace(temp_output_path, model_file); keep the existing inductor config
save/restore around each aoti_compile_and_package call (symbols:
aoti_compile_and_package, exported, exported_wc, with_comm_bytes, metadata,
model_file, _inductor_config.realize_opcount_threshold) so behavior is unchanged
but the final .pt2 is only moved into place on success.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@deepmd/dpmodel/descriptor/hybrid.py`:
- Around line 336-338: DescrptHybrid.call currently ignores its fparam when
invoking child descriptors and unpacks unused outputs g2, h2, sw (causing
RUF059); modify the call site inside DescrptHybrid.call where it invokes
descrpt(...) so that it forwards fparam (e.g., descrpt(coord_ext, atype_ext, nl,
mapping, comm_dict=comm_dict, fparam=fparam)) and stop unpacking unused returns
by either assigning them to _ placeholders or only unpacking the values you need
(e.g., odescriptor, gr = descrpt(...) or odescriptor, gr, *_ = descrpt(...));
update all similar calls in hybrid.py to follow this pattern so fparam is
propagated to child descriptors and the unused variable lint error is resolved.

In `@deepmd/dpmodel/descriptor/repflows.py`:
- Around line 509-528: In _exchange_ghosts, ensure we fail fast when
mapping_tiled is missing: if self.use_loc_mapping is False and mapping_tiled is
None, raise a clear ValueError stating that mapping_tiled (or mapping) is
required for the default ghost-exchange path; otherwise proceed to call
xp_take_along_axis(node_ebd, mapping_tiled, axis=1). Reference the
_exchange_ghosts method and the mapping_tiled parameter when adding this guard.

In `@deepmd/pt_expt/descriptor/repflows.py`:
- Around line 68-107: The code assumes a single-frame batch when using comm_dict
and squeezes the batch axis (node_ebd.squeeze(0)) before calling
torch.ops.deepmd_export.border_op, which produces wrong-rank tensors if
node_ebd.shape[0] != 1; add an explicit guard at the start of the comm_dict
branch (where has_spin is computed and before any squeeze/padding/border_op
calls) that checks node_ebd.shape[0] == 1 (or nf == 1) and raises a clear
ValueError mentioning comm_dict/border_op/frame-batching, so the function fails
fast rather than producing malformed tensors (ensure the check applies to both
the has_spin and non-spin paths and keep concat_switch_virtual unchanged).

In `@deepmd/pt_expt/descriptor/repformers.py`:
- Around line 48-82: The code assumes a non-batched input (nf==1) but calls
squeeze(0) which is a no-op for batched inputs; add an explicit guard at the
start of this block to reject batched tensors by checking g1.shape[0] (or nf) ==
1 and raise a clear ValueError if not, so border_op always receives a 2-D tensor
as expected; apply this check before any use of squeeze(0) or the has_spin
branch (symbols to update: g1, has_spin, border_op, concat_switch_virtual,
real_nloc).

In `@deepmd/pt_expt/utils/serialization.py`:
- Around line 82-120: The helper _has_message_passing currently bails out if
model.atomic_model.descriptor itself lacks has_message_passing, missing hybrid
wrappers; update it to walk wrapper chains before returning False: start from
model.atomic_model.descriptor and repeatedly follow likely wrapper fields (e.g.
"descriptor", "wrapped", "inner", "module") until you find an object that either
implements has_message_passing() or contains repflows/repformers; then run the
existing checks (call has_message_passing(), handle
AttributeError/NotImplementedError, and inspect block.use_loc_mapping on
repflows/repformers). If traversal finds nothing, then return False as before.
Use the existing symbol names (_has_message_passing,
model.atomic_model.descriptor, has_message_passing, repflows, repformers,
use_loc_mapping) so the change is easy to locate.

In `@source/api_cc/src/DeepPotPTExpt.cc`:
- Around line 165-182: When init() discovers has_comm_artifact_ is true, wrap
the extraction and loader construction for with_comm_tempfile_ and
with_comm_loader (the TempFile::from_zip_entry and AOTIModelPackageLoader
creation) in a try/catch; on any exception or failure clear has_comm_artifact_
(so init succeeds) and release/reset with_comm_tempfile_ and with_comm_loader,
leaving run_model_with_comm() as the only place that returns a hard error when
the comm artifact is actually required; apply the same defensive pattern in
DeepSpinPTExpt::init() to mirror behavior.

In `@source/lmp/tests/test_lammps_dpa3_pt2.py`:
- Line 479: Add a per-line Ruff suppression for the false-positive S603 on the
MPI subprocess invocation: update the call sp.check_call(argv) (in the
test_lammps_dpa3_pt2 test) to include a noqa S603 comment (e.g.,
sp.check_call(argv)  # noqa: S603) so Ruff won’t flag this trusted,
test-controlled subprocess usage.

In `@source/lmp/tests/test_lammps_spin_dpa3_pt2.py`:
- Line 176: Add an inline Ruff suppression on the mpirun invocation to silence
the false-positive S603 by annotating the sp.check_call(argv) call with a noqa
comment and brief rationale; locate the sp.check_call(argv) call in
test_lammps_spin_dpa3_pt2.py and append a comment such as "noqa: S603  # argv
built from fixed test paths and test-owned flags" so the linter knows this is
not untrusted input.

In `@source/tests/infer/gen_dpa3.py`:
- Around line 95-109: config_mpi is passed directly into get_model() which may
mutate it and thus makes data_mpi["model_def_script"] incorrect; fix by passing
a deep copy of the MPI config into get_model (i.e., call
get_model(copy.deepcopy(config_mpi)) or otherwise clone config_mpi before
invoking get_model) so that model_mpi is created from a copy while
data_mpi["model_def_script"] retains the original intended MPI config used later
by pt_expt_deserialize_to_file; update references around config_mpi, model_mpi,
data_mpi and the get_model call accordingly.

In `@source/tests/pt_expt/model/test_export_with_comm.py`:
- Around line 214-223: The unpacking of comm_inputs binds an unused name
"communicator", which Ruff flags; update the tuple assignment so the unused
binding is renamed to "_communicator" (or "_" if preferred) — i.e., change the
unpack targets (send_list, send_proc, recv_proc, send_num, recv_num,
communicator, nlocal, nghost_t) to (send_list, send_proc, recv_proc, send_num,
recv_num, _communicator, nlocal, nghost_t) to silence the lint warning while
keeping behavior identical.
- Around line 83-91: The test currently passes send_num/recv_num==0 and an
address for an empty sendlist when _make_sample_inputs() yields nghost==0;
mirror the clamp used in _build_self_comm_inputs() by clamping nghost to at
least 1 when constructing sendlist_indices, send_list, send_num and recv_num in
this snippet: if nghost == 0 replace sendlist_indices with a small 1-element
contiguous array (e.g. [0]) and set send_num and recv_num to torch.tensor([1],
dtype=torch.int32) (and construct send_list from the address of that clamped
array) so the helper never passes zero counts or an invalid address while
keeping the rest of the test logic unchanged.

---

Outside diff comments:
In `@deepmd/pt_expt/descriptor/dpa2.py`:
- Around line 228-249: In the call method, when self.compress is True you must
reject any provided comm_dict to avoid silent multi-rank misuse; add a guard at
the start of the compressed-path branch in call (the method that currently
returns self._call_compressed(...)) that checks if comm_dict is not None and
raises a clear exception (e.g., ValueError) with a message like "Compressed path
is local-only; comm_dict/ghost exchange not supported" before calling
_call_compressed; keep the rest of the behavior unchanged and reference the call
method and _call_compressed when making this change.

In `@deepmd/pt_expt/model/make_model.py`:
- Around line 273-323: The Hessian recompute path in forward_common_atomic is
not passing comm_dict into the recomputation, so _cal_hessian_ext ends up
differentiating a graph that lacks message-passing state; update the call site
inside the r_hessian loop to pass the comm_dict argument through to
_cal_hessian_ext (propagate the same comm_dict that was passed into
forward_common_atomic), ensuring _cal_hessian_ext and any subsequent
atomic_model.forward_common_atomic it invokes receive comm_dict as well.

In `@deepmd/pt_expt/utils/serialization.py`:
- Around line 842-913: The current flow writes the regular AOTI package directly
to model_file before building the with-comm artifact and before adding
extra/metadata.json, which can leave a corrupted .pt2 if the second compile or
ZIP append fails; change serialize logic so aoti_compile_and_package writes the
primary artifact into a temporary file (use tempfile.TemporaryDirectory and a
temp path for the first aoti_compile_and_package call with exported), also
produce the with-comm artifact into another temp file (exported_wc -> wc_path)
and collect with_comm_bytes, then open a new temp ZIP (or copy the primary temp
.pt2 into a temp output ZIP), add extra/metadata.json, extra/model.json,
extra/model_def_script.json and optionally extra/forward_lower_with_comm.pt2,
and finally atomically replace the original model_file with
os.replace(temp_output_path, model_file); keep the existing inductor config
save/restore around each aoti_compile_and_package call (symbols:
aoti_compile_and_package, exported, exported_wc, with_comm_bytes, metadata,
model_file, _inductor_config.realize_opcount_threshold) so behavior is unchanged
but the final .pt2 is only moved into place on success.

In `@source/op/pt/comm.cc`:
- Around line 174-193: The backward() currently returns 10 gradient tensors but
forward() accepts 9 inputs, breaking autograd's arity contract; update
backward() (the torch::autograd::Function::backward implementation) to return
exactly nine torch::Tensor() entries matching forward's inputs, keeping d_in
(the result of border_op_backward_dispatch) positioned to correspond to the same
forward input it gradients; use ctx->get_saved_variables() and variable_list
semantics to produce a variable_list of length 9 (e.g., replace the current
10-entry return with a 9-entry list containing torch::Tensor() placeholders and
d_in in the correct slot).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 479363c0-3e99-4a89-9d61-d599f6d2fc2c

📥 Commits

Reviewing files that changed from the base of the PR and between a094624 and 695b736.

📒 Files selected for processing (53)
  • deepmd/dpmodel/atomic_model/base_atomic_model.py
  • deepmd/dpmodel/atomic_model/dp_atomic_model.py
  • deepmd/dpmodel/atomic_model/linear_atomic_model.py
  • deepmd/dpmodel/atomic_model/pairtab_atomic_model.py
  • deepmd/dpmodel/descriptor/dpa1.py
  • deepmd/dpmodel/descriptor/dpa2.py
  • deepmd/dpmodel/descriptor/dpa3.py
  • deepmd/dpmodel/descriptor/hybrid.py
  • deepmd/dpmodel/descriptor/repflows.py
  • deepmd/dpmodel/descriptor/repformers.py
  • deepmd/dpmodel/descriptor/se_e2_a.py
  • deepmd/dpmodel/descriptor/se_r.py
  • deepmd/dpmodel/descriptor/se_t.py
  • deepmd/dpmodel/descriptor/se_t_tebd.py
  • deepmd/dpmodel/model/make_model.py
  • deepmd/dpmodel/model/spin_model.py
  • deepmd/pt_expt/descriptor/__init__.py
  • deepmd/pt_expt/descriptor/dpa1.py
  • deepmd/pt_expt/descriptor/dpa2.py
  • deepmd/pt_expt/descriptor/repflows.py
  • deepmd/pt_expt/descriptor/repformers.py
  • deepmd/pt_expt/descriptor/se_e2_a.py
  • deepmd/pt_expt/descriptor/se_r.py
  • deepmd/pt_expt/descriptor/se_t.py
  • deepmd/pt_expt/descriptor/se_t_tebd.py
  • deepmd/pt_expt/model/make_model.py
  • deepmd/pt_expt/model/spin_model.py
  • deepmd/pt_expt/utils/__init__.py
  • deepmd/pt_expt/utils/comm.py
  • deepmd/pt_expt/utils/serialization.py
  • source/api_cc/include/DeepPotPTExpt.h
  • source/api_cc/include/DeepSpinPTExpt.h
  • source/api_cc/src/DeepPotPTExpt.cc
  • source/api_cc/src/DeepSpinPTExpt.cc
  • source/api_cc/src/common.cc
  • source/api_cc/src/commonPTExpt.h
  • source/api_cc/tests/CMakeLists.txt
  • source/lmp/tests/run_mpi_pair_deepmd_dpa3_pt2.py
  • source/lmp/tests/run_mpi_pair_deepmd_spin_dpa3_pt2.py
  • source/lmp/tests/test_lammps_dpa2_pt2.py
  • source/lmp/tests/test_lammps_dpa3_pt2.py
  • source/lmp/tests/test_lammps_spin_dpa3_pt2.py
  • source/op/pt/comm.cc
  • source/tests/infer/gen_dpa2.py
  • source/tests/infer/gen_dpa3.py
  • source/tests/infer/gen_spin.py
  • source/tests/pt_expt/conftest.py
  • source/tests/pt_expt/descriptor/test_repflow_parallel.py
  • source/tests/pt_expt/descriptor/test_repformer_parallel.py
  • source/tests/pt_expt/model/test_export_with_comm.py
  • source/tests/pt_expt/model/test_spin_export_with_comm.py
  • source/tests/pt_expt/utils/test_border_op_backward.py
  • source/tests/pt_expt/utils/test_has_message_passing.py

Comment thread deepmd/dpmodel/descriptor/hybrid.py Outdated
Comment thread deepmd/dpmodel/descriptor/repflows.py
Comment thread deepmd/pt_expt/descriptor/repflows.py
Comment thread deepmd/pt_expt/descriptor/repformers.py
Comment thread deepmd/pt_expt/utils/serialization.py Outdated
Comment on lines +82 to +120
def _has_message_passing(model: torch.nn.Module) -> bool:
"""Detect whether a model's descriptor uses GNN-style message passing.

GNN descriptors (DPA2 with repformers, DPA3 with repflows) require
a per-layer ghost-atom MPI exchange when running multi-rank LAMMPS,
which means a separate ``with-comm`` AOTInductor artifact must be
compiled. Non-GNN descriptors (se_e2_a, se_r, se_t, se_t_tebd,
DPA1, hybrid-of-non-GNN) need only the regular artifact.

Additional gate: ``use_loc_mapping=True`` GNN models (the default
for DPA3) keep nlist in local-only indexing, so per-layer ghost
exchange is meaningless — these get only the regular artifact.
Multi-rank LAMMPS for GNN requires use_loc_mapping=False.

Returns False if the descriptor's ``has_message_passing()`` query
cannot be answered (e.g. linear/zbl/frozen models without a single
descriptor) — those are assumed local.
"""
try:
descriptor = model.atomic_model.descriptor
except AttributeError:
return False
if not hasattr(descriptor, "has_message_passing"):
return False
try:
if not descriptor.has_message_passing():
return False
except (AttributeError, NotImplementedError):
return False
# Walk into the GNN block (repflows / repformers) to inspect
# ``use_loc_mapping``. The attribute lives on the block, not on the
# top-level descriptor wrapper.
for attr in ("repflows", "repformers"):
block = getattr(descriptor, attr, None)
if block is None:
continue
if getattr(block, "use_loc_mapping", False):
return False
return True
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Handle hybrid wrappers before declaring the model “local”.

This helper only looks at model.atomic_model.descriptor and returns False as soon as that direct probe fails. Hybrid descriptors that contain a repflows/repformers sub-block under another wrapper will therefore never get extra/forward_lower_with_comm.pt2, so multi-rank LAMMPS silently falls back to the regular artifact on those models.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/pt_expt/utils/serialization.py` around lines 82 - 120, The helper
_has_message_passing currently bails out if model.atomic_model.descriptor itself
lacks has_message_passing, missing hybrid wrappers; update it to walk wrapper
chains before returning False: start from model.atomic_model.descriptor and
repeatedly follow likely wrapper fields (e.g. "descriptor", "wrapped", "inner",
"module") until you find an object that either implements has_message_passing()
or contains repflows/repformers; then run the existing checks (call
has_message_passing(), handle AttributeError/NotImplementedError, and inspect
block.use_loc_mapping on repflows/repformers). If traversal finds nothing, then
return False as before. Use the existing symbol names (_has_message_passing,
model.atomic_model.descriptor, has_message_passing, repflows, repformers,
use_loc_mapping) so the change is easy to locate.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid but out of scope for this PR. This is Tier-1 #1 in the project's gnn_mpi_untested_paths.md catalog of known limitations. Recursing _has_message_passing into hybrid wrappers requires a hybrid+GNN multi-rank fixture (so we can verify the with-comm artifact is actually correct for hybrid models, not just produced) plus end-to-end LAMMPS multi-rank testing of hybrid+GNN. Both are substantial work and will land in a separate PR.

For now the existing behaviour (silent fall-back to single-artifact) preserves the pre-PR status quo: hybrid descriptors with GNN children continue to work in single-rank mode just as they did before; multi-rank with this configuration was never supported and will fail loudly at the first ghost exchange via border_op if attempted.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wanghan-iapcm, understood — the cataloged limitation and the pre-PR status-quo fall-back behaviour are a reasonable boundary for this PR. The failure mode (a loud border_op error rather than silently wrong results) also makes the gap safe to defer.

Would you like me to open a GitHub issue to track the hybrid+GNN multi-rank support work (adding the fixture, the LAMMPS end-to-end tests, and the _has_message_passing recursion)?


✏️ Learnings added
Learnt from: wanghan-iapcm
Repo: deepmodeling/deepmd-kit PR: 5430
File: deepmd/pt_expt/utils/serialization.py:101-139
Timestamp: 2026-05-02T12:55:27.822Z
Learning: In `deepmd/pt_expt/utils/serialization.py`, the helper `_has_message_passing` intentionally does NOT recurse into hybrid descriptor wrappers to detect GNN sub-blocks. This is a known, cataloged limitation (Tier-1 `#1` in `gnn_mpi_untested_paths.md`). Hybrid+GNN models silently fall back to the single-artifact path (pre-PR status quo); multi-rank LAMMPS for hybrid+GNN descriptors was never supported and will fail loudly at the first `border_op` ghost exchange if attempted. Support requires a dedicated hybrid+GNN multi-rank fixture plus end-to-end LAMMPS tests, deferred to a separate PR.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: wanghan-iapcm
Repo: deepmodeling/deepmd-kit PR: 5428
File: source/tests/infer/gen_model_devi.py:144-147
Timestamp: 2026-05-01T13:46:12.190Z
Learning: Do not report or fail code review findings for Ruff `RUF059` (`unused-unpacked-variable`) in Python files. This rule is explicitly ignored via `tool.ruff.lint.ignore` in `pyproject.toml` (deepmodeling/deepmd-kit, referenced line 399), and since CI does not enforce it, unused variables created via tuple/unpack assignments (e.g., `e0, f0, v0, ae0, av0 = dp.eval(...)` where some of the targets are unused) should not be flagged for this reason.

argv.extend(extra_args)
if runner_args:
argv.extend(runner_args)
sp.check_call(argv)
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Suppress Ruff's false-positive S603 on the MPI subprocess call.

This command is built from fixed repo paths and test-controlled arguments, so the security rule is a false positive here, but ruff check will still fail until it is annotated.

Suggested fix
-        sp.check_call(argv)
+        sp.check_call(argv)  # noqa: S603 - test-controlled argv, no shell, fixed repo paths

As per coding guidelines, **/*.py: Install linter and run ruff check . before committing changes or the CI will fail.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
sp.check_call(argv)
sp.check_call(argv) # noqa: S603 - test-controlled argv, no shell, fixed repo paths
🧰 Tools
🪛 Ruff (0.15.12)

[error] 479-479: subprocess call: check for execution of untrusted input

(S603)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@source/lmp/tests/test_lammps_dpa3_pt2.py` at line 479, Add a per-line Ruff
suppression for the false-positive S603 on the MPI subprocess invocation: update
the call sp.check_call(argv) (in the test_lammps_dpa3_pt2 test) to include a
noqa S603 comment (e.g., sp.check_call(argv)  # noqa: S603) so Ruff won’t flag
this trusted, test-controlled subprocess usage.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not applicable to this project. pyproject.toml does not enable any S (flake8-bandit) rules — the lint config selects E,W,F,D,UP,C4,RUF,NPY,TID*,T20,B904,N804,N805,DTZ,TCH,PYI,ANN,B905. ruff check passes cleanly without the annotation. Adding # noqa: S603 would be a no-op comment for a rule the project does not run.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wanghan-iapcm, you're absolutely right — I apologize for the noise. Since the project's ruff configuration doesn't select any S (flake8-bandit) rules, S603 is never checked and the annotation would be a meaningless no-op. I'll record this for future reviews.


✏️ Learnings added
Learnt from: wanghan-iapcm
Repo: deepmodeling/deepmd-kit PR: 5430
File: source/lmp/tests/test_lammps_dpa3_pt2.py:479-479
Timestamp: 2026-05-02T12:55:24.164Z
Learning: In deepmodeling/deepmd-kit, the project's `pyproject.toml` Ruff lint configuration selects only `E,W,F,D,UP,C4,RUF,NPY,TID*,T20,B904,N804,N805,DTZ,TCH,PYI,ANN,B905` rule sets. The `S` (flake8-bandit) rules are NOT enabled, so `S`-prefixed findings (e.g., S603 subprocess calls) should never be flagged in code reviews for this project.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: wanghan-iapcm
Repo: deepmodeling/deepmd-kit PR: 5370
File: source/lmp/tests/test_lammps_spin_nopbc_pt2.py:22-23
Timestamp: 2026-04-04T17:03:26.124Z
Learning: For deepmd-kit LAMMPS Python test files under `source/lmp/tests/`, follow the established convention of using module-level fixed `Path` constants (e.g., `data.lmp`, `md.out`) that are managed via `setup_module`/`teardown_module`. In code review, do NOT flag these fixed module-level paths as a file-collision or parallel-run risk as long as the test uses the repo’s `setup_module`/`teardown_module` pattern; changing them would break consistency with the rest of the test suite.

Learnt from: wanghan-iapcm
Repo: deepmodeling/deepmd-kit PR: 5428
File: source/tests/infer/gen_model_devi.py:144-147
Timestamp: 2026-05-01T13:46:12.190Z
Learning: Do not report or fail code review findings for Ruff `RUF059` (`unused-unpacked-variable`) in Python files. This rule is explicitly ignored via `tool.ruff.lint.ignore` in `pyproject.toml` (deepmodeling/deepmd-kit, referenced line 399), and since CI does not enforce it, unused variables created via tuple/unpack assignments (e.g., `e0, f0, v0, ae0, av0 = dp.eval(...)` where some of the targets are unused) should not be flagged for this reason.

Comment thread source/lmp/tests/test_lammps_spin_dpa3_pt2.py
Comment thread source/tests/infer/gen_dpa3.py
Comment thread source/tests/pt_expt/model/test_export_with_comm.py Outdated
Comment on lines +214 to +223
(
send_list,
send_proc,
recv_proc,
send_num,
recv_num,
communicator,
nlocal,
nghost_t,
) = comm_inputs
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Rename the unused communicator binding to keep Ruff clean.

Ruff is already flagging this unpack target as unused. Renaming it to _communicator (or _) avoids a preventable lint failure/noise source.

Suggested fix
     (
         send_list,
         send_proc,
         recv_proc,
         send_num,
         recv_num,
-        communicator,
+        _communicator,
         nlocal,
         nghost_t,
     ) = comm_inputs

As per coding guidelines, **/*.py: Install linter and run ruff check . before committing changes or the CI will fail.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
(
send_list,
send_proc,
recv_proc,
send_num,
recv_num,
communicator,
nlocal,
nghost_t,
) = comm_inputs
(
send_list,
send_proc,
recv_proc,
send_num,
recv_num,
_communicator,
nlocal,
nghost_t,
) = comm_inputs
🧰 Tools
🪛 Ruff (0.15.12)

[warning] 220-220: Unpacked variable communicator is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@source/tests/pt_expt/model/test_export_with_comm.py` around lines 214 - 223,
The unpacking of comm_inputs binds an unused name "communicator", which Ruff
flags; update the tuple assignment so the unused binding is renamed to
"_communicator" (or "_" if preferred) — i.e., change the unpack targets
(send_list, send_proc, recv_proc, send_num, recv_num, communicator, nlocal,
nghost_t) to (send_list, send_proc, recv_proc, send_num, recv_num,
_communicator, nlocal, nghost_t) to silence the lint warning while keeping
behavior identical.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not applicable to this project. pyproject.toml line 399 explicitly ignores RUF059 (# unused-unpacked-variable), so the communicator binding here does not violate the project's lint policy and ruff check passes. Renaming to _communicator would be cosmetic only.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wanghan-iapcm, you're absolutely right — I apologize for the noise. As noted in my retrieved learnings, RUF059 (unused-unpacked-variable) is explicitly suppressed in pyproject.toml line 399, so ruff check passes as-is and the communicator binding is perfectly fine. No change needed.


🧠 Learnings used
Learnt from: wanghan-iapcm
Repo: deepmodeling/deepmd-kit PR: 5391
File: source/tests/pt_expt/infer/test_deep_eval.py:1667-1691
Timestamp: 2026-04-11T08:01:08.364Z
Learning: In deepmd-kit, `get_dp_atomic_model()` is a model-level API exposed on the full model hierarchy (e.g., `make_model` CM class, `FrozenModel`, `SpinModel`) and is not available on atomic models (e.g., `DPZBLLinearEnergyAtomicModel`, `DPAtomicModel`). When writing/validating tests for ZBL or other atomic-model behavior, do not call `get_dp_atomic_model()` on atomic model instances; instead, use `isinstance` checks (or equivalent type assertions) on the atomic model.

Learnt from: wanghan-iapcm
Repo: deepmodeling/deepmd-kit PR: 5428
File: source/tests/infer/gen_model_devi.py:144-147
Timestamp: 2026-05-01T13:46:12.190Z
Learning: Do not report or fail code review findings for Ruff `RUF059` (`unused-unpacked-variable`) in Python files. This rule is explicitly ignored via `tool.ruff.lint.ignore` in `pyproject.toml` (deepmodeling/deepmd-kit, referenced line 399), and since CI does not enforce it, unused variables created via tuple/unpack assignments (e.g., `e0, f0, v0, ae0, av0 = dp.eval(...)` where some of the targets are unused) should not be flagged for this reason.

@wanghan-iapcm wanghan-iapcm force-pushed the feat-pt-expt-gnn-mpi branch from 695b736 to 47f0c29 Compare May 1, 2026 13:38
# Conflicts:
#	deepmd/pt_expt/utils/serialization.py
Comment thread source/tests/pt_expt/conftest.py Fixed
import torch

# Trigger registration of the deepmd_export::border_op opaque wrapper.
import deepmd.pt_expt.utils.comm # noqa: F401
import torch

# Trigger registration of the deepmd_export::border_op opaque wrapper.
import deepmd.pt_expt.utils.comm # noqa: F401

# Trigger registration of the deepmd_export::border_op opaque wrapper
# (needed by the with-comm artifact at runtime).
import deepmd.pt_expt.utils.comm # noqa: F401
import numpy as np
import torch

import deepmd.pt_expt.utils.comm # noqa: F401 - opaque op registration
Comment thread source/tests/pt_expt/utils/test_border_op_backward.py Fixed
# Ensure the new C++ symbol is loaded. pt_expt imports deepmd.pt for
# the custom-op .so.
import deepmd.pt
import deepmd.pt_expt.utils.comm # noqa: F401 - registers deepmd_export::border_op
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
deepmd/pt_expt/utils/serialization.py (1)

877-960: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Stage .pt2 packaging before touching the destination path.

The regular artifact is compiled straight into model_file, and only afterwards do we build the optional with-comm artifact and append model/extra/*. If that second phase fails, the caller is left with a half-written .pt2 at the final path, which can also overwrite a previously good export.

🧰 Minimal staging pattern
-    aoti_compile_and_package(exported, package_path=model_file)
+    with tempfile.TemporaryDirectory() as td:
+        staged_model = os.path.join(td, "forward_lower.pt2")
+        aoti_compile_and_package(exported, package_path=staged_model)
+        ...
+        with zipfile.ZipFile(staged_model, "a") as zf:
+            ...
+        os.replace(staged_model, model_file)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/pt_expt/utils/serialization.py` around lines 877 - 960, The current
flow compiles the regular artifact directly into model_file then later appends
the with-comm artifact and metadata, risking leaving a half-written .pt2 on
failure; change to a staging pattern: perform both _trace_and_export calls and
aoti_compile_and_package invocations into temporary files/paths (e.g. compile
exported -> temp_main_path, exported_wc -> temp_wc_path) and collect
with_comm_bytes and data_for_json there, then create the final ZIP (using
PT2_EXTRA_PREFIX entries for metadata/model.json/model_def_script and include
forward_lower_with_comm.pt2 if present) in a temporary output file and
atomically replace the destination model_file with os.replace(temp_output,
model_file); preserve the existing inductor config save/restore logic around
each aoti_compile_and_package call but target the temp paths instead of
model_file so failures never overwrite the final path.
♻️ Duplicate comments (2)
deepmd/pt_expt/utils/serialization.py (2)

171-183: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Don't lock the with-comm artifact to nswap == 1.

_make_comm_sample_inputs() traces the comm path with nswap = 1, and these None specs make the first five comm tensors static to that sample shape. Any runtime where lmp_list.nswap is greater than 1 will fail the exported input contract before deepmd_export::border_op runs, so the artifact is still tied to the trace-time processor topology.

Also applies to: 387-397

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/pt_expt/utils/serialization.py` around lines 171 - 183, The code
currently hardcodes nswap = 1 in _make_comm_sample_inputs which fixes the traced
comm tensor shapes to one-swap topology; change nswap to be dynamic (e.g. nswap
= max(1, lmp_list.nswap) or similar runtime value) and ensure all dependent
tensors (send_proc, recv_proc, send_num, recv_num, any send_count logic) use
that nswap so the traced inputs support nswap > 1; apply the same fix to the
analogous block around lines 387-397.

119-139: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Follow wrapper descriptors before concluding there is no message passing.

_has_message_passing() still returns False as soon as the top-level atomic_model.descriptor lacks has_message_passing() or direct repflows/repformers attrs. Hybrid/wrapper descriptors will therefore never set has_comm_artifact, so DeepPotPTExpt/DeepSpinPTExpt silently stay on the regular artifact in multi-rank LAMMPS.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/pt_expt/utils/serialization.py` around lines 119 - 139, The function
_has_message_passing currently bails out if the top-level
model.atomic_model.descriptor lacks has_message_passing or direct
repflows/repformers, so wrapper descriptors never get inspected; update the
logic to unwrap/walk wrapper descriptors before making decisions: starting from
model.atomic_model.descriptor, follow common wrapper links (e.g., attributes
named "descriptor", "wrapped", or "__wrapped__") until you reach the concrete
descriptor, then check for has_message_passing(), and only then inspect
repflows/repformers and their use_loc_mapping flag; keep the existing exception
handling (AttributeError/NotImplementedError) around has_message_passing() and
preserve the final True/False semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@deepmd/pt_expt/utils/serialization.py`:
- Around line 877-960: The current flow compiles the regular artifact directly
into model_file then later appends the with-comm artifact and metadata, risking
leaving a half-written .pt2 on failure; change to a staging pattern: perform
both _trace_and_export calls and aoti_compile_and_package invocations into
temporary files/paths (e.g. compile exported -> temp_main_path, exported_wc ->
temp_wc_path) and collect with_comm_bytes and data_for_json there, then create
the final ZIP (using PT2_EXTRA_PREFIX entries for
metadata/model.json/model_def_script and include forward_lower_with_comm.pt2 if
present) in a temporary output file and atomically replace the destination
model_file with os.replace(temp_output, model_file); preserve the existing
inductor config save/restore logic around each aoti_compile_and_package call but
target the temp paths instead of model_file so failures never overwrite the
final path.

---

Duplicate comments:
In `@deepmd/pt_expt/utils/serialization.py`:
- Around line 171-183: The code currently hardcodes nswap = 1 in
_make_comm_sample_inputs which fixes the traced comm tensor shapes to one-swap
topology; change nswap to be dynamic (e.g. nswap = max(1, lmp_list.nswap) or
similar runtime value) and ensure all dependent tensors (send_proc, recv_proc,
send_num, recv_num, any send_count logic) use that nswap so the traced inputs
support nswap > 1; apply the same fix to the analogous block around lines
387-397.
- Around line 119-139: The function _has_message_passing currently bails out if
the top-level model.atomic_model.descriptor lacks has_message_passing or direct
repflows/repformers, so wrapper descriptors never get inspected; update the
logic to unwrap/walk wrapper descriptors before making decisions: starting from
model.atomic_model.descriptor, follow common wrapper links (e.g., attributes
named "descriptor", "wrapped", or "__wrapped__") until you reach the concrete
descriptor, then check for has_message_passing(), and only then inspect
repflows/repformers and their use_loc_mapping flag; keep the existing exception
handling (AttributeError/NotImplementedError) around has_message_passing() and
preserve the final True/False semantics.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: cadcb480-96e6-4a09-a2f0-0f282cd2a6c5

📥 Commits

Reviewing files that changed from the base of the PR and between 47f0c29 and c6a38e6.

📒 Files selected for processing (1)
  • deepmd/pt_expt/utils/serialization.py

Comment thread source/op/pt/comm.cc
nghost_tensor);
}

TORCH_LIBRARY_FRAGMENT(deepmd, m) {
Comment thread source/op/pt/comm.cc
// ============================================================================

namespace {
torch::Tensor border_op_export(const torch::Tensor& sendlist_tensor,
Comment thread source/op/pt/comm.cc
return out.empty() ? torch::empty_like(g1_tensor) : out[0].clone();
}

torch::Tensor border_op_backward_export(
Comment thread source/op/pt/comm.cc
}
} // namespace

TORCH_LIBRARY_FRAGMENT(deepmd_export, m) {
Comment thread source/op/pt/comm.cc
// Register CPU + CUDA implementations under explicit dispatch keys so
// torch.export sees opaque external calls (vs CompositeImplicitAutograd
// which gets decomposed during trace).
TORCH_LIBRARY_IMPL(deepmd_export, CPU, m) {
Comment thread source/op/pt/comm.cc
m.impl("border_op_backward", border_op_backward_export);
}
#if defined(GOOGLE_CUDA) || defined(TENSORFLOW_USE_ROCM)
TORCH_LIBRARY_IMPL(deepmd_export, CUDA, m) {
Han Wang added 2 commits May 2, 2026 07:51
The dpmodel layer threads a new ``comm_dict=None`` kwarg through
``forward_common_atomic`` (model and atomic-model levels) so the
pt_expt backend can wire MPI ghost-atom exchange for GNN multi-rank
LAMMPS. The JAX backend overrides ``forward_common_atomic`` with
explicit kwarg lists; without accepting ``comm_dict``, ``dp
convert-backend ... savedmodel`` fails at trace time:

    TypeError: jax_model.forward_common_atomic() got an unexpected
    keyword argument 'comm_dict'

Affected the entire CI matrix on PR deepmodeling#5430 (every Python shard goes
through the savedmodel build prep). Fix: add ``comm_dict: dict |
None = None`` to each JAX override and ``del comm_dict`` (the JAX
path has no MPI ghost exchange).

Files touched: dp_atomic_model, linear_atomic_model,
pairtab_atomic_model (atomic-model level), plus base_model,
dp_model, dp_zbl_model (model level). Paddle's
forward_common_atomic already accepts comm_dict and needs no change.
DDP-spawned worker subprocesses re-import modules from scratch and
never run the test conftest's ``import deepmd.pt``, so when
``pt_expt.utils.comm`` is imported the underlying
``deepmd_export::{border_op,border_op_backward}`` ops are not yet
registered and the import-time guard raises:

    RuntimeError: torch.ops.deepmd_export.{border_op,border_op_backward}
    are not registered. Build libdeepmd_op_pt.so and ensure deepmd.pt
    is imported before this module.

Repro: test_training_ddp.py::TestDDPRestart::test_ddp_restart on every
Python CI shard.

Fix: ``_check_underlying_ops_loaded`` now triggers ``import deepmd.pt``
as a side effect when the ops aren't yet registered. ``deepmd/pt/cxx_op.py``
loads ``libdeepmd_op_pt.so`` which registers the schemas. The original
RuntimeError stays as a fallback if ``import deepmd.pt`` itself fails.

Verified locally: importing ``deepmd.pt_expt.utils.comm`` in a fresh
process (without explicit ``import deepmd.pt`` first) now succeeds and
``torch.ops.deepmd_export.border_op`` is available.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
deepmd/pt_expt/utils/comm.py (1)

59-65: 💤 Low value

Consider logging the swallowed exception for debuggability.

The intent to fall through to a clearer RuntimeError is sound, but completely discarding the original exception loses diagnostic information when deepmd.pt fails to import for non-obvious reasons.

♻️ Optional: log the exception before passing
+import logging
+
+log = logging.getLogger(__name__)
+
 ...
         try:
             import deepmd.pt  # noqa: F401
-        except Exception:
+        except Exception as e:
             # If deepmd.pt itself fails to import, fall through to the
             # explicit RuntimeError below — clearer than re-raising a
             # potentially-unrelated import error.
+            log.debug("Failed to import deepmd.pt: %s", e)
             pass
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/pt_expt/utils/comm.py` around lines 59 - 65, The except block that
swallows import errors for "import deepmd.pt" should log the caught exception
before falling through to the intended RuntimeError; add a module logger
(logging.getLogger(__name__)) if one doesn’t exist and call
logger.exception(...) or logger.error(..., exc_info=True) inside the except in
comm.py so the original traceback is retained while preserving the current
behavior of raising the clearer RuntimeError later.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@deepmd/pt_expt/utils/comm.py`:
- Around line 59-65: The except block that swallows import errors for "import
deepmd.pt" should log the caught exception before falling through to the
intended RuntimeError; add a module logger (logging.getLogger(__name__)) if one
doesn’t exist and call logger.exception(...) or logger.error(..., exc_info=True)
inside the except in comm.py so the original traceback is retained while
preserving the current behavior of raising the clearer RuntimeError later.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 7c3534dd-c387-44d4-b633-46145c68bbfe

📥 Commits

Reviewing files that changed from the base of the PR and between 3c9ee65 and 87c9f3f.

📒 Files selected for processing (1)
  • deepmd/pt_expt/utils/comm.py

After 87c9f3f ``deepmd.pt_expt.utils.comm`` self-bootstraps
``libdeepmd_op_pt.so`` via ``_check_underlying_ops_loaded()``, so the
explicit ``import deepmd.pt`` preloads in conftest.py and
test_border_op_backward.py are no longer needed.

Closes 2 of the 13 GitHub Advanced Security CodeQL "unused import"
alerts on the PR. The remaining 5 Python alerts (other tests'
``import deepmd.pt_expt.utils.comm`` for opaque-op registration) and
6 C++ alerts (TORCH_LIBRARY_* / border_op_export reachable only
through macro-expanded static initialization) are CodeQL false
positives that need to be dismissed in the GitHub Security UI rather
than fixed in source.
@wanghan-iapcm wanghan-iapcm added the Test CUDA Trigger test CUDA workflow label May 2, 2026
@github-actions github-actions Bot removed the Test CUDA Trigger test CUDA workflow label May 2, 2026
Applies the substantive coderabbitai suggestions from the PR review.

Defensive guards (no behavioral change for existing callers):
- dpmodel/descriptor/{repflows,repformers}.py: raise ValueError when
  the default `_exchange_ghosts` is hit with `mapping_tiled=None` and
  `use_loc_mapping=False` instead of returning a cryptic
  array-backend error.
- pt_expt/descriptor/{repflows,repformers}.py: refuse `comm_dict` path
  when `nf != 1`. The squeeze(0)/unsqueeze(0) dance only works for a
  single frame; failing here surfaces the unsupported case loudly
  instead of producing a malformed border_op tensor.

Init robustness:
- api_cc/src/{DeepPotPTExpt,DeepSpinPTExpt}.cc: wrap the optional
  with-comm artifact load in try/catch. If `has_comm_artifact` is set
  in metadata but the nested artifact fails to extract or compile,
  log and fall back to single-rank-only dispatch instead of aborting
  init -- the hard error then surfaces only when multi-rank actually
  needs the missing artifact.

Code hygiene:
- dpmodel/descriptor/hybrid.py: rename unused unpacks (`g2/h2/sw` ->
  `_g2/_h2/_sw`) for ruff RUF059 cleanliness.
- tests/infer/gen_dpa3.py: deepcopy `config_mpi` before passing to
  `get_model()` so `data_mpi["model_def_script"]` retains the
  intended MPI export config even if the call mutates its argument.
- tests/pt_expt/model/test_export_with_comm.py: mirror the zero-ghost
  clamp from `serialization.py::_make_comm_sample_inputs` in the test
  helper, so no zero-length sendlist pointer is ever materialised.
  Also update `extra/...` -> `model/extra/...` archive paths to match
  PT2_EXTRA_PREFIX after the upstream/master merge.

Verified locally: pt_expt python (24/24), ctest (3/3, 498 tests
including 198 PtExpt), LAMMPS multi-rank GNN (19/19) all green.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 2, 2026

Codecov Report

❌ Patch coverage is 76.78161% with 101 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.48%. Comparing base (a094624) to head (a429fc9).

Files with missing lines Patch % Lines
source/api_cc/src/DeepSpinPTExpt.cc 61.53% 14 Missing and 6 partials ⚠️
deepmd/pt_expt/utils/serialization.py 77.38% 19 Missing ⚠️
source/api_cc/src/DeepPotPTExpt.cc 61.22% 13 Missing and 6 partials ⚠️
source/api_cc/src/commonPTExpt.h 67.24% 14 Missing and 5 partials ⚠️
deepmd/pt_expt/descriptor/repformers.py 66.66% 9 Missing ⚠️
deepmd/pt_expt/utils/comm.py 76.00% 6 Missing ⚠️
deepmd/dpmodel/descriptor/make_base_descriptor.py 50.00% 1 Missing ⚠️
deepmd/dpmodel/descriptor/repflows.py 90.90% 1 Missing ⚠️
deepmd/dpmodel/descriptor/repformers.py 90.00% 1 Missing ⚠️
deepmd/dpmodel/descriptor/se_r.py 50.00% 1 Missing ⚠️
... and 5 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5430      +/-   ##
==========================================
+ Coverage   82.46%   82.48%   +0.02%     
==========================================
  Files         824      827       +3     
  Lines       87589    87981     +392     
  Branches     4197     4218      +21     
==========================================
+ Hits        72229    72574     +345     
- Misses      14085    14132      +47     
  Partials     1275     1275              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@wanghan-iapcm wanghan-iapcm added the Test CUDA Trigger test CUDA workflow label May 2, 2026
@github-actions github-actions Bot removed the Test CUDA Trigger test CUDA workflow label May 2, 2026
Removes the private-attribute fishing in
``deepmd/pt_expt/utils/serialization.py`` (which read
``descriptor.repflows.use_loc_mapping`` and friends) and replaces it
with a public method on ``BaseDescriptor``: ``has_message_passing_-
across_ranks()``.

Why
---
The old helper conflated two questions:

1. "Is this a GNN-style descriptor?" (existing ``has_message_passing()``)
2. "Do per-layer node embeddings need MPI exchange across rank
   boundaries to be correct under multi-rank LAMMPS?"

Only #2 governs whether to compile a with-comm AOTI artifact. The old
function answered #2 by special-casing the ``repflows``/``repformers``
attribute names and ``use_loc_mapping`` flag — silent breakage on any
rename and never recursing into hybrid wrappers (Tier-1 #1 in the
gnn_mpi_untested_paths catalog).

Note: every LAMMPS pair_style already exchanges ghost-atom *coords and
forces* via the standard pair-style comm topology — that's not GNN-
specific. The new method asks specifically about per-layer atomic
feature exchange (the ``node_ebd`` tensor that flows between message-
passing layers), which is the actual concern that gates the with-comm
artifact.

How
---
``BaseDescriptor.has_message_passing_across_ranks()`` returns ``False``
by default. GNN paths override:

- ``DescrptBlockRepflows``: ``not self.use_loc_mapping``
- ``DescrptBlockRepformers``: ``True`` (no ``use_loc_mapping`` opt-out
  exists)
- ``DescrptDPA3`` / ``DescrptDPA2``: delegate to their block
- ``DescrptHybrid``: ``any(child.has_message_passing_across_ranks() ...)``
  (closes the structural side of catalog Tier-1 #1)

Non-GNN dpmodel descriptors (``se_e2_a``, ``se_r``, ``se_t``,
``se_t_tebd``, ``dpa1``) get explicit ``return False`` overrides
pinning the contract; pt and pd backend descriptors inherit the
default (no edits needed there).

The serialization helper ``_has_message_passing`` is renamed to
``_needs_with_comm_artifact`` and just calls
``descriptor.has_message_passing_across_ranks()``. The metadata key
``has_message_passing`` is dropped from the .pt2 archive (C++ readers
only consume ``has_comm_artifact``).

Per-descriptor tests
--------------------
The standalone ``source/tests/pt_expt/utils/test_has_message_passing.py``
is deleted; per-descriptor coverage of *both* APIs is added to existing
descriptor test files at ``source/tests/pt_expt/descriptor/``:

| File         | has_message_passing | has_message_passing_across_ranks |
|--------------|---------------------|----------------------------------|
| se_e2_a      | False               | False                            |
| dpa1         | False               | False                            |
| dpa3         | True                | not use_loc_mapping              |
| dpa2         | True                | True                             |
| hybrid       | depends on child    | True if any child needs it       |

Bonus: also includes a CUDA segfault fix
----------------------------------------
While running the post-refactor verification, the CUDA-runner CI
exposed a latent bug in ``source/op/pt/comm.cc`` (forward + backward
kernels): when built with ``USE_MPI`` but invoked single-rank
(world_size==0), ``cuda_aware`` defaults to 0 and the CPU-fallback
``recv_g1_tensor.to(kCPU)`` block (guarded by ``world_size >= 1``) is
skipped — the tensor stays on CUDA. The inner self-send branch then
did host ``memcpy`` on what were still CUDA pointers and segfaulted.
Fix: gate the host-memcpy / CPU-copy-back paths on
``world_size >= 1 && cuda_aware == 0`` so single-rank deployments
correctly use ``gpuMemcpy DeviceToDevice``. Mirrored in three sites
(forward inner, forward post-loop, backward inner, backward post-loop).

Float32 multi-rank fixture + test
---------------------------------
Adds ``test_lammps_dpa3_pt2_fp32.py`` and a paired
``deeppot_dpa3_mpi_fp32.pt2`` fixture (gen_dpa3.py addition). Validates
that the comm_dict path is dtype-agnostic in practice (template
dispatch on ``g1.dtype()``, ``register_fake``'s ``empty_like(g1)``,
and ``MPI_FLOAT`` exchange) — not just by inspection. Compares mpi-2
vs same-archive mpi-1 with float32-appropriate tolerances (atol 1e-4 /
rel 1e-3 for force/virial; rel 1e-5 for energy).

Verified locally (CPU build): pt_expt python 965 passed / 32 skipped,
ctest 3/3 (498 C++ tests), LAMMPS multi-rank 20/20 (DPA3 + DPA2 +
spin DPA3 + DPA3 fp32).

Trade-off note
--------------
The plan called for ``has_message_passing_across_ranks()`` to be
abstract on ``BaseDescriptor`` (mirroring ``has_message_passing``).
Implementing that requires touching all 49 subclasses across pt and
pd backends — well outside the scope of "GNN MPI for pt_expt". Kept
the method concrete with a ``return False`` default; pt and pd
backend descriptors inherit that. They can override later if they
grow a multi-rank GNN path of their own.
@pytest.mark.parametrize(
"child_factory,expected_hmp,expected_hmp_ar",
[
(lambda: _se_e2_a_child(), False, False),
@wanghan-iapcm wanghan-iapcm added the Test CUDA Trigger test CUDA workflow label May 2, 2026
@github-actions github-actions Bot removed the Test CUDA Trigger test CUDA workflow label May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants