fix: cover mixed types in stat sampling by njzjz-bot · Pull Request #5569 · deepmodeling/deepmd-kit

njzjz-bot · 2026-06-21T09:38:56Z

Summary

augment mixed-type statistics sampling so sampled frames cover every atom type present in the dataset
apply the coverage fix to both PyTorch stat packing and backend-agnostic/common stat packing used by TF-style paths
add regression tests for PT and common stat sampling helpers

Fixes #4424.

Tests

uvx ruff check deepmd/pt/utils/stat.py deepmd/utils/model_stat.py source/tests/pt/test_observed_type.py source/tests/common/test_model_stat.py
uvx ruff format deepmd/pt/utils/stat.py deepmd/utils/model_stat.py source/tests/pt/test_observed_type.py source/tests/common/test_model_stat.py --check
python3 -m py_compile deepmd/pt/utils/stat.py deepmd/utils/model_stat.py source/tests/pt/test_observed_type.py source/tests/common/test_model_stat.py
git diff --check

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

Summary by CodeRabbit

Improvements
- Enhanced statistics collection for mixed-type atomic datasets to ensure all atom types present in the full dataset are represented in sampled batches, improving overall coverage and representativeness.
Tests
- Added comprehensive test coverage validating mixed-type dataset statistics sampling behavior and proper representation of all atom types.

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

coderabbitai · 2026-06-21T09:43:18Z

📝 Walkthrough

Walkthrough

Two new helper functions (_mixed_type_coverage and _append_missing_type_frames) are added to both the PyTorch (deepmd/pt/utils/stat.py) and common (deepmd/utils/model_stat.py) stat-collection paths. They detect atom types present in mixed-type datasets but absent from sampled batches, then fetch and append representative frames until full type coverage is achieved. make_stat_input is refactored to use a new _append_stat_data helper for per-batch accumulation.

Changes

Mixed-type atom-type coverage in stat collection

Layer / File(s)	Summary
Entry-point refactoring and integration call sites `deepmd/pt/utils/stat.py`, `deepmd/utils/model_stat.py`	`make_stat_input` delegates per-batch accumulation to new `_append_stat_data` and calls `_append_missing_type_frames` before finalising each system's stat dict. `collect_batches` gains a single call to the common-backend helper before merging `sys_stat` into `all_stat`.
`_mixed_type_coverage` and `_append_missing_type_frames` (both backends) `deepmd/pt/utils/stat.py`, `deepmd/utils/model_stat.py`	`_mixed_type_coverage` loads `real_atom_types.npy` across dataset directories with optional `enforce_type_map` remapping and returns per-type counts plus the first representative frame index per type. `_append_missing_type_frames` iteratively appends extra frames from those indices until sampled counts cover all types or no further frames can be selected.
Unit tests for both backends `source/tests/common/test_model_stat.py`, `source/tests/pt/test_observed_type.py`	Backend-agnostic test exercises `make_stat_input` end-to-end with fake mixed-data objects and asserts positive per-type counts and correct energy array length. PyTorch test directly calls `_append_missing_type_frames` through a fake mixed-type data system and verifies `real_natoms_vec` growth and all-positive per-type counts.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

Python

Suggested reviewers

wanghan-iapcm
njzjz

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 26.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: fixing mixed-type coverage in statistical sampling, which is the primary objective of this PR.
Linked Issues check	✅ Passed	The PR fully implements the requirements from issue `#4424`: it calculates per-type atom counts, caches representative frame indices, and supplements sampled frames with missing types during statistics computation.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to addressing mixed-type coverage in statistics sampling; no unrelated modifications were introduced.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

deepmd/utils/model_stat.py (1)

112-116: 💤 Low value

Redundant type cast on line 116.

Line 112 already casts natoms_vec to np.int32, so the conditional cast on line 116 is redundant.

♻️ Remove redundant cast

         extra_batch["natoms_vec"] = data.natoms_vec[sys_idx].astype(np.int32)
         extra_batch["default_mesh"] = data.default_mesh[sys_idx]
         for key, value in extra_batch.items():
-            if key == "natoms_vec":
-                value = value.astype(np.int32)
             if (
                 key not in {"natoms_vec", "default_mesh"}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/utils/model_stat.py` around lines 112 - 116, The conditional block
within the loop that iterates through extra_batch.items() contains a redundant
type cast. Since natoms_vec is already cast to np.int32 when assigned to
extra_batch on line 112, the additional astype(np.int32) call in the conditional
block checking for key == "natoms_vec" is unnecessary. Remove the redundant
conditional cast block (the if statement checking for natoms_vec and the
associated value assignment) from the loop.

deepmd/pt/utils/stat.py (1)

147-159: 💤 Low value

Loop terminates early if first missing type's frame is already used.

When frame_idx in used_frames, the loop breaks immediately instead of trying another missing type. This is safe but could leave some types uncovered if a single frame covers multiple types. Since this is a conservative approach and the fix addresses the main issue (ensuring at least one representative frame per type is attempted), this is acceptable.

♻️ Alternative: continue to next missing type instead of breaking

     while len(missing_types) > 0:
         frame_idx = first_frame_for_type[int(missing_types[0])]
         if frame_idx < 0 or frame_idx in used_frames:
-            break
+            # This type's first frame is invalid or already fetched; remove from missing
+            # and continue. The frame may have covered multiple types, so recompute.
+            missing_types = missing_types[1:]
+            continue
         used_frames.add(frame_idx)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/pt/utils/stat.py` around lines 147 - 159, The loop that iterates over
missing_types breaks prematurely when the frame_idx for the first missing type
is already in used_frames, which prevents searching for representative frames
for other missing types. Instead of breaking when this condition occurs, the
loop should continue to the next iteration to attempt finding an unused frame
for other missing types. This ensures all type coverage is attempted rather than
giving up at the first already-used frame.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@deepmd/pt/utils/stat.py`:
- Around line 147-159: The loop that iterates over missing_types breaks
prematurely when the frame_idx for the first missing type is already in
used_frames, which prevents searching for representative frames for other
missing types. Instead of breaking when this condition occurs, the loop should
continue to the next iteration to attempt finding an unused frame for other
missing types. This ensures all type coverage is attempted rather than giving up
at the first already-used frame.

In `@deepmd/utils/model_stat.py`:
- Around line 112-116: The conditional block within the loop that iterates
through extra_batch.items() contains a redundant type cast. Since natoms_vec is
already cast to np.int32 when assigned to extra_batch on line 112, the
additional astype(np.int32) call in the conditional block checking for key ==
"natoms_vec" is unnecessary. Remove the redundant conditional cast block (the if
statement checking for natoms_vec and the associated value assignment) from the
loop.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 5ac212f0-e4d6-4d6c-be5a-e2b78e3b8e5b

📥 Commits

Reviewing files that changed from the base of the PR and between c5c57d6 and 0b2c374.

📒 Files selected for processing (4)

deepmd/pt/utils/stat.py
deepmd/utils/model_stat.py
source/tests/common/test_model_stat.py
source/tests/pt/test_observed_type.py

codecov · 2026-06-21T11:08:37Z

Codecov Report

❌ Patch coverage is 84.07080% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.17%. Comparing base (c5c57d6) to head (0b2c374).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
deepmd/pt/utils/stat.py	77.77%	14 Missing ⚠️
deepmd/utils/model_stat.py	92.00%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5569      +/-   ##
==========================================
- Coverage   82.18%   82.17%   -0.02%     
==========================================
  Files         898      898              
  Lines      103576   103676     +100     
  Branches     4434     4434              
==========================================
+ Hits        85129    85191      +62     
- Misses      17056    17090      +34     
- Partials     1391     1395       +4

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

njzjz

LGTM

— Opus 4.8

fix: cover mixed types in stat sampling

0b2c374

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

dosubot Bot added the bug label Jun 21, 2026

coderabbitai Bot reviewed Jun 21, 2026

View reviewed changes

github-actions Bot added the Python label Jun 21, 2026

njzjz reviewed Jun 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: cover mixed types in stat sampling#5569

fix: cover mixed types in stat sampling#5569
njzjz-bot wants to merge 1 commit into
deepmodeling:masterfrom
njzjz-bot:fix-4424-bias-stat-sampling

njzjz-bot commented Jun 21, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 21, 2026

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

codecov Bot commented Jun 21, 2026 •

edited

Loading

Uh oh!

njzjz left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

njzjz-bot commented Jun 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 21, 2026

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

njzjz left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

njzjz-bot commented Jun 21, 2026 •

edited by coderabbitai Bot

Loading

codecov Bot commented Jun 21, 2026 •

edited

Loading