Skip to content

fix: cover mixed types in stat sampling#5569

Open
njzjz-bot wants to merge 1 commit into
deepmodeling:masterfrom
njzjz-bot:fix-4424-bias-stat-sampling
Open

fix: cover mixed types in stat sampling#5569
njzjz-bot wants to merge 1 commit into
deepmodeling:masterfrom
njzjz-bot:fix-4424-bias-stat-sampling

Conversation

@njzjz-bot

@njzjz-bot njzjz-bot commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Summary

  • augment mixed-type statistics sampling so sampled frames cover every atom type present in the dataset
  • apply the coverage fix to both PyTorch stat packing and backend-agnostic/common stat packing used by TF-style paths
  • add regression tests for PT and common stat sampling helpers

Fixes #4424.

Tests

  • uvx ruff check deepmd/pt/utils/stat.py deepmd/utils/model_stat.py source/tests/pt/test_observed_type.py source/tests/common/test_model_stat.py
  • uvx ruff format deepmd/pt/utils/stat.py deepmd/utils/model_stat.py source/tests/pt/test_observed_type.py source/tests/common/test_model_stat.py --check
  • python3 -m py_compile deepmd/pt/utils/stat.py deepmd/utils/model_stat.py source/tests/pt/test_observed_type.py source/tests/common/test_model_stat.py
  • git diff --check

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

Summary by CodeRabbit

  • Improvements

    • Enhanced statistics collection for mixed-type atomic datasets to ensure all atom types present in the full dataset are represented in sampled batches, improving overall coverage and representativeness.
  • Tests

    • Added comprehensive test coverage validating mixed-type dataset statistics sampling behavior and proper representation of all atom types.

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)
@dosubot dosubot Bot added the bug label Jun 21, 2026
@coderabbitai

coderabbitai Bot commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Two new helper functions (_mixed_type_coverage and _append_missing_type_frames) are added to both the PyTorch (deepmd/pt/utils/stat.py) and common (deepmd/utils/model_stat.py) stat-collection paths. They detect atom types present in mixed-type datasets but absent from sampled batches, then fetch and append representative frames until full type coverage is achieved. make_stat_input is refactored to use a new _append_stat_data helper for per-batch accumulation.

Changes

Mixed-type atom-type coverage in stat collection

Layer / File(s) Summary
Entry-point refactoring and integration call sites
deepmd/pt/utils/stat.py, deepmd/utils/model_stat.py
make_stat_input delegates per-batch accumulation to new _append_stat_data and calls _append_missing_type_frames before finalising each system's stat dict. collect_batches gains a single call to the common-backend helper before merging sys_stat into all_stat.
_mixed_type_coverage and _append_missing_type_frames (both backends)
deepmd/pt/utils/stat.py, deepmd/utils/model_stat.py
_mixed_type_coverage loads real_atom_types.npy across dataset directories with optional enforce_type_map remapping and returns per-type counts plus the first representative frame index per type. _append_missing_type_frames iteratively appends extra frames from those indices until sampled counts cover all types or no further frames can be selected.
Unit tests for both backends
source/tests/common/test_model_stat.py, source/tests/pt/test_observed_type.py
Backend-agnostic test exercises make_stat_input end-to-end with fake mixed-data objects and asserts positive per-type counts and correct energy array length. PyTorch test directly calls _append_missing_type_frames through a fake mixed-type data system and verifies real_natoms_vec growth and all-positive per-type counts.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

Python

Suggested reviewers

  • wanghan-iapcm
  • njzjz
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 26.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: fixing mixed-type coverage in statistical sampling, which is the primary objective of this PR.
Linked Issues check ✅ Passed The PR fully implements the requirements from issue #4424: it calculates per-type atom counts, caches representative frame indices, and supplements sampled frames with missing types during statistics computation.
Out of Scope Changes check ✅ Passed All changes are directly scoped to addressing mixed-type coverage in statistics sampling; no unrelated modifications were introduced.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
deepmd/utils/model_stat.py (1)

112-116: 💤 Low value

Redundant type cast on line 116.

Line 112 already casts natoms_vec to np.int32, so the conditional cast on line 116 is redundant.

♻️ Remove redundant cast
         extra_batch["natoms_vec"] = data.natoms_vec[sys_idx].astype(np.int32)
         extra_batch["default_mesh"] = data.default_mesh[sys_idx]
         for key, value in extra_batch.items():
-            if key == "natoms_vec":
-                value = value.astype(np.int32)
             if (
                 key not in {"natoms_vec", "default_mesh"}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/utils/model_stat.py` around lines 112 - 116, The conditional block
within the loop that iterates through extra_batch.items() contains a redundant
type cast. Since natoms_vec is already cast to np.int32 when assigned to
extra_batch on line 112, the additional astype(np.int32) call in the conditional
block checking for key == "natoms_vec" is unnecessary. Remove the redundant
conditional cast block (the if statement checking for natoms_vec and the
associated value assignment) from the loop.
deepmd/pt/utils/stat.py (1)

147-159: 💤 Low value

Loop terminates early if first missing type's frame is already used.

When frame_idx in used_frames, the loop breaks immediately instead of trying another missing type. This is safe but could leave some types uncovered if a single frame covers multiple types. Since this is a conservative approach and the fix addresses the main issue (ensuring at least one representative frame per type is attempted), this is acceptable.

♻️ Alternative: continue to next missing type instead of breaking
     while len(missing_types) > 0:
         frame_idx = first_frame_for_type[int(missing_types[0])]
         if frame_idx < 0 or frame_idx in used_frames:
-            break
+            # This type's first frame is invalid or already fetched; remove from missing
+            # and continue. The frame may have covered multiple types, so recompute.
+            missing_types = missing_types[1:]
+            continue
         used_frames.add(frame_idx)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/pt/utils/stat.py` around lines 147 - 159, The loop that iterates over
missing_types breaks prematurely when the frame_idx for the first missing type
is already in used_frames, which prevents searching for representative frames
for other missing types. Instead of breaking when this condition occurs, the
loop should continue to the next iteration to attempt finding an unused frame
for other missing types. This ensures all type coverage is attempted rather than
giving up at the first already-used frame.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@deepmd/pt/utils/stat.py`:
- Around line 147-159: The loop that iterates over missing_types breaks
prematurely when the frame_idx for the first missing type is already in
used_frames, which prevents searching for representative frames for other
missing types. Instead of breaking when this condition occurs, the loop should
continue to the next iteration to attempt finding an unused frame for other
missing types. This ensures all type coverage is attempted rather than giving up
at the first already-used frame.

In `@deepmd/utils/model_stat.py`:
- Around line 112-116: The conditional block within the loop that iterates
through extra_batch.items() contains a redundant type cast. Since natoms_vec is
already cast to np.int32 when assigned to extra_batch on line 112, the
additional astype(np.int32) call in the conditional block checking for key ==
"natoms_vec" is unnecessary. Remove the redundant conditional cast block (the if
statement checking for natoms_vec and the associated value assignment) from the
loop.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 5ac212f0-e4d6-4d6c-be5a-e2b78e3b8e5b

📥 Commits

Reviewing files that changed from the base of the PR and between c5c57d6 and 0b2c374.

📒 Files selected for processing (4)
  • deepmd/pt/utils/stat.py
  • deepmd/utils/model_stat.py
  • source/tests/common/test_model_stat.py
  • source/tests/pt/test_observed_type.py

@codecov

codecov Bot commented Jun 21, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 84.07080% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.17%. Comparing base (c5c57d6) to head (0b2c374).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
deepmd/pt/utils/stat.py 77.77% 14 Missing ⚠️
deepmd/utils/model_stat.py 92.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5569      +/-   ##
==========================================
- Coverage   82.18%   82.17%   -0.02%     
==========================================
  Files         898      898              
  Lines      103576   103676     +100     
  Branches     4434     4434              
==========================================
+ Hits        85129    85191      +62     
- Misses      17056    17090      +34     
- Partials     1391     1395       +4     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@njzjz njzjz left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

— Opus 4.8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Incomplete and risky bias statistics

2 participants