[https://nvbugs/6329227][fix] Use pkgutil.extend_path to merge the two flash_attn distributions before…#15498
[https://nvbugs/6329227][fix] Use pkgutil.extend_path to merge the two flash_attn distributions before…#15498tensorrt-cicd wants to merge 2 commits into
Conversation
The integration test test_wan22_t2v_lpips_against_golden_tp[*] failed in post-merge CI with `ModuleNotFoundError: No module named 'tensorrt_llm.bindings'` raised inside a torch.multiprocessing.spawn worker. The worker reaches `tensorrt_llm/_torch/visual_gen/mapping.py: 246` (`from tensorrt_llm.bindings.internal.process_group import init_pg`) during VisualGenMapping.setup_communicators(), which the C++ allreduce path requires for tp_size > 1. mp.spawn launches a fresh interpreter that inherits env vars but does not replay the parent's runtime sys.path mutations. In the failing CI image, the compiled tensorrt_llm.bindings*.so is reachable from the parent through a sys.path entry that the child does not rebuild, so the import fails in the worker only. The matching non-TP variants (test_wan22_t2v_lpips_against_golden_multi_gpu[*]) skip setup_communicators() and therefore never hit this import. Propagate the parent's resolved sys.path through PYTHONPATH before mp.spawn so the spawn child sees the same import roots the parent used and init_pg resolves identically in both processes. Removes the three nvbugs/6329227 waivers added for these variants. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
…ualGen test relocation The TP variants of test_wan22_t2v_lpips_against_golden_tp execute the FA4 attention backend, which imports flash_attn.cute.* from the flash-attn-4 wheel. When the legacy flash-attn (v2) wheel is co-installed and ships its own __init__.py, Python pins flash_attn.__path__ to the v2 location and never sees the cute/ subpackage. Use pkgutil.extend_path so flash_attn behaves as a namespace-extending package and merges both installs. Additionally, commit d202244 moved test_visual_gen.py one directory deeper but did not update its REPO_ROOT computation, so the LPIPS eval script lookup pointed at /code/tensorrt_llm/tests instead of the repo root. Add the missing '..' so the eval script resolves correctly. Signed-off-by: Repair Bot <noreply@nvidia.com> Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
📝 WalkthroughWalkthroughAdds a ChangesVisual gen flash-attn namespace fix and test env repairs
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsStopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/integration/defs/examples/visual_gen/test_visual_gen_multi_gpu.py`:
- Around line 112-117: The code modifies os.environ["PYTHONPATH"] on line 115 to
set up the environment for the spawned workers but never restores the original
value after mp.spawn() completes, which can cause subsequent tests to inherit
this modified state and become order-dependent. Save the original PYTHONPATH
value before setting it (handle the case where it may not exist), then restore
it after the mp.spawn() call finishes to ensure the environment is clean for
subsequent tests.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f1a13485-c43c-4694-83ef-fc78270d1d05
⛔ Files ignored due to path filters (1)
tests/integration/defs/examples/visual_gen/golden/visual_gen_lpips/visual_gen_lpips_golden_media.zipis excluded by!**/*.zip
📒 Files selected for processing (6)
tensorrt_llm/_torch/visual_gen/attention_backend/flash_attn4.pytensorrt_llm/_torch/visual_gen/attention_backend/parallel.pytensorrt_llm/_torch/visual_gen/attention_backend/utils.pytests/integration/defs/examples/visual_gen/test_visual_gen.pytests/integration/defs/examples/visual_gen/test_visual_gen_multi_gpu.pytests/integration/test_lists/waives.txt
💤 Files with no reviewable changes (1)
- tests/integration/test_lists/waives.txt
| # mp.spawn starts a fresh interpreter that does not inherit the parent's | ||
| # runtime sys.path mutations; without this, the child cannot import | ||
| # tensorrt_llm.bindings.internal.process_group (needed by the tp_size>1 | ||
| # C++ allreduce path). | ||
| os.environ["PYTHONPATH"] = os.pathsep.join(filter(None, sys.path)) | ||
| mp.spawn( |
There was a problem hiding this comment.
Restore PYTHONPATH after spawning workers.
Line 116 mutates process-global environment state and never restores it; later tests can inherit this synthetic value and become order-dependent/flaky.
🔧 Proposed fix
- os.environ["PYTHONPATH"] = os.pathsep.join(filter(None, sys.path))
- mp.spawn(
- _distributed_worker,
- args=(world_size, backend, test_fn, port, kwargs),
- nprocs=world_size,
- join=True,
- )
+ previous_pythonpath = os.environ.get("PYTHONPATH")
+ os.environ["PYTHONPATH"] = os.pathsep.join(map(str, filter(None, sys.path)))
+ try:
+ mp.spawn(
+ _distributed_worker,
+ args=(world_size, backend, test_fn, port, kwargs),
+ nprocs=world_size,
+ join=True,
+ )
+ finally:
+ if previous_pythonpath is None:
+ os.environ.pop("PYTHONPATH", None)
+ else:
+ os.environ["PYTHONPATH"] = previous_pythonpath🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/integration/defs/examples/visual_gen/test_visual_gen_multi_gpu.py`
around lines 112 - 117, The code modifies os.environ["PYTHONPATH"] on line 115
to set up the environment for the spawned workers but never restores the
original value after mp.spawn() completes, which can cause subsequent tests to
inherit this modified state and become order-dependent. Save the original
PYTHONPATH value before setting it (handle the case where it may not exist),
then restore it after the mp.spawn() call finishes to ensure the environment is
clean for subsequent tests.
Summary
Test plan
Links
Summary by CodeRabbit
Bug Fixes
Tests