Skip to content

[None][fix] Make NIXL port-lock path configurable via TRTLLM_NIXL_PORT_LOCK_PATH#15500

Open
CodersAcademy006 wants to merge 3 commits into
NVIDIA:mainfrom
CodersAcademy006:fix/nixl-port-lock-path
Open

[None][fix] Make NIXL port-lock path configurable via TRTLLM_NIXL_PORT_LOCK_PATH#15500
CodersAcademy006 wants to merge 3 commits into
NVIDIA:mainfrom
CodersAcademy006:fix/nixl-port-lock-path

Conversation

@CodersAcademy006

@CodersAcademy006 CodersAcademy006 commented Jun 19, 2026

Copy link
Copy Markdown

Description

NixlTransferAgent hardcodes the port-selection file lock at /tmp/trtllm_nixl_port.lock. On a shared /tmp (multi-tenant hosts, colocated containers), independent deployments contend on the same lock, serializing the startup of otherwise-unrelated agents.

This adds getEnvNixlPortLockPath(), overridable via the TRTLLM_NIXL_PORT_LOCK_PATH env var and defaulting to the existing /tmp/trtllm_nixl_port.lock so behavior is unchanged unless the var is set. The lock still serializes port selection among agents that share the path, preserving the coordination it exists for — operators just scope it per deployment to avoid cross-tenant contention.

An env var (rather than a config field) keeps parity with the existing NIXL env helpers (TRTLLM_NIXL_PORT, TRTLLM_NIXL_INTERFACE, …) and is a deployment-time ops knob, not a model parameter.

Fixes #13950

Changes

  • common/envUtils.{h,cpp}: add getEnvNixlPortLockPath() (same std::call_once pattern as the other NIXL env helpers), with a pure resolveNixlPortLockPath() underneath for testability — mirrors populateIterationIndexesImpl in cudaProfilerUtils.
  • nixl_utils/transferAgent.cpp: use it for the FileLock and the throw message (which now prints the resolved path).

Test Coverage

  • cpp/tests/unit_tests/common/envUtilsTest.cpp (new): resolveNixlPortLockPath returns the default on null and on set-but-empty, and the override verbatim when set.
  • Default path preserved → no behavior change without the env var; no functional CUDA path altered.

PR Checklist

  • Commits are DCO signed-off
  • Default behavior unchanged
  • Unit test added
  • clang-format (v16.0.0) clean

Summary by CodeRabbit

  • New Features

    • NIXL port lock path is now configurable through an environment variable, allowing custom lock file locations instead of using the default path.
  • Tests

    • Added unit tests validating lock path configuration and default fallback behavior.

NixlTransferAgent hardcodes the port-selection file lock at
/tmp/trtllm_nixl_port.lock. On a shared /tmp (multi-tenant hosts,
colocated containers) independent deployments contend on the same
lock, serializing unrelated agents' startup.

Add getEnvNixlPortLockPath(), overridable via TRTLLM_NIXL_PORT_LOCK_PATH,
defaulting to the existing path so behavior is unchanged unless set.
The lock still serializes port selection among agents that share the
path, preserving the coordination it exists for.

Fixes NVIDIA#13950

Signed-off-by: Srijan Upadhyay <srjnupadhyay@gmail.com>
@CodersAcademy006 CodersAcademy006 changed the title [fix] Make NIXL port-lock path configurable via TRTLLM_NIXL_PORT_LOCK_PATH [None][fix] Make NIXL port-lock path configurable via TRTLLM_NIXL_PORT_LOCK_PATH Jun 19, 2026
Split a pure resolveNixlPortLockPath() out of getEnvNixlPortLockPath()
so the resolution logic is testable without the once-cached env read,
mirroring the populateIterationIndexesImpl pattern in cudaProfilerUtils.

Covers the default (null), set-but-empty, and override cases.

Signed-off-by: Srijan Upadhyay <srjnupadhyay@gmail.com>
@CodersAcademy006 CodersAcademy006 marked this pull request as ready for review June 19, 2026 17:48
@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds resolveNixlPortLockPath and getEnvNixlPortLockPath to envUtils, which read TRTLLM_NIXL_PORT_LOCK_PATH and fall back to /tmp/trtllm_nixl_port.lock when unset. NixlTransferAgent now calls getEnvNixlPortLockPath() instead of using the hardcoded string. Two unit tests validate the resolver's default and override behaviors.

Changes

Configurable NIXL Port Lock Path

Layer / File(s) Summary
Lock-path resolver: declaration, implementation, and call site
cpp/tensorrt_llm/common/envUtils.h, cpp/tensorrt_llm/common/envUtils.cpp, cpp/tensorrt_llm/executor/cache_transmission/nixl_utils/transferAgent.cpp
Declares and implements resolveNixlPortLockPath (null/empty → default path, otherwise verbatim) and getEnvNixlPortLockPath (reads TRTLLM_NIXL_PORT_LOCK_PATH, caches result via std::once_flag). NixlTransferAgent's listen-thread path replaces the hardcoded lock-file string with common::getEnvNixlPortLockPath() and reports the dynamic path in the thrown error.
Unit tests for resolver
cpp/tests/unit_tests/common/CMakeLists.txt, cpp/tests/unit_tests/common/envUtilsTest.cpp
Registers envUtilsTest as a CMake GoogleTest target and adds two test cases: one for null/empty inputs resolving to the default lock path, one for a non-empty override being returned verbatim.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: making the NIXL port-lock path configurable via an environment variable, which aligns with the primary objective.
Description check ✅ Passed The PR description comprehensively addresses the problem, solution, test coverage, and aligns with the provided template structure with all critical sections completed.
Linked Issues check ✅ Passed The code changes directly address issue #13950 by implementing the suggested environment variable override for the lock path, maintaining backward compatibility while unblocking shared-/tmp scenarios.
Out of Scope Changes check ✅ Passed All changes are directly relevant to the stated objectives: new env-var helpers in envUtils, integration into transferAgent, and corresponding unit tests. No extraneous modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
cpp/tensorrt_llm/common/envUtils.cpp (1)

312-314: ⚡ Quick win

Extract the default lock path into a named constant.

Line 314 embeds the default path literal directly; please move it to a k-prefixed constant and reuse it in the resolver.

Proposed change
+namespace
+{
+char const* const kDefaultNixlPortLockPath = "/tmp/trtllm_nixl_port.lock";
+} // namespace
+
 std::string resolveNixlPortLockPath(char const* envValue)
 {
-    return (envValue && *envValue) ? std::string(envValue) : std::string("/tmp/trtllm_nixl_port.lock");
+    return (envValue && *envValue) ? std::string(envValue) : std::string(kDefaultNixlPortLockPath);
 }

As per coding guidelines, non-exempt literals should be extracted into named constants, and constants should use k-prefixed camelCase names.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tensorrt_llm/common/envUtils.cpp` around lines 312 - 314, The function
`resolveNixlPortLockPath` contains a hardcoded string literal
"/tmp/trtllm_nixl_port.lock" in its return statement on line 314. Extract this
literal into a module-level constant using the k-prefix naming convention (e.g.,
kDefaultNixlPortLockPath or similar) and replace the embedded string in the
function with a reference to this constant. Ensure the constant is defined
before the function and follows the k-prefixed camelCase naming pattern as per
coding guidelines.

Source: Coding guidelines

cpp/tests/unit_tests/common/envUtilsTest.cpp (1)

26-27: ⚡ Quick win

Use named constants for expected lock paths in assertions.

Please initialize the expected paths as k-prefixed constants and reference them in EXPECT_EQ calls instead of inline literals.

Proposed change
 TEST(EnvUtils, ResolveNixlPortLockPathDefault)
 {
+    std::string const kDefaultNixlPortLockPath = "/tmp/trtllm_nixl_port.lock";
     // Unset (nullptr) and set-but-empty both fall back to the default path.
-    EXPECT_EQ(tc::resolveNixlPortLockPath(nullptr), "/tmp/trtllm_nixl_port.lock");
-    EXPECT_EQ(tc::resolveNixlPortLockPath(""), "/tmp/trtllm_nixl_port.lock");
+    EXPECT_EQ(tc::resolveNixlPortLockPath(nullptr), kDefaultNixlPortLockPath);
+    EXPECT_EQ(tc::resolveNixlPortLockPath(""), kDefaultNixlPortLockPath);
 }
 
 TEST(EnvUtils, ResolveNixlPortLockPathOverride)
 {
+    std::string const kOverrideNixlPortLockPath = "/run/deployA/nixl_port.lock";
     // A non-empty value is used verbatim.
-    EXPECT_EQ(tc::resolveNixlPortLockPath("/run/deployA/nixl_port.lock"), "/run/deployA/nixl_port.lock");
+    EXPECT_EQ(tc::resolveNixlPortLockPath(kOverrideNixlPortLockPath.c_str()), kOverrideNixlPortLockPath);
 }

As per coding guidelines, non-exempt literals should be extracted into named constants, and constants should use k-prefixed camelCase names.

Also applies to: 33-33

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tests/unit_tests/common/envUtilsTest.cpp` around lines 26 - 27, The test
file contains hardcoded string literals "/tmp/trtllm_nixl_port.lock" repeated in
multiple EXPECT_EQ assertions when testing the resolveNixlPortLockPath function.
Extract this literal into a k-prefixed named constant (e.g.,
kExpectedNixlPortLockPath) at the top of the test file or test class, then
replace all inline string literals in the EXPECT_EQ calls with references to
this constant to follow the coding guidelines for non-exempt literals.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@cpp/tensorrt_llm/common/envUtils.cpp`:
- Around line 312-314: The function `resolveNixlPortLockPath` contains a
hardcoded string literal "/tmp/trtllm_nixl_port.lock" in its return statement on
line 314. Extract this literal into a module-level constant using the k-prefix
naming convention (e.g., kDefaultNixlPortLockPath or similar) and replace the
embedded string in the function with a reference to this constant. Ensure the
constant is defined before the function and follows the k-prefixed camelCase
naming pattern as per coding guidelines.

In `@cpp/tests/unit_tests/common/envUtilsTest.cpp`:
- Around line 26-27: The test file contains hardcoded string literals
"/tmp/trtllm_nixl_port.lock" repeated in multiple EXPECT_EQ assertions when
testing the resolveNixlPortLockPath function. Extract this literal into a
k-prefixed named constant (e.g., kExpectedNixlPortLockPath) at the top of the
test file or test class, then replace all inline string literals in the
EXPECT_EQ calls with references to this constant to follow the coding guidelines
for non-exempt literals.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 194b5dc1-2adf-49eb-9a53-d556445c291a

📥 Commits

Reviewing files that changed from the base of the PR and between a76c818 and bf4d402.

📒 Files selected for processing (5)
  • cpp/tensorrt_llm/common/envUtils.cpp
  • cpp/tensorrt_llm/common/envUtils.h
  • cpp/tensorrt_llm/executor/cache_transmission/nixl_utils/transferAgent.cpp
  • cpp/tests/unit_tests/common/CMakeLists.txt
  • cpp/tests/unit_tests/common/envUtilsTest.cpp

Address review: replace the repeated /tmp/trtllm_nixl_port.lock literal
with kDefaultNixlPortLockPath (k-prefixed, per coding guidelines) in the
resolver, and use named constants in the test assertions.

Signed-off-by: Srijan Upadhyay <srjnupadhyay@gmail.com>
@CodersAcademy006

Copy link
Copy Markdown
Author

Thanks for the review. Addressed in 8b12101:

Nitpicks (both applied):

  • envUtils.cpp — extracted the default into kDefaultNixlPortLockPath (anonymous namespace, k-prefixed per the coding guideline) and reference it from resolveNixlPortLockPath.
  • envUtilsTest.cpp — replaced the inline literals with named kDefaultNixlPortLockPath / kOverrideNixlPortLockPath constants in the assertions.

Docstring Coverage check (warning): holding on this one. The 0% is the file-wide baseline — none of the existing envUtils helpers (getEnvNixlPort, getEnvNixlInterface, getEnvNixlBackend, etc.) carry docstrings, so adding Doxygen to only the two new functions would break local consistency. The new declarations already have explanatory header comments describing the env var, default, and rationale — more than the surrounding helpers. Happy to add formal docstrings if maintainers want them, but didn't want to introduce a one-off convention.

For context on the design: the FileLock ctor takes std::string const&, so passing the resolved std::string binds directly (no .c_str() needed), and the default path is now defined in a single place. clang-format (v16.0.0, matching the pinned pre-commit rev) is clean across all changed files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NixlTransferAgent hardcodes /tmp/trtllm_nixl_port.lock — breaks shared-/tmp setups (Docker bind-mounts, parallel CI)

1 participant