add support of torch profiling by STwangyingrui · Pull Request #1093 · ModelTC/LightX2V

STwangyingrui · 2026-05-25T12:42:10Z

No description provided.

…del graph

gemini-code-assist

Code Review

This pull request introduces a PyTorch Trace Profiling module, featuring a kernel inspector tool, Docker bridge script, and detailed documentation. It also integrates magi_compiler support for NeoPP and Qwen Image models, optimizing inference through custom op registration and Dynamo configuration. Furthermore, a new audio_io utility is implemented to ensure consistent audio loading across various backends. Review feedback highlights several performance optimization opportunities, including specifying target devices for tensor creation to minimize host-to-device synchronization, improving audio loading efficiency by reading directly as float32, and refactoring the profiler initialization to reduce redundant environment variable parsing.

gemini-code-assist · 2026-05-25T12:44:27Z

+                cu_seqlens_q=torch.tensor([0, seq_len_q], dtype=torch.int32),
+                cu_seqlens_kv=torch.tensor([0, seq_len_k], dtype=torch.int32),


Creating these tensors inside the per-layer loop without specifying a device can lead to unnecessary host-to-device synchronizations and overhead in eager mode. It is recommended to specify the device to ensure they are created on the same device as the input tensors.

Suggested change

cu_seqlens_q=torch.tensor([0, seq_len_q], dtype=torch.int32),

cu_seqlens_kv=torch.tensor([0, seq_len_k], dtype=torch.int32),

cu_seqlens_q=torch.tensor([0, seq_len_q], dtype=torch.int32, device=query_states.device),

cu_seqlens_kv=torch.tensor([0, seq_len_k], dtype=torch.int32, device=query_states.device),

gemini-code-assist · 2026-05-25T12:44:27Z

+    query = xq.reshape(L, H * D).contiguous().clone()
+    key = xk.reshape(L, H * D).contiguous().clone()
+
+    positions = torch.arange(L, device="cpu", dtype=torch.long).to(xq.device, non_blocking=True)


torch.arange can be created directly on the target device to avoid an extra host-to-device transfer.

Suggested change

positions = torch.arange(L, device="cpu", dtype=torch.long).to(xq.device, non_blocking=True)

positions = torch.arange(L, device=xq.device, dtype=torch.long)

gemini-code-assist · 2026-05-25T12:44:27Z

+    data, sample_rate = sf.read(uri, always_2d=True, start=start, stop=stop)
+    tensor = torch.from_numpy(data.T.copy()).float()
+    if not channels_first:
+        tensor = tensor.transpose(0, 1)
+    return tensor, sample_rate


The current logic always performs a transpose and copy, which is inefficient when channels_first is False. Additionally, reading as float32 directly from soundfile is more efficient than reading as float64 and converting later.

Suggested change

data, sample_rate = sf.read(uri, always_2d=True, start=start, stop=stop)

tensor = torch.from_numpy(data.T.copy()).float()

if not channels_first:

tensor = tensor.transpose(0, 1)

return tensor, sample_rate

data, sample_rate = sf.read(uri, always_2d=True, start=start, stop=stop, dtype='float32')

if channels_first:

tensor = torch.from_numpy(data.T.copy())

else:

tensor = torch.from_numpy(data)

return tensor, sample_rate

gemini-code-assist · 2026-05-25T12:44:27Z

        image_rotary_emb,
        modulate_index,
    ):
+        profiler = TorchTraceProfiler.from_env()


TorchTraceProfiler.from_env() parses environment variables on every call to infer_calculating. Since these variables are typically static for the duration of the process, it is more efficient to initialize the profiler once in the __init__ method and reuse it.

STwangyingrui added 10 commits May 14, 2026 10:25

add magi_compiler support to neopp

89a60da

Merge branch 'main' into yr/magi_compiler

c84381a

add option of use_magi_compile

d62db69

optimize step1 time(magi compile time): use block graph instead of mo…

d51efde

…del graph

add magi-compiler support to qwen_image

ee2aed0

merge main and fix conflict

d6373d5

fix codec, compatible with legacy env of soundfile

4cb518d

merge main and fix conflict

4c941f8

fix as gemini code assist

5eba407

add support of torch profiling

39ad25f

gemini-code-assist Bot reviewed May 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add support of torch profiling#1093

add support of torch profiling#1093
STwangyingrui wants to merge 10 commits into
mainfrom
yr/torch_profiling

STwangyingrui commented May 25, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 25, 2026

Uh oh!

gemini-code-assist Bot May 25, 2026

Uh oh!

gemini-code-assist Bot May 25, 2026

Uh oh!

gemini-code-assist Bot May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		cu_seqlens_q=torch.tensor([0, seq_len_q], dtype=torch.int32),
		cu_seqlens_kv=torch.tensor([0, seq_len_k], dtype=torch.int32),

	positions = torch.arange(L, device="cpu", dtype=torch.long).to(xq.device, non_blocking=True)
	positions = torch.arange(L, device=xq.device, dtype=torch.long)

Conversation

STwangyingrui commented May 25, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant