Skip to content

add support of torch profiling#1093

Open
STwangyingrui wants to merge 10 commits into
mainfrom
yr/torch_profiling
Open

add support of torch profiling#1093
STwangyingrui wants to merge 10 commits into
mainfrom
yr/torch_profiling

Conversation

@STwangyingrui
Copy link
Copy Markdown
Contributor

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a PyTorch Trace Profiling module, featuring a kernel inspector tool, Docker bridge script, and detailed documentation. It also integrates magi_compiler support for NeoPP and Qwen Image models, optimizing inference through custom op registration and Dynamo configuration. Furthermore, a new audio_io utility is implemented to ensure consistent audio loading across various backends. Review feedback highlights several performance optimization opportunities, including specifying target devices for tensor creation to minimize host-to-device synchronization, improving audio loading efficiency by reading directly as float32, and refactoring the profiler initialization to reduce redundant environment variable parsing.

Comment on lines +271 to +272
cu_seqlens_q=torch.tensor([0, seq_len_q], dtype=torch.int32),
cu_seqlens_kv=torch.tensor([0, seq_len_k], dtype=torch.int32),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Creating these tensors inside the per-layer loop without specifying a device can lead to unnecessary host-to-device synchronizations and overhead in eager mode. It is recommended to specify the device to ensure they are created on the same device as the input tensors.

Suggested change
cu_seqlens_q=torch.tensor([0, seq_len_q], dtype=torch.int32),
cu_seqlens_kv=torch.tensor([0, seq_len_k], dtype=torch.int32),
cu_seqlens_q=torch.tensor([0, seq_len_q], dtype=torch.int32, device=query_states.device),
cu_seqlens_kv=torch.tensor([0, seq_len_k], dtype=torch.int32, device=query_states.device),

query = xq.reshape(L, H * D).contiguous().clone()
key = xk.reshape(L, H * D).contiguous().clone()

positions = torch.arange(L, device="cpu", dtype=torch.long).to(xq.device, non_blocking=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

torch.arange can be created directly on the target device to avoid an extra host-to-device transfer.

Suggested change
positions = torch.arange(L, device="cpu", dtype=torch.long).to(xq.device, non_blocking=True)
positions = torch.arange(L, device=xq.device, dtype=torch.long)

Comment on lines +45 to +49
data, sample_rate = sf.read(uri, always_2d=True, start=start, stop=stop)
tensor = torch.from_numpy(data.T.copy()).float()
if not channels_first:
tensor = tensor.transpose(0, 1)
return tensor, sample_rate
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current logic always performs a transpose and copy, which is inefficient when channels_first is False. Additionally, reading as float32 directly from soundfile is more efficient than reading as float64 and converting later.

Suggested change
data, sample_rate = sf.read(uri, always_2d=True, start=start, stop=stop)
tensor = torch.from_numpy(data.T.copy()).float()
if not channels_first:
tensor = tensor.transpose(0, 1)
return tensor, sample_rate
data, sample_rate = sf.read(uri, always_2d=True, start=start, stop=stop, dtype='float32')
if channels_first:
tensor = torch.from_numpy(data.T.copy())
else:
tensor = torch.from_numpy(data)
return tensor, sample_rate

image_rotary_emb,
modulate_index,
):
profiler = TorchTraceProfiler.from_env()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

TorchTraceProfiler.from_env() parses environment variables on every call to infer_calculating. Since these variables are typically static for the duration of the process, it is more efficient to initialize the profiler once in the __init__ method and reuse it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant