add support of torch profiling#1093
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a PyTorch Trace Profiling module, featuring a kernel inspector tool, Docker bridge script, and detailed documentation. It also integrates magi_compiler support for NeoPP and Qwen Image models, optimizing inference through custom op registration and Dynamo configuration. Furthermore, a new audio_io utility is implemented to ensure consistent audio loading across various backends. Review feedback highlights several performance optimization opportunities, including specifying target devices for tensor creation to minimize host-to-device synchronization, improving audio loading efficiency by reading directly as float32, and refactoring the profiler initialization to reduce redundant environment variable parsing.
| cu_seqlens_q=torch.tensor([0, seq_len_q], dtype=torch.int32), | ||
| cu_seqlens_kv=torch.tensor([0, seq_len_k], dtype=torch.int32), |
There was a problem hiding this comment.
Creating these tensors inside the per-layer loop without specifying a device can lead to unnecessary host-to-device synchronizations and overhead in eager mode. It is recommended to specify the device to ensure they are created on the same device as the input tensors.
| cu_seqlens_q=torch.tensor([0, seq_len_q], dtype=torch.int32), | |
| cu_seqlens_kv=torch.tensor([0, seq_len_k], dtype=torch.int32), | |
| cu_seqlens_q=torch.tensor([0, seq_len_q], dtype=torch.int32, device=query_states.device), | |
| cu_seqlens_kv=torch.tensor([0, seq_len_k], dtype=torch.int32, device=query_states.device), |
| query = xq.reshape(L, H * D).contiguous().clone() | ||
| key = xk.reshape(L, H * D).contiguous().clone() | ||
|
|
||
| positions = torch.arange(L, device="cpu", dtype=torch.long).to(xq.device, non_blocking=True) |
There was a problem hiding this comment.
| data, sample_rate = sf.read(uri, always_2d=True, start=start, stop=stop) | ||
| tensor = torch.from_numpy(data.T.copy()).float() | ||
| if not channels_first: | ||
| tensor = tensor.transpose(0, 1) | ||
| return tensor, sample_rate |
There was a problem hiding this comment.
The current logic always performs a transpose and copy, which is inefficient when channels_first is False. Additionally, reading as float32 directly from soundfile is more efficient than reading as float64 and converting later.
| data, sample_rate = sf.read(uri, always_2d=True, start=start, stop=stop) | |
| tensor = torch.from_numpy(data.T.copy()).float() | |
| if not channels_first: | |
| tensor = tensor.transpose(0, 1) | |
| return tensor, sample_rate | |
| data, sample_rate = sf.read(uri, always_2d=True, start=start, stop=stop, dtype='float32') | |
| if channels_first: | |
| tensor = torch.from_numpy(data.T.copy()) | |
| else: | |
| tensor = torch.from_numpy(data) | |
| return tensor, sample_rate |
| image_rotary_emb, | ||
| modulate_index, | ||
| ): | ||
| profiler = TorchTraceProfiler.from_env() |
There was a problem hiding this comment.
No description provided.