fix linear att cpu cache offload load speed by hiworldwzj · Pull Request #1317 · ModelTC/LightLLM

hiworldwzj · 2026-05-24T02:43:50Z

No description provided.

gemini-code-assist

Code Review

This pull request optimizes the Triton kernel for linear attention CPU cache copying by implementing 128-bit alignment using torch.uint64 and refining the parallelization strategy for better performance. It also updates stream synchronization to use non-blocking wait calls and refactors the multi-turn benchmark tool from an asynchronous architecture to a threaded implementation using requests and ThreadPoolExecutor. The reviewer feedback identifies several improvement opportunities, including the removal of dead commented-out code, the correction of an inaccurate comment in the new speed test script, and the restoration of robust error handling in the benchmark tool to prevent the entire process from crashing due to individual request failures.

gemini-code-assist · 2026-05-24T02:45:48Z

                        # TODO fa3 现在必须使用同步模式, 未来需要移除
-                        g_infer_context.get_overlap_stream().synchronize()
+                        torch.cuda.current_stream().wait_stream(g_infer_context.get_overlap_stream())
+                        # g_infer_context.get_overlap_stream().synchronize()


The commented-out code should be removed to maintain a clean codebase.

gemini-code-assist · 2026-05-24T02:45:48Z

+            if response.status_code != 200:
+                err = response.text
+                raise RuntimeError(f"stream_one_turn failed: status={response.status_code}, body={err[:200]}")


Changing the error handling to raise a RuntimeError makes the benchmark tool fragile. A single request failure will now crash the entire benchmark process. It is better to log the error and return None to allow other concurrent sessions to continue, maintaining the previous robust behavior.

Suggested change

if response.status_code != 200:

err = response.text

raise RuntimeError(f"stream_one_turn failed: status={response.status_code}, body={err[:200]}")

if response.status_code != 200:

err = response.text

print(f"\n[turn failed] status={response.status_code} body={err[:200]}")

return None

gemini-code-assist · 2026-05-24T02:45:48Z

    if first_token_time is None:
-        return None
+        raise RuntimeError("stream_one_turn failed: no token received from stream")


Raising a RuntimeError here will terminate the benchmark session abruptly. Consider returning None to allow the benchmark to continue with other sessions.

Suggested change

if first_token_time is None:

return None

raise RuntimeError("stream_one_turn failed: no token received from stream")

if first_token_time is None:

print("\n[turn failed] no token received from stream")

return None

gemini-code-assist · 2026-05-24T02:45:48Z

+# ---------------------------------------------------------------------------
+# Step 2 – derive sizes from the config
+# ---------------------------------------------------------------------------
+big_page_token_num = _env_args["cpu_cache_token_page_size"]  # 512


The comment # 512 is incorrect. Based on the _env_args definition, cpu_cache_token_page_size is 2048 * 8, which equals 16384.

Suggested change

big_page_token_num = _env_args["cpu_cache_token_page_size"] # 512

big_page_token_num = _env_args["cpu_cache_token_page_size"] # 16384

gemini-code-assist · 2026-05-24T02:45:49Z

+# conv_shape = linear_config.get_conv_state_shape()
+# cpu_kv_conv_state = torch.empty(
+#     (buffer_count, linear_config.linear_layer_num, *conv_shape),
+#     dtype=linear_config.conv_state_dtype, device="cuda",
+# )
+
+# ssm_shape = linear_config.get_ssm_state_shape()  # (num_linear_v_heads, head_linear_k_dim, head_linear_v_dim)
+# cpu_kv_ssm_state = torch.empty(
+#     (buffer_count, linear_config.linear_layer_num, *ssm_shape),
+#     dtype=linear_config.ssm_state_dtype, device="cuda",
+# )


Remove commented-out code blocks to maintain cleanliness.

hiworldwzj added 3 commits May 24, 2026 02:24

fix linear att cpu cache offload load speed

1348394

fix

f3519fa

fix

4171b5f

gemini-code-assist Bot reviewed May 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix linear att cpu cache offload load speed#1317

fix linear att cpu cache offload load speed#1317
hiworldwzj wants to merge 3 commits into
mainfrom
wzj_fix

hiworldwzj commented May 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 24, 2026

Uh oh!

gemini-code-assist Bot May 24, 2026

Uh oh!

gemini-code-assist Bot May 24, 2026

Uh oh!

gemini-code-assist Bot May 24, 2026

Uh oh!

gemini-code-assist Bot May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	big_page_token_num = _env_args["cpu_cache_token_page_size"] # 512
	big_page_token_num = _env_args["cpu_cache_token_page_size"] # 16384

Conversation

hiworldwzj commented May 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant