Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds YAML-driven sweep orchestration and bounded-concurrency execution; new run utilities and CLI sweep flags; introduces HumanEval dataset and per-dataset category filtering; converts metrics to record-driven finalization (adds ThinkingAcceptance); standardizes model streaming outputs with a CumulativeTokenCollector; adds env dumping and tokenizer/chat defaults. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI
participant RunUtils as run_utils (sweep)
participant Runner as Runner
participant Model as Model
participant Metrics as Metrics
CLI->>RunUtils: provide args + --sweep_config
RunUtils->>RunUtils: load_sweep_entries(), expand_sweep_run_specs()
loop per run_spec
RunUtils->>RunUtils: build_dataset(), build_metrics()
RunUtils->>Runner: start run with dataset & metrics
loop per request
Runner->>Model: run(prompt_ids, max_length, end_id, request_id, turn_id)
Model->>Model: CumulativeTokenCollector collects chunks
Model-->>Runner: build_model_output(output_ids, token_times, chunk_lengths)
Runner->>Metrics: process_metrics_step(record with token_times, chunk_lengths, output_ids)
end
Runner->>Metrics: process_final(text_outputs, metric_records)
Metrics->>RunUtils: write outputs/plots
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
689b4a0 to
64bfe8d
Compare
There was a problem hiding this comment.
Actionable comments posted: 12
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/specdec_bench/specdec_bench/datasets/base.py (1)
21-29:⚠️ Potential issue | 🟡 MinorReject requests that define both
turnsandmessages.This dataclass now allows two conversation representations at once, which makes invalid requests easy to construct and silently mis-handle downstream. Add a
__post_init__guard or normalize one form into the other here.Proposed fix
`@dataclass` class Request: question_id: int | None = None category: str | None = None system_prompt: str | None = None turns: list[str] = field(default_factory=list) mm_content: Any | None = None # TODO messages: list[dict] | None = None # pre-built full conversation history + + def __post_init__(self) -> None: + if self.messages is not None and self.turns: + raise ValueError("Request cannot define both 'turns' and 'messages'.")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/specdec_bench/specdec_bench/datasets/base.py` around lines 21 - 29, The Request dataclass allows both turns and messages to be set simultaneously which is invalid; add a __post_init__ method on Request that checks if both self.turns (non-empty) and self.messages (non-None and non-empty) are provided and raise a ValueError explaining only one conversation representation is allowed; optionally, if you prefer normalization instead of rejecting, implement the same __post_init__ to populate self.messages from self.turns when messages is None (or populate turns from messages when turns is empty) — but ensure the check for both being present is enforced in the __post_init__ of Request.
🧹 Nitpick comments (5)
examples/specdec_bench/specdec_bench/models/auto_deploy.py (1)
95-115: Consider exposingmtp_eagle_one_modelas configurable.The
EAGLE,EAGLE3, andMTPspeculative decoding configurations look correct. However,mtp_eagle_one_model=Trueon line 114 is hardcoded while similar parameters in other configs (e.g.,eagle3_one_model) are configurable via kwargs.Make mtp_eagle_one_model configurable
elif speculative_algorithm == "MTP": specdec = MTPDecodingConfig( speculative_model=kwargs.get("draft_model_dir"), num_nextn_predict_layers=kwargs.get("speculative_num_steps", 1), - mtp_eagle_one_model=True, + mtp_eagle_one_model=kwargs.get("mtp_eagle_one_model", True), )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/specdec_bench/specdec_bench/models/auto_deploy.py` around lines 95 - 115, In the MTP branch where MTPDecodingConfig is constructed (the speculative_algorithm == "MTP" block), make the mtp_eagle_one_model flag configurable instead of hardcoded: replace the hardcoded mtp_eagle_one_model=True with mtp_eagle_one_model=kwargs.get("mtp_eagle_one_model", True) so callers can pass mtp_eagle_one_model via kwargs while preserving the existing default; update any related docs/comments if present.examples/specdec_bench/README.md (1)
219-224: Minor: Missing trailing newline.The file should end with a newline character per standard convention.
Add trailing newline
use a more robust benchmarking client like NVIDIA AI Perf. +🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/specdec_bench/README.md` around lines 219 - 224, The README ends without a trailing newline; open the README.md and add a single newline character at the end of the file so the final line (e.g., the "## Notes" section text) is terminated by a newline per POSIX/text-file conventions.examples/specdec_bench/specdec_bench/metrics/base.py (1)
62-70: Consider documenting the[1]fallback return value.The
trim_beam_lengthsmethod returns[1]as a fallback whenbeam_lengthsis empty or when trimming results in an invalid range. This sentinel value may have specific meaning for downstream consumers, but it's not immediately obvious why[1]was chosen over an empty list.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/specdec_bench/specdec_bench/metrics/base.py` around lines 62 - 70, The trim_beam_lengths function returns a sentinel [1] in edge cases but lacks documentation; update the trim_beam_lengths docstring to explain why [1] is returned (e.g., represents a minimum beam length/default sentinel for downstream metrics consumers), when it will be returned (empty input or invalid trimmed range), and any assumptions callers must make about this sentinel so maintainers and callers (who call trim_beam_lengths with beam_lengths) understand its semantics and won’t misinterpret it.examples/specdec_bench/specdec_bench/metrics/specbench.py (1)
189-191: Unused loop variablei.The
enumerateprovidesiwhich is not used in the loop body.♻️ Suggested fix
- for i, (bar, count) in enumerate(zip(bars2, length_counts)): + for bar, count in zip(bars2, length_counts):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/specdec_bench/specdec_bench/metrics/specbench.py` around lines 189 - 191, The loop is enumerating but never uses the index variable `i`; change the loop that builds `bars2` to avoid the unused `i` by iterating over indices or zipped sequences directly (e.g., use range(len(length_bins)) or zip(length_bins, length_means, length_stds)) when calling `ax2.bar` so `bars2 = ax2.bar(range(len(length_bins)), length_means, yerr=length_stds, capsize=5, alpha=0.8)` has no leftover unused `i` from an `enumerate` elsewhere.examples/specdec_bench/specdec_bench/metrics/thinking_acceptance.py (1)
61-69: Transition chunk is discarded whenthinking_end_tokenis detected.When the end token is found in a chunk (line 62-65), the chunk's acceptance length is not counted in either
thinking_lengthsornon_thinking_lengthsdue to thecontinue. This may be intentional behavior for boundary handling, but it means the transition chunk's tokens are excluded from statistics. Consider documenting this behavior or counting it in one category.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/specdec_bench/specdec_bench/metrics/thinking_acceptance.py` around lines 61 - 69, The transition chunk is skipped by the continue when self._contains_sequence(output_id_iter, self.thinking_end_token) is true, so its acceptance_length isn't recorded; update the block in thinking handling (where in_thinking and self.thinking_end_token are checked) to record the current acceptance_length into the appropriate list (e.g., append acceptance_length to thinking_lengths) before flipping prompt_state[request_id][turn_id]["in_thinking"] = False and setting in_thinking = False, or explicitly document that transition chunks are excluded if you prefer; reference symbols: thinking_end_token, _contains_sequence, prompt_state, in_thinking, thinking_lengths, non_thinking_lengths, acceptance_length.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/specdec_bench/run.py`:
- Around line 405-409: The YAML loaded into args.runtime_params via
yaml.safe_load may be a scalar or list causing AttributeError later in
run_simple(); after loading (or setting {}), validate that args.runtime_params
is a mapping (dict) and if not raise a clear CLI error (e.g.,
argparse.ArgumentTypeError or print to stderr and exit) mentioning that
runtime_params must be a YAML mapping; reference args.runtime_params,
yaml.safe_load, and run_simple() when adding the validation.
- Around line 169-184: The code reuses one model instance (constructed via
model_class(...) and assigned to model) while mutating
model.sampling_kwargs["temperature"] per-run, but the executor's
allow_advanced_sampling (set in create_executor()) is computed only at model
init and cannot change; fix by either (A) detecting per-run temperature changes
against the initial sampling_kwargs temperature and recreating the model
(re-instantiate model via model_class with the new sampling_kwargs) before
executing that run, ensuring create_executor() sees the correct temperature, or
(B) rejecting per-run temperature overrides for engines that bake sampling
settings at init (validate run_specs and raise an error if any
run_spec.temperature differs from the initial sampling_kwargs temperature).
Update the sweep logic that mutates model.sampling_kwargs (and the code that
builds model from model_class) to implement one of these two behaviors and
reference create_executor, allow_advanced_sampling, model, model_class,
sampling_kwargs, and run_specs when making the change.
In `@examples/specdec_bench/specdec_bench/datasets/humaneval.py`:
- Around line 18-22: The module-level optional import for "datasets" is broken
because load_dataset can be undefined; move the import into the function that
needs it (e.g. _preprocess() or the class __init__ that calls it) and replace
the print fallback with a raised ImportError: try to import load_dataset inside
_preprocess() (or __init__), catch ImportError and raise a new ImportError with
a clear message instructing the user to install the optional extras (e.g. "[hf]"
or "[all]"); ensure no module-level variables like load_dataset remain undefined
after this change.
In `@examples/specdec_bench/specdec_bench/datasets/speed.py`:
- Around line 477-480: After filtering by category, clamp self.num_samples to
the filtered dataset length before calling Dataset.select to avoid IndexError:
compute a local count (e.g., n = min(self.num_samples, len(dataset) or
dataset.num_rows)) after the dataset.filter step and then call
dataset.select(range(n)) only when self.num_samples is not None and n > 0;
update the block that references dataset, category, and self.num_samples
accordingly so select() never receives out-of-bounds indices.
In `@examples/specdec_bench/specdec_bench/metrics/mtbench.py`:
- Around line 34-40: The code currently assumes exactly two turns by indexing
turns[0] and turns[1], which will raise errors for requests with fewer or more
turns; update the loop over prompt_ar so you handle an arbitrary-length turns
list: compute request_ar by aggregating all tokens/values across the turns
(e.g., sum of concatenated turns divided by total length) and set
self.out["Request_AL"][request_id] accordingly, and replace the two explicit
_get_lengths calls with a loop that calls self._get_lengths(turn, lengths) for
each turn in turns (or skip requests with no turns) to defensively handle
varying turn counts.
In `@examples/specdec_bench/specdec_bench/metrics/timing.py`:
- Around line 38-40: The min/max calls over timing can crash if request_records
is empty or any token_times list is empty; update the metric calculation (the
block that computes start_time, end_time and assigns self.out["Output TPS"]) to
first filter out empty token_times (e.g., filtered_timing = [t for t in timing
if t]) and check for an empty result, and if filtered_timing is empty (or timing
is empty) set self.out["Output TPS"] to 0 (or another safe sentinel) instead of
calling min/max; otherwise compute start_time = min(t[0] for t in
filtered_timing), end_time = max(t[-1] for t in filtered_timing) and proceed to
compute TPS using total_tokens.
In `@examples/specdec_bench/specdec_bench/models/base.py`:
- Around line 46-67: CumulativeTokenCollector currently records timestamps in a
single list self.token_times which becomes ambiguous when multiple beams update
interleaved; change the design so timing is stored per-beam: add a timestamp
container on the per-beam state (e.g. add a timestamps list to _BeamState) and
update add_update(beam_idx, cumulative_tokens, timestamp) to append timestamps
to that beam's timestamps instead of self.token_times, and adjust
chunk_lengths() and any consumers of token_times/output_ids to read per-beam
timestamps from each _BeamState (or alternatively enforce single-beam usage by
documenting/raising in CumulativeTokenCollector if more than one beam is used).
Ensure references: CumulativeTokenCollector, _BeamState, add_update,
token_times, chunk_lengths, output_ids are all updated consistently.
In `@examples/specdec_bench/specdec_bench/models/sglang.py`:
- Around line 68-80: Replace the fragile assert with explicit validation: check
sampling_config.get("beam_width", 1) and if it is not 1 raise a clear
RuntimeError or ValueError (e.g., "multi-beam generation not supported") so the
check remains in optimized runs; update the surrounding logic around
sampling_config, the call to self.model.async_generate, and collector.add_update
(which currently hardcodes beam index 0) to rely on this validation instead of
an assert.
- Around line 40-60: The sgl.Engine call currently hardcodes
trust_remote_code=True, which is unsafe; change the code to accept a
caller-configurable parameter (e.g., a function/class initializer argument named
trust_remote_code defaulting to False) and pass that value into sgl.Engine
instead of the literal True; locate the sgl.Engine instantiation and replace the
hardcoded trust_remote_code argument with the new parameter (or pull it from
kwargs with kwargs.pop("trust_remote_code", False)) so the default remains False
but callers can opt in.
In `@examples/specdec_bench/specdec_bench/models/vllm.py`:
- Around line 73-74: The kwargs.pop("parallel_drafting") call can raise KeyError
when the key is absent; change it to provide a default (e.g.,
kwargs.pop("parallel_drafting", False) or None) and then set
specdec["parallel_drafting"] = True only when the popped value is truthy; update
the constructor logic around the specdec handling (the block referencing
kwargs.pop("parallel_drafting") and specdec) so it safely handles missing keys
without throwing.
- Around line 76-88: The code hardcodes trust_remote_code=True when constructing
AsyncEngineArgs in the AsyncEngineArgs(...) call, which is a security risk;
change the API to accept a caller-configurable boolean (e.g., add a parameter
named trust_remote_code with default False to the function/class that
instantiates AsyncEngineArgs) and pass that parameter into
AsyncEngineArgs(trust_remote_code=trust_remote_code) instead of the hardcoded
True, ensuring callers can opt in explicitly; update any call sites that
construct this vllm engine to forward the new argument or rely on the default
False.
In `@examples/specdec_bench/specdec_bench/utils.py`:
- Around line 23-24: The get_tokenizer function currently hardcodes
trust_remote_code=True which is a security risk; change get_tokenizer to accept
a caller-configurable parameter (e.g., trust_remote_code: bool = False) and pass
that variable into AutoTokenizer.from_pretrained(path,
trust_remote_code=trust_remote_code); update any call sites to rely on the
default False or explicitly pass True when the caller trusts the model source.
Ensure the function signature and its usage across the repo are updated so the
hardcoded True is removed.
---
Outside diff comments:
In `@examples/specdec_bench/specdec_bench/datasets/base.py`:
- Around line 21-29: The Request dataclass allows both turns and messages to be
set simultaneously which is invalid; add a __post_init__ method on Request that
checks if both self.turns (non-empty) and self.messages (non-None and non-empty)
are provided and raise a ValueError explaining only one conversation
representation is allowed; optionally, if you prefer normalization instead of
rejecting, implement the same __post_init__ to populate self.messages from
self.turns when messages is None (or populate turns from messages when turns is
empty) — but ensure the check for both being present is enforced in the
__post_init__ of Request.
---
Nitpick comments:
In `@examples/specdec_bench/README.md`:
- Around line 219-224: The README ends without a trailing newline; open the
README.md and add a single newline character at the end of the file so the final
line (e.g., the "## Notes" section text) is terminated by a newline per
POSIX/text-file conventions.
In `@examples/specdec_bench/specdec_bench/metrics/base.py`:
- Around line 62-70: The trim_beam_lengths function returns a sentinel [1] in
edge cases but lacks documentation; update the trim_beam_lengths docstring to
explain why [1] is returned (e.g., represents a minimum beam length/default
sentinel for downstream metrics consumers), when it will be returned (empty
input or invalid trimmed range), and any assumptions callers must make about
this sentinel so maintainers and callers (who call trim_beam_lengths with
beam_lengths) understand its semantics and won’t misinterpret it.
In `@examples/specdec_bench/specdec_bench/metrics/specbench.py`:
- Around line 189-191: The loop is enumerating but never uses the index variable
`i`; change the loop that builds `bars2` to avoid the unused `i` by iterating
over indices or zipped sequences directly (e.g., use range(len(length_bins)) or
zip(length_bins, length_means, length_stds)) when calling `ax2.bar` so `bars2 =
ax2.bar(range(len(length_bins)), length_means, yerr=length_stds, capsize=5,
alpha=0.8)` has no leftover unused `i` from an `enumerate` elsewhere.
In `@examples/specdec_bench/specdec_bench/metrics/thinking_acceptance.py`:
- Around line 61-69: The transition chunk is skipped by the continue when
self._contains_sequence(output_id_iter, self.thinking_end_token) is true, so its
acceptance_length isn't recorded; update the block in thinking handling (where
in_thinking and self.thinking_end_token are checked) to record the current
acceptance_length into the appropriate list (e.g., append acceptance_length to
thinking_lengths) before flipping
prompt_state[request_id][turn_id]["in_thinking"] = False and setting in_thinking
= False, or explicitly document that transition chunks are excluded if you
prefer; reference symbols: thinking_end_token, _contains_sequence, prompt_state,
in_thinking, thinking_lengths, non_thinking_lengths, acceptance_length.
In `@examples/specdec_bench/specdec_bench/models/auto_deploy.py`:
- Around line 95-115: In the MTP branch where MTPDecodingConfig is constructed
(the speculative_algorithm == "MTP" block), make the mtp_eagle_one_model flag
configurable instead of hardcoded: replace the hardcoded
mtp_eagle_one_model=True with
mtp_eagle_one_model=kwargs.get("mtp_eagle_one_model", True) so callers can pass
mtp_eagle_one_model via kwargs while preserving the existing default; update any
related docs/comments if present.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 3b5bc295-bd32-42e5-a20d-34a0edcfb946
📒 Files selected for processing (30)
examples/specdec_bench/README.mdexamples/specdec_bench/SPECBENCH_PORTING.mdexamples/specdec_bench/prepare_data.pyexamples/specdec_bench/run.pyexamples/specdec_bench/specdec_bench/datasets/__init__.pyexamples/specdec_bench/specdec_bench/datasets/base.pyexamples/specdec_bench/specdec_bench/datasets/humaneval.pyexamples/specdec_bench/specdec_bench/datasets/mtbench.pyexamples/specdec_bench/specdec_bench/datasets/specbench.pyexamples/specdec_bench/specdec_bench/datasets/speed.pyexamples/specdec_bench/specdec_bench/metrics/__init__.pyexamples/specdec_bench/specdec_bench/metrics/acceptance_rate.pyexamples/specdec_bench/specdec_bench/metrics/base.pyexamples/specdec_bench/specdec_bench/metrics/mtbench.pyexamples/specdec_bench/specdec_bench/metrics/specbench.pyexamples/specdec_bench/specdec_bench/metrics/thinking_acceptance.pyexamples/specdec_bench/specdec_bench/metrics/timing.pyexamples/specdec_bench/specdec_bench/models/__init__.pyexamples/specdec_bench/specdec_bench/models/auto_deploy.pyexamples/specdec_bench/specdec_bench/models/base.pyexamples/specdec_bench/specdec_bench/models/sglang.pyexamples/specdec_bench/specdec_bench/models/specbench_medusa.pyexamples/specdec_bench/specdec_bench/models/trtllm_torch_api.pyexamples/specdec_bench/specdec_bench/models/vllm.pyexamples/specdec_bench/specdec_bench/run_utils.pyexamples/specdec_bench/specdec_bench/runners/__init__.pyexamples/specdec_bench/specdec_bench/runners/base.pyexamples/specdec_bench/specdec_bench/runners/simple.pyexamples/specdec_bench/specdec_bench/utils.pyexamples/specdec_bench/sweep_example.yaml
| if args.runtime_params is not None: | ||
| with open(args.runtime_params) as f: | ||
| args.runtime_params = yaml.safe_load(f) | ||
| args.runtime_params = yaml.safe_load(f) or {} | ||
| else: | ||
| args.runtime_params = {} |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate the file and examine the code at the specified lines
find . -name "run.py" -path "*/specdec_bench/*" | head -5Repository: NVIDIA/Model-Optimizer
Length of output: 97
🏁 Script executed:
cat -n examples/specdec_bench/run.py | sed -n '400,415p'Repository: NVIDIA/Model-Optimizer
Length of output: 702
🏁 Script executed:
rg -A 20 "def run_simple" examples/specdec_bench/run.pyRepository: NVIDIA/Model-Optimizer
Length of output: 971
🏁 Script executed:
rg "args\.runtime_params" examples/specdec_bench/run.pyRepository: NVIDIA/Model-Optimizer
Length of output: 550
Validate that runtime_params loads to a mapping.
yaml.safe_load() also accepts scalars and lists, so a malformed YAML file will make the first .get(...) call in run_simple() fail with AttributeError instead of a clear CLI error.
🧩 Proposed fix
if args.runtime_params is not None:
with open(args.runtime_params) as f:
args.runtime_params = yaml.safe_load(f) or {}
+ if not isinstance(args.runtime_params, dict):
+ raise ValueError("--runtime_params must load to a mapping")
else:
args.runtime_params = {}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/specdec_bench/run.py` around lines 405 - 409, The YAML loaded into
args.runtime_params via yaml.safe_load may be a scalar or list causing
AttributeError later in run_simple(); after loading (or setting {}), validate
that args.runtime_params is a mapping (dict) and if not raise a clear CLI error
(e.g., argparse.ArgumentTypeError or print to stderr and exit) mentioning that
runtime_params must be a YAML mapping; reference args.runtime_params,
yaml.safe_load, and run_simple() when adding the validation.
| for request_id, turns in prompt_ar.items(): | ||
| turn_1 = turns[0] | ||
| turn_2 = turns[1] | ||
| q_id = request_id | ||
| mtbench_topic = MTBENCH_TOPICS[q_id // 10] | ||
| self.out["Request_AR"][request_id] = sum(turn_1 + turn_2) / len(turn_1 + turn_2) | ||
| request_ar = sum(turn_1 + turn_2) / len(turn_1 + turn_2) | ||
| self.out["Request_AL"][request_id] = request_ar | ||
| self._get_lengths(turn_1, lengths) | ||
| self._get_lengths(turn_2, lengths) |
There was a problem hiding this comment.
Hard-coded assumption of exactly 2 turns per request.
Accessing turns[0] and turns[1] directly will raise KeyError if a request has fewer than 2 turns. This may be intentional for MTBench's fixed format, but defensive handling would improve robustness.
🛡️ Suggested defensive check
for request_id, turns in prompt_ar.items():
- turn_1 = turns[0]
- turn_2 = turns[1]
+ turn_1 = turns.get(0, [])
+ turn_2 = turns.get(1, [])
+ if not turn_1 or not turn_2:
+ continue
request_ar = sum(turn_1 + turn_2) / len(turn_1 + turn_2)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for request_id, turns in prompt_ar.items(): | |
| turn_1 = turns[0] | |
| turn_2 = turns[1] | |
| q_id = request_id | |
| mtbench_topic = MTBENCH_TOPICS[q_id // 10] | |
| self.out["Request_AR"][request_id] = sum(turn_1 + turn_2) / len(turn_1 + turn_2) | |
| request_ar = sum(turn_1 + turn_2) / len(turn_1 + turn_2) | |
| self.out["Request_AL"][request_id] = request_ar | |
| self._get_lengths(turn_1, lengths) | |
| self._get_lengths(turn_2, lengths) | |
| for request_id, turns in prompt_ar.items(): | |
| if len(turns) < 2: | |
| continue | |
| turn_1 = turns[0] | |
| turn_2 = turns[1] | |
| request_ar = sum(turn_1 + turn_2) / len(turn_1 + turn_2) | |
| self.out["Request_AL"][request_id] = request_ar | |
| self._get_lengths(turn_1, lengths) | |
| self._get_lengths(turn_2, lengths) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/specdec_bench/specdec_bench/metrics/mtbench.py` around lines 34 -
40, The code currently assumes exactly two turns by indexing turns[0] and
turns[1], which will raise errors for requests with fewer or more turns; update
the loop over prompt_ar so you handle an arbitrary-length turns list: compute
request_ar by aggregating all tokens/values across the turns (e.g., sum of
concatenated turns divided by total length) and set
self.out["Request_AL"][request_id] accordingly, and replace the two explicit
_get_lengths calls with a loop that calls self._get_lengths(turn, lengths) for
each turn in turns (or skip requests with no turns) to defensively handle
varying turn counts.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/specdec_bench/specdec_bench/metrics/timing.py (1)
49-54:⚠️ Potential issue | 🟡 Minor
compute_statisticscan fail on empty input lists.If all records have
len(times) <= 1, thene2e_timeandttft_timewill be empty, causingcompute_statisticsto fail when callingnp.min/np.max/np.meanon an empty array. Similarly,total_tokenswill be empty ifrequest_recordsis empty.🛡️ Suggested fix: guard compute_statistics calls
- self.out["E2E Request Time"] = compute_statistics(e2e_time) - self.out["TTFT Time"] = compute_statistics(ttft_time) + if e2e_time: + self.out["E2E Request Time"] = compute_statistics(e2e_time) + if ttft_time: + self.out["TTFT Time"] = compute_statistics(ttft_time) if tpot_time: self.out["Request Generation Step Time"] = compute_statistics(tpot_time) self.out["Request Generation Tokens Per Second"] = compute_statistics(gen_tp_time) - self.out["Number of Output Tokens"] = compute_statistics(total_tokens) + if total_tokens: + self.out["Number of Output Tokens"] = compute_statistics(total_tokens)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/specdec_bench/specdec_bench/metrics/timing.py` around lines 49 - 54, Guard calls to compute_statistics for empty input arrays: before calling compute_statistics for e2e_time, ttft_time, tpot_time, gen_tp_time and total_tokens, check that the corresponding list/array is non-empty (and for tpot_time also keep existing conditional). If empty, assign a sensible fallback (e.g., None or an empty statistics dict) to self.out keys ("E2E Request Time", "TTFT Time", "Request Generation Step Time", "Request Generation Tokens Per Second", "Number of Output Tokens") instead of calling compute_statistics; otherwise call compute_statistics as before. This change touches the block that assigns into self.out and references variables e2e_time, ttft_time, tpot_time, gen_tp_time, and total_tokens.
♻️ Duplicate comments (1)
examples/specdec_bench/specdec_bench/models/vllm.py (1)
76-88:⚠️ Potential issue | 🔴 CriticalCRITICAL: Hardcoded
trust_remote_code=Trueviolates security guidelines.Per SECURITY.md and coding guidelines,
trust_remote_code=Truemust not be hardcoded. This enables execution of arbitrary Python shipped with a checkpoint, creating an RCE vector if the model source is untrusted. The code should expose this as a caller-configurable parameter defaulting toFalse.🔒 Proposed fix
class VLLMModel(Model): def __init__(self, model_dir, max_concurrent_requests, sampling_kwargs, **kwargs): + trust_remote_code = kwargs.pop("trust_remote_code", False) specdec = None # ... existing speculative config code ... engine_args = AsyncEngineArgs( model=model_dir, - trust_remote_code=True, + trust_remote_code=trust_remote_code, tensor_parallel_size=kwargs.pop("tensor_parallel_size", 1),As per coding guidelines: "Flag
trust_remote_code=Truehardcoded for transformers model or tokenizer loading as CRITICAL security issue. Code should expose it as a caller-configurable parameter defaulting to False, not hardcode True."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/specdec_bench/specdec_bench/models/vllm.py` around lines 76 - 88, The code currently hardcodes trust_remote_code=True when constructing AsyncEngineArgs; change this to accept a caller-configurable parameter defaulting to False by reading a new or existing argument (e.g., trust_remote_code = kwargs.pop("trust_remote_code", False)) and pass that variable into AsyncEngineArgs instead of the literal True; update any callers or documentation if needed and ensure kwargs.pop("trust_remote_code", False) is used so the default remains False and callers can opt in.
🧹 Nitpick comments (1)
examples/specdec_bench/specdec_bench/metrics/timing.py (1)
63-77: Consider adding empty-data guard tocompute_statistics.As a defensive measure,
compute_statisticscould return a sentinel or raise a clearer error whendatais empty, rather than letting NumPy raise an opaque error.♻️ Optional: add early return for empty data
def compute_statistics(data, quantiles=[0.25, 0.5, 0.75]): + if len(data) == 0: + return {"min": "N/A", "max": "N/A", "mean": "N/A", "std": "N/A", "quantiles": {}} data = np.array(data)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/specdec_bench/specdec_bench/metrics/timing.py` around lines 63 - 77, compute_statistics currently lets NumPy raise an opaque error on empty input; add an explicit empty-data guard at the start of compute_statistics (check if data is empty via len(data) or data.size) and either raise a clear ValueError (e.g. "compute_statistics: input data is empty") or return a defined sentinel structure, so downstream code gets a predictable result before calling np.min/np.max/np.mean/np.std/np.percentile; update the function body around compute_statistics and the quantile logic to early-return or raise accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@examples/specdec_bench/specdec_bench/metrics/timing.py`:
- Around line 49-54: Guard calls to compute_statistics for empty input arrays:
before calling compute_statistics for e2e_time, ttft_time, tpot_time,
gen_tp_time and total_tokens, check that the corresponding list/array is
non-empty (and for tpot_time also keep existing conditional). If empty, assign a
sensible fallback (e.g., None or an empty statistics dict) to self.out keys
("E2E Request Time", "TTFT Time", "Request Generation Step Time", "Request
Generation Tokens Per Second", "Number of Output Tokens") instead of calling
compute_statistics; otherwise call compute_statistics as before. This change
touches the block that assigns into self.out and references variables e2e_time,
ttft_time, tpot_time, gen_tp_time, and total_tokens.
---
Duplicate comments:
In `@examples/specdec_bench/specdec_bench/models/vllm.py`:
- Around line 76-88: The code currently hardcodes trust_remote_code=True when
constructing AsyncEngineArgs; change this to accept a caller-configurable
parameter defaulting to False by reading a new or existing argument (e.g.,
trust_remote_code = kwargs.pop("trust_remote_code", False)) and pass that
variable into AsyncEngineArgs instead of the literal True; update any callers or
documentation if needed and ensure kwargs.pop("trust_remote_code", False) is
used so the default remains False and callers can opt in.
---
Nitpick comments:
In `@examples/specdec_bench/specdec_bench/metrics/timing.py`:
- Around line 63-77: compute_statistics currently lets NumPy raise an opaque
error on empty input; add an explicit empty-data guard at the start of
compute_statistics (check if data is empty via len(data) or data.size) and
either raise a clear ValueError (e.g. "compute_statistics: input data is empty")
or return a defined sentinel structure, so downstream code gets a predictable
result before calling np.min/np.max/np.mean/np.std/np.percentile; update the
function body around compute_statistics and the quantile logic to early-return
or raise accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: ab43911b-151a-45c5-9d88-f3af34de4f58
📒 Files selected for processing (30)
examples/specdec_bench/README.mdexamples/specdec_bench/SPECBENCH_PORTING.mdexamples/specdec_bench/prepare_data.pyexamples/specdec_bench/run.pyexamples/specdec_bench/specdec_bench/datasets/__init__.pyexamples/specdec_bench/specdec_bench/datasets/base.pyexamples/specdec_bench/specdec_bench/datasets/humaneval.pyexamples/specdec_bench/specdec_bench/datasets/mtbench.pyexamples/specdec_bench/specdec_bench/datasets/specbench.pyexamples/specdec_bench/specdec_bench/datasets/speed.pyexamples/specdec_bench/specdec_bench/metrics/__init__.pyexamples/specdec_bench/specdec_bench/metrics/acceptance_rate.pyexamples/specdec_bench/specdec_bench/metrics/base.pyexamples/specdec_bench/specdec_bench/metrics/mtbench.pyexamples/specdec_bench/specdec_bench/metrics/specbench.pyexamples/specdec_bench/specdec_bench/metrics/thinking_acceptance.pyexamples/specdec_bench/specdec_bench/metrics/timing.pyexamples/specdec_bench/specdec_bench/models/__init__.pyexamples/specdec_bench/specdec_bench/models/auto_deploy.pyexamples/specdec_bench/specdec_bench/models/base.pyexamples/specdec_bench/specdec_bench/models/sglang.pyexamples/specdec_bench/specdec_bench/models/specbench_medusa.pyexamples/specdec_bench/specdec_bench/models/trtllm_torch_api.pyexamples/specdec_bench/specdec_bench/models/vllm.pyexamples/specdec_bench/specdec_bench/run_utils.pyexamples/specdec_bench/specdec_bench/runners/__init__.pyexamples/specdec_bench/specdec_bench/runners/base.pyexamples/specdec_bench/specdec_bench/runners/simple.pyexamples/specdec_bench/specdec_bench/utils.pyexamples/specdec_bench/sweep_example.yaml
✅ Files skipped from review due to trivial changes (6)
- examples/specdec_bench/specdec_bench/models/specbench_medusa.py
- examples/specdec_bench/specdec_bench/datasets/base.py
- examples/specdec_bench/prepare_data.py
- examples/specdec_bench/specdec_bench/datasets/init.py
- examples/specdec_bench/sweep_example.yaml
- examples/specdec_bench/specdec_bench/metrics/init.py
🚧 Files skipped from review as they are similar to previous changes (14)
- examples/specdec_bench/specdec_bench/runners/simple.py
- examples/specdec_bench/specdec_bench/runners/init.py
- examples/specdec_bench/SPECBENCH_PORTING.md
- examples/specdec_bench/specdec_bench/models/auto_deploy.py
- examples/specdec_bench/specdec_bench/models/init.py
- examples/specdec_bench/specdec_bench/datasets/mtbench.py
- examples/specdec_bench/specdec_bench/datasets/specbench.py
- examples/specdec_bench/specdec_bench/runners/base.py
- examples/specdec_bench/specdec_bench/metrics/thinking_acceptance.py
- examples/specdec_bench/specdec_bench/metrics/mtbench.py
- examples/specdec_bench/specdec_bench/metrics/specbench.py
- examples/specdec_bench/specdec_bench/metrics/base.py
- examples/specdec_bench/specdec_bench/utils.py
- examples/specdec_bench/specdec_bench/models/sglang.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1272 +/- ##
==========================================
+ Coverage 75.58% 77.04% +1.46%
==========================================
Files 459 459
Lines 48613 48613
==========================================
+ Hits 36745 37455 +710
+ Misses 11868 11158 -710
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
64bfe8d to
0408d02
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (5)
examples/specdec_bench/specdec_bench/models/vllm.py (2)
76-88:⚠️ Potential issue | 🔴 CriticalDo not hardcode
trust_remote_code=Truein engine args.Line 78 forces remote code execution for model loading. Make this caller-configurable and default to
False.🔒 Proposed fix
class VLLMModel(Model): def __init__(self, model_dir, max_concurrent_requests, sampling_kwargs, **kwargs): + trust_remote_code = kwargs.pop("trust_remote_code", False) specdec = None speculative_algorithm = kwargs.pop("speculative_algorithm", None) ... engine_args = AsyncEngineArgs( model=model_dir, - trust_remote_code=True, + trust_remote_code=trust_remote_code, tensor_parallel_size=kwargs.pop("tensor_parallel_size", 1), ... **kwargs, )As per coding guidelines: "Be especially careful with remote-code execution flags: do not hardcode trust_remote_code=True."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/specdec_bench/specdec_bench/models/vllm.py` around lines 76 - 88, The code hardcodes trust_remote_code=True in the AsyncEngineArgs instantiation (engine_args) which forces remote code execution; change it to be caller-configurable and default to False by reading a parameter or kwargs (e.g., use kwargs.pop("trust_remote_code", False) or add a function parameter trust_remote_code=False) when building AsyncEngineArgs in vllm.py so AsyncEngineArgs(..., trust_remote_code=trust_remote_code, ...); ensure callers that rely on the old behavior pass the flag explicitly if needed.
73-74:⚠️ Potential issue | 🟠 Major
kwargs.pop("parallel_drafting")can crash when key is absent.Line 73 can raise
KeyErrorifparallel_draftingis not provided, which breaks initialization paths that omit this flag.🐛 Proposed fix
- if kwargs.pop("parallel_drafting") and specdec is not None: + if kwargs.pop("parallel_drafting", False) and specdec is not None: specdec["parallel_drafting"] = True🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/specdec_bench/specdec_bench/models/vllm.py` around lines 73 - 74, The code currently uses kwargs.pop("parallel_drafting") which raises KeyError when the key is absent; change the pop to provide a default (e.g., kwargs.pop("parallel_drafting", False)) or use kwargs.get("parallel_drafting", False) so initialization won't crash, then if that value is truthy and specdec is not None set specdec["parallel_drafting"] = True; update the statement around kwargs.pop("parallel_drafting") in the vllm model initialization to use the safe default.examples/specdec_bench/specdec_bench/models/sglang.py (2)
68-70:⚠️ Potential issue | 🟠 MajorReplace
assertwith explicit runtime validation.Line 70 uses
assertfor behavior gating. Under optimized Python runs, this check is removed and unsupported multi-beam paths can proceed silently.♻️ Proposed fix
if end_id != -1: sampling_config["stop_token_ids"] = [end_id] - assert sampling_config.get("beam_width", 1) == 1 + if sampling_config.get("beam_width", 1) != 1: + raise ValueError("SGLANGModel only supports beam_width=1")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/specdec_bench/specdec_bench/models/sglang.py` around lines 68 - 70, Replace the assert on beam width with an explicit runtime check: retrieve beam_width via sampling_config.get("beam_width", 1) in the same block (near end_id handling) and if it is not equal to 1 raise a clear exception (e.g., ValueError or RuntimeError) with a descriptive message explaining that multi-beam decoding is unsupported; reference the sampling_config and beam_width symbols and keep the check next to the existing end_id logic so unsupported configurations fail at runtime instead of being optimized away.
40-44:⚠️ Potential issue | 🔴 CriticalDo not hardcode
trust_remote_code=Trueinsgl.Engine(...).Line 43 enables remote code execution unconditionally. Make it caller-configurable with default
False.🔒 Proposed fix
self.model = sgl.Engine( model_path=model_dir, skip_tokenizer_init=True, - trust_remote_code=True, + trust_remote_code=kwargs.pop("trust_remote_code", False), mem_fraction_static=0.8, disable_overlap_schedule=kwargs.pop("disable_overlap_schedule", False),As per coding guidelines: "Be especially careful with remote-code execution flags: do not hardcode trust_remote_code=True."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/specdec_bench/specdec_bench/models/sglang.py` around lines 40 - 44, The code currently hardcodes trust_remote_code=True when instantiating sgl.Engine (see self.model = sgl.Engine(..., trust_remote_code=True, ...)); make this flag caller-configurable instead: add a parameter (e.g., trust_remote_code: bool = False) to the surrounding class or function signature and pass that parameter into sgl.Engine, defaulting to False, so callers can opt in to remote-code execution; ensure any code paths that construct self.model (using model_path, skip_tokenizer_init, mem_fraction_static) use the new parameter rather than the literal True.examples/specdec_bench/specdec_bench/datasets/speed.py (1)
477-480:⚠️ Potential issue | 🟠 MajorClamp
num_samplesto avoidIndexErrorafter category filtering.When
categoryfiltering reduces the dataset belowself.num_samples,dataset.select(range(self.num_samples))will raise anIndexError. Clamp to the available rows.Proposed fix
if category is not None: dataset = dataset.filter(lambda example: example["category"] == category, desc=f"Filtering for category {category}") if self.num_samples is not None: - dataset = dataset.select(range(self.num_samples)) + dataset = dataset.select(range(min(self.num_samples, len(dataset)))) return dataset🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/specdec_bench/specdec_bench/datasets/speed.py` around lines 477 - 480, After filtering by category, clamp self.num_samples to the filtered dataset size before calling dataset.select to avoid IndexError; i.e., compute an effective_count from the filtered dataset (e.g., using its length or num_rows) and use range(min(self.num_samples, effective_count)) in the call to dataset.select rather than range(self.num_samples), updating the block that uses category, dataset.filter, and dataset.select accordingly.
🧹 Nitpick comments (5)
examples/specdec_bench/README.md (2)
224-224: Add a trailing newline at end of file.Per markdownlint MD047, files should end with a single newline character.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/specdec_bench/README.md` at line 224, The README.md in the examples/specdec_bench example is missing a trailing newline; open examples/specdec_bench/README.md and ensure the file ends with exactly one newline character (no extra blank lines), then save the file so the file terminates with a single '\n' to satisfy markdownlint MD047.
209-215: Add a language specifier to the fenced code block.The output directory structure example should have a language specifier (e.g.,
textorplaintext) for consistency and to satisfy linter rules.-``` +```text my_sweep_results/ 000_speed_c32/ # first entry, concurrency=32 001_speed_c1/ # second entry, concurrency=1 001_speed_c2/ # second entry, concurrency=2 ...<details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@examples/specdec_bench/README.mdaround lines 209 - 215, Add a language
specifier to the fenced code block showing the output directory structure in the
README: modify the triple-backtick fence that wraps the my_sweep_results/
example (the directory tree block) to include a language token such as "text" or
"plaintext" (e.g., ```text) so the code block is explicitly labeled and
satisfies the linter and formatting consistency.</details> </blockquote></details> <details> <summary>examples/specdec_bench/specdec_bench/runners/base.py (1)</summary><blockquote> `23-25`: **Consider removing the redundant `_ensure_metric_records` guard.** Since `__init__` already initializes `self.metric_records = []` at line 21, the `hasattr` check is unnecessary unless subclasses are expected to skip calling `super().__init__()`. If that's intentional for defensive coding, consider adding a brief comment explaining why. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@examples/specdec_bench/specdec_bench/runners/base.py` around lines 23 - 25, The _ensure_metric_records method redundantly checks hasattr(self, "metric_records") despite __init__ already setting self.metric_records = []; remove the guard (or remove the whole _ensure_metric_records method) so callers can rely on __init__'s initialization, or if subclasses may intentionally skip super().__init__() add a brief comment in the class or above _ensure_metric_records explaining that defensive check is required—refer to __init__ and _ensure_metric_records to locate the code. ``` </details> </blockquote></details> <details> <summary>examples/specdec_bench/specdec_bench/run_utils.py (1)</summary><blockquote> `379-391`: **Minor: The second `asyncio.gather(*tasks)` after `queue.join()` may be redundant.** Once `queue.join()` completes and all items are processed, the worker loops exit naturally (via `QueueEmpty`). The first `asyncio.gather(*tasks)` at line 384 should already await their completion. The cancellation logic in `finally` is appropriate for handling interruptions, but under normal flow the tasks will have already finished. This is not a bug—just a minor observation that the flow is slightly verbose. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@examples/specdec_bench/specdec_bench/run_utils.py` around lines 379 - 391, Remove the redundant await asyncio.gather(*tasks) that immediately follows await queue.join(); since worker_loop tasks exit naturally once the queue is drained, waiting on queue.join() is sufficient to ensure items are processed—keep the tasks list and the finally cancellation/cleanup (including worker_loop, queue.join, and the finally block with task.cancel() and asyncio.gather(..., return_exceptions=True)) but delete the extra asyncio.gather(*tasks) after queue.join(). ``` </details> </blockquote></details> <details> <summary>examples/specdec_bench/specdec_bench/metrics/thinking_acceptance.py (1)</summary><blockquote> `61-65`: **Consider documenting or adjusting the boundary chunk handling.** When the thinking end token is detected, the chunk containing it is skipped entirely (via `continue`) and not counted toward either thinking or non-thinking lengths. If this is intentional (treating the boundary as transitional), a brief comment would clarify the design. Otherwise, consider splitting the chunk's acceptance length proportionally. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@examples/specdec_bench/specdec_bench/metrics/thinking_acceptance.py` around lines 61 - 65, The current branch in the block that checks self._contains_sequence(output_id_iter, self.thinking_end_token) simply continues and omits the chunk containing the end token from any length counts; either clarify intent by adding a short comment above that continue (mentioning that boundary chunks are intentionally treated as transitional and excluded), or implement boundary-splitting: detect the token position within the current output chunk via output_id_iter (use the index/offset returned by _contains_sequence or compute it), count the IDs before the token toward the current thinking length, set prompt_state[request_id][turn_id]["in_thinking"]=False and in_thinking=False, then process the IDs after the token as non-thinking (do not use continue so the remainder is counted). Ensure you update the same variables (output_id_iter, in_thinking, prompt_state entries) and preserve existing length-accumulation logic when splitting. ``` </details> </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against the current code and only fix it if needed.
Inline comments:
In@examples/specdec_bench/specdec_bench/metrics/thinking_acceptance.py:
- Around line 29-40: The _contains_sequence method can fail when token_sequence
is a tuple because token_list slices produce lists; normalize types before
comparing by converting token_sequence to the same type as the slice (e.g.,
list(token_sequence)) or convert the slice to a tuple; update _contains_sequence
to coerce token_sequence when it is a sequence (but not an int) and then perform
the equality check using the normalized types so list slices vs tuple sequences
no longer mismatch.In
@examples/specdec_bench/specdec_bench/models/trtllm_torch_api.py:
- Around line 62-63: Change the direct indexing of sampling_kwargs to use .get()
with a default so missing keys don't raise KeyError: replace the use of
sampling_kwargs["temperature"] in the block that sets
kwargs["allow_advanced_sampling"] with sampling_kwargs.get("temperature", 0.0)
(this mirrors the defensive pattern used by check_sampling_config and other
model files) so that when temperature is absent it defaults to 0.0 and preserves
the intended behavior.In
@examples/specdec_bench/specdec_bench/models/vllm.py:
- Around line 128-132: The stop() method currently suppresses all exceptions
which hides real shutdown errors and masks asyncio.run()'s RuntimeError when
called inside a running loop; change it to explicitly handle RuntimeError from
asyncio.run(self.model.shutdown()) by detecting a running loop and scheduling
the coroutine (e.g., via
asyncio.get_running_loop().create_task(self.model.shutdown()) or
asyncio.create_task) instead of calling asyncio.run, and for all other
exceptions catch and log (or re-raise) the error rather than silencing it;
update the stop() implementation around self.model.shutdown() and asyncio.run()
accordingly.
Duplicate comments:
In@examples/specdec_bench/specdec_bench/datasets/speed.py:
- Around line 477-480: After filtering by category, clamp self.num_samples to
the filtered dataset size before calling dataset.select to avoid IndexError;
i.e., compute an effective_count from the filtered dataset (e.g., using its
length or num_rows) and use range(min(self.num_samples, effective_count)) in the
call to dataset.select rather than range(self.num_samples), updating the block
that uses category, dataset.filter, and dataset.select accordingly.In
@examples/specdec_bench/specdec_bench/models/sglang.py:
- Around line 68-70: Replace the assert on beam width with an explicit runtime
check: retrieve beam_width via sampling_config.get("beam_width", 1) in the same
block (near end_id handling) and if it is not equal to 1 raise a clear exception
(e.g., ValueError or RuntimeError) with a descriptive message explaining that
multi-beam decoding is unsupported; reference the sampling_config and beam_width
symbols and keep the check next to the existing end_id logic so unsupported
configurations fail at runtime instead of being optimized away.- Around line 40-44: The code currently hardcodes trust_remote_code=True when
instantiating sgl.Engine (see self.model = sgl.Engine(...,
trust_remote_code=True, ...)); make this flag caller-configurable instead: add a
parameter (e.g., trust_remote_code: bool = False) to the surrounding class or
function signature and pass that parameter into sgl.Engine, defaulting to False,
so callers can opt in to remote-code execution; ensure any code paths that
construct self.model (using model_path, skip_tokenizer_init,
mem_fraction_static) use the new parameter rather than the literal True.In
@examples/specdec_bench/specdec_bench/models/vllm.py:
- Around line 76-88: The code hardcodes trust_remote_code=True in the
AsyncEngineArgs instantiation (engine_args) which forces remote code execution;
change it to be caller-configurable and default to False by reading a parameter
or kwargs (e.g., use kwargs.pop("trust_remote_code", False) or add a function
parameter trust_remote_code=False) when building AsyncEngineArgs in vllm.py so
AsyncEngineArgs(..., trust_remote_code=trust_remote_code, ...); ensure callers
that rely on the old behavior pass the flag explicitly if needed.- Around line 73-74: The code currently uses kwargs.pop("parallel_drafting")
which raises KeyError when the key is absent; change the pop to provide a
default (e.g., kwargs.pop("parallel_drafting", False)) or use
kwargs.get("parallel_drafting", False) so initialization won't crash, then if
that value is truthy and specdec is not None set specdec["parallel_drafting"] =
True; update the statement around kwargs.pop("parallel_drafting") in the vllm
model initialization to use the safe default.
Nitpick comments:
In@examples/specdec_bench/README.md:
- Line 224: The README.md in the examples/specdec_bench example is missing a
trailing newline; open examples/specdec_bench/README.md and ensure the file ends
with exactly one newline character (no extra blank lines), then save the file so
the file terminates with a single '\n' to satisfy markdownlint MD047.- Around line 209-215: Add a language specifier to the fenced code block showing
the output directory structure in the README: modify the triple-backtick fence
that wraps the my_sweep_results/ example (the directory tree block) to include a
language token such as "text" or "plaintext" (e.g., ```text) so the code block
is explicitly labeled and satisfies the linter and formatting consistency.In
@examples/specdec_bench/specdec_bench/metrics/thinking_acceptance.py:
- Around line 61-65: The current branch in the block that checks
self._contains_sequence(output_id_iter, self.thinking_end_token) simply
continues and omits the chunk containing the end token from any length counts;
either clarify intent by adding a short comment above that continue (mentioning
that boundary chunks are intentionally treated as transitional and excluded), or
implement boundary-splitting: detect the token position within the current
output chunk via output_id_iter (use the index/offset returned by
_contains_sequence or compute it), count the IDs before the token toward the
current thinking length, set
prompt_state[request_id][turn_id]["in_thinking"]=False and in_thinking=False,
then process the IDs after the token as non-thinking (do not use continue so the
remainder is counted). Ensure you update the same variables (output_id_iter,
in_thinking, prompt_state entries) and preserve existing length-accumulation
logic when splitting.In
@examples/specdec_bench/specdec_bench/run_utils.py:
- Around line 379-391: Remove the redundant await asyncio.gather(*tasks) that
immediately follows await queue.join(); since worker_loop tasks exit naturally
once the queue is drained, waiting on queue.join() is sufficient to ensure items
are processed—keep the tasks list and the finally cancellation/cleanup
(including worker_loop, queue.join, and the finally block with task.cancel() and
asyncio.gather(..., return_exceptions=True)) but delete the extra
asyncio.gather(*tasks) after queue.join().In
@examples/specdec_bench/specdec_bench/runners/base.py:
- Around line 23-25: The _ensure_metric_records method redundantly checks
hasattr(self, "metric_records") despite init already setting
self.metric_records = []; remove the guard (or remove the whole
_ensure_metric_records method) so callers can rely on init's initialization,
or if subclasses may intentionally skip super().init() add a brief comment
in the class or above _ensure_metric_records explaining that defensive check is
required—refer to init and _ensure_metric_records to locate the code.</details> <details> <summary>🪄 Autofix (Beta)</summary> Fix all unresolved CodeRabbit comments on this PR: - [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended) - [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes </details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: Path: .coderabbit.yaml **Review profile**: CHILL **Plan**: Pro Plus **Run ID**: `593d9630-5590-4507-b80c-a05eb1e75350` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between 64bfe8d4390014c0d9303d3f0f3c5a641540e244 and 0408d02ef8caa2f950f69cb4a4e58261b7aa396e. </details> <details> <summary>📒 Files selected for processing (31)</summary> * `examples/specdec_bench/README.md` * `examples/specdec_bench/SPECBENCH_PORTING.md` * `examples/specdec_bench/prepare_data.py` * `examples/specdec_bench/requirements_speed.txt` * `examples/specdec_bench/run.py` * `examples/specdec_bench/specdec_bench/datasets/__init__.py` * `examples/specdec_bench/specdec_bench/datasets/base.py` * `examples/specdec_bench/specdec_bench/datasets/humaneval.py` * `examples/specdec_bench/specdec_bench/datasets/mtbench.py` * `examples/specdec_bench/specdec_bench/datasets/specbench.py` * `examples/specdec_bench/specdec_bench/datasets/speed.py` * `examples/specdec_bench/specdec_bench/metrics/__init__.py` * `examples/specdec_bench/specdec_bench/metrics/acceptance_rate.py` * `examples/specdec_bench/specdec_bench/metrics/base.py` * `examples/specdec_bench/specdec_bench/metrics/mtbench.py` * `examples/specdec_bench/specdec_bench/metrics/specbench.py` * `examples/specdec_bench/specdec_bench/metrics/thinking_acceptance.py` * `examples/specdec_bench/specdec_bench/metrics/timing.py` * `examples/specdec_bench/specdec_bench/models/__init__.py` * `examples/specdec_bench/specdec_bench/models/auto_deploy.py` * `examples/specdec_bench/specdec_bench/models/base.py` * `examples/specdec_bench/specdec_bench/models/sglang.py` * `examples/specdec_bench/specdec_bench/models/specbench_medusa.py` * `examples/specdec_bench/specdec_bench/models/trtllm_torch_api.py` * `examples/specdec_bench/specdec_bench/models/vllm.py` * `examples/specdec_bench/specdec_bench/run_utils.py` * `examples/specdec_bench/specdec_bench/runners/__init__.py` * `examples/specdec_bench/specdec_bench/runners/base.py` * `examples/specdec_bench/specdec_bench/runners/simple.py` * `examples/specdec_bench/specdec_bench/utils.py` * `examples/specdec_bench/sweep_example.yaml` </details> <details> <summary>✅ Files skipped from review due to trivial changes (6)</summary> * examples/specdec_bench/requirements_speed.txt * examples/specdec_bench/prepare_data.py * examples/specdec_bench/specdec_bench/datasets/base.py * examples/specdec_bench/sweep_example.yaml * examples/specdec_bench/specdec_bench/datasets/__init__.py * examples/specdec_bench/specdec_bench/models/base.py </details> <details> <summary>🚧 Files skipped from review as they are similar to previous changes (10)</summary> * examples/specdec_bench/specdec_bench/models/specbench_medusa.py * examples/specdec_bench/SPECBENCH_PORTING.md * examples/specdec_bench/specdec_bench/runners/__init__.py * examples/specdec_bench/specdec_bench/datasets/specbench.py * examples/specdec_bench/specdec_bench/models/__init__.py * examples/specdec_bench/specdec_bench/metrics/timing.py * examples/specdec_bench/specdec_bench/utils.py * examples/specdec_bench/specdec_bench/datasets/humaneval.py * examples/specdec_bench/run.py * examples/specdec_bench/specdec_bench/metrics/mtbench.py </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
| def _contains_sequence(self, token_list, token_sequence): | ||
| if isinstance(token_sequence, int): | ||
| return token_sequence in token_list | ||
|
|
||
| seq_len = len(token_sequence) | ||
| if seq_len == 0: | ||
| return False | ||
|
|
||
| return any( | ||
| token_list[i : i + seq_len] == token_sequence | ||
| for i in range(len(token_list) - seq_len + 1) | ||
| ) |
There was a problem hiding this comment.
Type comparison may fail for tuple sequences.
If thinking_end_token is passed as a tuple (e.g., (1, 2, 3)), the comparison token_list[i : i + seq_len] == token_sequence will fail because slicing a list returns a list, not a tuple. Consider converting to list or using a consistent type.
Proposed fix
def _contains_sequence(self, token_list, token_sequence):
if isinstance(token_sequence, int):
return token_sequence in token_list
+ token_sequence = list(token_sequence)
seq_len = len(token_sequence)
if seq_len == 0:
return False
return any(
token_list[i : i + seq_len] == token_sequence
for i in range(len(token_list) - seq_len + 1)
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/specdec_bench/specdec_bench/metrics/thinking_acceptance.py` around
lines 29 - 40, The _contains_sequence method can fail when token_sequence is a
tuple because token_list slices produce lists; normalize types before comparing
by converting token_sequence to the same type as the slice (e.g.,
list(token_sequence)) or convert the slice to a tuple; update _contains_sequence
to coerce token_sequence when it is a sequence (but not an int) and then perform
the equality check using the normalized types so list slices vs tuple sequences
no longer mismatch.
| def stop(self): | ||
| try: | ||
| self.loop.run_until_complete(self.model.shutdown()) | ||
| self.loop.close() | ||
| except Exception: | ||
| pass | ||
| import contextlib | ||
|
|
||
| with contextlib.suppress(Exception): | ||
| asyncio.run(self.model.shutdown()) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's verify the file exists and read the relevant section
head -150 examples/specdec_bench/specdec_bench/models/vllm.py | tail -40Repository: NVIDIA/Model-Optimizer
Length of output: 1624
🏁 Script executed:
# Also check the full context of the stop() method
cat -n examples/specdec_bench/specdec_bench/models/vllm.py | sed -n '120,145p'Repository: NVIDIA/Model-Optimizer
Length of output: 615
🏁 Script executed:
# Search for usages of the stop() method to understand how it's called
rg "\.stop\(\)" examples/specdec_bench/ -A 2 -B 2Repository: NVIDIA/Model-Optimizer
Length of output: 827
🏁 Script executed:
# Check if there are any event loop related patterns in this file
rg "asyncio|event.loop|get_running_loop" examples/specdec_bench/specdec_bench/models/vllm.py -B 2 -A 2Repository: NVIDIA/Model-Optimizer
Length of output: 223
🏁 Script executed:
# Look at run.py to understand the full context
cat -n examples/specdec_bench/run.py | head -80Repository: NVIDIA/Model-Optimizer
Length of output: 2986
🏁 Script executed:
# Check if there's an async version of stop or any other patterns
rg -A 5 -B 5 "def stop\(" examples/specdec_bench/specdec_bench/Repository: NVIDIA/Model-Optimizer
Length of output: 4479
🏁 Script executed:
# Check the runner to understand how stop() is called
cat -n examples/specdec_bench/specdec_bench/runners/base.py | head -100Repository: NVIDIA/Model-Optimizer
Length of output: 2448
🏁 Script executed:
# Let me check the broader execution flow to see if stop() could realistically be called from async context
rg "await.*stop\(\)|stop\(\)" examples/specdec_bench/ -B 3 -A 3Repository: NVIDIA/Model-Optimizer
Length of output: 2097
🏁 Script executed:
# Check if there's any exception handling documentation or comments about why broad suppression is used
grep -r "contextlib.suppress" examples/specdec_bench/ -B 5 -A 5Repository: NVIDIA/Model-Optimizer
Length of output: 1686
Clarify exception handling in stop() to avoid silently masking shutdown failures.
Line 132's broad contextlib.suppress(Exception) hides all exceptions, including legitimate errors in model.shutdown(). Additionally, asyncio.run() raises RuntimeError if called inside an already-running event loop, which the suppression masks. While the current usage (stop called after asyncio.run() completes in run.py) is safe, the code is fragile.
♻️ Proposed fix
def stop(self):
import contextlib
with contextlib.suppress(Exception):
- asyncio.run(self.model.shutdown())
+ try:
+ loop = asyncio.get_running_loop()
+ except RuntimeError:
+ asyncio.run(self.model.shutdown())
+ else:
+ loop.create_task(self.model.shutdown())🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/specdec_bench/specdec_bench/models/vllm.py` around lines 128 - 132,
The stop() method currently suppresses all exceptions which hides real shutdown
errors and masks asyncio.run()'s RuntimeError when called inside a running loop;
change it to explicitly handle RuntimeError from
asyncio.run(self.model.shutdown()) by detecting a running loop and scheduling
the coroutine (e.g., via
asyncio.get_running_loop().create_task(self.model.shutdown()) or
asyncio.create_task) instead of calling asyncio.run, and for all other
exceptions catch and log (or re-raise) the error rather than silencing it;
update the stop() implementation around self.model.shutdown() and asyncio.run()
accordingly.
|
0c60eb3 to
ac7ed96
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
examples/specdec_bench/README.md (2)
209-215: Add a language identifier to the fenced code block.The directory structure example is missing a language specifier. Use
textorplaintextfor non-code content to satisfy markdownlint MD040.-``` +```text my_sweep_results/ 000_speed_c32/ # first entry, concurrency=32 001_speed_c1/ # second entry, concurrency=1 001_speed_c2/ # second entry, concurrency=2 ... +```🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/specdec_bench/README.md` around lines 209 - 215, Update the fenced code block that shows the directory structure in README.md to include a language identifier (use "text" or "plaintext") so markdownlint MD040 is satisfied; locate the three-backtick block containing the my_sweep_results/ directory listing and change the opening fence from ``` to ```text and keep the closing ``` intact.
224-224: Add a trailing newline at end of file.Per markdownlint MD047, files should end with a single newline character.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/specdec_bench/README.md` at line 224, The README file is missing a trailing newline character; open examples/specdec_bench/README.md and append a single newline at the end of the file so it ends with exactly one trailing newline character to satisfy markdownlint MD047.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/specdec_bench/specdec_bench/metrics/base.py`:
- Around line 27-31: AATiming defines a now-orphaned process_step method in
aa_timing.py; remove that method or move its logic into the new
process_final(self, text_outputs, request_records) implementation so the runner
will call it—update the AATiming class to implement process_final (matching the
base MetricsBase signature) and perform whatever aggregation/recording the old
process_step did using the provided text_outputs and request_records, and keep
clear(self) untouched; reference AATiming and process_final when making the
change.
---
Nitpick comments:
In `@examples/specdec_bench/README.md`:
- Around line 209-215: Update the fenced code block that shows the directory
structure in README.md to include a language identifier (use "text" or
"plaintext") so markdownlint MD040 is satisfied; locate the three-backtick block
containing the my_sweep_results/ directory listing and change the opening fence
from ``` to ```text and keep the closing ``` intact.
- Line 224: The README file is missing a trailing newline character; open
examples/specdec_bench/README.md and append a single newline at the end of the
file so it ends with exactly one trailing newline character to satisfy
markdownlint MD047.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: e6bb222f-16ee-4388-b148-64b972b16550
📒 Files selected for processing (31)
examples/specdec_bench/README.mdexamples/specdec_bench/SPECBENCH_PORTING.mdexamples/specdec_bench/prepare_data.pyexamples/specdec_bench/requirements_speed.txtexamples/specdec_bench/run.pyexamples/specdec_bench/specdec_bench/datasets/__init__.pyexamples/specdec_bench/specdec_bench/datasets/base.pyexamples/specdec_bench/specdec_bench/datasets/humaneval.pyexamples/specdec_bench/specdec_bench/datasets/mtbench.pyexamples/specdec_bench/specdec_bench/datasets/specbench.pyexamples/specdec_bench/specdec_bench/datasets/speed.pyexamples/specdec_bench/specdec_bench/metrics/__init__.pyexamples/specdec_bench/specdec_bench/metrics/acceptance_rate.pyexamples/specdec_bench/specdec_bench/metrics/base.pyexamples/specdec_bench/specdec_bench/metrics/mtbench.pyexamples/specdec_bench/specdec_bench/metrics/specbench.pyexamples/specdec_bench/specdec_bench/metrics/thinking_acceptance.pyexamples/specdec_bench/specdec_bench/metrics/timing.pyexamples/specdec_bench/specdec_bench/models/__init__.pyexamples/specdec_bench/specdec_bench/models/auto_deploy.pyexamples/specdec_bench/specdec_bench/models/base.pyexamples/specdec_bench/specdec_bench/models/sglang.pyexamples/specdec_bench/specdec_bench/models/specbench_medusa.pyexamples/specdec_bench/specdec_bench/models/trtllm_torch_api.pyexamples/specdec_bench/specdec_bench/models/vllm.pyexamples/specdec_bench/specdec_bench/run_utils.pyexamples/specdec_bench/specdec_bench/runners/__init__.pyexamples/specdec_bench/specdec_bench/runners/base.pyexamples/specdec_bench/specdec_bench/runners/simple.pyexamples/specdec_bench/specdec_bench/utils.pyexamples/specdec_bench/sweep_example.yaml
✅ Files skipped from review due to trivial changes (8)
- examples/specdec_bench/requirements_speed.txt
- examples/specdec_bench/specdec_bench/metrics/init.py
- examples/specdec_bench/prepare_data.py
- examples/specdec_bench/specdec_bench/datasets/init.py
- examples/specdec_bench/SPECBENCH_PORTING.md
- examples/specdec_bench/sweep_example.yaml
- examples/specdec_bench/specdec_bench/runners/simple.py
- examples/specdec_bench/specdec_bench/metrics/specbench.py
🚧 Files skipped from review as they are similar to previous changes (8)
- examples/specdec_bench/specdec_bench/models/specbench_medusa.py
- examples/specdec_bench/specdec_bench/datasets/base.py
- examples/specdec_bench/specdec_bench/datasets/mtbench.py
- examples/specdec_bench/specdec_bench/models/init.py
- examples/specdec_bench/specdec_bench/utils.py
- examples/specdec_bench/specdec_bench/runners/init.py
- examples/specdec_bench/specdec_bench/runners/base.py
- examples/specdec_bench/specdec_bench/datasets/specbench.py
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (7)
examples/specdec_bench/specdec_bench/utils.py (1)
23-24:⚠️ Potential issue | 🔴 CriticalDefault
trust_remote_codemust stay opt-in.Making this helper default to
Truereopens the remote-code-execution path for every tokenizer load unless each caller remembers to override it. Please keep the default atFalseand require trusted call sites to opt in explicitly.Proposed fix
-def get_tokenizer(path, trust_remote_code=True): +def get_tokenizer(path, trust_remote_code=False): return AutoTokenizer.from_pretrained(path, trust_remote_code=trust_remote_code)As per coding guidelines "Flag
trust_remote_code=Truehardcoded for transformers model or tokenizer loading as CRITICAL security issue. Code should expose it as a caller-configurable parameter defaulting to False, not hardcode True".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/specdec_bench/specdec_bench/utils.py` around lines 23 - 24, The helper get_tokenizer currently defaults trust_remote_code=True which enables remote code execution by default; change the function signature get_tokenizer(path, trust_remote_code=True) to default trust_remote_code=False, and ensure the call to AutoTokenizer.from_pretrained(path, trust_remote_code=trust_remote_code) uses that parameter so callers must explicitly opt in to trust_remote_code; reference the get_tokenizer function and the AutoTokenizer.from_pretrained call when making this change.examples/specdec_bench/specdec_bench/datasets/humaneval.py (1)
18-22:⚠️ Potential issue | 🔴 Critical
datasetsfallback is still broken and will crash on the success path.When the import succeeds,
datasetsis never defined, so Line 34 raisesNameErrorbefore_preprocess()runs. Please lazy-loadload_datasetwhere it is used and raise a realImportErrorinstead of printing during module import.Proposed fix
-try: - from datasets import load_dataset -except ImportError: - print("datasets is not installed.") - datasets = None - - def format_prompt(prompt: str) -> str: @@ class HumanEval(Dataset): def __init__(self, path, num_samples=164, **kwargs): - if datasets is None: - raise ImportError("datasets is not installed.") self.data: list[Request] = [] # list of list of questions. self.num_samples = num_samples self._preprocess(path) def _preprocess(self, path: str): - dataset = load_dataset(path, split="test") + try: + from datasets import load_dataset + except ImportError as exc: + raise ImportError( + "HumanEval requires the HuggingFace `datasets` package. Install the appropriate extra first." + ) from exc + dataset = load_dataset(path, split="test") for item in dataset: self.data.append(Request(system_prompt=None, turns=[format_prompt(item["prompt"])])) self.data = self.data[: self.num_samples]As per coding guidelines "Gate optional features by install extras (
[onnx],[hf],[all]); avoid hard imports of optional dependencies at module level".Also applies to: 33-41
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/specdec_bench/specdec_bench/datasets/humaneval.py` around lines 18 - 22, The module currently imports load_dataset at top-level and prints on ImportError, leaving `datasets` undefined on success and causing a NameError when `_preprocess()` or the code that expects `load_dataset` runs; change this by removing the top-level import fallback and instead perform a lazy import of `datasets.load_dataset` inside the function that calls it (reference the call site that uses load_dataset/_preprocess()), and if the import fails raise an ImportError with a clear message about the optional dependency (e.g., suggest installing the `[hf]` or relevant extra) rather than printing; also apply the same lazy-import and proper ImportError handling to the other affected block(s) around the existing load_dataset usage noted in the file (lines referenced in review).examples/specdec_bench/run.py (1)
412-416:⚠️ Potential issue | 🟡 MinorValidate
--runtime_paramsis a mapping right after loading.
yaml.safe_load()also returns scalars and lists. If that happens, the first.get(...)inrun_simple()fails withAttributeErrorinstead of a clear CLI error. Raise a targeted validation error immediately after Line 414.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/specdec_bench/run.py` around lines 412 - 416, After loading args.runtime_params with yaml.safe_load in the run.py startup block, validate that args.runtime_params is a mapping (dict) and raise a clear CLI error if it's not; specifically, after yaml.safe_load(...) or the fallback {} ensure isinstance(args.runtime_params, dict) and if not, raise a ValueError or SystemExit with a message indicating "--runtime_params must be a mapping/dictionary" so that run_simple() (which expects a mapping) won't hit AttributeError later.examples/specdec_bench/specdec_bench/metrics/thinking_acceptance.py (1)
29-40:⚠️ Potential issue | 🟡 MinorNormalize sequence types before the boundary check.
If
thinking_end_tokenis passed as a tuple, Line 38 compares a list slice to a tuple and never matches, so the metric never exits thinking mode. Coerce non-scalar sequences tolist(...)before scanning.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/specdec_bench/specdec_bench/metrics/thinking_acceptance.py` around lines 29 - 40, In _contains_sequence, coerce non-scalar token_sequence to a list before performing the boundary check so list slices compare correctly to the sequence; specifically, inside the _contains_sequence method (and before computing seq_len), if token_sequence is not an int convert it via list(token_sequence) so tuple inputs (e.g. thinking_end_token) match list slices and the thinking-exit detection works.examples/specdec_bench/specdec_bench/metrics/timing.py (1)
27-40:⚠️ Potential issue | 🟠 MajorGuard empty timing series before computing TPS and summary stats.
If
request_recordsis empty, or any record has no emitted tokens, Line 38/39 will raise onmin/max, and Line 49/50/54 can still pass empty lists intocompute_statistics(). With the new record-driven flow this is a real crash path for empty category filters or requests that terminate before generating output. Filter out emptytoken_timesfirst, and return a safe zero/empty summary when nothing remains.Also applies to: 49-54
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/specdec_bench/specdec_bench/metrics/timing.py` around lines 27 - 40, In process_final, guard against empty timing series by filtering request_records to only those with non-empty record["token_times"] before computing start_time/end_time (the min/max on timing) and before calling compute_statistics; if no records remain, set self.out to a safe empty summary (e.g., zero TPS and empty stats) and return early. Update the logic that builds timing, total_tokens, e2e_time, ttft_time, tpot_time, and gen_tp_time to operate on the filtered list, and ensure any downstream calls to compute_statistics receive non-empty lists or handle empty inputs safely.examples/specdec_bench/specdec_bench/models/base.py (1)
46-61:⚠️ Potential issue | 🟠 Major
token_timesis still ambiguous across beams.
chunk_lengths()andoutput_ids()are tracked per beam, buttoken_timesis a single flat list for the whole collector. Once more than one beam advances, there is no stable way to align timestamps with each beam’s chunks, so downstream latency metrics become incorrect. Store timestamps on_BeamStateor explicitly reject multi-beam usage here.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/specdec_bench/specdec_bench/models/base.py` around lines 46 - 61, The collector's token_times is global and ambiguous across beams; modify CumulativeTokenCollector and _BeamState so timestamps are tracked per beam: add a token_times list attribute to _BeamState, in CumulativeTokenCollector.add_update append timestamp to beam_state.token_times instead of self.token_times, and remove or deprecate CumulativeTokenCollector.token_times; update any code that reads token_times to use the per-beam _BeamState.token_times (or alternatively, if multi-beam support is undesired, detect multiple beam_idx values in add_update and raise an error).examples/specdec_bench/specdec_bench/metrics/mtbench.py (1)
34-50:⚠️ Potential issue | 🟠 MajorDon't assume both turns always contain acceptance data.
With the new record-driven path,
turns[0]/turns[1]can be missing or empty even when the request object itself has two turns, and Line 50 also divides by zero when no request metrics were produced. Aggregate whatever turn buckets are present and guard the empty case instead of indexing both turns unconditionally.
🧹 Nitpick comments (2)
examples/specdec_bench/specdec_bench/run_utils.py (2)
351-393: Worker exceptions are silently stored rather than propagated.The
gather_limitedfunction stores exceptions as results (lines 372-373) instead of propagating them. This is a valid design choice for batch processing, but callers must explicitly check if any result is anExceptioninstance. Consider either:
- Adding a docstring noting this behavior, or
- Raising an aggregate error after all workers complete if any failed
📝 Optional: Add documentation about exception handling
async def gather_limited(items, worker, concurrency, show_progress=False, progress_desc=None): + """Run worker on items with bounded concurrency, preserving order. + + Returns: + List of results in input order. If a worker raises an exception, + the exception object is stored at that index instead of a result. + Callers should check `isinstance(result, Exception)` for each item. + """ if concurrency <= 0: raise ValueError("concurrency must be > 0")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/specdec_bench/specdec_bench/run_utils.py` around lines 351 - 393, The gather_limited function currently stores exceptions in the results list instead of propagating them; either document this behavior in the gather_limited docstring or change the post-processing to raise an aggregate error if any worker failed: collect any Exception instances from results after queue.join(), and if any exist raise an ExceptionGroup (or a custom AggregateError) with those exceptions (reference gather_limited, worker_loop, and results) so callers don't silently miss failures; ensure progress cleanup and task cancellation behavior is preserved when raising.
230-237: Consider explicit type conversion error handling forrandom_islfallback.Line 232 uses
int(dataset_path)to deriverandom_islwhen a dataset path is provided for therandomdataset. Ifdataset_pathis not a valid integer string, this raises an unhelpfulValueError. Consider wrapping with a clearer error message:♻️ Suggested improvement
if random_isl is None and dataset_path is not None: - random_isl = int(dataset_path) + try: + random_isl = int(dataset_path) + except ValueError: + raise ValueError( + f"For 'random' dataset, dataset_path must be an integer (got '{dataset_path}')" + ) from None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/specdec_bench/specdec_bench/run_utils.py` around lines 230 - 237, In the dataset_name == "random" branch, avoid allowing int(dataset_path) to raise a vague ValueError: wrap the conversion of dataset_path to int for random_isl in a try/except around the int(...) call (the variables to modify are random_isl and dataset_path in that block), and on failure raise a new ValueError that includes the invalid dataset_path and guidance like "invalid --random_isl or dataset_path; expected a positive integer"; keep the existing check that random_isl is not None and that int(random_isl) > 0 after conversion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/specdec_bench/README.md`:
- Around line 209-215: The fenced directory-tree code blocks in README.md (the
block showing "my_sweep_results/" and the other at the later location) are
missing a language tag and the file is missing a trailing newline; add a
language identifier (e.g., change the opening fence to ```text) for both
directory-tree fences and ensure the README.md ends with a newline so the MD040
and MD047 markdownlint rules are satisfied.
In `@examples/specdec_bench/specdec_bench/metrics/specbench.py`:
- Around line 49-58: The code assumes acceptance data exists for every request
and calls mean(list(chain(*turns))) which will raise on empty data; update the
block that reads prompt_ar and computes Request_AL to use
prompt_ar.get(request_id, {}) when accessing per-request acceptance records,
flatten turns and check if the resulting list is empty before calling mean, and
either set self.out["Request_AL"][request.question_id] = 0.0 (or skip
assignment) when no accepted lengths exist; keep the existing loop calling
_get_lengths(turn, lengths) for side-effects but ensure it only runs when turns
contains data.
- Around line 99-108: The table and other artifacts still use the label
"Acceptance Rate" but the data are acceptance lengths (keys like Category_AL,
Request_AL, Average_AL); update all human-facing labels and filenames that
contain "Acceptance Rate" to "Acceptance Length" (or "Acceptance Lengths") so
they match the data: change the table title argument currently "Acceptance Rate
Results" to "Acceptance Length Results", update any column/row labels (e.g., the
overall row label) to mention Acceptance Length/Average Acceptance Length, and
search the surrounding code (including where figure titles, axis labels, and
output filenames are built between the areas around the table and lines
~123-223) and replace "Acceptance Rate" occurrences with the new wording while
leaving keys like Category_AL, Request_AL, Average_AL unchanged.
In `@examples/specdec_bench/specdec_bench/models/trtllm_torch_api.py`:
- Around line 64-115: The code silently treats speculative_algorithm "EAGLE" as
unsupported (specdec becomes None) though callers (e.g., create_executor())
expect EAGLE to be handled; fix by explicitly handling "EAGLE" in the
speculative_algorithm dispatch: either reintroduce the EAGLE branch that
constructs the appropriate decoding config (import and instantiate the correct
Eagle/Eagle3 decoding config such as Eagle3DecodingConfig and assign to specdec,
using the same kwargs pattern as the other branches) or raise a clear ValueError
when speculative_algorithm == "EAGLE" to fail fast. Update the logic around
speculative_algorithm/specdec in trtllm_torch_api.py (referencing
speculative_algorithm, specdec, and create_executor()) so EAGLE is never
silently ignored.
In `@examples/specdec_bench/specdec_bench/utils.py`:
- Around line 92-103: The dump_env function currently persists vars(args) and
sys.argv verbatim which can leak secrets; update dump_env to sanitize sensitive
keys before writing: copy vars(args) into config as now, but remove or replace
values for keys matching patterns like "token", "api_key", "apikey", "password",
"secret", "key", "access_token" (case-insensitive) and any other configured
sensitive names in overrides, and do not write raw sys.argv — either omit argv
entirely or write a redacted argv where flag values are replaced with
"<REDACTED>" while keeping flag names; keep the rest of the behavior
(engine_version via _get_engine_version, gpu via _get_gpu_name, os.makedirs,
json.dump) unchanged and apply changes in the dump_env function and the local
variable config/argv handling.
---
Duplicate comments:
In `@examples/specdec_bench/run.py`:
- Around line 412-416: After loading args.runtime_params with yaml.safe_load in
the run.py startup block, validate that args.runtime_params is a mapping (dict)
and raise a clear CLI error if it's not; specifically, after yaml.safe_load(...)
or the fallback {} ensure isinstance(args.runtime_params, dict) and if not,
raise a ValueError or SystemExit with a message indicating "--runtime_params
must be a mapping/dictionary" so that run_simple() (which expects a mapping)
won't hit AttributeError later.
In `@examples/specdec_bench/specdec_bench/datasets/humaneval.py`:
- Around line 18-22: The module currently imports load_dataset at top-level and
prints on ImportError, leaving `datasets` undefined on success and causing a
NameError when `_preprocess()` or the code that expects `load_dataset` runs;
change this by removing the top-level import fallback and instead perform a lazy
import of `datasets.load_dataset` inside the function that calls it (reference
the call site that uses load_dataset/_preprocess()), and if the import fails
raise an ImportError with a clear message about the optional dependency (e.g.,
suggest installing the `[hf]` or relevant extra) rather than printing; also
apply the same lazy-import and proper ImportError handling to the other affected
block(s) around the existing load_dataset usage noted in the file (lines
referenced in review).
In `@examples/specdec_bench/specdec_bench/metrics/thinking_acceptance.py`:
- Around line 29-40: In _contains_sequence, coerce non-scalar token_sequence to
a list before performing the boundary check so list slices compare correctly to
the sequence; specifically, inside the _contains_sequence method (and before
computing seq_len), if token_sequence is not an int convert it via
list(token_sequence) so tuple inputs (e.g. thinking_end_token) match list slices
and the thinking-exit detection works.
In `@examples/specdec_bench/specdec_bench/metrics/timing.py`:
- Around line 27-40: In process_final, guard against empty timing series by
filtering request_records to only those with non-empty record["token_times"]
before computing start_time/end_time (the min/max on timing) and before calling
compute_statistics; if no records remain, set self.out to a safe empty summary
(e.g., zero TPS and empty stats) and return early. Update the logic that builds
timing, total_tokens, e2e_time, ttft_time, tpot_time, and gen_tp_time to operate
on the filtered list, and ensure any downstream calls to compute_statistics
receive non-empty lists or handle empty inputs safely.
In `@examples/specdec_bench/specdec_bench/models/base.py`:
- Around line 46-61: The collector's token_times is global and ambiguous across
beams; modify CumulativeTokenCollector and _BeamState so timestamps are tracked
per beam: add a token_times list attribute to _BeamState, in
CumulativeTokenCollector.add_update append timestamp to beam_state.token_times
instead of self.token_times, and remove or deprecate
CumulativeTokenCollector.token_times; update any code that reads token_times to
use the per-beam _BeamState.token_times (or alternatively, if multi-beam support
is undesired, detect multiple beam_idx values in add_update and raise an error).
In `@examples/specdec_bench/specdec_bench/utils.py`:
- Around line 23-24: The helper get_tokenizer currently defaults
trust_remote_code=True which enables remote code execution by default; change
the function signature get_tokenizer(path, trust_remote_code=True) to default
trust_remote_code=False, and ensure the call to
AutoTokenizer.from_pretrained(path, trust_remote_code=trust_remote_code) uses
that parameter so callers must explicitly opt in to trust_remote_code; reference
the get_tokenizer function and the AutoTokenizer.from_pretrained call when
making this change.
---
Nitpick comments:
In `@examples/specdec_bench/specdec_bench/run_utils.py`:
- Around line 351-393: The gather_limited function currently stores exceptions
in the results list instead of propagating them; either document this behavior
in the gather_limited docstring or change the post-processing to raise an
aggregate error if any worker failed: collect any Exception instances from
results after queue.join(), and if any exist raise an ExceptionGroup (or a
custom AggregateError) with those exceptions (reference gather_limited,
worker_loop, and results) so callers don't silently miss failures; ensure
progress cleanup and task cancellation behavior is preserved when raising.
- Around line 230-237: In the dataset_name == "random" branch, avoid allowing
int(dataset_path) to raise a vague ValueError: wrap the conversion of
dataset_path to int for random_isl in a try/except around the int(...) call (the
variables to modify are random_isl and dataset_path in that block), and on
failure raise a new ValueError that includes the invalid dataset_path and
guidance like "invalid --random_isl or dataset_path; expected a positive
integer"; keep the existing check that random_isl is not None and that
int(random_isl) > 0 after conversion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 62563ec6-0e01-472d-b138-76415f75cbbb
📒 Files selected for processing (31)
examples/specdec_bench/README.mdexamples/specdec_bench/SPECBENCH_PORTING.mdexamples/specdec_bench/prepare_data.pyexamples/specdec_bench/requirements_speed.txtexamples/specdec_bench/run.pyexamples/specdec_bench/specdec_bench/datasets/__init__.pyexamples/specdec_bench/specdec_bench/datasets/base.pyexamples/specdec_bench/specdec_bench/datasets/humaneval.pyexamples/specdec_bench/specdec_bench/datasets/mtbench.pyexamples/specdec_bench/specdec_bench/datasets/specbench.pyexamples/specdec_bench/specdec_bench/datasets/speed.pyexamples/specdec_bench/specdec_bench/metrics/__init__.pyexamples/specdec_bench/specdec_bench/metrics/acceptance_rate.pyexamples/specdec_bench/specdec_bench/metrics/base.pyexamples/specdec_bench/specdec_bench/metrics/mtbench.pyexamples/specdec_bench/specdec_bench/metrics/specbench.pyexamples/specdec_bench/specdec_bench/metrics/thinking_acceptance.pyexamples/specdec_bench/specdec_bench/metrics/timing.pyexamples/specdec_bench/specdec_bench/models/__init__.pyexamples/specdec_bench/specdec_bench/models/auto_deploy.pyexamples/specdec_bench/specdec_bench/models/base.pyexamples/specdec_bench/specdec_bench/models/sglang.pyexamples/specdec_bench/specdec_bench/models/specbench_medusa.pyexamples/specdec_bench/specdec_bench/models/trtllm_torch_api.pyexamples/specdec_bench/specdec_bench/models/vllm.pyexamples/specdec_bench/specdec_bench/run_utils.pyexamples/specdec_bench/specdec_bench/runners/__init__.pyexamples/specdec_bench/specdec_bench/runners/base.pyexamples/specdec_bench/specdec_bench/runners/simple.pyexamples/specdec_bench/specdec_bench/utils.pyexamples/specdec_bench/sweep_example.yaml
✅ Files skipped from review due to trivial changes (5)
- examples/specdec_bench/requirements_speed.txt
- examples/specdec_bench/specdec_bench/datasets/base.py
- examples/specdec_bench/prepare_data.py
- examples/specdec_bench/sweep_example.yaml
- examples/specdec_bench/specdec_bench/datasets/init.py
🚧 Files skipped from review as they are similar to previous changes (6)
- examples/specdec_bench/specdec_bench/runners/init.py
- examples/specdec_bench/specdec_bench/runners/simple.py
- examples/specdec_bench/specdec_bench/datasets/specbench.py
- examples/specdec_bench/specdec_bench/datasets/mtbench.py
- examples/specdec_bench/specdec_bench/models/init.py
- examples/specdec_bench/SPECBENCH_PORTING.md
| ``` | ||
| my_sweep_results/ | ||
| 000_speed_c32/ # first entry, concurrency=32 | ||
| 001_speed_c1/ # second entry, concurrency=1 | ||
| 001_speed_c2/ # second entry, concurrency=2 | ||
| ... | ||
| ``` |
There was a problem hiding this comment.
This README still trips markdownlint.
Please add a language to the fenced directory-tree block (for example, text) and restore the trailing newline at EOF so MD040 and MD047 stay clean.
Also applies to: 224-224
🧰 Tools
🪛 markdownlint-cli2 (0.22.0)
[warning] 209-209: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/specdec_bench/README.md` around lines 209 - 215, The fenced
directory-tree code blocks in README.md (the block showing "my_sweep_results/"
and the other at the later location) are missing a language tag and the file is
missing a trailing newline; add a language identifier (e.g., change the opening
fence to ```text) for both directory-tree fences and ensure the README.md ends
with a newline so the MD040 and MD047 markdownlint rules are satisfied.
| prompt_ar = self.build_prompt_ar(request_records) | ||
| self._set_request_acceptance_lengths(prompt_ar) | ||
| for request_id, request in enumerate(self.requests): | ||
| turns = self.prompt_ar[request_id].values() | ||
| turns = prompt_ar[request_id].values() | ||
| assert len(turns) == len(request.turns), ( | ||
| f"Number of turns {len(turns)} does not match number of turns in request {len(request.turns)}" | ||
| ) | ||
| self.out["Request_AR"][request.question_id] = mean(list(chain(*turns))) | ||
| self.out["Request_AL"][request.question_id] = mean(list(chain(*turns))) | ||
| for turn in turns: | ||
| self._get_lengths(turn, lengths) |
There was a problem hiding this comment.
Handle missing or empty acceptance data before calling mean().
prompt_ar[request_id] assumes every request has a record, and mean(list(chain(*turns))) raises when a request produced no accepted lengths. That turns a valid benchmark run into a metrics crash. Use .get(request_id, {}) and fall back to 0.0 or skip the request when the flattened turn list is empty.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/specdec_bench/specdec_bench/metrics/specbench.py` around lines 49 -
58, The code assumes acceptance data exists for every request and calls
mean(list(chain(*turns))) which will raise on empty data; update the block that
reads prompt_ar and computes Request_AL to use prompt_ar.get(request_id, {})
when accessing per-request acceptance records, flatten turns and check if the
resulting list is empty before calling mean, and either set
self.out["Request_AL"][request.question_id] = 0.0 (or skip assignment) when no
accepted lengths exist; keep the existing loop calling _get_lengths(turn,
lengths) for side-effects but ensure it only runs when turns contains data.
| title="Acceptance Rate Results", show_header=True, header_style="bold magenta" | ||
| ) | ||
| table.add_column("Category", style="cyan", no_wrap=True) | ||
| table.add_column("Average AR", justify="right", style="green") | ||
| table.add_column("Average AL", justify="right", style="green") | ||
|
|
||
| # Add category rows | ||
| for category_name, category_ar in sorted(self.out["Category_AR"].items()): | ||
| for category_name, category_ar in sorted(self.out["Category_AL"].items()): | ||
| table.add_row(category_name, f"{category_ar:.4f}") | ||
|
|
||
| # Add separator and summary row | ||
| table.add_section() | ||
| table.add_row("[bold]Overall Average[/bold]", f"[bold]{self.out['Average_AR']:.4f}[/bold]") | ||
| table.add_row("[bold]Overall Average[/bold]", f"[bold]{self.out['Average_AL']:.4f}[/bold]") |
There was a problem hiding this comment.
Rename the remaining “Acceptance Rate” labels.
This path now reports acceptance lengths (Request_AL / Average_AL), but the table title, axis labels, figure title, and output filename still say “Acceptance Rate”. The saved artifacts will be misleading for downstream analysis.
Also applies to: 123-223
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/specdec_bench/specdec_bench/metrics/specbench.py` around lines 99 -
108, The table and other artifacts still use the label "Acceptance Rate" but the
data are acceptance lengths (keys like Category_AL, Request_AL, Average_AL);
update all human-facing labels and filenames that contain "Acceptance Rate" to
"Acceptance Length" (or "Acceptance Lengths") so they match the data: change the
table title argument currently "Acceptance Rate Results" to "Acceptance Length
Results", update any column/row labels (e.g., the overall row label) to mention
Acceptance Length/Average Acceptance Length, and search the surrounding code
(including where figure titles, axis labels, and output filenames are built
between the areas around the table and lines ~123-223) and replace "Acceptance
Rate" occurrences with the new wording while leaving keys like Category_AL,
Request_AL, Average_AL unchanged.
| speculative_algorithm = kwargs.pop("speculative_algorithm", None) | ||
| num_speculative_tokens = kwargs.pop("speculative_num_steps", 3) | ||
| draft_model_dir = kwargs.pop("draft_model_dir", None) | ||
| if speculative_algorithm == "DRAFT_TARGET": | ||
| specdec = DraftTargetDecodingConfig( | ||
| max_draft_len=kwargs.get("speculative_num_steps", 3), | ||
| speculative_model_dir=kwargs.get("draft_model_dir", None), | ||
| max_draft_len=num_speculative_tokens, | ||
| speculative_model_dir=draft_model_dir, | ||
| ) | ||
|
|
||
| elif kwargs.get("speculative_algorithm", None) == "EAGLE3": | ||
| extra_params = {} | ||
| if "allow_advanced_sampling" in EagleDecodingConfig.model_fields: | ||
| extra_params["allow_advanced_sampling"] = kwargs.get("allow_advanced_sampling", False) | ||
| elif "allow_advanced_sampling" in kwargs: | ||
| print( | ||
| f"WARNING: allow_advanced_sampling unsupported in tensorrt_llm version: {trtllm.__version__}" | ||
| ) | ||
| specdec = EagleDecodingConfig( | ||
| max_draft_len=kwargs.get("speculative_num_steps", 3), | ||
| speculative_model_dir=kwargs.get("draft_model_dir", None), | ||
| eagle3_one_model=kwargs.get("use_one_model", True), | ||
| eagle3_layers_to_capture=kwargs.get("eagle3_layers_to_capture", None), | ||
| num_eagle_layers=kwargs.get("num_eagle_layers", 1), | ||
| **extra_params, | ||
| elif speculative_algorithm == "EAGLE3": | ||
| from tensorrt_llm.llmapi import Eagle3DecodingConfig | ||
|
|
||
| specdec = Eagle3DecodingConfig( | ||
| max_draft_len=num_speculative_tokens, | ||
| speculative_model_dir=draft_model_dir, | ||
| eagle3_layers_to_capture=kwargs.pop("eagle3_layers_to_capture", None), | ||
| num_eagle_layers=kwargs.pop("num_eagle_layers", 1), | ||
| allow_advanced_sampling=kwargs.pop("allow_advanced_sampling", True), | ||
| use_sa_spec=kwargs.pop("use_sa_spec", False), | ||
| sa_spec_threshold=kwargs.pop("sa_spec_threshold", 4), | ||
| ) | ||
|
|
||
| elif kwargs.get("speculative_algorithm", None) == "MTP": | ||
| elif speculative_algorithm == "MTP": | ||
| specdec = MTPDecodingConfig( | ||
| num_nextn_predict_layers=kwargs.get("speculative_num_steps", 3), | ||
| use_relaxed_acceptance_for_thinking=kwargs.get("relaxed_acceptance", False), | ||
| relaxed_topk=kwargs.get("relaxed_topk", 10), | ||
| relaxed_delta=kwargs.get("relaxed_delta", 0.6), | ||
| num_nextn_predict_layers=num_speculative_tokens, | ||
| use_relaxed_acceptance_for_thinking=kwargs.pop("relaxed_acceptance", False), | ||
| relaxed_topk=kwargs.pop("relaxed_topk", 10), | ||
| relaxed_delta=kwargs.pop("relaxed_delta", 0.6), | ||
| allow_advanced_sampling=kwargs.pop("allow_advanced_sampling", True), | ||
| use_sa_spec=kwargs.pop("use_sa_spec", False), | ||
| sa_spec_threshold=kwargs.pop("sa_spec_threshold", 4), | ||
| ) | ||
| elif kwargs.get("speculative_algorithm", None) == "NGRAM": | ||
| elif speculative_algorithm == "NGRAM": | ||
| specdec = NGramDecodingConfig( | ||
| max_draft_len=kwargs.get("speculative_num_steps", 5), | ||
| max_matching_ngram_size=kwargs.get("max_matching_ngram_size", 3), | ||
| max_draft_len=num_speculative_tokens, | ||
| max_matching_ngram_size=kwargs.pop("max_matching_ngram_size", 3), | ||
| is_keep_all=True, | ||
| is_use_oldest=True, | ||
| is_public_pool=True, | ||
| ) | ||
| elif kwargs.get("speculative_algorithm", None) == "NONE": | ||
| elif speculative_algorithm == "PARD": | ||
| from tensorrt_llm.llmapi import PARDDecodingConfig | ||
|
|
||
| specdec = PARDDecodingConfig( | ||
| max_draft_len=num_speculative_tokens, | ||
| speculative_model_dir=draft_model_dir, | ||
| ) | ||
| elif speculative_algorithm == "NONE": | ||
| specdec = None | ||
| else: | ||
| specdec = None | ||
|
|
There was a problem hiding this comment.
EAGLE silently falls back to plain decoding in this backend.
create_executor() no longer has an EAGLE branch, so --speculative_algorithm EAGLE drops into specdec = None. For TRTLLM that means the requested mode is ignored without any error. Either restore the EAGLE branch or raise a clear ValueError for unsupported algorithms here.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/specdec_bench/specdec_bench/models/trtllm_torch_api.py` around lines
64 - 115, The code silently treats speculative_algorithm "EAGLE" as unsupported
(specdec becomes None) though callers (e.g., create_executor()) expect EAGLE to
be handled; fix by explicitly handling "EAGLE" in the speculative_algorithm
dispatch: either reintroduce the EAGLE branch that constructs the appropriate
decoding config (import and instantiate the correct Eagle/Eagle3 decoding config
such as Eagle3DecodingConfig and assign to specdec, using the same kwargs
pattern as the other branches) or raise a clear ValueError when
speculative_algorithm == "EAGLE" to fail fast. Update the logic around
speculative_algorithm/specdec in trtllm_torch_api.py (referencing
speculative_algorithm, specdec, and create_executor()) so EAGLE is never
silently ignored.
| def dump_env(args, save_dir, overrides=None): | ||
| """Write a configuration.json to save_dir capturing the run args and engine version.""" | ||
| config = vars(args).copy() | ||
| if overrides: | ||
| config.update(overrides) | ||
| config["engine_version"] = _get_engine_version(config.get("engine")) | ||
| config["gpu"] = _get_gpu_name() | ||
| config["python_version"] = sys.version | ||
| config["argv"] = sys.argv[:] | ||
| os.makedirs(save_dir, exist_ok=True) | ||
| with open(os.path.join(save_dir, "configuration.json"), "w") as f: | ||
| json.dump(config, f, indent=4) |
There was a problem hiding this comment.
Avoid persisting raw CLI arguments without redaction.
dump_env() writes both vars(args) and sys.argv verbatim to configuration.json. That will persist any future --token / --api-key / --password style flag to disk unchanged. Please redact sensitive keys and omit raw argv unless there is a strong reason to keep it.
Proposed fix
+_SENSITIVE_SUBSTRINGS = ("token", "key", "secret", "password")
+
+
+def _redact_config(config):
+ return {
+ key: (
+ "***REDACTED***"
+ if any(part in key.lower() for part in _SENSITIVE_SUBSTRINGS)
+ else value
+ )
+ for key, value in config.items()
+ }
+
+
def dump_env(args, save_dir, overrides=None):
"""Write a configuration.json to save_dir capturing the run args and engine version."""
- config = vars(args).copy()
+ config = _redact_config(vars(args).copy())
if overrides:
- config.update(overrides)
+ config.update(_redact_config(overrides))
config["engine_version"] = _get_engine_version(config.get("engine"))
config["gpu"] = _get_gpu_name()
config["python_version"] = sys.version
- config["argv"] = sys.argv[:]
os.makedirs(save_dir, exist_ok=True)
with open(os.path.join(save_dir, "configuration.json"), "w") as f:
json.dump(config, f, indent=4)As per coding guidelines "be careful that logs don’t leak sensitive paths/tokens/model details."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/specdec_bench/specdec_bench/utils.py` around lines 92 - 103, The
dump_env function currently persists vars(args) and sys.argv verbatim which can
leak secrets; update dump_env to sanitize sensitive keys before writing: copy
vars(args) into config as now, but remove or replace values for keys matching
patterns like "token", "api_key", "apikey", "password", "secret", "key",
"access_token" (case-insensitive) and any other configured sensitive names in
overrides, and do not write raw sys.argv — either omit argv entirely or write a
redacted argv where flag values are replaced with "<REDACTED>" while keeping
flag names; keep the rest of the behavior (engine_version via
_get_engine_version, gpu via _get_gpu_name, os.makedirs, json.dump) unchanged
and apply changes in the dump_env function and the local variable config/argv
handling.
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
ac7ed96 to
1376d30
Compare
|
What does this PR do?
Type of change: ?
Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit
New Features
Improvements
Documentation