Skip to content

SpecDec Bench: April Update#1272

Open
IzzyPutterman wants to merge 1 commit intomainfrom
iputterman/specdec-04-15
Open

SpecDec Bench: April Update#1272
IzzyPutterman wants to merge 1 commit intomainfrom
iputterman/specdec-04-15

Conversation

@IzzyPutterman
Copy link
Copy Markdown
Contributor

@IzzyPutterman IzzyPutterman commented Apr 16, 2026

What does this PR do?

Type of change: ?

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

  • New Features

    • YAML sweep support with per-run overrides, new CLI flags, per-run output directories, and a sweep example config
    • Added HumanEval dataset and dataset-category filtering
    • New Thinking-Acceptance metric
  • Improvements

    • Metrics now report acceptance-lengths, include raw choices and output token IDs; removed output_logits from outputs
    • Support for pre-built conversation requests, per-run temperature/output-length control, bounded concurrency, progress reporting, and environment dump
  • Documentation

    • Updated installation image tags and detailed sweep/usage guide

@IzzyPutterman IzzyPutterman requested a review from a team as a code owner April 16, 2026 04:41
@IzzyPutterman IzzyPutterman requested a review from h-guo18 April 16, 2026 04:41
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 16, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds YAML-driven sweep orchestration and bounded-concurrency execution; new run utilities and CLI sweep flags; introduces HumanEval dataset and per-dataset category filtering; converts metrics to record-driven finalization (adds ThinkingAcceptance); standardizes model streaming outputs with a CumulativeTokenCollector; adds env dumping and tokenizer/chat defaults.

Changes

Cohort / File(s) Summary
Run orchestration & CLI
examples/specdec_bench/run.py, examples/specdec_bench/sweep_example.yaml, examples/specdec_bench/README.md
Adds sweep CLI args (--sweep_config, --sweep_output_root, --temperature, etc.), per-run sweep flow, per-run save-dir resolution, env dumping, and documents YAML sweep format, concurrency mapping, and output layout; updates Docker image tags in README.
Run utilities (new)
examples/specdec_bench/specdec_bench/run_utils.py
New module for sweep parsing/validation/expansion, dataset/metric builders, save-dir resolution, and gather_limited bounded-concurrency executor with progress handling.
Datasets
examples/specdec_bench/specdec_bench/datasets/__init__.py, .../base.py, .../humaneval.py, .../mtbench.py, .../specbench.py, .../speed.py
Re-exports Dataset/HumanEval; Request gains optional messages; adds HumanEval dataset loader; MTBench/SpecBench/SPEEDBench gain optional category filtering; SPEEDBench refactors loading, typing, and truncation behavior.
Runners / runner state
examples/specdec_bench/specdec_bench/runners/__init__.py, .../base.py, .../simple.py
Exports BaseRunner; replaces prompt_ar with metric_records; runner run signatures accept request_id/turn_id; collects per-step metric records and passes them to metrics in process_final.
Metrics core & implementations
examples/specdec_bench/specdec_bench/metrics/__init__.py, .../base.py, .../acceptance_rate.py, .../mtbench.py, .../specbench.py, .../timing.py, .../thinking_acceptance.py
Switches metrics to process_final(text_outputs, request_records) model and removes per-step hook; adds token/chunk helpers in base; AcceptanceRate/MTBench/SpecBench now compute acceptance-lengths (and binned stats); adds ThinkingAcceptance metric and updates exports.
Models core & implementations
examples/specdec_bench/specdec_bench/models/__init__.py, .../base.py, .../auto_deploy.py, .../sglang.py, .../trtllm_torch_api.py, .../vllm.py, .../specbench_medusa.py
Adds Model top-level export; changes Model.run signature to optional request_id/turn_id; introduces CumulativeTokenCollector and build_model_output; standardizes streaming accumulation and updates speculative-decoding and sampling config handling across model backends.
Utilities & misc
examples/specdec_bench/specdec_bench/utils.py, examples/specdec_bench/prepare_data.py, examples/specdec_bench/requirements_speed.txt
get_tokenizer now defaults trust_remote_code=True; encode_chat uses chat_template_args=None; adds _get_engine_version, _get_gpu_name, and dump_env; tweaks argparse choices in prepare_data.py; bumps datasets requirement.
Examples & docs
examples/specdec_bench/README.md, examples/specdec_bench/sweep_example.yaml
README Docker tags updated; new sweep_example.yaml showing sample runs and concurrency/override mappings.
Small edits & formatting
multiple files (models, metrics, datasets, runners)
Numerous signature tweaks, removed trailing newlines, minor refactors and typing adjustments across many modules to align with new run/metric model.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI
    participant RunUtils as run_utils (sweep)
    participant Runner as Runner
    participant Model as Model
    participant Metrics as Metrics

    CLI->>RunUtils: provide args + --sweep_config
    RunUtils->>RunUtils: load_sweep_entries(), expand_sweep_run_specs()
    loop per run_spec
        RunUtils->>RunUtils: build_dataset(), build_metrics()
        RunUtils->>Runner: start run with dataset & metrics
        loop per request
            Runner->>Model: run(prompt_ids, max_length, end_id, request_id, turn_id)
            Model->>Model: CumulativeTokenCollector collects chunks
            Model-->>Runner: build_model_output(output_ids, token_times, chunk_lengths)
            Runner->>Metrics: process_metrics_step(record with token_times, chunk_lengths, output_ids)
        end
        Runner->>Metrics: process_final(text_outputs, metric_records)
        Metrics->>RunUtils: write outputs/plots
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Security Anti-Patterns ❌ Error The pull request changes the default value of trust_remote_code from False to True in get_tokenizer(), violating SECURITY.md coding practice #3. Revert trust_remote_code default to False or obtain explicit security review and approval with documented justification for the change.
Docstring Coverage ⚠️ Warning Docstring coverage is 8.87% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'SpecDec Bench: April Update' is generic and vague, using non-descriptive terms that do not convey meaningful information about the changeset. Provide a specific title that summarizes the main changes, such as 'Add sweep functionality and refactor metrics in SpecDec Bench' or 'SpecDec Bench: Update metrics to use acceptance lengths and add sweep configuration support'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch iputterman/specdec-04-15

Comment @coderabbitai help to get the list of available commands and usage tips.

@IzzyPutterman IzzyPutterman force-pushed the iputterman/specdec-04-15 branch from 689b4a0 to 64bfe8d Compare April 16, 2026 04:46
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/specdec_bench/specdec_bench/datasets/base.py (1)

21-29: ⚠️ Potential issue | 🟡 Minor

Reject requests that define both turns and messages.

This dataclass now allows two conversation representations at once, which makes invalid requests easy to construct and silently mis-handle downstream. Add a __post_init__ guard or normalize one form into the other here.

Proposed fix
 `@dataclass`
 class Request:
     question_id: int | None = None
     category: str | None = None
     system_prompt: str | None = None
     turns: list[str] = field(default_factory=list)
     mm_content: Any | None = None  # TODO
     messages: list[dict] | None = None  # pre-built full conversation history
+
+    def __post_init__(self) -> None:
+        if self.messages is not None and self.turns:
+            raise ValueError("Request cannot define both 'turns' and 'messages'.")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/datasets/base.py` around lines 21 - 29,
The Request dataclass allows both turns and messages to be set simultaneously
which is invalid; add a __post_init__ method on Request that checks if both
self.turns (non-empty) and self.messages (non-None and non-empty) are provided
and raise a ValueError explaining only one conversation representation is
allowed; optionally, if you prefer normalization instead of rejecting, implement
the same __post_init__ to populate self.messages from self.turns when messages
is None (or populate turns from messages when turns is empty) — but ensure the
check for both being present is enforced in the __post_init__ of Request.
🧹 Nitpick comments (5)
examples/specdec_bench/specdec_bench/models/auto_deploy.py (1)

95-115: Consider exposing mtp_eagle_one_model as configurable.

The EAGLE, EAGLE3, and MTP speculative decoding configurations look correct. However, mtp_eagle_one_model=True on line 114 is hardcoded while similar parameters in other configs (e.g., eagle3_one_model) are configurable via kwargs.

Make mtp_eagle_one_model configurable
     elif speculative_algorithm == "MTP":
         specdec = MTPDecodingConfig(
             speculative_model=kwargs.get("draft_model_dir"),
             num_nextn_predict_layers=kwargs.get("speculative_num_steps", 1),
-            mtp_eagle_one_model=True,
+            mtp_eagle_one_model=kwargs.get("mtp_eagle_one_model", True),
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/models/auto_deploy.py` around lines 95 -
115, In the MTP branch where MTPDecodingConfig is constructed (the
speculative_algorithm == "MTP" block), make the mtp_eagle_one_model flag
configurable instead of hardcoded: replace the hardcoded
mtp_eagle_one_model=True with
mtp_eagle_one_model=kwargs.get("mtp_eagle_one_model", True) so callers can pass
mtp_eagle_one_model via kwargs while preserving the existing default; update any
related docs/comments if present.
examples/specdec_bench/README.md (1)

219-224: Minor: Missing trailing newline.

The file should end with a newline character per standard convention.

Add trailing newline
 use a more robust benchmarking client like NVIDIA AI Perf.
+
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/README.md` around lines 219 - 224, The README ends
without a trailing newline; open the README.md and add a single newline
character at the end of the file so the final line (e.g., the "## Notes" section
text) is terminated by a newline per POSIX/text-file conventions.
examples/specdec_bench/specdec_bench/metrics/base.py (1)

62-70: Consider documenting the [1] fallback return value.

The trim_beam_lengths method returns [1] as a fallback when beam_lengths is empty or when trimming results in an invalid range. This sentinel value may have specific meaning for downstream consumers, but it's not immediately obvious why [1] was chosen over an empty list.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/metrics/base.py` around lines 62 - 70,
The trim_beam_lengths function returns a sentinel [1] in edge cases but lacks
documentation; update the trim_beam_lengths docstring to explain why [1] is
returned (e.g., represents a minimum beam length/default sentinel for downstream
metrics consumers), when it will be returned (empty input or invalid trimmed
range), and any assumptions callers must make about this sentinel so maintainers
and callers (who call trim_beam_lengths with beam_lengths) understand its
semantics and won’t misinterpret it.
examples/specdec_bench/specdec_bench/metrics/specbench.py (1)

189-191: Unused loop variable i.

The enumerate provides i which is not used in the loop body.

♻️ Suggested fix
-        for i, (bar, count) in enumerate(zip(bars2, length_counts)):
+        for bar, count in zip(bars2, length_counts):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/metrics/specbench.py` around lines 189 -
191, The loop is enumerating but never uses the index variable `i`; change the
loop that builds `bars2` to avoid the unused `i` by iterating over indices or
zipped sequences directly (e.g., use range(len(length_bins)) or zip(length_bins,
length_means, length_stds)) when calling `ax2.bar` so `bars2 =
ax2.bar(range(len(length_bins)), length_means, yerr=length_stds, capsize=5,
alpha=0.8)` has no leftover unused `i` from an `enumerate` elsewhere.
examples/specdec_bench/specdec_bench/metrics/thinking_acceptance.py (1)

61-69: Transition chunk is discarded when thinking_end_token is detected.

When the end token is found in a chunk (line 62-65), the chunk's acceptance length is not counted in either thinking_lengths or non_thinking_lengths due to the continue. This may be intentional behavior for boundary handling, but it means the transition chunk's tokens are excluded from statistics. Consider documenting this behavior or counting it in one category.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/metrics/thinking_acceptance.py` around
lines 61 - 69, The transition chunk is skipped by the continue when
self._contains_sequence(output_id_iter, self.thinking_end_token) is true, so its
acceptance_length isn't recorded; update the block in thinking handling (where
in_thinking and self.thinking_end_token are checked) to record the current
acceptance_length into the appropriate list (e.g., append acceptance_length to
thinking_lengths) before flipping
prompt_state[request_id][turn_id]["in_thinking"] = False and setting in_thinking
= False, or explicitly document that transition chunks are excluded if you
prefer; reference symbols: thinking_end_token, _contains_sequence, prompt_state,
in_thinking, thinking_lengths, non_thinking_lengths, acceptance_length.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/specdec_bench/run.py`:
- Around line 405-409: The YAML loaded into args.runtime_params via
yaml.safe_load may be a scalar or list causing AttributeError later in
run_simple(); after loading (or setting {}), validate that args.runtime_params
is a mapping (dict) and if not raise a clear CLI error (e.g.,
argparse.ArgumentTypeError or print to stderr and exit) mentioning that
runtime_params must be a YAML mapping; reference args.runtime_params,
yaml.safe_load, and run_simple() when adding the validation.
- Around line 169-184: The code reuses one model instance (constructed via
model_class(...) and assigned to model) while mutating
model.sampling_kwargs["temperature"] per-run, but the executor's
allow_advanced_sampling (set in create_executor()) is computed only at model
init and cannot change; fix by either (A) detecting per-run temperature changes
against the initial sampling_kwargs temperature and recreating the model
(re-instantiate model via model_class with the new sampling_kwargs) before
executing that run, ensuring create_executor() sees the correct temperature, or
(B) rejecting per-run temperature overrides for engines that bake sampling
settings at init (validate run_specs and raise an error if any
run_spec.temperature differs from the initial sampling_kwargs temperature).
Update the sweep logic that mutates model.sampling_kwargs (and the code that
builds model from model_class) to implement one of these two behaviors and
reference create_executor, allow_advanced_sampling, model, model_class,
sampling_kwargs, and run_specs when making the change.

In `@examples/specdec_bench/specdec_bench/datasets/humaneval.py`:
- Around line 18-22: The module-level optional import for "datasets" is broken
because load_dataset can be undefined; move the import into the function that
needs it (e.g. _preprocess() or the class __init__ that calls it) and replace
the print fallback with a raised ImportError: try to import load_dataset inside
_preprocess() (or __init__), catch ImportError and raise a new ImportError with
a clear message instructing the user to install the optional extras (e.g. "[hf]"
or "[all]"); ensure no module-level variables like load_dataset remain undefined
after this change.

In `@examples/specdec_bench/specdec_bench/datasets/speed.py`:
- Around line 477-480: After filtering by category, clamp self.num_samples to
the filtered dataset length before calling Dataset.select to avoid IndexError:
compute a local count (e.g., n = min(self.num_samples, len(dataset) or
dataset.num_rows)) after the dataset.filter step and then call
dataset.select(range(n)) only when self.num_samples is not None and n > 0;
update the block that references dataset, category, and self.num_samples
accordingly so select() never receives out-of-bounds indices.

In `@examples/specdec_bench/specdec_bench/metrics/mtbench.py`:
- Around line 34-40: The code currently assumes exactly two turns by indexing
turns[0] and turns[1], which will raise errors for requests with fewer or more
turns; update the loop over prompt_ar so you handle an arbitrary-length turns
list: compute request_ar by aggregating all tokens/values across the turns
(e.g., sum of concatenated turns divided by total length) and set
self.out["Request_AL"][request_id] accordingly, and replace the two explicit
_get_lengths calls with a loop that calls self._get_lengths(turn, lengths) for
each turn in turns (or skip requests with no turns) to defensively handle
varying turn counts.

In `@examples/specdec_bench/specdec_bench/metrics/timing.py`:
- Around line 38-40: The min/max calls over timing can crash if request_records
is empty or any token_times list is empty; update the metric calculation (the
block that computes start_time, end_time and assigns self.out["Output TPS"]) to
first filter out empty token_times (e.g., filtered_timing = [t for t in timing
if t]) and check for an empty result, and if filtered_timing is empty (or timing
is empty) set self.out["Output TPS"] to 0 (or another safe sentinel) instead of
calling min/max; otherwise compute start_time = min(t[0] for t in
filtered_timing), end_time = max(t[-1] for t in filtered_timing) and proceed to
compute TPS using total_tokens.

In `@examples/specdec_bench/specdec_bench/models/base.py`:
- Around line 46-67: CumulativeTokenCollector currently records timestamps in a
single list self.token_times which becomes ambiguous when multiple beams update
interleaved; change the design so timing is stored per-beam: add a timestamp
container on the per-beam state (e.g. add a timestamps list to _BeamState) and
update add_update(beam_idx, cumulative_tokens, timestamp) to append timestamps
to that beam's timestamps instead of self.token_times, and adjust
chunk_lengths() and any consumers of token_times/output_ids to read per-beam
timestamps from each _BeamState (or alternatively enforce single-beam usage by
documenting/raising in CumulativeTokenCollector if more than one beam is used).
Ensure references: CumulativeTokenCollector, _BeamState, add_update,
token_times, chunk_lengths, output_ids are all updated consistently.

In `@examples/specdec_bench/specdec_bench/models/sglang.py`:
- Around line 68-80: Replace the fragile assert with explicit validation: check
sampling_config.get("beam_width", 1) and if it is not 1 raise a clear
RuntimeError or ValueError (e.g., "multi-beam generation not supported") so the
check remains in optimized runs; update the surrounding logic around
sampling_config, the call to self.model.async_generate, and collector.add_update
(which currently hardcodes beam index 0) to rely on this validation instead of
an assert.
- Around line 40-60: The sgl.Engine call currently hardcodes
trust_remote_code=True, which is unsafe; change the code to accept a
caller-configurable parameter (e.g., a function/class initializer argument named
trust_remote_code defaulting to False) and pass that value into sgl.Engine
instead of the literal True; locate the sgl.Engine instantiation and replace the
hardcoded trust_remote_code argument with the new parameter (or pull it from
kwargs with kwargs.pop("trust_remote_code", False)) so the default remains False
but callers can opt in.

In `@examples/specdec_bench/specdec_bench/models/vllm.py`:
- Around line 73-74: The kwargs.pop("parallel_drafting") call can raise KeyError
when the key is absent; change it to provide a default (e.g.,
kwargs.pop("parallel_drafting", False) or None) and then set
specdec["parallel_drafting"] = True only when the popped value is truthy; update
the constructor logic around the specdec handling (the block referencing
kwargs.pop("parallel_drafting") and specdec) so it safely handles missing keys
without throwing.
- Around line 76-88: The code hardcodes trust_remote_code=True when constructing
AsyncEngineArgs in the AsyncEngineArgs(...) call, which is a security risk;
change the API to accept a caller-configurable boolean (e.g., add a parameter
named trust_remote_code with default False to the function/class that
instantiates AsyncEngineArgs) and pass that parameter into
AsyncEngineArgs(trust_remote_code=trust_remote_code) instead of the hardcoded
True, ensuring callers can opt in explicitly; update any call sites that
construct this vllm engine to forward the new argument or rely on the default
False.

In `@examples/specdec_bench/specdec_bench/utils.py`:
- Around line 23-24: The get_tokenizer function currently hardcodes
trust_remote_code=True which is a security risk; change get_tokenizer to accept
a caller-configurable parameter (e.g., trust_remote_code: bool = False) and pass
that variable into AutoTokenizer.from_pretrained(path,
trust_remote_code=trust_remote_code); update any call sites to rely on the
default False or explicitly pass True when the caller trusts the model source.
Ensure the function signature and its usage across the repo are updated so the
hardcoded True is removed.

---

Outside diff comments:
In `@examples/specdec_bench/specdec_bench/datasets/base.py`:
- Around line 21-29: The Request dataclass allows both turns and messages to be
set simultaneously which is invalid; add a __post_init__ method on Request that
checks if both self.turns (non-empty) and self.messages (non-None and non-empty)
are provided and raise a ValueError explaining only one conversation
representation is allowed; optionally, if you prefer normalization instead of
rejecting, implement the same __post_init__ to populate self.messages from
self.turns when messages is None (or populate turns from messages when turns is
empty) — but ensure the check for both being present is enforced in the
__post_init__ of Request.

---

Nitpick comments:
In `@examples/specdec_bench/README.md`:
- Around line 219-224: The README ends without a trailing newline; open the
README.md and add a single newline character at the end of the file so the final
line (e.g., the "## Notes" section text) is terminated by a newline per
POSIX/text-file conventions.

In `@examples/specdec_bench/specdec_bench/metrics/base.py`:
- Around line 62-70: The trim_beam_lengths function returns a sentinel [1] in
edge cases but lacks documentation; update the trim_beam_lengths docstring to
explain why [1] is returned (e.g., represents a minimum beam length/default
sentinel for downstream metrics consumers), when it will be returned (empty
input or invalid trimmed range), and any assumptions callers must make about
this sentinel so maintainers and callers (who call trim_beam_lengths with
beam_lengths) understand its semantics and won’t misinterpret it.

In `@examples/specdec_bench/specdec_bench/metrics/specbench.py`:
- Around line 189-191: The loop is enumerating but never uses the index variable
`i`; change the loop that builds `bars2` to avoid the unused `i` by iterating
over indices or zipped sequences directly (e.g., use range(len(length_bins)) or
zip(length_bins, length_means, length_stds)) when calling `ax2.bar` so `bars2 =
ax2.bar(range(len(length_bins)), length_means, yerr=length_stds, capsize=5,
alpha=0.8)` has no leftover unused `i` from an `enumerate` elsewhere.

In `@examples/specdec_bench/specdec_bench/metrics/thinking_acceptance.py`:
- Around line 61-69: The transition chunk is skipped by the continue when
self._contains_sequence(output_id_iter, self.thinking_end_token) is true, so its
acceptance_length isn't recorded; update the block in thinking handling (where
in_thinking and self.thinking_end_token are checked) to record the current
acceptance_length into the appropriate list (e.g., append acceptance_length to
thinking_lengths) before flipping
prompt_state[request_id][turn_id]["in_thinking"] = False and setting in_thinking
= False, or explicitly document that transition chunks are excluded if you
prefer; reference symbols: thinking_end_token, _contains_sequence, prompt_state,
in_thinking, thinking_lengths, non_thinking_lengths, acceptance_length.

In `@examples/specdec_bench/specdec_bench/models/auto_deploy.py`:
- Around line 95-115: In the MTP branch where MTPDecodingConfig is constructed
(the speculative_algorithm == "MTP" block), make the mtp_eagle_one_model flag
configurable instead of hardcoded: replace the hardcoded
mtp_eagle_one_model=True with
mtp_eagle_one_model=kwargs.get("mtp_eagle_one_model", True) so callers can pass
mtp_eagle_one_model via kwargs while preserving the existing default; update any
related docs/comments if present.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 3b5bc295-bd32-42e5-a20d-34a0edcfb946

📥 Commits

Reviewing files that changed from the base of the PR and between 07ae8e7 and 689b4a0.

📒 Files selected for processing (30)
  • examples/specdec_bench/README.md
  • examples/specdec_bench/SPECBENCH_PORTING.md
  • examples/specdec_bench/prepare_data.py
  • examples/specdec_bench/run.py
  • examples/specdec_bench/specdec_bench/datasets/__init__.py
  • examples/specdec_bench/specdec_bench/datasets/base.py
  • examples/specdec_bench/specdec_bench/datasets/humaneval.py
  • examples/specdec_bench/specdec_bench/datasets/mtbench.py
  • examples/specdec_bench/specdec_bench/datasets/specbench.py
  • examples/specdec_bench/specdec_bench/datasets/speed.py
  • examples/specdec_bench/specdec_bench/metrics/__init__.py
  • examples/specdec_bench/specdec_bench/metrics/acceptance_rate.py
  • examples/specdec_bench/specdec_bench/metrics/base.py
  • examples/specdec_bench/specdec_bench/metrics/mtbench.py
  • examples/specdec_bench/specdec_bench/metrics/specbench.py
  • examples/specdec_bench/specdec_bench/metrics/thinking_acceptance.py
  • examples/specdec_bench/specdec_bench/metrics/timing.py
  • examples/specdec_bench/specdec_bench/models/__init__.py
  • examples/specdec_bench/specdec_bench/models/auto_deploy.py
  • examples/specdec_bench/specdec_bench/models/base.py
  • examples/specdec_bench/specdec_bench/models/sglang.py
  • examples/specdec_bench/specdec_bench/models/specbench_medusa.py
  • examples/specdec_bench/specdec_bench/models/trtllm_torch_api.py
  • examples/specdec_bench/specdec_bench/models/vllm.py
  • examples/specdec_bench/specdec_bench/run_utils.py
  • examples/specdec_bench/specdec_bench/runners/__init__.py
  • examples/specdec_bench/specdec_bench/runners/base.py
  • examples/specdec_bench/specdec_bench/runners/simple.py
  • examples/specdec_bench/specdec_bench/utils.py
  • examples/specdec_bench/sweep_example.yaml

Comment thread examples/specdec_bench/run.py
Comment on lines 405 to 409
if args.runtime_params is not None:
with open(args.runtime_params) as f:
args.runtime_params = yaml.safe_load(f)
args.runtime_params = yaml.safe_load(f) or {}
else:
args.runtime_params = {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, locate the file and examine the code at the specified lines
find . -name "run.py" -path "*/specdec_bench/*" | head -5

Repository: NVIDIA/Model-Optimizer

Length of output: 97


🏁 Script executed:

cat -n examples/specdec_bench/run.py | sed -n '400,415p'

Repository: NVIDIA/Model-Optimizer

Length of output: 702


🏁 Script executed:

rg -A 20 "def run_simple" examples/specdec_bench/run.py

Repository: NVIDIA/Model-Optimizer

Length of output: 971


🏁 Script executed:

rg "args\.runtime_params" examples/specdec_bench/run.py

Repository: NVIDIA/Model-Optimizer

Length of output: 550


Validate that runtime_params loads to a mapping.

yaml.safe_load() also accepts scalars and lists, so a malformed YAML file will make the first .get(...) call in run_simple() fail with AttributeError instead of a clear CLI error.

🧩 Proposed fix
     if args.runtime_params is not None:
         with open(args.runtime_params) as f:
             args.runtime_params = yaml.safe_load(f) or {}
+        if not isinstance(args.runtime_params, dict):
+            raise ValueError("--runtime_params must load to a mapping")
     else:
         args.runtime_params = {}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/run.py` around lines 405 - 409, The YAML loaded into
args.runtime_params via yaml.safe_load may be a scalar or list causing
AttributeError later in run_simple(); after loading (or setting {}), validate
that args.runtime_params is a mapping (dict) and if not raise a clear CLI error
(e.g., argparse.ArgumentTypeError or print to stderr and exit) mentioning that
runtime_params must be a YAML mapping; reference args.runtime_params,
yaml.safe_load, and run_simple() when adding the validation.

Comment thread examples/specdec_bench/specdec_bench/datasets/humaneval.py
Comment thread examples/specdec_bench/specdec_bench/datasets/speed.py Outdated
Comment on lines +34 to 40
for request_id, turns in prompt_ar.items():
turn_1 = turns[0]
turn_2 = turns[1]
q_id = request_id
mtbench_topic = MTBENCH_TOPICS[q_id // 10]
self.out["Request_AR"][request_id] = sum(turn_1 + turn_2) / len(turn_1 + turn_2)
request_ar = sum(turn_1 + turn_2) / len(turn_1 + turn_2)
self.out["Request_AL"][request_id] = request_ar
self._get_lengths(turn_1, lengths)
self._get_lengths(turn_2, lengths)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Hard-coded assumption of exactly 2 turns per request.

Accessing turns[0] and turns[1] directly will raise KeyError if a request has fewer than 2 turns. This may be intentional for MTBench's fixed format, but defensive handling would improve robustness.

🛡️ Suggested defensive check
         for request_id, turns in prompt_ar.items():
-            turn_1 = turns[0]
-            turn_2 = turns[1]
+            turn_1 = turns.get(0, [])
+            turn_2 = turns.get(1, [])
+            if not turn_1 or not turn_2:
+                continue
             request_ar = sum(turn_1 + turn_2) / len(turn_1 + turn_2)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for request_id, turns in prompt_ar.items():
turn_1 = turns[0]
turn_2 = turns[1]
q_id = request_id
mtbench_topic = MTBENCH_TOPICS[q_id // 10]
self.out["Request_AR"][request_id] = sum(turn_1 + turn_2) / len(turn_1 + turn_2)
request_ar = sum(turn_1 + turn_2) / len(turn_1 + turn_2)
self.out["Request_AL"][request_id] = request_ar
self._get_lengths(turn_1, lengths)
self._get_lengths(turn_2, lengths)
for request_id, turns in prompt_ar.items():
if len(turns) < 2:
continue
turn_1 = turns[0]
turn_2 = turns[1]
request_ar = sum(turn_1 + turn_2) / len(turn_1 + turn_2)
self.out["Request_AL"][request_id] = request_ar
self._get_lengths(turn_1, lengths)
self._get_lengths(turn_2, lengths)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/metrics/mtbench.py` around lines 34 -
40, The code currently assumes exactly two turns by indexing turns[0] and
turns[1], which will raise errors for requests with fewer or more turns; update
the loop over prompt_ar so you handle an arbitrary-length turns list: compute
request_ar by aggregating all tokens/values across the turns (e.g., sum of
concatenated turns divided by total length) and set
self.out["Request_AL"][request_id] accordingly, and replace the two explicit
_get_lengths calls with a loop that calls self._get_lengths(turn, lengths) for
each turn in turns (or skip requests with no turns) to defensively handle
varying turn counts.

Comment thread examples/specdec_bench/specdec_bench/models/sglang.py
Comment thread examples/specdec_bench/specdec_bench/models/sglang.py
Comment thread examples/specdec_bench/specdec_bench/models/vllm.py Outdated
Comment thread examples/specdec_bench/specdec_bench/models/vllm.py
Comment thread examples/specdec_bench/specdec_bench/utils.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/specdec_bench/specdec_bench/metrics/timing.py (1)

49-54: ⚠️ Potential issue | 🟡 Minor

compute_statistics can fail on empty input lists.

If all records have len(times) <= 1, then e2e_time and ttft_time will be empty, causing compute_statistics to fail when calling np.min/np.max/np.mean on an empty array. Similarly, total_tokens will be empty if request_records is empty.

🛡️ Suggested fix: guard compute_statistics calls
-        self.out["E2E Request Time"] = compute_statistics(e2e_time)
-        self.out["TTFT Time"] = compute_statistics(ttft_time)
+        if e2e_time:
+            self.out["E2E Request Time"] = compute_statistics(e2e_time)
+        if ttft_time:
+            self.out["TTFT Time"] = compute_statistics(ttft_time)
         if tpot_time:
             self.out["Request Generation Step Time"] = compute_statistics(tpot_time)
             self.out["Request Generation Tokens Per Second"] = compute_statistics(gen_tp_time)
-        self.out["Number of Output Tokens"] = compute_statistics(total_tokens)
+        if total_tokens:
+            self.out["Number of Output Tokens"] = compute_statistics(total_tokens)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/metrics/timing.py` around lines 49 - 54,
Guard calls to compute_statistics for empty input arrays: before calling
compute_statistics for e2e_time, ttft_time, tpot_time, gen_tp_time and
total_tokens, check that the corresponding list/array is non-empty (and for
tpot_time also keep existing conditional). If empty, assign a sensible fallback
(e.g., None or an empty statistics dict) to self.out keys ("E2E Request Time",
"TTFT Time", "Request Generation Step Time", "Request Generation Tokens Per
Second", "Number of Output Tokens") instead of calling compute_statistics;
otherwise call compute_statistics as before. This change touches the block that
assigns into self.out and references variables e2e_time, ttft_time, tpot_time,
gen_tp_time, and total_tokens.
♻️ Duplicate comments (1)
examples/specdec_bench/specdec_bench/models/vllm.py (1)

76-88: ⚠️ Potential issue | 🔴 Critical

CRITICAL: Hardcoded trust_remote_code=True violates security guidelines.

Per SECURITY.md and coding guidelines, trust_remote_code=True must not be hardcoded. This enables execution of arbitrary Python shipped with a checkpoint, creating an RCE vector if the model source is untrusted. The code should expose this as a caller-configurable parameter defaulting to False.

🔒 Proposed fix
 class VLLMModel(Model):
     def __init__(self, model_dir, max_concurrent_requests, sampling_kwargs, **kwargs):
+        trust_remote_code = kwargs.pop("trust_remote_code", False)
         specdec = None
         # ... existing speculative config code ...
         
         engine_args = AsyncEngineArgs(
             model=model_dir,
-            trust_remote_code=True,
+            trust_remote_code=trust_remote_code,
             tensor_parallel_size=kwargs.pop("tensor_parallel_size", 1),

As per coding guidelines: "Flag trust_remote_code=True hardcoded for transformers model or tokenizer loading as CRITICAL security issue. Code should expose it as a caller-configurable parameter defaulting to False, not hardcode True."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/models/vllm.py` around lines 76 - 88,
The code currently hardcodes trust_remote_code=True when constructing
AsyncEngineArgs; change this to accept a caller-configurable parameter
defaulting to False by reading a new or existing argument (e.g.,
trust_remote_code = kwargs.pop("trust_remote_code", False)) and pass that
variable into AsyncEngineArgs instead of the literal True; update any callers or
documentation if needed and ensure kwargs.pop("trust_remote_code", False) is
used so the default remains False and callers can opt in.
🧹 Nitpick comments (1)
examples/specdec_bench/specdec_bench/metrics/timing.py (1)

63-77: Consider adding empty-data guard to compute_statistics.

As a defensive measure, compute_statistics could return a sentinel or raise a clearer error when data is empty, rather than letting NumPy raise an opaque error.

♻️ Optional: add early return for empty data
 def compute_statistics(data, quantiles=[0.25, 0.5, 0.75]):
+    if len(data) == 0:
+        return {"min": "N/A", "max": "N/A", "mean": "N/A", "std": "N/A", "quantiles": {}}
     data = np.array(data)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/metrics/timing.py` around lines 63 - 77,
compute_statistics currently lets NumPy raise an opaque error on empty input;
add an explicit empty-data guard at the start of compute_statistics (check if
data is empty via len(data) or data.size) and either raise a clear ValueError
(e.g. "compute_statistics: input data is empty") or return a defined sentinel
structure, so downstream code gets a predictable result before calling
np.min/np.max/np.mean/np.std/np.percentile; update the function body around
compute_statistics and the quantile logic to early-return or raise accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@examples/specdec_bench/specdec_bench/metrics/timing.py`:
- Around line 49-54: Guard calls to compute_statistics for empty input arrays:
before calling compute_statistics for e2e_time, ttft_time, tpot_time,
gen_tp_time and total_tokens, check that the corresponding list/array is
non-empty (and for tpot_time also keep existing conditional). If empty, assign a
sensible fallback (e.g., None or an empty statistics dict) to self.out keys
("E2E Request Time", "TTFT Time", "Request Generation Step Time", "Request
Generation Tokens Per Second", "Number of Output Tokens") instead of calling
compute_statistics; otherwise call compute_statistics as before. This change
touches the block that assigns into self.out and references variables e2e_time,
ttft_time, tpot_time, gen_tp_time, and total_tokens.

---

Duplicate comments:
In `@examples/specdec_bench/specdec_bench/models/vllm.py`:
- Around line 76-88: The code currently hardcodes trust_remote_code=True when
constructing AsyncEngineArgs; change this to accept a caller-configurable
parameter defaulting to False by reading a new or existing argument (e.g.,
trust_remote_code = kwargs.pop("trust_remote_code", False)) and pass that
variable into AsyncEngineArgs instead of the literal True; update any callers or
documentation if needed and ensure kwargs.pop("trust_remote_code", False) is
used so the default remains False and callers can opt in.

---

Nitpick comments:
In `@examples/specdec_bench/specdec_bench/metrics/timing.py`:
- Around line 63-77: compute_statistics currently lets NumPy raise an opaque
error on empty input; add an explicit empty-data guard at the start of
compute_statistics (check if data is empty via len(data) or data.size) and
either raise a clear ValueError (e.g. "compute_statistics: input data is empty")
or return a defined sentinel structure, so downstream code gets a predictable
result before calling np.min/np.max/np.mean/np.std/np.percentile; update the
function body around compute_statistics and the quantile logic to early-return
or raise accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: ab43911b-151a-45c5-9d88-f3af34de4f58

📥 Commits

Reviewing files that changed from the base of the PR and between 689b4a0 and 64bfe8d.

📒 Files selected for processing (30)
  • examples/specdec_bench/README.md
  • examples/specdec_bench/SPECBENCH_PORTING.md
  • examples/specdec_bench/prepare_data.py
  • examples/specdec_bench/run.py
  • examples/specdec_bench/specdec_bench/datasets/__init__.py
  • examples/specdec_bench/specdec_bench/datasets/base.py
  • examples/specdec_bench/specdec_bench/datasets/humaneval.py
  • examples/specdec_bench/specdec_bench/datasets/mtbench.py
  • examples/specdec_bench/specdec_bench/datasets/specbench.py
  • examples/specdec_bench/specdec_bench/datasets/speed.py
  • examples/specdec_bench/specdec_bench/metrics/__init__.py
  • examples/specdec_bench/specdec_bench/metrics/acceptance_rate.py
  • examples/specdec_bench/specdec_bench/metrics/base.py
  • examples/specdec_bench/specdec_bench/metrics/mtbench.py
  • examples/specdec_bench/specdec_bench/metrics/specbench.py
  • examples/specdec_bench/specdec_bench/metrics/thinking_acceptance.py
  • examples/specdec_bench/specdec_bench/metrics/timing.py
  • examples/specdec_bench/specdec_bench/models/__init__.py
  • examples/specdec_bench/specdec_bench/models/auto_deploy.py
  • examples/specdec_bench/specdec_bench/models/base.py
  • examples/specdec_bench/specdec_bench/models/sglang.py
  • examples/specdec_bench/specdec_bench/models/specbench_medusa.py
  • examples/specdec_bench/specdec_bench/models/trtllm_torch_api.py
  • examples/specdec_bench/specdec_bench/models/vllm.py
  • examples/specdec_bench/specdec_bench/run_utils.py
  • examples/specdec_bench/specdec_bench/runners/__init__.py
  • examples/specdec_bench/specdec_bench/runners/base.py
  • examples/specdec_bench/specdec_bench/runners/simple.py
  • examples/specdec_bench/specdec_bench/utils.py
  • examples/specdec_bench/sweep_example.yaml
✅ Files skipped from review due to trivial changes (6)
  • examples/specdec_bench/specdec_bench/models/specbench_medusa.py
  • examples/specdec_bench/specdec_bench/datasets/base.py
  • examples/specdec_bench/prepare_data.py
  • examples/specdec_bench/specdec_bench/datasets/init.py
  • examples/specdec_bench/sweep_example.yaml
  • examples/specdec_bench/specdec_bench/metrics/init.py
🚧 Files skipped from review as they are similar to previous changes (14)
  • examples/specdec_bench/specdec_bench/runners/simple.py
  • examples/specdec_bench/specdec_bench/runners/init.py
  • examples/specdec_bench/SPECBENCH_PORTING.md
  • examples/specdec_bench/specdec_bench/models/auto_deploy.py
  • examples/specdec_bench/specdec_bench/models/init.py
  • examples/specdec_bench/specdec_bench/datasets/mtbench.py
  • examples/specdec_bench/specdec_bench/datasets/specbench.py
  • examples/specdec_bench/specdec_bench/runners/base.py
  • examples/specdec_bench/specdec_bench/metrics/thinking_acceptance.py
  • examples/specdec_bench/specdec_bench/metrics/mtbench.py
  • examples/specdec_bench/specdec_bench/metrics/specbench.py
  • examples/specdec_bench/specdec_bench/metrics/base.py
  • examples/specdec_bench/specdec_bench/utils.py
  • examples/specdec_bench/specdec_bench/models/sglang.py

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 16, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.04%. Comparing base (6ded36b) to head (1376d30).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1272      +/-   ##
==========================================
+ Coverage   75.58%   77.04%   +1.46%     
==========================================
  Files         459      459              
  Lines       48613    48613              
==========================================
+ Hits        36745    37455     +710     
+ Misses      11868    11158     -710     
Flag Coverage Δ
examples 41.38% <ø> (+11.54%) ⬆️
unit 52.21% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@IzzyPutterman IzzyPutterman force-pushed the iputterman/specdec-04-15 branch from 64bfe8d to 0408d02 Compare April 16, 2026 16:22
@IzzyPutterman IzzyPutterman requested a review from a team as a code owner April 16, 2026 16:22
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (5)
examples/specdec_bench/specdec_bench/models/vllm.py (2)

76-88: ⚠️ Potential issue | 🔴 Critical

Do not hardcode trust_remote_code=True in engine args.

Line 78 forces remote code execution for model loading. Make this caller-configurable and default to False.

🔒 Proposed fix
 class VLLMModel(Model):
     def __init__(self, model_dir, max_concurrent_requests, sampling_kwargs, **kwargs):
+        trust_remote_code = kwargs.pop("trust_remote_code", False)
         specdec = None
         speculative_algorithm = kwargs.pop("speculative_algorithm", None)
         ...
         engine_args = AsyncEngineArgs(
             model=model_dir,
-            trust_remote_code=True,
+            trust_remote_code=trust_remote_code,
             tensor_parallel_size=kwargs.pop("tensor_parallel_size", 1),
             ...
             **kwargs,
         )

As per coding guidelines: "Be especially careful with remote-code execution flags: do not hardcode trust_remote_code=True."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/models/vllm.py` around lines 76 - 88,
The code hardcodes trust_remote_code=True in the AsyncEngineArgs instantiation
(engine_args) which forces remote code execution; change it to be
caller-configurable and default to False by reading a parameter or kwargs (e.g.,
use kwargs.pop("trust_remote_code", False) or add a function parameter
trust_remote_code=False) when building AsyncEngineArgs in vllm.py so
AsyncEngineArgs(..., trust_remote_code=trust_remote_code, ...); ensure callers
that rely on the old behavior pass the flag explicitly if needed.

73-74: ⚠️ Potential issue | 🟠 Major

kwargs.pop("parallel_drafting") can crash when key is absent.

Line 73 can raise KeyError if parallel_drafting is not provided, which breaks initialization paths that omit this flag.

🐛 Proposed fix
-        if kwargs.pop("parallel_drafting") and specdec is not None:
+        if kwargs.pop("parallel_drafting", False) and specdec is not None:
             specdec["parallel_drafting"] = True
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/models/vllm.py` around lines 73 - 74,
The code currently uses kwargs.pop("parallel_drafting") which raises KeyError
when the key is absent; change the pop to provide a default (e.g.,
kwargs.pop("parallel_drafting", False)) or use kwargs.get("parallel_drafting",
False) so initialization won't crash, then if that value is truthy and specdec
is not None set specdec["parallel_drafting"] = True; update the statement around
kwargs.pop("parallel_drafting") in the vllm model initialization to use the safe
default.
examples/specdec_bench/specdec_bench/models/sglang.py (2)

68-70: ⚠️ Potential issue | 🟠 Major

Replace assert with explicit runtime validation.

Line 70 uses assert for behavior gating. Under optimized Python runs, this check is removed and unsupported multi-beam paths can proceed silently.

♻️ Proposed fix
         if end_id != -1:
             sampling_config["stop_token_ids"] = [end_id]
-        assert sampling_config.get("beam_width", 1) == 1
+        if sampling_config.get("beam_width", 1) != 1:
+            raise ValueError("SGLANGModel only supports beam_width=1")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/models/sglang.py` around lines 68 - 70,
Replace the assert on beam width with an explicit runtime check: retrieve
beam_width via sampling_config.get("beam_width", 1) in the same block (near
end_id handling) and if it is not equal to 1 raise a clear exception (e.g.,
ValueError or RuntimeError) with a descriptive message explaining that
multi-beam decoding is unsupported; reference the sampling_config and beam_width
symbols and keep the check next to the existing end_id logic so unsupported
configurations fail at runtime instead of being optimized away.

40-44: ⚠️ Potential issue | 🔴 Critical

Do not hardcode trust_remote_code=True in sgl.Engine(...).

Line 43 enables remote code execution unconditionally. Make it caller-configurable with default False.

🔒 Proposed fix
         self.model = sgl.Engine(
             model_path=model_dir,
             skip_tokenizer_init=True,
-            trust_remote_code=True,
+            trust_remote_code=kwargs.pop("trust_remote_code", False),
             mem_fraction_static=0.8,
             disable_overlap_schedule=kwargs.pop("disable_overlap_schedule", False),

As per coding guidelines: "Be especially careful with remote-code execution flags: do not hardcode trust_remote_code=True."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/models/sglang.py` around lines 40 - 44,
The code currently hardcodes trust_remote_code=True when instantiating
sgl.Engine (see self.model = sgl.Engine(..., trust_remote_code=True, ...)); make
this flag caller-configurable instead: add a parameter (e.g., trust_remote_code:
bool = False) to the surrounding class or function signature and pass that
parameter into sgl.Engine, defaulting to False, so callers can opt in to
remote-code execution; ensure any code paths that construct self.model (using
model_path, skip_tokenizer_init, mem_fraction_static) use the new parameter
rather than the literal True.
examples/specdec_bench/specdec_bench/datasets/speed.py (1)

477-480: ⚠️ Potential issue | 🟠 Major

Clamp num_samples to avoid IndexError after category filtering.

When category filtering reduces the dataset below self.num_samples, dataset.select(range(self.num_samples)) will raise an IndexError. Clamp to the available rows.

Proposed fix
         if category is not None:
             dataset = dataset.filter(lambda example: example["category"] == category, desc=f"Filtering for category {category}")
         if self.num_samples is not None:
-            dataset = dataset.select(range(self.num_samples))
+            dataset = dataset.select(range(min(self.num_samples, len(dataset))))
         return dataset
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/datasets/speed.py` around lines 477 -
480, After filtering by category, clamp self.num_samples to the filtered dataset
size before calling dataset.select to avoid IndexError; i.e., compute an
effective_count from the filtered dataset (e.g., using its length or num_rows)
and use range(min(self.num_samples, effective_count)) in the call to
dataset.select rather than range(self.num_samples), updating the block that uses
category, dataset.filter, and dataset.select accordingly.
🧹 Nitpick comments (5)
examples/specdec_bench/README.md (2)

224-224: Add a trailing newline at end of file.

Per markdownlint MD047, files should end with a single newline character.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/README.md` at line 224, The README.md in the
examples/specdec_bench example is missing a trailing newline; open
examples/specdec_bench/README.md and ensure the file ends with exactly one
newline character (no extra blank lines), then save the file so the file
terminates with a single '\n' to satisfy markdownlint MD047.

209-215: Add a language specifier to the fenced code block.

The output directory structure example should have a language specifier (e.g., text or plaintext) for consistency and to satisfy linter rules.

-```
+```text
 my_sweep_results/
   000_speed_c32/       # first entry, concurrency=32
   001_speed_c1/        # second entry, concurrency=1
   001_speed_c2/        # second entry, concurrency=2
   ...

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @examples/specdec_bench/README.md around lines 209 - 215, Add a language
specifier to the fenced code block showing the output directory structure in the
README: modify the triple-backtick fence that wraps the my_sweep_results/
example (the directory tree block) to include a language token such as "text" or
"plaintext" (e.g., ```text) so the code block is explicitly labeled and
satisfies the linter and formatting consistency.


</details>

</blockquote></details>
<details>
<summary>examples/specdec_bench/specdec_bench/runners/base.py (1)</summary><blockquote>

`23-25`: **Consider removing the redundant `_ensure_metric_records` guard.**

Since `__init__` already initializes `self.metric_records = []` at line 21, the `hasattr` check is unnecessary unless subclasses are expected to skip calling `super().__init__()`. If that's intentional for defensive coding, consider adding a brief comment explaining why.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/runners/base.py` around lines 23 - 25,
The _ensure_metric_records method redundantly checks hasattr(self,
"metric_records") despite __init__ already setting self.metric_records = [];
remove the guard (or remove the whole _ensure_metric_records method) so callers
can rely on __init__'s initialization, or if subclasses may intentionally skip
super().__init__() add a brief comment in the class or above
_ensure_metric_records explaining that defensive check is required—refer to
__init__ and _ensure_metric_records to locate the code.
```

</details>

</blockquote></details>
<details>
<summary>examples/specdec_bench/specdec_bench/run_utils.py (1)</summary><blockquote>

`379-391`: **Minor: The second `asyncio.gather(*tasks)` after `queue.join()` may be redundant.**

Once `queue.join()` completes and all items are processed, the worker loops exit naturally (via `QueueEmpty`). The first `asyncio.gather(*tasks)` at line 384 should already await their completion. The cancellation logic in `finally` is appropriate for handling interruptions, but under normal flow the tasks will have already finished.

This is not a bug—just a minor observation that the flow is slightly verbose.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/run_utils.py` around lines 379 - 391,
Remove the redundant await asyncio.gather(*tasks) that immediately follows await
queue.join(); since worker_loop tasks exit naturally once the queue is drained,
waiting on queue.join() is sufficient to ensure items are processed—keep the
tasks list and the finally cancellation/cleanup (including worker_loop,
queue.join, and the finally block with task.cancel() and asyncio.gather(...,
return_exceptions=True)) but delete the extra asyncio.gather(*tasks) after
queue.join().
```

</details>

</blockquote></details>
<details>
<summary>examples/specdec_bench/specdec_bench/metrics/thinking_acceptance.py (1)</summary><blockquote>

`61-65`: **Consider documenting or adjusting the boundary chunk handling.**

When the thinking end token is detected, the chunk containing it is skipped entirely (via `continue`) and not counted toward either thinking or non-thinking lengths. If this is intentional (treating the boundary as transitional), a brief comment would clarify the design. Otherwise, consider splitting the chunk's acceptance length proportionally.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/metrics/thinking_acceptance.py` around
lines 61 - 65, The current branch in the block that checks
self._contains_sequence(output_id_iter, self.thinking_end_token) simply
continues and omits the chunk containing the end token from any length counts;
either clarify intent by adding a short comment above that continue (mentioning
that boundary chunks are intentionally treated as transitional and excluded), or
implement boundary-splitting: detect the token position within the current
output chunk via output_id_iter (use the index/offset returned by
_contains_sequence or compute it), count the IDs before the token toward the
current thinking length, set
prompt_state[request_id][turn_id]["in_thinking"]=False and in_thinking=False,
then process the IDs after the token as non-thinking (do not use continue so the
remainder is counted). Ensure you update the same variables (output_id_iter,
in_thinking, prompt_state entries) and preserve existing length-accumulation
logic when splitting.
```

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @examples/specdec_bench/specdec_bench/metrics/thinking_acceptance.py:

  • Around line 29-40: The _contains_sequence method can fail when token_sequence
    is a tuple because token_list slices produce lists; normalize types before
    comparing by converting token_sequence to the same type as the slice (e.g.,
    list(token_sequence)) or convert the slice to a tuple; update _contains_sequence
    to coerce token_sequence when it is a sequence (but not an int) and then perform
    the equality check using the normalized types so list slices vs tuple sequences
    no longer mismatch.

In @examples/specdec_bench/specdec_bench/models/trtllm_torch_api.py:

  • Around line 62-63: Change the direct indexing of sampling_kwargs to use .get()
    with a default so missing keys don't raise KeyError: replace the use of
    sampling_kwargs["temperature"] in the block that sets
    kwargs["allow_advanced_sampling"] with sampling_kwargs.get("temperature", 0.0)
    (this mirrors the defensive pattern used by check_sampling_config and other
    model files) so that when temperature is absent it defaults to 0.0 and preserves
    the intended behavior.

In @examples/specdec_bench/specdec_bench/models/vllm.py:

  • Around line 128-132: The stop() method currently suppresses all exceptions
    which hides real shutdown errors and masks asyncio.run()'s RuntimeError when
    called inside a running loop; change it to explicitly handle RuntimeError from
    asyncio.run(self.model.shutdown()) by detecting a running loop and scheduling
    the coroutine (e.g., via
    asyncio.get_running_loop().create_task(self.model.shutdown()) or
    asyncio.create_task) instead of calling asyncio.run, and for all other
    exceptions catch and log (or re-raise) the error rather than silencing it;
    update the stop() implementation around self.model.shutdown() and asyncio.run()
    accordingly.

Duplicate comments:
In @examples/specdec_bench/specdec_bench/datasets/speed.py:

  • Around line 477-480: After filtering by category, clamp self.num_samples to
    the filtered dataset size before calling dataset.select to avoid IndexError;
    i.e., compute an effective_count from the filtered dataset (e.g., using its
    length or num_rows) and use range(min(self.num_samples, effective_count)) in the
    call to dataset.select rather than range(self.num_samples), updating the block
    that uses category, dataset.filter, and dataset.select accordingly.

In @examples/specdec_bench/specdec_bench/models/sglang.py:

  • Around line 68-70: Replace the assert on beam width with an explicit runtime
    check: retrieve beam_width via sampling_config.get("beam_width", 1) in the same
    block (near end_id handling) and if it is not equal to 1 raise a clear exception
    (e.g., ValueError or RuntimeError) with a descriptive message explaining that
    multi-beam decoding is unsupported; reference the sampling_config and beam_width
    symbols and keep the check next to the existing end_id logic so unsupported
    configurations fail at runtime instead of being optimized away.
  • Around line 40-44: The code currently hardcodes trust_remote_code=True when
    instantiating sgl.Engine (see self.model = sgl.Engine(...,
    trust_remote_code=True, ...)); make this flag caller-configurable instead: add a
    parameter (e.g., trust_remote_code: bool = False) to the surrounding class or
    function signature and pass that parameter into sgl.Engine, defaulting to False,
    so callers can opt in to remote-code execution; ensure any code paths that
    construct self.model (using model_path, skip_tokenizer_init,
    mem_fraction_static) use the new parameter rather than the literal True.

In @examples/specdec_bench/specdec_bench/models/vllm.py:

  • Around line 76-88: The code hardcodes trust_remote_code=True in the
    AsyncEngineArgs instantiation (engine_args) which forces remote code execution;
    change it to be caller-configurable and default to False by reading a parameter
    or kwargs (e.g., use kwargs.pop("trust_remote_code", False) or add a function
    parameter trust_remote_code=False) when building AsyncEngineArgs in vllm.py so
    AsyncEngineArgs(..., trust_remote_code=trust_remote_code, ...); ensure callers
    that rely on the old behavior pass the flag explicitly if needed.
  • Around line 73-74: The code currently uses kwargs.pop("parallel_drafting")
    which raises KeyError when the key is absent; change the pop to provide a
    default (e.g., kwargs.pop("parallel_drafting", False)) or use
    kwargs.get("parallel_drafting", False) so initialization won't crash, then if
    that value is truthy and specdec is not None set specdec["parallel_drafting"] =
    True; update the statement around kwargs.pop("parallel_drafting") in the vllm
    model initialization to use the safe default.

Nitpick comments:
In @examples/specdec_bench/README.md:

  • Line 224: The README.md in the examples/specdec_bench example is missing a
    trailing newline; open examples/specdec_bench/README.md and ensure the file ends
    with exactly one newline character (no extra blank lines), then save the file so
    the file terminates with a single '\n' to satisfy markdownlint MD047.
  • Around line 209-215: Add a language specifier to the fenced code block showing
    the output directory structure in the README: modify the triple-backtick fence
    that wraps the my_sweep_results/ example (the directory tree block) to include a
    language token such as "text" or "plaintext" (e.g., ```text) so the code block
    is explicitly labeled and satisfies the linter and formatting consistency.

In @examples/specdec_bench/specdec_bench/metrics/thinking_acceptance.py:

  • Around line 61-65: The current branch in the block that checks
    self._contains_sequence(output_id_iter, self.thinking_end_token) simply
    continues and omits the chunk containing the end token from any length counts;
    either clarify intent by adding a short comment above that continue (mentioning
    that boundary chunks are intentionally treated as transitional and excluded), or
    implement boundary-splitting: detect the token position within the current
    output chunk via output_id_iter (use the index/offset returned by
    _contains_sequence or compute it), count the IDs before the token toward the
    current thinking length, set
    prompt_state[request_id][turn_id]["in_thinking"]=False and in_thinking=False,
    then process the IDs after the token as non-thinking (do not use continue so the
    remainder is counted). Ensure you update the same variables (output_id_iter,
    in_thinking, prompt_state entries) and preserve existing length-accumulation
    logic when splitting.

In @examples/specdec_bench/specdec_bench/run_utils.py:

  • Around line 379-391: Remove the redundant await asyncio.gather(*tasks) that
    immediately follows await queue.join(); since worker_loop tasks exit naturally
    once the queue is drained, waiting on queue.join() is sufficient to ensure items
    are processed—keep the tasks list and the finally cancellation/cleanup
    (including worker_loop, queue.join, and the finally block with task.cancel() and
    asyncio.gather(..., return_exceptions=True)) but delete the extra
    asyncio.gather(*tasks) after queue.join().

In @examples/specdec_bench/specdec_bench/runners/base.py:

  • Around line 23-25: The _ensure_metric_records method redundantly checks
    hasattr(self, "metric_records") despite init already setting
    self.metric_records = []; remove the guard (or remove the whole
    _ensure_metric_records method) so callers can rely on init's initialization,
    or if subclasses may intentionally skip super().init() add a brief comment
    in the class or above _ensure_metric_records explaining that defensive check is
    required—refer to init and _ensure_metric_records to locate the code.

</details>

<details>
<summary>🪄 Autofix (Beta)</summary>

Fix all unresolved CodeRabbit comments on this PR:

- [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended)
- [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Path: .coderabbit.yaml

**Review profile**: CHILL

**Plan**: Pro Plus

**Run ID**: `593d9630-5590-4507-b80c-a05eb1e75350`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 64bfe8d4390014c0d9303d3f0f3c5a641540e244 and 0408d02ef8caa2f950f69cb4a4e58261b7aa396e.

</details>

<details>
<summary>📒 Files selected for processing (31)</summary>

* `examples/specdec_bench/README.md`
* `examples/specdec_bench/SPECBENCH_PORTING.md`
* `examples/specdec_bench/prepare_data.py`
* `examples/specdec_bench/requirements_speed.txt`
* `examples/specdec_bench/run.py`
* `examples/specdec_bench/specdec_bench/datasets/__init__.py`
* `examples/specdec_bench/specdec_bench/datasets/base.py`
* `examples/specdec_bench/specdec_bench/datasets/humaneval.py`
* `examples/specdec_bench/specdec_bench/datasets/mtbench.py`
* `examples/specdec_bench/specdec_bench/datasets/specbench.py`
* `examples/specdec_bench/specdec_bench/datasets/speed.py`
* `examples/specdec_bench/specdec_bench/metrics/__init__.py`
* `examples/specdec_bench/specdec_bench/metrics/acceptance_rate.py`
* `examples/specdec_bench/specdec_bench/metrics/base.py`
* `examples/specdec_bench/specdec_bench/metrics/mtbench.py`
* `examples/specdec_bench/specdec_bench/metrics/specbench.py`
* `examples/specdec_bench/specdec_bench/metrics/thinking_acceptance.py`
* `examples/specdec_bench/specdec_bench/metrics/timing.py`
* `examples/specdec_bench/specdec_bench/models/__init__.py`
* `examples/specdec_bench/specdec_bench/models/auto_deploy.py`
* `examples/specdec_bench/specdec_bench/models/base.py`
* `examples/specdec_bench/specdec_bench/models/sglang.py`
* `examples/specdec_bench/specdec_bench/models/specbench_medusa.py`
* `examples/specdec_bench/specdec_bench/models/trtllm_torch_api.py`
* `examples/specdec_bench/specdec_bench/models/vllm.py`
* `examples/specdec_bench/specdec_bench/run_utils.py`
* `examples/specdec_bench/specdec_bench/runners/__init__.py`
* `examples/specdec_bench/specdec_bench/runners/base.py`
* `examples/specdec_bench/specdec_bench/runners/simple.py`
* `examples/specdec_bench/specdec_bench/utils.py`
* `examples/specdec_bench/sweep_example.yaml`

</details>

<details>
<summary>✅ Files skipped from review due to trivial changes (6)</summary>

* examples/specdec_bench/requirements_speed.txt
* examples/specdec_bench/prepare_data.py
* examples/specdec_bench/specdec_bench/datasets/base.py
* examples/specdec_bench/sweep_example.yaml
* examples/specdec_bench/specdec_bench/datasets/__init__.py
* examples/specdec_bench/specdec_bench/models/base.py

</details>

<details>
<summary>🚧 Files skipped from review as they are similar to previous changes (10)</summary>

* examples/specdec_bench/specdec_bench/models/specbench_medusa.py
* examples/specdec_bench/SPECBENCH_PORTING.md
* examples/specdec_bench/specdec_bench/runners/__init__.py
* examples/specdec_bench/specdec_bench/datasets/specbench.py
* examples/specdec_bench/specdec_bench/models/__init__.py
* examples/specdec_bench/specdec_bench/metrics/timing.py
* examples/specdec_bench/specdec_bench/utils.py
* examples/specdec_bench/specdec_bench/datasets/humaneval.py
* examples/specdec_bench/run.py
* examples/specdec_bench/specdec_bench/metrics/mtbench.py

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Comment on lines +29 to +40
def _contains_sequence(self, token_list, token_sequence):
if isinstance(token_sequence, int):
return token_sequence in token_list

seq_len = len(token_sequence)
if seq_len == 0:
return False

return any(
token_list[i : i + seq_len] == token_sequence
for i in range(len(token_list) - seq_len + 1)
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Type comparison may fail for tuple sequences.

If thinking_end_token is passed as a tuple (e.g., (1, 2, 3)), the comparison token_list[i : i + seq_len] == token_sequence will fail because slicing a list returns a list, not a tuple. Consider converting to list or using a consistent type.

Proposed fix
     def _contains_sequence(self, token_list, token_sequence):
         if isinstance(token_sequence, int):
             return token_sequence in token_list
 
+        token_sequence = list(token_sequence)
         seq_len = len(token_sequence)
         if seq_len == 0:
             return False
 
         return any(
             token_list[i : i + seq_len] == token_sequence
             for i in range(len(token_list) - seq_len + 1)
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/metrics/thinking_acceptance.py` around
lines 29 - 40, The _contains_sequence method can fail when token_sequence is a
tuple because token_list slices produce lists; normalize types before comparing
by converting token_sequence to the same type as the slice (e.g.,
list(token_sequence)) or convert the slice to a tuple; update _contains_sequence
to coerce token_sequence when it is a sequence (but not an int) and then perform
the equality check using the normalized types so list slices vs tuple sequences
no longer mismatch.

Comment thread examples/specdec_bench/specdec_bench/models/trtllm_torch_api.py Outdated
Comment on lines 128 to +132
def stop(self):
try:
self.loop.run_until_complete(self.model.shutdown())
self.loop.close()
except Exception:
pass
import contextlib

with contextlib.suppress(Exception):
asyncio.run(self.model.shutdown())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's verify the file exists and read the relevant section
head -150 examples/specdec_bench/specdec_bench/models/vllm.py | tail -40

Repository: NVIDIA/Model-Optimizer

Length of output: 1624


🏁 Script executed:

# Also check the full context of the stop() method
cat -n examples/specdec_bench/specdec_bench/models/vllm.py | sed -n '120,145p'

Repository: NVIDIA/Model-Optimizer

Length of output: 615


🏁 Script executed:

# Search for usages of the stop() method to understand how it's called
rg "\.stop\(\)" examples/specdec_bench/ -A 2 -B 2

Repository: NVIDIA/Model-Optimizer

Length of output: 827


🏁 Script executed:

# Check if there are any event loop related patterns in this file
rg "asyncio|event.loop|get_running_loop" examples/specdec_bench/specdec_bench/models/vllm.py -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 223


🏁 Script executed:

# Look at run.py to understand the full context
cat -n examples/specdec_bench/run.py | head -80

Repository: NVIDIA/Model-Optimizer

Length of output: 2986


🏁 Script executed:

# Check if there's an async version of stop or any other patterns
rg -A 5 -B 5 "def stop\(" examples/specdec_bench/specdec_bench/

Repository: NVIDIA/Model-Optimizer

Length of output: 4479


🏁 Script executed:

# Check the runner to understand how stop() is called
cat -n examples/specdec_bench/specdec_bench/runners/base.py | head -100

Repository: NVIDIA/Model-Optimizer

Length of output: 2448


🏁 Script executed:

# Let me check the broader execution flow to see if stop() could realistically be called from async context
rg "await.*stop\(\)|stop\(\)" examples/specdec_bench/ -B 3 -A 3

Repository: NVIDIA/Model-Optimizer

Length of output: 2097


🏁 Script executed:

# Check if there's any exception handling documentation or comments about why broad suppression is used
grep -r "contextlib.suppress" examples/specdec_bench/ -B 5 -A 5

Repository: NVIDIA/Model-Optimizer

Length of output: 1686


Clarify exception handling in stop() to avoid silently masking shutdown failures.

Line 132's broad contextlib.suppress(Exception) hides all exceptions, including legitimate errors in model.shutdown(). Additionally, asyncio.run() raises RuntimeError if called inside an already-running event loop, which the suppression masks. While the current usage (stop called after asyncio.run() completes in run.py) is safe, the code is fragile.

♻️ Proposed fix
     def stop(self):
         import contextlib

         with contextlib.suppress(Exception):
-            asyncio.run(self.model.shutdown())
+            try:
+                loop = asyncio.get_running_loop()
+            except RuntimeError:
+                asyncio.run(self.model.shutdown())
+            else:
+                loop.create_task(self.model.shutdown())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/models/vllm.py` around lines 128 - 132,
The stop() method currently suppresses all exceptions which hides real shutdown
errors and masks asyncio.run()'s RuntimeError when called inside a running loop;
change it to explicitly handle RuntimeError from
asyncio.run(self.model.shutdown()) by detecting a running loop and scheduling
the coroutine (e.g., via
asyncio.get_running_loop().create_task(self.model.shutdown()) or
asyncio.create_task) instead of calling asyncio.run, and for all other
exceptions catch and log (or re-raise) the error rather than silencing it;
update the stop() implementation around self.model.shutdown() and asyncio.run()
accordingly.

@kevalmorabia97
Copy link
Copy Markdown
Collaborator

kevalmorabia97 commented Apr 16, 2026

@IzzyPutterman

  1. can you update PR title and description with what's changed
  2. Address Coderabbit's Critical and Major concerns?
  3. Fix merge conflicts

@IzzyPutterman IzzyPutterman force-pushed the iputterman/specdec-04-15 branch 2 times, most recently from 0c60eb3 to ac7ed96 Compare April 16, 2026 22:38
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
examples/specdec_bench/README.md (2)

209-215: Add a language identifier to the fenced code block.

The directory structure example is missing a language specifier. Use text or plaintext for non-code content to satisfy markdownlint MD040.

-```
+```text
 my_sweep_results/
   000_speed_c32/       # first entry, concurrency=32
   001_speed_c1/        # second entry, concurrency=1
   001_speed_c2/        # second entry, concurrency=2
   ...
+```
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/README.md` around lines 209 - 215, Update the fenced
code block that shows the directory structure in README.md to include a language
identifier (use "text" or "plaintext") so markdownlint MD040 is satisfied;
locate the three-backtick block containing the my_sweep_results/ directory
listing and change the opening fence from ``` to ```text and keep the closing
``` intact.

224-224: Add a trailing newline at end of file.

Per markdownlint MD047, files should end with a single newline character.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/README.md` at line 224, The README file is missing a
trailing newline character; open examples/specdec_bench/README.md and append a
single newline at the end of the file so it ends with exactly one trailing
newline character to satisfy markdownlint MD047.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/specdec_bench/specdec_bench/metrics/base.py`:
- Around line 27-31: AATiming defines a now-orphaned process_step method in
aa_timing.py; remove that method or move its logic into the new
process_final(self, text_outputs, request_records) implementation so the runner
will call it—update the AATiming class to implement process_final (matching the
base MetricsBase signature) and perform whatever aggregation/recording the old
process_step did using the provided text_outputs and request_records, and keep
clear(self) untouched; reference AATiming and process_final when making the
change.

---

Nitpick comments:
In `@examples/specdec_bench/README.md`:
- Around line 209-215: Update the fenced code block that shows the directory
structure in README.md to include a language identifier (use "text" or
"plaintext") so markdownlint MD040 is satisfied; locate the three-backtick block
containing the my_sweep_results/ directory listing and change the opening fence
from ``` to ```text and keep the closing ``` intact.
- Line 224: The README file is missing a trailing newline character; open
examples/specdec_bench/README.md and append a single newline at the end of the
file so it ends with exactly one trailing newline character to satisfy
markdownlint MD047.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: e6bb222f-16ee-4388-b148-64b972b16550

📥 Commits

Reviewing files that changed from the base of the PR and between 0408d02 and 0c60eb3.

📒 Files selected for processing (31)
  • examples/specdec_bench/README.md
  • examples/specdec_bench/SPECBENCH_PORTING.md
  • examples/specdec_bench/prepare_data.py
  • examples/specdec_bench/requirements_speed.txt
  • examples/specdec_bench/run.py
  • examples/specdec_bench/specdec_bench/datasets/__init__.py
  • examples/specdec_bench/specdec_bench/datasets/base.py
  • examples/specdec_bench/specdec_bench/datasets/humaneval.py
  • examples/specdec_bench/specdec_bench/datasets/mtbench.py
  • examples/specdec_bench/specdec_bench/datasets/specbench.py
  • examples/specdec_bench/specdec_bench/datasets/speed.py
  • examples/specdec_bench/specdec_bench/metrics/__init__.py
  • examples/specdec_bench/specdec_bench/metrics/acceptance_rate.py
  • examples/specdec_bench/specdec_bench/metrics/base.py
  • examples/specdec_bench/specdec_bench/metrics/mtbench.py
  • examples/specdec_bench/specdec_bench/metrics/specbench.py
  • examples/specdec_bench/specdec_bench/metrics/thinking_acceptance.py
  • examples/specdec_bench/specdec_bench/metrics/timing.py
  • examples/specdec_bench/specdec_bench/models/__init__.py
  • examples/specdec_bench/specdec_bench/models/auto_deploy.py
  • examples/specdec_bench/specdec_bench/models/base.py
  • examples/specdec_bench/specdec_bench/models/sglang.py
  • examples/specdec_bench/specdec_bench/models/specbench_medusa.py
  • examples/specdec_bench/specdec_bench/models/trtllm_torch_api.py
  • examples/specdec_bench/specdec_bench/models/vllm.py
  • examples/specdec_bench/specdec_bench/run_utils.py
  • examples/specdec_bench/specdec_bench/runners/__init__.py
  • examples/specdec_bench/specdec_bench/runners/base.py
  • examples/specdec_bench/specdec_bench/runners/simple.py
  • examples/specdec_bench/specdec_bench/utils.py
  • examples/specdec_bench/sweep_example.yaml
✅ Files skipped from review due to trivial changes (8)
  • examples/specdec_bench/requirements_speed.txt
  • examples/specdec_bench/specdec_bench/metrics/init.py
  • examples/specdec_bench/prepare_data.py
  • examples/specdec_bench/specdec_bench/datasets/init.py
  • examples/specdec_bench/SPECBENCH_PORTING.md
  • examples/specdec_bench/sweep_example.yaml
  • examples/specdec_bench/specdec_bench/runners/simple.py
  • examples/specdec_bench/specdec_bench/metrics/specbench.py
🚧 Files skipped from review as they are similar to previous changes (8)
  • examples/specdec_bench/specdec_bench/models/specbench_medusa.py
  • examples/specdec_bench/specdec_bench/datasets/base.py
  • examples/specdec_bench/specdec_bench/datasets/mtbench.py
  • examples/specdec_bench/specdec_bench/models/init.py
  • examples/specdec_bench/specdec_bench/utils.py
  • examples/specdec_bench/specdec_bench/runners/init.py
  • examples/specdec_bench/specdec_bench/runners/base.py
  • examples/specdec_bench/specdec_bench/datasets/specbench.py

Comment thread examples/specdec_bench/specdec_bench/metrics/base.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (7)
examples/specdec_bench/specdec_bench/utils.py (1)

23-24: ⚠️ Potential issue | 🔴 Critical

Default trust_remote_code must stay opt-in.

Making this helper default to True reopens the remote-code-execution path for every tokenizer load unless each caller remembers to override it. Please keep the default at False and require trusted call sites to opt in explicitly.

Proposed fix
-def get_tokenizer(path, trust_remote_code=True):
+def get_tokenizer(path, trust_remote_code=False):
     return AutoTokenizer.from_pretrained(path, trust_remote_code=trust_remote_code)

As per coding guidelines "Flag trust_remote_code=True hardcoded for transformers model or tokenizer loading as CRITICAL security issue. Code should expose it as a caller-configurable parameter defaulting to False, not hardcode True".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/utils.py` around lines 23 - 24, The
helper get_tokenizer currently defaults trust_remote_code=True which enables
remote code execution by default; change the function signature
get_tokenizer(path, trust_remote_code=True) to default trust_remote_code=False,
and ensure the call to AutoTokenizer.from_pretrained(path,
trust_remote_code=trust_remote_code) uses that parameter so callers must
explicitly opt in to trust_remote_code; reference the get_tokenizer function and
the AutoTokenizer.from_pretrained call when making this change.
examples/specdec_bench/specdec_bench/datasets/humaneval.py (1)

18-22: ⚠️ Potential issue | 🔴 Critical

datasets fallback is still broken and will crash on the success path.

When the import succeeds, datasets is never defined, so Line 34 raises NameError before _preprocess() runs. Please lazy-load load_dataset where it is used and raise a real ImportError instead of printing during module import.

Proposed fix
-try:
-    from datasets import load_dataset
-except ImportError:
-    print("datasets is not installed.")
-    datasets = None
-
-
 def format_prompt(prompt: str) -> str:
@@
 class HumanEval(Dataset):
     def __init__(self, path, num_samples=164, **kwargs):
-        if datasets is None:
-            raise ImportError("datasets is not installed.")
         self.data: list[Request] = []  # list of list of questions.
         self.num_samples = num_samples
         self._preprocess(path)
 
     def _preprocess(self, path: str):
-        dataset = load_dataset(path, split="test")
+        try:
+            from datasets import load_dataset
+        except ImportError as exc:
+            raise ImportError(
+                "HumanEval requires the HuggingFace `datasets` package. Install the appropriate extra first."
+            ) from exc
+        dataset = load_dataset(path, split="test")
         for item in dataset:
             self.data.append(Request(system_prompt=None, turns=[format_prompt(item["prompt"])]))
         self.data = self.data[: self.num_samples]

As per coding guidelines "Gate optional features by install extras ([onnx], [hf], [all]); avoid hard imports of optional dependencies at module level".

Also applies to: 33-41

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/datasets/humaneval.py` around lines 18 -
22, The module currently imports load_dataset at top-level and prints on
ImportError, leaving `datasets` undefined on success and causing a NameError
when `_preprocess()` or the code that expects `load_dataset` runs; change this
by removing the top-level import fallback and instead perform a lazy import of
`datasets.load_dataset` inside the function that calls it (reference the call
site that uses load_dataset/_preprocess()), and if the import fails raise an
ImportError with a clear message about the optional dependency (e.g., suggest
installing the `[hf]` or relevant extra) rather than printing; also apply the
same lazy-import and proper ImportError handling to the other affected block(s)
around the existing load_dataset usage noted in the file (lines referenced in
review).
examples/specdec_bench/run.py (1)

412-416: ⚠️ Potential issue | 🟡 Minor

Validate --runtime_params is a mapping right after loading.

yaml.safe_load() also returns scalars and lists. If that happens, the first .get(...) in run_simple() fails with AttributeError instead of a clear CLI error. Raise a targeted validation error immediately after Line 414.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/run.py` around lines 412 - 416, After loading
args.runtime_params with yaml.safe_load in the run.py startup block, validate
that args.runtime_params is a mapping (dict) and raise a clear CLI error if it's
not; specifically, after yaml.safe_load(...) or the fallback {} ensure
isinstance(args.runtime_params, dict) and if not, raise a ValueError or
SystemExit with a message indicating "--runtime_params must be a
mapping/dictionary" so that run_simple() (which expects a mapping) won't hit
AttributeError later.
examples/specdec_bench/specdec_bench/metrics/thinking_acceptance.py (1)

29-40: ⚠️ Potential issue | 🟡 Minor

Normalize sequence types before the boundary check.

If thinking_end_token is passed as a tuple, Line 38 compares a list slice to a tuple and never matches, so the metric never exits thinking mode. Coerce non-scalar sequences to list(...) before scanning.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/metrics/thinking_acceptance.py` around
lines 29 - 40, In _contains_sequence, coerce non-scalar token_sequence to a list
before performing the boundary check so list slices compare correctly to the
sequence; specifically, inside the _contains_sequence method (and before
computing seq_len), if token_sequence is not an int convert it via
list(token_sequence) so tuple inputs (e.g. thinking_end_token) match list slices
and the thinking-exit detection works.
examples/specdec_bench/specdec_bench/metrics/timing.py (1)

27-40: ⚠️ Potential issue | 🟠 Major

Guard empty timing series before computing TPS and summary stats.

If request_records is empty, or any record has no emitted tokens, Line 38/39 will raise on min/max, and Line 49/50/54 can still pass empty lists into compute_statistics(). With the new record-driven flow this is a real crash path for empty category filters or requests that terminate before generating output. Filter out empty token_times first, and return a safe zero/empty summary when nothing remains.

Also applies to: 49-54

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/metrics/timing.py` around lines 27 - 40,
In process_final, guard against empty timing series by filtering request_records
to only those with non-empty record["token_times"] before computing
start_time/end_time (the min/max on timing) and before calling
compute_statistics; if no records remain, set self.out to a safe empty summary
(e.g., zero TPS and empty stats) and return early. Update the logic that builds
timing, total_tokens, e2e_time, ttft_time, tpot_time, and gen_tp_time to operate
on the filtered list, and ensure any downstream calls to compute_statistics
receive non-empty lists or handle empty inputs safely.
examples/specdec_bench/specdec_bench/models/base.py (1)

46-61: ⚠️ Potential issue | 🟠 Major

token_times is still ambiguous across beams.

chunk_lengths() and output_ids() are tracked per beam, but token_times is a single flat list for the whole collector. Once more than one beam advances, there is no stable way to align timestamps with each beam’s chunks, so downstream latency metrics become incorrect. Store timestamps on _BeamState or explicitly reject multi-beam usage here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/models/base.py` around lines 46 - 61,
The collector's token_times is global and ambiguous across beams; modify
CumulativeTokenCollector and _BeamState so timestamps are tracked per beam: add
a token_times list attribute to _BeamState, in
CumulativeTokenCollector.add_update append timestamp to beam_state.token_times
instead of self.token_times, and remove or deprecate
CumulativeTokenCollector.token_times; update any code that reads token_times to
use the per-beam _BeamState.token_times (or alternatively, if multi-beam support
is undesired, detect multiple beam_idx values in add_update and raise an error).
examples/specdec_bench/specdec_bench/metrics/mtbench.py (1)

34-50: ⚠️ Potential issue | 🟠 Major

Don't assume both turns always contain acceptance data.

With the new record-driven path, turns[0] / turns[1] can be missing or empty even when the request object itself has two turns, and Line 50 also divides by zero when no request metrics were produced. Aggregate whatever turn buckets are present and guard the empty case instead of indexing both turns unconditionally.

🧹 Nitpick comments (2)
examples/specdec_bench/specdec_bench/run_utils.py (2)

351-393: Worker exceptions are silently stored rather than propagated.

The gather_limited function stores exceptions as results (lines 372-373) instead of propagating them. This is a valid design choice for batch processing, but callers must explicitly check if any result is an Exception instance. Consider either:

  1. Adding a docstring noting this behavior, or
  2. Raising an aggregate error after all workers complete if any failed
📝 Optional: Add documentation about exception handling
 async def gather_limited(items, worker, concurrency, show_progress=False, progress_desc=None):
+    """Run worker on items with bounded concurrency, preserving order.
+
+    Returns:
+        List of results in input order. If a worker raises an exception,
+        the exception object is stored at that index instead of a result.
+        Callers should check `isinstance(result, Exception)` for each item.
+    """
     if concurrency <= 0:
         raise ValueError("concurrency must be > 0")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/run_utils.py` around lines 351 - 393,
The gather_limited function currently stores exceptions in the results list
instead of propagating them; either document this behavior in the gather_limited
docstring or change the post-processing to raise an aggregate error if any
worker failed: collect any Exception instances from results after queue.join(),
and if any exist raise an ExceptionGroup (or a custom AggregateError) with those
exceptions (reference gather_limited, worker_loop, and results) so callers don't
silently miss failures; ensure progress cleanup and task cancellation behavior
is preserved when raising.

230-237: Consider explicit type conversion error handling for random_isl fallback.

Line 232 uses int(dataset_path) to derive random_isl when a dataset path is provided for the random dataset. If dataset_path is not a valid integer string, this raises an unhelpful ValueError. Consider wrapping with a clearer error message:

♻️ Suggested improvement
         if random_isl is None and dataset_path is not None:
-            random_isl = int(dataset_path)
+            try:
+                random_isl = int(dataset_path)
+            except ValueError:
+                raise ValueError(
+                    f"For 'random' dataset, dataset_path must be an integer (got '{dataset_path}')"
+                ) from None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/run_utils.py` around lines 230 - 237, In
the dataset_name == "random" branch, avoid allowing int(dataset_path) to raise a
vague ValueError: wrap the conversion of dataset_path to int for random_isl in a
try/except around the int(...) call (the variables to modify are random_isl and
dataset_path in that block), and on failure raise a new ValueError that includes
the invalid dataset_path and guidance like "invalid --random_isl or
dataset_path; expected a positive integer"; keep the existing check that
random_isl is not None and that int(random_isl) > 0 after conversion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/specdec_bench/README.md`:
- Around line 209-215: The fenced directory-tree code blocks in README.md (the
block showing "my_sweep_results/" and the other at the later location) are
missing a language tag and the file is missing a trailing newline; add a
language identifier (e.g., change the opening fence to ```text) for both
directory-tree fences and ensure the README.md ends with a newline so the MD040
and MD047 markdownlint rules are satisfied.

In `@examples/specdec_bench/specdec_bench/metrics/specbench.py`:
- Around line 49-58: The code assumes acceptance data exists for every request
and calls mean(list(chain(*turns))) which will raise on empty data; update the
block that reads prompt_ar and computes Request_AL to use
prompt_ar.get(request_id, {}) when accessing per-request acceptance records,
flatten turns and check if the resulting list is empty before calling mean, and
either set self.out["Request_AL"][request.question_id] = 0.0 (or skip
assignment) when no accepted lengths exist; keep the existing loop calling
_get_lengths(turn, lengths) for side-effects but ensure it only runs when turns
contains data.
- Around line 99-108: The table and other artifacts still use the label
"Acceptance Rate" but the data are acceptance lengths (keys like Category_AL,
Request_AL, Average_AL); update all human-facing labels and filenames that
contain "Acceptance Rate" to "Acceptance Length" (or "Acceptance Lengths") so
they match the data: change the table title argument currently "Acceptance Rate
Results" to "Acceptance Length Results", update any column/row labels (e.g., the
overall row label) to mention Acceptance Length/Average Acceptance Length, and
search the surrounding code (including where figure titles, axis labels, and
output filenames are built between the areas around the table and lines
~123-223) and replace "Acceptance Rate" occurrences with the new wording while
leaving keys like Category_AL, Request_AL, Average_AL unchanged.

In `@examples/specdec_bench/specdec_bench/models/trtllm_torch_api.py`:
- Around line 64-115: The code silently treats speculative_algorithm "EAGLE" as
unsupported (specdec becomes None) though callers (e.g., create_executor())
expect EAGLE to be handled; fix by explicitly handling "EAGLE" in the
speculative_algorithm dispatch: either reintroduce the EAGLE branch that
constructs the appropriate decoding config (import and instantiate the correct
Eagle/Eagle3 decoding config such as Eagle3DecodingConfig and assign to specdec,
using the same kwargs pattern as the other branches) or raise a clear ValueError
when speculative_algorithm == "EAGLE" to fail fast. Update the logic around
speculative_algorithm/specdec in trtllm_torch_api.py (referencing
speculative_algorithm, specdec, and create_executor()) so EAGLE is never
silently ignored.

In `@examples/specdec_bench/specdec_bench/utils.py`:
- Around line 92-103: The dump_env function currently persists vars(args) and
sys.argv verbatim which can leak secrets; update dump_env to sanitize sensitive
keys before writing: copy vars(args) into config as now, but remove or replace
values for keys matching patterns like "token", "api_key", "apikey", "password",
"secret", "key", "access_token" (case-insensitive) and any other configured
sensitive names in overrides, and do not write raw sys.argv — either omit argv
entirely or write a redacted argv where flag values are replaced with
"<REDACTED>" while keeping flag names; keep the rest of the behavior
(engine_version via _get_engine_version, gpu via _get_gpu_name, os.makedirs,
json.dump) unchanged and apply changes in the dump_env function and the local
variable config/argv handling.

---

Duplicate comments:
In `@examples/specdec_bench/run.py`:
- Around line 412-416: After loading args.runtime_params with yaml.safe_load in
the run.py startup block, validate that args.runtime_params is a mapping (dict)
and raise a clear CLI error if it's not; specifically, after yaml.safe_load(...)
or the fallback {} ensure isinstance(args.runtime_params, dict) and if not,
raise a ValueError or SystemExit with a message indicating "--runtime_params
must be a mapping/dictionary" so that run_simple() (which expects a mapping)
won't hit AttributeError later.

In `@examples/specdec_bench/specdec_bench/datasets/humaneval.py`:
- Around line 18-22: The module currently imports load_dataset at top-level and
prints on ImportError, leaving `datasets` undefined on success and causing a
NameError when `_preprocess()` or the code that expects `load_dataset` runs;
change this by removing the top-level import fallback and instead perform a lazy
import of `datasets.load_dataset` inside the function that calls it (reference
the call site that uses load_dataset/_preprocess()), and if the import fails
raise an ImportError with a clear message about the optional dependency (e.g.,
suggest installing the `[hf]` or relevant extra) rather than printing; also
apply the same lazy-import and proper ImportError handling to the other affected
block(s) around the existing load_dataset usage noted in the file (lines
referenced in review).

In `@examples/specdec_bench/specdec_bench/metrics/thinking_acceptance.py`:
- Around line 29-40: In _contains_sequence, coerce non-scalar token_sequence to
a list before performing the boundary check so list slices compare correctly to
the sequence; specifically, inside the _contains_sequence method (and before
computing seq_len), if token_sequence is not an int convert it via
list(token_sequence) so tuple inputs (e.g. thinking_end_token) match list slices
and the thinking-exit detection works.

In `@examples/specdec_bench/specdec_bench/metrics/timing.py`:
- Around line 27-40: In process_final, guard against empty timing series by
filtering request_records to only those with non-empty record["token_times"]
before computing start_time/end_time (the min/max on timing) and before calling
compute_statistics; if no records remain, set self.out to a safe empty summary
(e.g., zero TPS and empty stats) and return early. Update the logic that builds
timing, total_tokens, e2e_time, ttft_time, tpot_time, and gen_tp_time to operate
on the filtered list, and ensure any downstream calls to compute_statistics
receive non-empty lists or handle empty inputs safely.

In `@examples/specdec_bench/specdec_bench/models/base.py`:
- Around line 46-61: The collector's token_times is global and ambiguous across
beams; modify CumulativeTokenCollector and _BeamState so timestamps are tracked
per beam: add a token_times list attribute to _BeamState, in
CumulativeTokenCollector.add_update append timestamp to beam_state.token_times
instead of self.token_times, and remove or deprecate
CumulativeTokenCollector.token_times; update any code that reads token_times to
use the per-beam _BeamState.token_times (or alternatively, if multi-beam support
is undesired, detect multiple beam_idx values in add_update and raise an error).

In `@examples/specdec_bench/specdec_bench/utils.py`:
- Around line 23-24: The helper get_tokenizer currently defaults
trust_remote_code=True which enables remote code execution by default; change
the function signature get_tokenizer(path, trust_remote_code=True) to default
trust_remote_code=False, and ensure the call to
AutoTokenizer.from_pretrained(path, trust_remote_code=trust_remote_code) uses
that parameter so callers must explicitly opt in to trust_remote_code; reference
the get_tokenizer function and the AutoTokenizer.from_pretrained call when
making this change.

---

Nitpick comments:
In `@examples/specdec_bench/specdec_bench/run_utils.py`:
- Around line 351-393: The gather_limited function currently stores exceptions
in the results list instead of propagating them; either document this behavior
in the gather_limited docstring or change the post-processing to raise an
aggregate error if any worker failed: collect any Exception instances from
results after queue.join(), and if any exist raise an ExceptionGroup (or a
custom AggregateError) with those exceptions (reference gather_limited,
worker_loop, and results) so callers don't silently miss failures; ensure
progress cleanup and task cancellation behavior is preserved when raising.
- Around line 230-237: In the dataset_name == "random" branch, avoid allowing
int(dataset_path) to raise a vague ValueError: wrap the conversion of
dataset_path to int for random_isl in a try/except around the int(...) call (the
variables to modify are random_isl and dataset_path in that block), and on
failure raise a new ValueError that includes the invalid dataset_path and
guidance like "invalid --random_isl or dataset_path; expected a positive
integer"; keep the existing check that random_isl is not None and that
int(random_isl) > 0 after conversion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 62563ec6-0e01-472d-b138-76415f75cbbb

📥 Commits

Reviewing files that changed from the base of the PR and between 0c60eb3 and ac7ed96.

📒 Files selected for processing (31)
  • examples/specdec_bench/README.md
  • examples/specdec_bench/SPECBENCH_PORTING.md
  • examples/specdec_bench/prepare_data.py
  • examples/specdec_bench/requirements_speed.txt
  • examples/specdec_bench/run.py
  • examples/specdec_bench/specdec_bench/datasets/__init__.py
  • examples/specdec_bench/specdec_bench/datasets/base.py
  • examples/specdec_bench/specdec_bench/datasets/humaneval.py
  • examples/specdec_bench/specdec_bench/datasets/mtbench.py
  • examples/specdec_bench/specdec_bench/datasets/specbench.py
  • examples/specdec_bench/specdec_bench/datasets/speed.py
  • examples/specdec_bench/specdec_bench/metrics/__init__.py
  • examples/specdec_bench/specdec_bench/metrics/acceptance_rate.py
  • examples/specdec_bench/specdec_bench/metrics/base.py
  • examples/specdec_bench/specdec_bench/metrics/mtbench.py
  • examples/specdec_bench/specdec_bench/metrics/specbench.py
  • examples/specdec_bench/specdec_bench/metrics/thinking_acceptance.py
  • examples/specdec_bench/specdec_bench/metrics/timing.py
  • examples/specdec_bench/specdec_bench/models/__init__.py
  • examples/specdec_bench/specdec_bench/models/auto_deploy.py
  • examples/specdec_bench/specdec_bench/models/base.py
  • examples/specdec_bench/specdec_bench/models/sglang.py
  • examples/specdec_bench/specdec_bench/models/specbench_medusa.py
  • examples/specdec_bench/specdec_bench/models/trtllm_torch_api.py
  • examples/specdec_bench/specdec_bench/models/vllm.py
  • examples/specdec_bench/specdec_bench/run_utils.py
  • examples/specdec_bench/specdec_bench/runners/__init__.py
  • examples/specdec_bench/specdec_bench/runners/base.py
  • examples/specdec_bench/specdec_bench/runners/simple.py
  • examples/specdec_bench/specdec_bench/utils.py
  • examples/specdec_bench/sweep_example.yaml
✅ Files skipped from review due to trivial changes (5)
  • examples/specdec_bench/requirements_speed.txt
  • examples/specdec_bench/specdec_bench/datasets/base.py
  • examples/specdec_bench/prepare_data.py
  • examples/specdec_bench/sweep_example.yaml
  • examples/specdec_bench/specdec_bench/datasets/init.py
🚧 Files skipped from review as they are similar to previous changes (6)
  • examples/specdec_bench/specdec_bench/runners/init.py
  • examples/specdec_bench/specdec_bench/runners/simple.py
  • examples/specdec_bench/specdec_bench/datasets/specbench.py
  • examples/specdec_bench/specdec_bench/datasets/mtbench.py
  • examples/specdec_bench/specdec_bench/models/init.py
  • examples/specdec_bench/SPECBENCH_PORTING.md

Comment on lines +209 to +215
```
my_sweep_results/
000_speed_c32/ # first entry, concurrency=32
001_speed_c1/ # second entry, concurrency=1
001_speed_c2/ # second entry, concurrency=2
...
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

This README still trips markdownlint.

Please add a language to the fenced directory-tree block (for example, text) and restore the trailing newline at EOF so MD040 and MD047 stay clean.

Also applies to: 224-224

🧰 Tools
🪛 markdownlint-cli2 (0.22.0)

[warning] 209-209: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/README.md` around lines 209 - 215, The fenced
directory-tree code blocks in README.md (the block showing "my_sweep_results/"
and the other at the later location) are missing a language tag and the file is
missing a trailing newline; add a language identifier (e.g., change the opening
fence to ```text) for both directory-tree fences and ensure the README.md ends
with a newline so the MD040 and MD047 markdownlint rules are satisfied.

Comment on lines +49 to 58
prompt_ar = self.build_prompt_ar(request_records)
self._set_request_acceptance_lengths(prompt_ar)
for request_id, request in enumerate(self.requests):
turns = self.prompt_ar[request_id].values()
turns = prompt_ar[request_id].values()
assert len(turns) == len(request.turns), (
f"Number of turns {len(turns)} does not match number of turns in request {len(request.turns)}"
)
self.out["Request_AR"][request.question_id] = mean(list(chain(*turns)))
self.out["Request_AL"][request.question_id] = mean(list(chain(*turns)))
for turn in turns:
self._get_lengths(turn, lengths)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle missing or empty acceptance data before calling mean().

prompt_ar[request_id] assumes every request has a record, and mean(list(chain(*turns))) raises when a request produced no accepted lengths. That turns a valid benchmark run into a metrics crash. Use .get(request_id, {}) and fall back to 0.0 or skip the request when the flattened turn list is empty.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/metrics/specbench.py` around lines 49 -
58, The code assumes acceptance data exists for every request and calls
mean(list(chain(*turns))) which will raise on empty data; update the block that
reads prompt_ar and computes Request_AL to use prompt_ar.get(request_id, {})
when accessing per-request acceptance records, flatten turns and check if the
resulting list is empty before calling mean, and either set
self.out["Request_AL"][request.question_id] = 0.0 (or skip assignment) when no
accepted lengths exist; keep the existing loop calling _get_lengths(turn,
lengths) for side-effects but ensure it only runs when turns contains data.

Comment on lines +99 to +108
title="Acceptance Rate Results", show_header=True, header_style="bold magenta"
)
table.add_column("Category", style="cyan", no_wrap=True)
table.add_column("Average AR", justify="right", style="green")
table.add_column("Average AL", justify="right", style="green")

# Add category rows
for category_name, category_ar in sorted(self.out["Category_AR"].items()):
for category_name, category_ar in sorted(self.out["Category_AL"].items()):
table.add_row(category_name, f"{category_ar:.4f}")

# Add separator and summary row
table.add_section()
table.add_row("[bold]Overall Average[/bold]", f"[bold]{self.out['Average_AR']:.4f}[/bold]")
table.add_row("[bold]Overall Average[/bold]", f"[bold]{self.out['Average_AL']:.4f}[/bold]")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Rename the remaining “Acceptance Rate” labels.

This path now reports acceptance lengths (Request_AL / Average_AL), but the table title, axis labels, figure title, and output filename still say “Acceptance Rate”. The saved artifacts will be misleading for downstream analysis.

Also applies to: 123-223

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/metrics/specbench.py` around lines 99 -
108, The table and other artifacts still use the label "Acceptance Rate" but the
data are acceptance lengths (keys like Category_AL, Request_AL, Average_AL);
update all human-facing labels and filenames that contain "Acceptance Rate" to
"Acceptance Length" (or "Acceptance Lengths") so they match the data: change the
table title argument currently "Acceptance Rate Results" to "Acceptance Length
Results", update any column/row labels (e.g., the overall row label) to mention
Acceptance Length/Average Acceptance Length, and search the surrounding code
(including where figure titles, axis labels, and output filenames are built
between the areas around the table and lines ~123-223) and replace "Acceptance
Rate" occurrences with the new wording while leaving keys like Category_AL,
Request_AL, Average_AL unchanged.

Comment on lines +64 to 115
speculative_algorithm = kwargs.pop("speculative_algorithm", None)
num_speculative_tokens = kwargs.pop("speculative_num_steps", 3)
draft_model_dir = kwargs.pop("draft_model_dir", None)
if speculative_algorithm == "DRAFT_TARGET":
specdec = DraftTargetDecodingConfig(
max_draft_len=kwargs.get("speculative_num_steps", 3),
speculative_model_dir=kwargs.get("draft_model_dir", None),
max_draft_len=num_speculative_tokens,
speculative_model_dir=draft_model_dir,
)

elif kwargs.get("speculative_algorithm", None) == "EAGLE3":
extra_params = {}
if "allow_advanced_sampling" in EagleDecodingConfig.model_fields:
extra_params["allow_advanced_sampling"] = kwargs.get("allow_advanced_sampling", False)
elif "allow_advanced_sampling" in kwargs:
print(
f"WARNING: allow_advanced_sampling unsupported in tensorrt_llm version: {trtllm.__version__}"
)
specdec = EagleDecodingConfig(
max_draft_len=kwargs.get("speculative_num_steps", 3),
speculative_model_dir=kwargs.get("draft_model_dir", None),
eagle3_one_model=kwargs.get("use_one_model", True),
eagle3_layers_to_capture=kwargs.get("eagle3_layers_to_capture", None),
num_eagle_layers=kwargs.get("num_eagle_layers", 1),
**extra_params,
elif speculative_algorithm == "EAGLE3":
from tensorrt_llm.llmapi import Eagle3DecodingConfig

specdec = Eagle3DecodingConfig(
max_draft_len=num_speculative_tokens,
speculative_model_dir=draft_model_dir,
eagle3_layers_to_capture=kwargs.pop("eagle3_layers_to_capture", None),
num_eagle_layers=kwargs.pop("num_eagle_layers", 1),
allow_advanced_sampling=kwargs.pop("allow_advanced_sampling", True),
use_sa_spec=kwargs.pop("use_sa_spec", False),
sa_spec_threshold=kwargs.pop("sa_spec_threshold", 4),
)

elif kwargs.get("speculative_algorithm", None) == "MTP":
elif speculative_algorithm == "MTP":
specdec = MTPDecodingConfig(
num_nextn_predict_layers=kwargs.get("speculative_num_steps", 3),
use_relaxed_acceptance_for_thinking=kwargs.get("relaxed_acceptance", False),
relaxed_topk=kwargs.get("relaxed_topk", 10),
relaxed_delta=kwargs.get("relaxed_delta", 0.6),
num_nextn_predict_layers=num_speculative_tokens,
use_relaxed_acceptance_for_thinking=kwargs.pop("relaxed_acceptance", False),
relaxed_topk=kwargs.pop("relaxed_topk", 10),
relaxed_delta=kwargs.pop("relaxed_delta", 0.6),
allow_advanced_sampling=kwargs.pop("allow_advanced_sampling", True),
use_sa_spec=kwargs.pop("use_sa_spec", False),
sa_spec_threshold=kwargs.pop("sa_spec_threshold", 4),
)
elif kwargs.get("speculative_algorithm", None) == "NGRAM":
elif speculative_algorithm == "NGRAM":
specdec = NGramDecodingConfig(
max_draft_len=kwargs.get("speculative_num_steps", 5),
max_matching_ngram_size=kwargs.get("max_matching_ngram_size", 3),
max_draft_len=num_speculative_tokens,
max_matching_ngram_size=kwargs.pop("max_matching_ngram_size", 3),
is_keep_all=True,
is_use_oldest=True,
is_public_pool=True,
)
elif kwargs.get("speculative_algorithm", None) == "NONE":
elif speculative_algorithm == "PARD":
from tensorrt_llm.llmapi import PARDDecodingConfig

specdec = PARDDecodingConfig(
max_draft_len=num_speculative_tokens,
speculative_model_dir=draft_model_dir,
)
elif speculative_algorithm == "NONE":
specdec = None
else:
specdec = None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

EAGLE silently falls back to plain decoding in this backend.

create_executor() no longer has an EAGLE branch, so --speculative_algorithm EAGLE drops into specdec = None. For TRTLLM that means the requested mode is ignored without any error. Either restore the EAGLE branch or raise a clear ValueError for unsupported algorithms here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/models/trtllm_torch_api.py` around lines
64 - 115, The code silently treats speculative_algorithm "EAGLE" as unsupported
(specdec becomes None) though callers (e.g., create_executor()) expect EAGLE to
be handled; fix by explicitly handling "EAGLE" in the speculative_algorithm
dispatch: either reintroduce the EAGLE branch that constructs the appropriate
decoding config (import and instantiate the correct Eagle/Eagle3 decoding config
such as Eagle3DecodingConfig and assign to specdec, using the same kwargs
pattern as the other branches) or raise a clear ValueError when
speculative_algorithm == "EAGLE" to fail fast. Update the logic around
speculative_algorithm/specdec in trtllm_torch_api.py (referencing
speculative_algorithm, specdec, and create_executor()) so EAGLE is never
silently ignored.

Comment on lines +92 to +103
def dump_env(args, save_dir, overrides=None):
"""Write a configuration.json to save_dir capturing the run args and engine version."""
config = vars(args).copy()
if overrides:
config.update(overrides)
config["engine_version"] = _get_engine_version(config.get("engine"))
config["gpu"] = _get_gpu_name()
config["python_version"] = sys.version
config["argv"] = sys.argv[:]
os.makedirs(save_dir, exist_ok=True)
with open(os.path.join(save_dir, "configuration.json"), "w") as f:
json.dump(config, f, indent=4)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid persisting raw CLI arguments without redaction.

dump_env() writes both vars(args) and sys.argv verbatim to configuration.json. That will persist any future --token / --api-key / --password style flag to disk unchanged. Please redact sensitive keys and omit raw argv unless there is a strong reason to keep it.

Proposed fix
+_SENSITIVE_SUBSTRINGS = ("token", "key", "secret", "password")
+
+
+def _redact_config(config):
+    return {
+        key: (
+            "***REDACTED***"
+            if any(part in key.lower() for part in _SENSITIVE_SUBSTRINGS)
+            else value
+        )
+        for key, value in config.items()
+    }
+
+
 def dump_env(args, save_dir, overrides=None):
     """Write a configuration.json to save_dir capturing the run args and engine version."""
-    config = vars(args).copy()
+    config = _redact_config(vars(args).copy())
     if overrides:
-        config.update(overrides)
+        config.update(_redact_config(overrides))
     config["engine_version"] = _get_engine_version(config.get("engine"))
     config["gpu"] = _get_gpu_name()
     config["python_version"] = sys.version
-    config["argv"] = sys.argv[:]
     os.makedirs(save_dir, exist_ok=True)
     with open(os.path.join(save_dir, "configuration.json"), "w") as f:
         json.dump(config, f, indent=4)

As per coding guidelines "be careful that logs don’t leak sensitive paths/tokens/model details."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/utils.py` around lines 92 - 103, The
dump_env function currently persists vars(args) and sys.argv verbatim which can
leak secrets; update dump_env to sanitize sensitive keys before writing: copy
vars(args) into config as now, but remove or replace values for keys matching
patterns like "token", "api_key", "apikey", "password", "secret", "key",
"access_token" (case-insensitive) and any other configured sensitive names in
overrides, and do not write raw sys.argv — either omit argv entirely or write a
redacted argv where flag values are replaced with "<REDACTED>" while keeping
flag names; keep the rest of the behavior (engine_version via
_get_engine_version, gpu via _get_gpu_name, os.makedirs, json.dump) unchanged
and apply changes in the dump_env function and the local variable config/argv
handling.

Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
@IzzyPutterman IzzyPutterman force-pushed the iputterman/specdec-04-15 branch from ac7ed96 to 1376d30 Compare April 16, 2026 22:51
@github-actions
Copy link
Copy Markdown
Contributor

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1272/

Built to branch gh-pages at 2026-04-16 22:55 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants