Skip to content

Modified comprehensive_8dev_experiments.yaml from 20 num runs to 100#123

Closed
danielwu7115 wants to merge 90 commits into
AI-Hypercomputer:mainfrom
danielwu7115:tpu7x-h2dd2h
Closed

Modified comprehensive_8dev_experiments.yaml from 20 num runs to 100#123
danielwu7115 wants to merge 90 commits into
AI-Hypercomputer:mainfrom
danielwu7115:tpu7x-h2dd2h

Conversation

@danielwu7115
Copy link
Copy Markdown

What this PR does:

This PR updates the microbenchmark workload configuration for 8-device Host to Device and Device to Host (h2dd2h). Specifically, it:

  • Increases the number of runs for the benchmark from 20 to 100.

Why these changes are needed:

  • Increased Runs: The original 20 runs were insufficient to accurately characterize the performance distribution of this workload due to inherent run-to-run variability. Increasing num_runs to 100 provides a larger, more statistically robust dataset, ensuring more reliable and repeatable performance insights.

How these changes were implemented:

  • The configuration value for num_runs was updated from 20 to 100 in Ironwood/configs/host_device/comprehensive_8dev_experiments.yaml.

How this was tested:

  • Local verification to ensure the benchmark accepts the updated configuration and successfully executes for 100 runs.

rahul-anand and others added 30 commits January 30, 2026 08:11
Support run_benchmark.py to use argument `gcs-bucket-csv-dir` to configure the
directory for writing csv/tsv result.
* Fix the issue where `kubectl wait` could only wait for one condition.
  Use poll loop to check for status.

* Store the failed jobs and retry with maximum 3 times

TEST=Use dummy `must-fail` and `must-succeed` job which exit 1/0
directly. Make sure the script will retry on the failed one for 3 times,
and eventually print out the command to retry.
linamy85 and others added 29 commits February 11, 2026 09:22
…#111)

We add a tuned table instead of. run the sweeping at each microbenchmark.
For the config not tuned yet, we just use
default block sizes and output with flag has_optimized=false.
…-Hypercomputer#112)

Use segment id to filter out the padding KV if needed.
Since the segment id would affect the latency,
we should add them to reflect the padding situation.
* Adding CCC based autoscaler files

Signed-off-by: pulasthi <pulasthi@google.com>

* adding Readme file

---------

Signed-off-by: pulasthi <pulasthi@google.com>
Test out larger matrix

Test out larger matrix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants