Modified comprehensive_8dev_experiments.yaml from 20 num runs to 100 by danielwu7115 · Pull Request #123 · AI-Hypercomputer/accelerator-microbenchmarks

danielwu7115 · 2026-05-25T01:08:07Z

What this PR does:

This PR updates the microbenchmark workload configuration for 8-device Host to Device and Device to Host (h2dd2h). Specifically, it:

Increases the number of runs for the benchmark from 20 to 100.

Why these changes are needed:

Increased Runs: The original 20 runs were insufficient to accurately characterize the performance distribution of this workload due to inherent run-to-run variability. Increasing num_runs to 100 provides a larger, more statistically robust dataset, ensuring more reliable and repeatable performance insights.

How these changes were implemented:

The configuration value for num_runs was updated from 20 to 100 in Ironwood/configs/host_device/comprehensive_8dev_experiments.yaml.

How this was tested:

Local verification to ensure the benchmark accepts the updated configuration and successfully executes for 100 runs.

Support run_benchmark.py to use argument `gcs-bucket-csv-dir` to configure the directory for writing csv/tsv result.

* Fix the issue where `kubectl wait` could only wait for one condition. Use poll loop to check for status. * Store the failed jobs and retry with maximum 3 times TEST=Use dummy `must-fail` and `must-succeed` job which exit 1/0 directly. Make sure the script will retry on the failed one for 3 times, and eventually print out the command to retry.

…r#90)

…#111) We add a tuned table instead of. run the sweeping at each microbenchmark. For the config not tuned yet, we just use default block sizes and output with flag has_optimized=false.

…-Hypercomputer#112) Use segment id to filter out the padding KV if needed. Since the segment id would affect the latency, we should add them to reflect the padding situation.

* Adding CCC based autoscaler files Signed-off-by: pulasthi <pulasthi@google.com> * adding Readme file --------- Signed-off-by: pulasthi <pulasthi@google.com>

…puter#117)

Test out larger matrix Test out larger matrix

…onfigs

…ns from 20 to 100.

rahul-anand and others added 30 commits January 30, 2026 08:11

fix gemm timing logic (AI-Hypercomputer#92)

e2f2a81

Add gcs-bucket-csv-dir to support GCS upload

20286e5

Support run_benchmark.py to use argument `gcs-bucket-csv-dir` to configure the directory for writing csv/tsv result.

Add automation script and an HBM yaml example.

8dedb4c

Add aggregator yaml file.

2bf6a94

[Automation] Add readme and node-pools topology check

00fa7b2

Update automation script and yaml files for different topologies.

99fa6b1

[Automation] Add missing topology tracking in check_node_pool_setup.sh

62a0461

[Automation] Add topology-aware node pool validation.

991cb2f

[Automation] Update configurations for GEMM, H2D and Collectives

6b5c66b

[Automation] Update automation_launch.sh

926a7c0

[Automation] Enable kueue to prevent deadlock from race condition

a86ab39

[Automation] Update aggregator

f53bf5e

[Automation] Update aggregator and rename host to device yaml files

a06944d

[Automation] Delete unused yaml file and update aggregator file

ddd3473

[Automation] Update aggregator

9ea7110

Add dtype to H2D/D2H

bb5fc2f

[Automation] Automatically delete aggregator after completion

5315132

Update README with kueue and reformat

4bdf77f

Add dtype to aggregator H2D method

2de55f4

Remove unnecessary columns when aggregating and fix a typo of per_device

ed9f6ef

Create config folder and modify kubenetes yaml for gemm test

aad5d9d

Update aggregator for gemm test

cb79abb

Add dtype string in aggregated TSV file

b10e6bb

Add multiple precisions for HBM test

6f525bf

Print pending process status every minute

69661f9

Revert the changes that were made for an urgent demo (AI-Hypercompute…

3e4b59a

…r#90)

[Ironwood] Add pipelined H2D mode to H2D benchmark

0b24e56

add extra datatypes in configs (AI-Hypercomputer#94)

a70b701

add GCS service account name to job yamls (AI-Hypercomputer#95)

94ddada

linamy85 and others added 29 commits February 11, 2026 09:22

Add step time to matmul series

a585ecb

update benchmark_attention not sweep at the runtime (AI-Hypercomputer…

adb47d1

…#111) We add a tuned table instead of. run the sweeping at each microbenchmark. For the config not tuned yet, we just use default block sizes and output with flag has_optimized=false.

Add attention into automation

9bf9e38

Update attention aggregate logic

1d36fa8

Set automation timeout to 2 hours

5d958cc

Set attention num_runs to 20

1ab8008

Try pinned memory

9b4e8de

fix numeric error cause by padding and improve default block size (AI…

e0a9abc

…-Hypercomputer#112) Use segment id to filter out the padding KV if needed. Since the segment id would affect the latency, we should add them to reflect the padding situation.

Fix retry command

e7c1649

Remove BMM multi-host runs from the 2x2x1 yaml file to avoid confusion.

c2bec50

Adding CCC based autoscaler files (AI-Hypercomputer#109)

f4f89ee

* Adding CCC based autoscaler files Signed-off-by: pulasthi <pulasthi@google.com> * adding Readme file --------- Signed-off-by: pulasthi <pulasthi@google.com>

adding all benchmarks to automation script (AI-Hypercomputer#114)

1629d32

Add missing 8192 gemm

5885a28

Remove peak flops for fp32, which is unspecified in spec (AI-Hypercom…

4a28403

…puter#117)

Increase sweeping range for all reduce

a495fd6

Extend configs for gemm and collectives

c378bdb

Extend configs for gemm and collectives

dc795d9

Fix collectives aggregator for multi dtypes

55fa0ea

Address too much event issue

cb56a43

Use larger transfering size

dd61804

Test out larger matrix Test out larger matrix

Optimize H2D/D2H transfer pipelines and add comprehensive benchmark c…

f924a7e

…onfigs

Add benchmark guide and run script

7439b2a

Allow sweeping dtype in host_device benchmarks

38ec530

Added sample variance as a metric for h2dd2h and increased the num_ru…

aa1e67c

…ns from 20 to 100.

Triggering CLA recheck

ac83fee

Triggering CLA recheck 2

f302e98

shorten sample_variance as variance

9e88b7b

check if the variance is nan and set the value to zero

c8eca6f

Updated comprehensive_8dev_experiments.yaml from 20 to 100 num runs

07fc9b3

danielwu7115 closed this May 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modified comprehensive_8dev_experiments.yaml from 20 num runs to 100#123

Modified comprehensive_8dev_experiments.yaml from 20 num runs to 100#123
danielwu7115 wants to merge 90 commits into
AI-Hypercomputer:mainfrom
danielwu7115:tpu7x-h2dd2h

danielwu7115 commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

danielwu7115 commented May 25, 2026

What this PR does:

Why these changes are needed:

How these changes were implemented:

How this was tested:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants