Adopt latest BS_thread_pool libary version v5.1.0 by 3manifold · Pull Request #2053 · OpenNMT/CTranslate2

3manifold · 2026-05-22T13:06:56Z

BS_thread_pool

Adopt BS_thread_pool @version 5.1.0 and the update relevant code.

A noticeable feature in the pool are the various flavors one can use in the respective ctor. That is:

// ...
// ...
/**
 * @brief A fast, lightweight, modern, and easy-to-use C++17/C++20/C++23 thread pool class.
 *
 * @tparam OptFlags A bitmask of flags which can be used to enable optional features. The flags are members of the `BS::tp` enumeration: `BS::tp::priority`, `BS::tp::pause`, and `BS::tp::wait_deadlock_checks`. The default is `BS::tp::none`, which disables all optional features. To enable multiple features, use the bitwise OR operator `|`, e.g. `BS::tp::priority | BS::tp::pause`.
 */
// ...
// ...
/**
 * @brief A fast, lightweight, modern, and easy-to-use C++17/C++20/C++23 thread pool class. This alias defines a thread pool with all optional features disabled.
 */
using light_thread_pool = thread_pool<tp::none>;

/**
 * @brief A fast, lightweight, modern, and easy-to-use C++17/C++20/C++23 thread pool class. This alias defines a thread pool with task priority enabled.
 */
using priority_thread_pool = thread_pool<tp::priority>;

/**
 * @brief A fast, lightweight, modern, and easy-to-use C++17/C++20/C++23 thread pool class. This alias defines a thread pool with pausing enabled.
 */
using pause_thread_pool = thread_pool<tp::pause>;

/**
 * @brief A fast, lightweight, modern, and easy-to-use C++17/C++20/C++23 thread pool class. This alias defines a thread pool with wait deadlock checks enabled.
 */
using wdc_thread_pool = thread_pool<tp::wait_deadlock_checks>;
// ...
// ...

Thread pool light_thread_pool was currently adopted.

ReplicaPool

Regarding include/ctranslate2/thread_pool.h & include/ctranslate2/replica_pool.h:
It doesn't make sense to replace ReplicaPool with BS::thread_pool directly. The CTranslate2 design is intentionally tighter than a general thread pool — the coupling between thread identity and model replica is load-bearing for GPU correctness and memory management.
Where BS::thread_pool would make sense is if we were replacing the internal ThreadPool primitive (the lower-level class ReplicaPool builds on), but we'd still need to re-implement the worker lifecycle on top of it. The gain would be modest (BS has nicer ergonomics and optional task priorities), but the migration cost would be non-trivial.

Benchmark

#2053 (comment)

resolves #2052

This reverts commit b194809.

This reverts commit 5dfaa51.

jordimas · 2026-05-22T18:07:42Z

Thanks for the changes

By default we use OPENMP_RUNTIME=INTEL which does not execute this code path, including in the unit test. Can you please build with -DOPENMP_RUNTIME=NONE and do a benchmark to see that there is no regression in output or performance?

3manifold · 2026-05-22T20:20:31Z

Thanks for the changes

By default we use OPENMP_RUNTIME=INTEL which does not execute this code path, including in the unit test. Can you please build with -DOPENMP_RUNTIME=NONE and do a benchmark to see that there is no regression in output or performance?

Hi, apart from OPENMP_RUNTIME, how do you set the rest of the cmake flags in these cases (mkl, ryu, etc.)? In that way, the benchmark (before & after) numbers will make more sense.

3manifold · 2026-05-27T08:54:19Z

Thanks for the changes

By default we use OPENMP_RUNTIME=INTEL which does not execute this code path, including in the unit test. Can you please build with -DOPENMP_RUNTIME=NONE and do a benchmark to see that there is no regression in output or performance?

@jordimas

Benchmarks

Results (cpu, float32)

Operation	Before (ms)	After (ms)	Delta (ms)	Change (%)	Result
gather	0.00779288	0.00689337	-0.00089951	-11.54%	Faster
transpose	0.18562000	0.16001400	-0.02560600	-13.79%	Faster
split	0.00593810	0.00550140	-0.00043670	-7.35%	Faster
layer_norm	0.00808120	0.00649450	-0.00158670	-19.64%	Faster
softmax	0.03118040	0.02679670	-0.00438370	-14.06%	Faster
masked_softmax	0.35382600	0.30794200	-0.04588400	-12.97%	Faster
topk	0.51753900	0.44593000	-0.07160900	-13.84%	Faster
dequantize	0.04008460	0.03651400	-0.00357060	-8.91%	Faster
conv1d	46.99660000	42.37420000	-4.62240000	-9.84%	Faster
median_filter	1.28043000	1.18569000	-0.09474000	-7.40%	Faster

Summary

All 10 benchmarks improved.
Largest improvement: layer_norm (-19.64%).
Biggest absolute gain: conv1d (-4.62 ms).
Mean improvement across all ops: approximately -11.93%.

Build & run specifics

cmake -DCMAKE_INSTALL_PREFIX=$PWD/install -DBUILD_TESTS=ON -DCMAKE_BUILD_TYPE=Release -DOPENMP_RUNTIME=NONE ..

for op in gather transpose split layer_norm softmax masked_softmax topk dequantize conv1d median_filter; do echo "=== benchmark_ops cpu $op float32 ==="; tests/benchmark_ops $op cpu; done

Benchamrk Fix

I also applied a fix on masked_softmax benchmark. Error was:

terminate called after throwing an instance of 'std::invalid_argument' what(): Length mask has size 32 which is different than the current batch size 6144

jordimas · 2026-05-27T15:41:03Z

Thanks so much

3manifold added 5 commits May 22, 2026 14:51

Adopt BS_thread_pool.hpp version 5.1.0

48bef50

[TMP] Update .gitignore

ee53326

[TMP] Read local test data

30ab6d7

Revert "[TMP] Read local test data"

2dd27c0

This reverts commit b194809.

Revert "[TMP] Update .gitignore"

1d1764f

This reverts commit 5dfaa51.

3manifold mentioned this pull request May 22, 2026

Fix ThreadPool shutdown deadlock on Windows with CUDA #2027

Open

6 tasks

Fix benchmark_masked_softmax

c2c2648

jordimas changed the title ~~Adopt latest BS_thread_pool~~ Adopt latest BS_thread_pool libary version v5.1.0 May 27, 2026

jordimas merged commit 4a9ed43 into OpenNMT:master May 27, 2026
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adopt latest BS_thread_pool libary version v5.1.0#2053

Adopt latest BS_thread_pool libary version v5.1.0#2053
jordimas merged 6 commits into
OpenNMT:masterfrom
3manifold:adopt-latest-bs_thread_pool

3manifold commented May 22, 2026 •

edited

Loading

Uh oh!

jordimas commented May 22, 2026

Uh oh!

3manifold commented May 22, 2026 •

edited

Loading

Uh oh!

3manifold commented May 27, 2026

Uh oh!

jordimas commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

3manifold commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

BS_thread_pool

ReplicaPool

Benchmark

Uh oh!

jordimas commented May 22, 2026

Uh oh!

3manifold commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

3manifold commented May 27, 2026

Benchmarks

Results (cpu, float32)

Summary

Build & run specifics

Benchamrk Fix

Uh oh!

jordimas commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

3manifold commented May 22, 2026 •

edited

Loading

3manifold commented May 22, 2026 •

edited

Loading