Skip to content

Adopt latest BS_thread_pool libary version v5.1.0#2053

Merged
jordimas merged 6 commits into
OpenNMT:masterfrom
3manifold:adopt-latest-bs_thread_pool
May 27, 2026
Merged

Adopt latest BS_thread_pool libary version v5.1.0#2053
jordimas merged 6 commits into
OpenNMT:masterfrom
3manifold:adopt-latest-bs_thread_pool

Conversation

@3manifold
Copy link
Copy Markdown
Contributor

@3manifold 3manifold commented May 22, 2026

BS_thread_pool

Adopt BS_thread_pool @version 5.1.0 and the update relevant code.

A noticeable feature in the pool are the various flavors one can use in the respective ctor. That is:

// ...
// ...
/**
 * @brief A fast, lightweight, modern, and easy-to-use C++17/C++20/C++23 thread pool class.
 *
 * @tparam OptFlags A bitmask of flags which can be used to enable optional features. The flags are members of the `BS::tp` enumeration: `BS::tp::priority`, `BS::tp::pause`, and `BS::tp::wait_deadlock_checks`. The default is `BS::tp::none`, which disables all optional features. To enable multiple features, use the bitwise OR operator `|`, e.g. `BS::tp::priority | BS::tp::pause`.
 */
// ...
// ...
/**
 * @brief A fast, lightweight, modern, and easy-to-use C++17/C++20/C++23 thread pool class. This alias defines a thread pool with all optional features disabled.
 */
using light_thread_pool = thread_pool<tp::none>;

/**
 * @brief A fast, lightweight, modern, and easy-to-use C++17/C++20/C++23 thread pool class. This alias defines a thread pool with task priority enabled.
 */
using priority_thread_pool = thread_pool<tp::priority>;

/**
 * @brief A fast, lightweight, modern, and easy-to-use C++17/C++20/C++23 thread pool class. This alias defines a thread pool with pausing enabled.
 */
using pause_thread_pool = thread_pool<tp::pause>;

/**
 * @brief A fast, lightweight, modern, and easy-to-use C++17/C++20/C++23 thread pool class. This alias defines a thread pool with wait deadlock checks enabled.
 */
using wdc_thread_pool = thread_pool<tp::wait_deadlock_checks>;
// ...
// ...

Thread pool light_thread_pool was currently adopted.

ReplicaPool

Regarding include/ctranslate2/thread_pool.h & include/ctranslate2/replica_pool.h:
It doesn't make sense to replace ReplicaPool with BS::thread_pool directly. The CTranslate2 design is intentionally tighter than a general thread pool — the coupling between thread identity and model replica is load-bearing for GPU correctness and memory management.
Where BS::thread_pool would make sense is if we were replacing the internal ThreadPool primitive (the lower-level class ReplicaPool builds on), but we'd still need to re-implement the worker lifecycle on top of it. The gain would be modest (BS has nicer ergonomics and optional task priorities), but the migration cost would be non-trivial.

Benchmark

#2053 (comment)

resolves #2052

@jordimas
Copy link
Copy Markdown
Collaborator

Thanks for the changes

By default we use OPENMP_RUNTIME=INTEL which does not execute this code path, including in the unit test. Can you please build with -DOPENMP_RUNTIME=NONE and do a benchmark to see that there is no regression in output or performance?

@3manifold
Copy link
Copy Markdown
Contributor Author

3manifold commented May 22, 2026

Thanks for the changes

By default we use OPENMP_RUNTIME=INTEL which does not execute this code path, including in the unit test. Can you please build with -DOPENMP_RUNTIME=NONE and do a benchmark to see that there is no regression in output or performance?

Hi, apart from OPENMP_RUNTIME, how do you set the rest of the cmake flags in these cases (mkl, ryu, etc.)? In that way, the benchmark (before & after) numbers will make more sense.

@3manifold
Copy link
Copy Markdown
Contributor Author

Thanks for the changes

By default we use OPENMP_RUNTIME=INTEL which does not execute this code path, including in the unit test. Can you please build with -DOPENMP_RUNTIME=NONE and do a benchmark to see that there is no regression in output or performance?

@jordimas

Benchmarks

Results (cpu, float32)

Operation Before (ms) After (ms) Delta (ms) Change (%) Result
gather 0.00779288 0.00689337 -0.00089951 -11.54% Faster
transpose 0.18562000 0.16001400 -0.02560600 -13.79% Faster
split 0.00593810 0.00550140 -0.00043670 -7.35% Faster
layer_norm 0.00808120 0.00649450 -0.00158670 -19.64% Faster
softmax 0.03118040 0.02679670 -0.00438370 -14.06% Faster
masked_softmax 0.35382600 0.30794200 -0.04588400 -12.97% Faster
topk 0.51753900 0.44593000 -0.07160900 -13.84% Faster
dequantize 0.04008460 0.03651400 -0.00357060 -8.91% Faster
conv1d 46.99660000 42.37420000 -4.62240000 -9.84% Faster
median_filter 1.28043000 1.18569000 -0.09474000 -7.40% Faster

Summary

  • All 10 benchmarks improved.
  • Largest improvement: layer_norm (-19.64%).
  • Biggest absolute gain: conv1d (-4.62 ms).
  • Mean improvement across all ops: approximately -11.93%.

Build & run specifics

cmake -DCMAKE_INSTALL_PREFIX=$PWD/install -DBUILD_TESTS=ON -DCMAKE_BUILD_TYPE=Release -DOPENMP_RUNTIME=NONE ..
for op in gather transpose split layer_norm softmax masked_softmax topk dequantize conv1d median_filter; do echo "=== benchmark_ops cpu $op float32 ==="; tests/benchmark_ops $op cpu; done

Benchamrk Fix

I also applied a fix on masked_softmax benchmark. Error was:

terminate called after throwing an instance of 'std::invalid_argument' what(): Length mask has size 32 which is different than the current batch size 6144

@jordimas jordimas changed the title Adopt latest BS_thread_pool Adopt latest BS_thread_pool libary version v5.1.0 May 27, 2026
@jordimas
Copy link
Copy Markdown
Collaborator

Thanks so much

@jordimas jordimas merged commit 4a9ed43 into OpenNMT:master May 27, 2026
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Adopt latest BS_thread_pool

2 participants