A few fixes in the threadpool semaphore. Unify Windows/Unix implementation of LIFO policy.#123921
A few fixes in the threadpool semaphore. Unify Windows/Unix implementation of LIFO policy.#123921VSadov merged 25 commits intodotnet:mainfrom
Conversation
|
Tagging subscribers to this area: @agocke, @VSadov |
There was a problem hiding this comment.
Pull request overview
This PR addresses performance regressions in the threadpool semaphore (issue #123159) and unifies the Windows/Unix implementation of the LIFO (Last-In-First-Out) policy for threadpool worker thread management.
Changes:
- Introduces a unified
LowLevelThreadBlockerclass that uses OS-provided compare-and-wait APIs (futex on Linux, WaitOnAddress on Windows) for efficient thread blocking, with a fallback to monitor-based implementation for other platforms - Refactors
LowLevelLifoSemaphoreto use the new blocker infrastructure, removes platform-specific Windows/Unix implementations, and improves spinning heuristics based on CPU availability - Adds native futex support for Linux through syscalls and Windows WaitOnAddress API interop
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/native/libs/System.Native/pal_threading.h | Adds declarations for Linux futex operations |
| src/native/libs/System.Native/pal_threading.c | Implements futex wait/wake operations for Linux using syscalls |
| src/native/libs/System.Native/entrypoints.c | Registers new futex entrypoints for Linux |
| src/libraries/System.Private.CoreLib/src/System/Threading/PortableThreadPool.WorkerThread.cs | Fixes spelling, removes spin count configuration, passes active thread count to semaphore |
| src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelThreadBlocker.cs | New class providing portable thread blocking using futex/WaitOnAddress or monitor fallback |
| src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs | Major refactoring to use LowLevelThreadBlocker, implements LIFO queue with pending signals, improves spin heuristics |
| src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.Windows.cs | Deleted - functionality moved to unified implementation |
| src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.Unix.cs | Deleted - functionality moved to unified implementation |
| src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelFutex.Windows.cs | New file providing Windows WaitOnAddress API wrapper |
| src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelFutex.Unix.cs | New file providing Linux futex wrapper |
| src/libraries/System.Private.CoreLib/src/System/Threading/Backoff.cs | Modified to return spin count and skip spinning on first attempt |
| src/libraries/System.Private.CoreLib/src/System.Private.CoreLib.Shared.projitems | Updates project to include new files and remove deleted platform-specific files |
| src/libraries/Common/src/Interop/Windows/Kernel32/Interop.WaitOnAddress.cs | New interop declarations for Windows WaitOnAddress and WakeByAddressSingle APIs |
| src/libraries/Common/src/Interop/Windows/Kernel32/Interop.CriticalSection.cs | Adds SuppressGCTransition attribute to LeaveCriticalSection |
| src/libraries/Common/src/Interop/Windows/Kernel32/Interop.ConditionVariable.cs | Adds SuppressGCTransition attribute to WakeConditionVariable |
| src/libraries/Common/src/Interop/Unix/System.Native/Interop.LowLevelMonitor.cs | Adds SuppressGCTransition attributes to Release and Signal_Release |
| src/libraries/Common/src/Interop/Unix/System.Native/Interop.Futex.cs | New interop declarations for Linux futex operations |
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelThreadBlocker.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System.Private.CoreLib.Shared.projitems
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs
Show resolved
Hide resolved
|
One test that was affected by #123159 is The test involved one thread renting an array, mutating it, passing to another thread via one-element buffer, the other thread would inspect the buffer and release, and so on. Since the scenario needs to wait on a task only occasionally, depending on environment (CPU speed, memory speed, ...), it varies how frequently the need arises for such task, but generally the test is sensitive to threadpool spinning long enough to execute a task without waking a thread. The results after this PR, vs baseline: === Linux x64
|
|
Same tests on Windows: BenchmarkDotNet v0.14.1-nightly.20250107.205, Windows 11 (10.0.26200.7623) === baseline:
=== this PR:
|
|
TE benchmarks seem to favor the change as well. Unlike ProducerConsumer microbenchmark, TE does not like long threadpool spins, likely because there are non-threadpool threads like epoll threads. Using command: === Baseline: === This PR: |
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs
Show resolved
Hide resolved
…evelThreadBlocker.cs Co-authored-by: Jan Kotas <jkotas@microsoft.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelThreadBlocker.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs
Show resolved
Hide resolved
src/libraries/Common/src/Interop/Unix/System.Native/Interop.LowLevelMonitor.cs
Show resolved
Hide resolved
|
Thanks!!! |
Re: #123159
Changes:
Backoff.Exponential(0).Embarrassing bug.
To get exponentially growing random spin count for an iteration we generate pseudorandom
uintand do>> (32 - attempt). Since C# masks the shift operand with 31, whenattempt==0we end up not shifting at all, and the first iteration gets a large random spin count.That caused many noisy results and interestingly some improvements (in scenarios that benefit from very long spins).
Once we are done spinning, we block threads and when workers are needed again wake them in LIFO order.
Unix WaitSubsystem is pretty heavy for these needs. It supports Interruptible waits, waiting on multiple objects, etc... None of that is interesting here. Most calls into the subsystem take a global process-wide lock which can contend under load with other uses, or a worker-waking threads may contend with the workers going to sleep, etc...
Windows used an opaque
GetQueuedCompletionStatusfor the side effect of releasing threads in LIFO order when completion is posted, with unknown overheads and interactions, even though typically it is more efficient than Unix WaitSubsystem.The portable implementation seems to be faster than either of the platform-specific ones.
(measured by disabling spinning and running a few latency-sensitive benchmarks).
The portable implementation is also easier to reason about and to debug anomalies.
Spinning in threadpool is very tricky and spinning benefits differ greatly between scenarios. For some scenarios the longer the spin the better. But there are scenarios that benefit when the threadpool releases cores quickly once it sees no work. No preset fixed spin count is going to be good for everything.
Adaptive approach appears to be necessary to improve some scenarios without regressing many others.
We can further improve the heuristic, if there are more ideas.