A software implementation of the dwcas (CMPXCHG16B on x86) instruction
Algorithms built around CAS typically read some key memory location and remember the old value. Based on that old value, they compute some new value. Then they try to swap in the new value using CAS, where the comparison checks for the location still being equal to the old value. If CAS indicates that the attempt has failed, it has to be repeated from the beginning: the location is re-read, a new value is re-computed and the CAS is tried again. Instead of immediately retrying after a CAS operation fails, researchers have found that total system performance can be improved in multiprocessor systems—where many threads constantly update some particular shared variable if threads that see their CAS fail use exponential backoff, in other words, wait a little before retrying the CAS.
soft_dwcas
The actual (hardware-)implementation is surpising and confusing (hence soft_dwcas, exposing its operation). It is (slightly) usefull in developing as mis-using the CMPXCHG16B instruction results in an instant crash, while the soft_dwcas let's one inspect results more easily (print-debugging).
namespace lockless {
template<typename MutexType>
[[nodiscard]] HEDLEY_ALWAYS_INLINE bool soft_dwcas ( uint128_t * dest_, uint128_t ex_new_, uint128_t * cr_old_ ) noexcept {
alignas ( 64 ) static MutexType cas_mutex;
std::scoped_lock lock ( cas_mutex );
bool check = not equal_m128 ( dest_, cr_old_ );
std::memcpy ( cr_old_, dest_, 16 );
if ( check )
return false;
std::memcpy ( dest_, &ex_new_, 16 );
return true;
}
} // namespace lockless
Implementing this little function made me understand the difference between lockless and concurrent (mutex based) programming. I've listed in random order some of my thoughts:
- mutexes prevent concurrency
soft_dwcasimplies full serialization (i.e. has no real-world use other than real-world developing)soft_dwcascannot be used to test concurrency itself, just that the mechanics [of lockless progression] work- the mechanism used in lockless-programming is locks
dwcasuses 'the data' as a mutex, intrusive mutexes- as many intrusive mutexes as there are data points (as compared to 1, or a few)
CMPXCHG24B(fully lockless dl-list) andCMPXCHG32B?- illustrative, a mental model (
CMPXCHG16Bis 'similar' tosoft_dwcasminus the mutex) - below a lockless implementation (for completeness sake)
dwcas (win/nix/ios)
namespace lockless {
[[nodiscard]] HEDLEY_ALWAYS_INLINE bool dwcas ( volatile uint128_t * dest_, uint128_t ex_new_, uint128_t * cr_old_ ) noexcept {
#if ( defined( __clang__ ) or defined( __GNUC__ ) )
bool value;
__asm__ __volatile__( "lock cmpxchg16b %1\n\t"
"setz %0"
: "=q"( value ), "+m"( dest_ ), "+d"( cr_old_->hi ), "+a"( cr_old_->lo )
: "c"( ex_new_.hi ), "b"( ex_new_.lo )
: "cc" );
return value;
#else
return _InterlockedCompareExchange128 ( ( volatile long long * ) dest_->lo, ex_new_.hi, ex_new_.lo,
( long long * ) cr_old_->lo );
#endif
}
} // namespace lockless