void _mm_stream_si128(__m128i *p, __m128i a)
Stores the data in a to the address p without polluting the caches. If the cache line containing address p is already in the cache, the cache will be updated. Address p must be 16 byte aligned.
*p := a
void _mm_stream_si32(int *p, int a)
Stores the data in a to the address p without polluting the caches. If the cache line containing address p is already in the cache, the cache will be updated.
*p := a
void _mm_clflush(void const *p)
Cache line containing p is flushed and invalidated from all caches in the coherency domain.
void _mm_lfence(void)
Guarantees that every load instruction that precedes, in program order, the load fence instruction is globally visible before any load instruction which follows the fence in program order.
void _mm_mfence(void)
Guarantees that every memory access that precedes, in program order, the memory fence instruction is globally visible before any memory instruction which follows the fence in program order.
void _mm_pause(void)
The execution of the next instruction is delayed an implementation specific amount of time. The instruction does not modify the architectural state. This intrinsic provides especially significant performance gain and described in more detail below.
The PAUSE intrinsic is used in spin-wait loops with the processors implementing dynamic execution (especially out-of-order execution). In the spin-wait loop, PAUSE improves the speed at which the code detects the release of the lock. For dynamic scheduling, the PAUSE instruction reduces the penalty of exiting from the spin-loop.
Future generations of Intel microarchitectures will see increasing performance benefit from the use of PAUSE in spin-wait loops.
Example of loop with the PAUSE instruction:
spin_loop:pause
cmp eax, A
jne spin_loop
In the above example, the program spins until memory location A matches the value in register eax. The code sequence that follows shows a test-and-test-and-set. In this example, the spin occurs only after the attempt to get a lock has failed.
get_lock: mov eax, 1
xchg eax, A ; Try to get lock
cmp eax, 0 ; Test if successful
jne spin_loop
Critical Section:
<critical_section code>
mov A, 0 ; Release lock
jmp continue
spin_loop: pause ; Spin-loop hint
cmp 0, A ; Check lock availability
jne spin_loop
jmp get_lock
continue: <other code>
Note that the first branch is predicted to fall-through to the critical section in anticipation of successfully gaining access to the lock. It is highly recommended that all spin-wait loops include the PAUSE instruction. Since PAUSE is backwards compatible to all existing IA-32 processor generations, a test for processor type (a CPUID test) is not needed. All legacy processors will execute PAUSE as a NOP, but in processors which use the PAUSE as a hint there can be significant performance benefit.