Memory and Initialization Using Streaming SIMD Extensions

This section describes the load, set, and store operations, which let you load and store data into memory. The load and set operations are similar in that both initialize __m128 data. However, the set operations take a float argument and are intended for initialization with constants, whereas the load operations take a floating point argument and are intended to mimic the instructions for loading data from memory. The store operation assigns the initialized data to the address.

The intrinsics are listed in the following table. Syntax and a brief description are contained the following topics.

The prototypes for Streaming SIMD Extensions intrinsics are in the xmmintrin.h header file.

Intrinsic
Name
Alternate
Name
Operation Corresponding
Instruction
_mm_load_ss   Load the low value and clear the three high values MOVSS
_mm_load_ps1 _mm_load1_ps Load one value into all four words MOVSS + Shuffling
_mm_load_ps   Load four values, address aligned MOVAPS
_mm_loadu_ps   Load four values, address unaligned MOVUPS
_mm_loadr_ps   Load four values, in reverse order MOVAPS + Shuffling
_mm_set_ss   Set the low value and clear the three high values Composite
_mm_set_ps1 _mm_set1_ps Set all four words with the same value Composite
_mm_set_ps   Set four values, address aligned Composite
_mm_setr_ps   Set four values, in reverse order Composite
_mm_setzero_ps   Clear all four values Composite
_mm_store_ss   Store the low value MOVSS
_mm_store_ps1 _mm_store1_ps Store the low value across all four words. The address must be 16-byte aligned. Shuffling + MOVSS
_mm_store_ps   Store four values, address aligned MOVAPS
_mm_storeu_ps   Store four values, address unaligned MOVUPS
_mm_storer_ps   Store four values, in reverse order MOVAPS + Shuffling
_mm_move_ss   Set the low word, and pass in three high values MOVSS
_mm_getcsr   Return register contents STMXCSR
_mm_setcsr   Control Register LDMXCSR
_mm_prefetch      
_mm_stream_pi      
_mm_stream_ps      
_mm_sfence      
_mm_cvtss_f32      

__m128 _mm_load_ss(float const*a)

Loads an SP FP value into the low word and clears the upper three words.
r0 := *a
r1 := 0.0 ; r2 := 0.0 ; r3 := 0.0

__m128 _mm_load_ps1(float const*a)

Loads a single SP FP value, copying it into all four words.
r0 := *a
r1 := *a
r2 := *a
r3 := *a

__m128 _mm_load_ps(float const*a)

Loads four SP FP values. The address must be 16-byte-aligned.
r0 := a[0]
r1 := a[1]
r2 := a[2]
r3 := a[3]

__m128 _mm_loadu_ps(float const*a)

Loads four SP FP values. The address need not be 16-byte-aligned.
r0 := a[0]
r1 := a[1]
r2 := a[2]
r3 := a[3]

__m128 _mm_loadr_ps(float const*a)

Loads four SP FP values in reverse order. The address must be 16-byte-aligned.
r0 := a[3]
r1 := a[2]
r2 := a[1]
r3 := a[0]

__m128 _mm_set_ss(float a)

Sets the low word of an SP FP value to a and clears the upper three words.
r0 := c
r1 := r2 := r3 := 0.0

__m128 _mm_set_ps1(float a)

Sets the four SP FP values to a.
r0 := r1 := r2 := r3 := a

__m128 _mm_set_ps(float a, float b, float c, float d)

Sets the four SP FP values to the four inputs.
r0 := a
r1 := b
r2 := c
r3 := d

__m128 _mm_setr_ps(float a, float b, float c, float d)

Sets the four SP FP values to the four inputs in reverse order.
r0 := d
r1 := c
r2 := b
r3 := a

__m128 _mm_setzero_ps(void)

Clears the four SP FP values.
r0 := r1 := r2 := r3 := 0.0

void _mm_store_ss(float *v, __m128 a)

Stores the lower SP FP value.
*v := a0

void _mm_store_ps1(float *v, __m128 a)

Stores the lower SP FP value across four words.
v[0] := a0
v[1] := a0
v[2] := a0
v[3] := a0

void _mm_store_ps(float *v, __m128 a)

Stores four SP FP values. The address must be 16-byte-aligned.
v[0] := a0
v[1] := a1
v[2] := a2
v[3] := a3

void _mm_storeu_ps(float *v, __m128 a)

Stores four SP FP values. The address need not be 16-byte-aligned.
v[0] := a0
v[1] := a1
v[2] := a2
v[3] := a3

void _mm_storer_ps(float *v, __m128 a)

Stores four SP FP values in reverse order. The address must be 16-byte-aligned.
v[0] := a3
v[1] := a2
v[2] := a1
v[3] := a0

__m128 _mm_move_ss(__m128 a, __m128 b)

Sets the low word to the SP FP value of b. The upper 3 SP FP values are passed through from a.
r0 := b0
r1 := a1
r2 := a2
r3 := a3

unsigned int _mm_getcsr(void)

Returns the contents of the control register.

void _mm_setcsr(unsigned int i)

Sets the control register to the value specified.

void _mm_prefetch(char const*a, int sel)

(uses PREFETCH) Loads one cache line of data from address a to a location "closer" to the processor. The value sel specifies the type of prefetch operation: the constants _MM_HINT_T0, _MM_HINT_T1, _MM_HINT_T2, and _MM_HINT_NTA should be used, corresponding to the type of prefetch instruction.

void _mm_stream_pi(__m64 *p, __m64 a)

(uses MOVNTQ) Stores the data in a to the address p without polluting the caches. This intrinsic requires you to empty the multimedia state for the mmx register. See The EMMS Instruction: Why You Need It and When to Use It topic.

void _mm_stream_ps(float *p, __m128 a)

(see MOVNTPS) Stores the data in a to the address p without polluting the caches. The address must be 16-byte-aligned.

void _mm_sfence(void)

(uses SFENCE) Guarantees that every preceding store is globally visible before any subsequent store.

float _mm_cvtss_f32(__m128 a)

This intrinsic extracts a single precision floating point value from the first vector element of an __m128. It does so in the most effecient manner possible in the context used. This intrinsic doesn't map to any specific SSE instruction.