Native Intrinsics for Itanium(TM) Instructions

For more information on the instructions, refer to:

Itanium(TM)-based Application Developer's Architecture Guide, Intel Corporation

Itanium(TM) Architecture Software Developer's Manual Vol. 3: Instruction Set Reference, Intel Corporation, doc. number 245319-001

Both of these documents are available from http://developer.intel.com.

Intrinsic	Corresponding Instruction
__m64 _m64_czx1l(__m64 a)	czx1.l (Compute Zero Index)
__m64 _m64_czx1r(__m64 a)	czx1.r (Compute Zero Index)
__m64 _m64_czx2l(__m64 a)	czx2.l (Compute Zero Index)
__m64 _m64_czx2r(__m64 a)	czx2.r (Compute Zero Index)
__int64 _i64_dep_mr(__int64 r, __int64 s, const int pos, const int len)	dep (Deposit)
__int64 _i64_dep_mi(const int r, __int64 s, const int pos, const int len)	dep (Deposit)
__int64 _i64_dep_zr(__int64 r, const int pos , const int len)	dep.z (Deposit)
__int64 _i64_dep_zi(const int v, const int pos, const int len)	dep.z (Deposit)
__int64 _i64_extr(__int64 r, const int pos, const int len)	extr (Extract)
__int64 _i64_extru(__int64 r, const int pos, const int len)	extr.u (Extract)
__int64 _i64_muladd64lo( __int64 a, __int64 b, __int64 c)	xma.l (Fixed-point multiply add)
__int64 _i64_muladd64lo_u( __int64 a, __int64 b, __int64 c)	xma.lu (Fixed-point multiply add)
__int64 _i64_muladd64hi( __int64 a, __int64 b, __int64 c)	xma.h (Fixed-point multiply add)
__int64 _i64_muladd64hi_u( __int64 a, __int64 b, __int64 c)	xma.hu (Fixed-point multiply add)
__m64 _m64_mix1l(__m64 a, __m64 b)	mix1.l (Mix)
__m64 _m64_mix1r(__m64 a, __m64 b)	mix1.r (Mix)
__m64 _m64_mix2l(__m64 a, __m64 b)	mix2.l (Mix)
__m64 _m64_mix2r(__m64 a, __m64 b)	mix2.r (Mix)
__m64 _m64_mix4l(__m64 a, __m64 b)	mix4.l (Mix)
__m64 _m64_mix4r(__m64 a, __m64 b)	mix4.r (Mix)
__m64 _m64_mux1(__m64 a, const int n)	mux1 (Mux)
__m64 _m64_mux2(__m64 a, const int n)	mux2 (Mux)
__int64 _i64_popcnt(__int64 a)	popcnt (Population count)
__m64 _m64_pavgsub1(__m64 a, __m64 b)	pavgsub1 (Parallel average subtract)
__m64 _m64_pavgsub2(__m64 a, __m64 b)	pavgsub2 (Parallel average subtract)
__m64 _m64_pmpy2r(__m64 a, __m64 b)	pmpy2.r (Parallel multiply)
__m64 _m64_pmpy2l(__m64 a, __m64 b)	pmpy2.l (Parallel multiply)
__m64 _m64_pmpyshr2(__m64 a, __m64 b, const int count)	pmpyshr2 (Parallel multiply and shift right)
__m64 _m64_pmpyshr2u(__m64 a, __m64 b, const int count)	pmpyshr2.u (Parallel multiply and shift right)
__m64 _m64_pshladd2(__m64 a, const int count, __m64 b)	pshladd2 (Parallel shift left and add)
__m64 _m64_pshradd2(__m64 a, const int count, __m64 b)	pshradd2 (Parallel shift right and add)
__int64 _i64_shladd(__int64 a, const int count, __int64 b)	shladd (Shift left and add)
__int64 _i64_shrp(__int64 a, __int64 b, const int count)	shrp (Shift right pair)
__m64 _m64_padd1uus(__m64 a, __m64 b)	padd1.uus (Parallel add)
__m64 _m64_padd2uus(__m64 a, __m64 b)	padd2.uus (Parallel add)
__m64 _m64_psub1uus(__m64 a, __m64 b)	psub1.uus (Parallel subtract)
__m64 _m64_psub2uus(__m64 a, __m64 b)	psub2.uus (Parallel subtract)
__m64 _m64_pavg1_nraz(__m64 a, __m64 b)	pavg1 (Parallel average)
__m64 _m64_pavg2_nraz(__m64 a, __m64 b)	pavg2 (Parallel average)

Other Native Intrinsics	Description
void __lfetch(int, lfhint, _int64)	Line prefetch, non fault form. Maps to the lfetch.lfhint [r] instruction.
void __lfetch_fault(int lfhint, _int64)	Line prefetch, fault form. Maps to the lfetch.fault.lfhint [r] instruction.
void _fclrf(void)	Clears the floating point status flags (the 6-bit flags of FPSR.sf0). Maps to the fclrf.sf0 instruction.
void _fsetc(int amask, int omask)	Sets the control bits of FPSR.sf0. Maps to the fsetc.sf0 r, r instruction. There is no corresponding instruction to read the control bits. Use _mm_getfpsr().
void _mm_setfpsr(unsigned __int64 i)	Set the bits of the FPSR that cannot be set using the macros described in the Macro Functions to Read and Write the Control Registers topic.
unsigned __int64 _mm_getfpsr(void)	Get the bits of the FPSR that cannot be accessed using the macros described in the Macro Functions to Read and Write the Control Registers topic.
__int64 _m_to_int64(__m64 a)	Convert a of type __m64 to type __int64. Translates to nop since both types reside in the same register on Itanium-based systems.
__m64 _m_from_int64(__int64 a)	Convert a of type __int64 to type __m64. Translates to nop since both types reside in the same register on Itanium-based systems.

__m64 _m64_czx1l(__m64 a)

The 64-bit value a is scanned for a zero element from the most significant element to the least significant element, and the index of the first zero element is returned. The element width is 8 bits, so the range of the result is from 0 - 7. If no zero element is found, the default result is 8.

__m64 _m64_czx1r(__m64 a)

The 64-bit value a is scanned for a zero element from the least significant element to the most significant element, and the index of the first zero element is returned. The element width is 8 bits, so the range of the result is from 0 - 7. If no zero element is found, the default result is 8.

__m64 _m64_czx2l(__m64 a)

The 64-bit value a is scanned for a zero element from the most significant element to the least significant element, and the index of the first zero element is returned. The element width is 16 bits, so the range of the result is from 0 - 3. If no zero element is found, the default result is 4.

__m64 _m64_czx2r(__m64 a)

The 64-bit value a is scanned for a zero element from the least significant element to the most significant element, and the index of the first zero element is returned. The element width is 16 bits, so the range of the result is from 0 - 3. If no zero element is found, the default result is 4.

__int64 _i64_dep_mr(__int64 r, __int64 s, const int pos, const int len)

The right-justified 64-bit value r is deposited into the value in s at an arbitrary bit position and the result is returned. The deposited bit field begins at bit position pos and extends to the left (toward the most significant bit) the number of bits specified by len.

__int64 _i64_dep_mi(const int r, __int64 s, const int pos, const int len)

The sign-extended value r (either all 1s or all 0s) is deposited into the value in s at an arbitrary bit position and the result is returned. The deposited bit field begins at bit position pos and extends to the left (toward the most significant bit) the number of bits specified by len.

__int64 _i64_dep_zr(__int64 r, const int pos , const int len)

The right-justified 64-bit value r is deposited into a 64-bit field of all zeros at an arbitrary bit position and the result is returned. The deposited bit field begins at bit position pos and extends to the left (toward the most significant bit) the number of bits specified by len.

__int64 _i64_dep_zi(const int v, const int pos, const int len)

The sign-extended value r (either all 1s or all 0s) is deposited into a 64-bit field of all zeros at an arbitrary bit position and the result is returned. The deposited bit field begins at bit position pos and extends to the left (toward the most significant bit) the number of bits specified by len.

__int64 _i64_extr(__int64 r, const int pos, const int len)

A field is extracted from the 64-bit value r and is returned right-justified and sign extended. The extracted field begins at position pos and extends len bits to the left. The sign is taken from the most significant bit of the extracted field.

__int64 _i64_extru(__int64 r, const int pos, const int len)

A field is extracted from the 64-bit value r and is returned right-justified and zero extended. The extracted field begins at position pos and extends len bits to the left.

__int64 _i64_muladd64lo( __int64 a, __int64 b, __int64 c)

The 64-bit values a and b are treated as signed integers and multiplied to produce a full 128-bit signed result. The 64-bit value c is zero-extended and added to the product. The least significant 64 bits of the sum are then returned.

__int64 _i64_muladd64lo_u( __int64 a, __int64 b, __int64 c)

The 64-bit values a and b are treated as signed integers and multiplied to produce a full 128-bit unsigned result. The 64-bit value c is zero-extended and added to the product. The least significant 64 bits of the sum are then returned.

__int64 _i64_muladd64hi( __int64 a, __int64 b, __int64 c)

The 64-bit values a and b are treated as signed integers and multiplied to produce a full 128-bit signed result. The 64-bit value c is zero-extended and added to the product. The most significant 64 bits of the sum are then returned.

__int64 _i64_muladd64hi_u( __int64 a, __int64 b, __int64 c)

The 64-bit values a and b are treated as unsigned integers and multiplied to produce a full 128-bit unsigned result. The 64-bit value c is zero-extended and added to the product. The most significant 64 bits of the sum are then returned.

__m64 _m64_mix1l(__m64 a, __m64 b)

Interleave 64-bit quantities a and b in 1-byte groups, starting from the left, as shown in Figure 1, and return the result.

__m64 _m64_mix2l(__m64 a, __m64 b)

Interleave 64-bit quantities a and b in 1-byte groups, starting from the right, as shown in Figure 2, and return the result.

__m64 _m64_mix2l(__m64 a, __m64 b)

Interleave 64-bit quantities a and b in 2-byte groups, starting from the left, as shown in Figure 3, and return the result.

__m64 _m64_mix2r(__m64 a, __m64 b)

Interleave 64-bit quantities a and b in 2-byte groups, starting from the right, as shown in Figure 4, and return the result.

__m64 _m64_mix4l(__m64 a, __m64 b)

Interleave 64-bit quantities a and b in 4-byte groups, starting from the left, as shown in Figure 5, and return the result.

__m64 _m64_mix4r(__m64 a, __m64 b)

Interleave 64-bit quantities a and b in 4-byte groups, starting from the right, as shown in Figure 6, and return the result.

__m64 _m64_mux1(__m64 a, const int n)

Based on the value of n, a permutation is performed on a as shown in Figure 7, and the result is returned. Table 1 shows the possible values of n.

Table 1. Values of n for m64_mux1
Operation	n
@brcst	0
@mix	8
@shuf	9
@alt	0xA
@rev	0xB

__m64 _m64_mux2(__m64 a, const int n)

Based on the value of n, a permutation is performed on a as shown in Figure 8, and the result is returned.

__int64 _i64_popcnt(__int64 a)

The number of bits in the 64-bit integer a that have the value 1 are counted, and the resulting sum is returned.

__m64 _m64_pavgsub1(__m64 a, __m64 b)

The unsigned data elements (bytes) of b are subtracted from the unsigned data elements (bytes) of a and the results of the subtraction are then each independently shifted to the right by one position. The high-order bits of each element are filled with the borrow bits of the subtraction.

__m64 _m64_pavgsub2(__m64 a, __m64 b)

The unsigned data elements (double bytes) of b are subtracted from the unsigned data elements (double bytes) of a and the results of the subtraction are then each independently shifted to the right by one position. The high-order bits of each element are filled with the borrow bits of the subtraction.

__m64 _m64_pmpy2l(__m64 a, __m64 b)

Two signed 16-bit data elements of a, starting with the most significant data element, are multiplied by the corresponding two signed 16-bit data elements of b, and the the two 32-bit results are returned as shown in Figure 9.

__m64 _m64_pmpy2r(__m64 a, __m64 b)

Two signed 16-bit data elements of a, starting with the least significant data element, are multiplied by the corresponding two signed 16-bit data elements of b, and the two 32-bit results are returned as shown in Figure 10.

__m64 _m64_pmpyshr2(__m64 a, __m64, const int count)

The four signed 16-bit data elements of a are multiplied by the corresponding signed 16-bit data elements of b, yielding four 32-bit products. Each product is then shifted to the right count bits and the least significant 16 bits of each shifted product form 4 16-bit results, which are returned as one 64-bit word.

__m64 _m64_pmpyshr2u(__m64 a, __m64 b, const int count)

The four unsigned 16-bit data elements of a are multiplied by the corresponding unsigned 16-bit data elements of b, yielding four 32-bit products. Each product is then shifted to the right count bits and the least significant 16 bits of each shifted product form 4 16-bit results, which are returned as one 64-bit word.

__m64 _m64_pshladd2(__m64 a, const int count, __m64 b)

a is shifted to the left by count bits and then is added to b. The upper 32 bits of the result are forced to 0, and then bits [31:30] of b are copied to bits [62:61] of the result. The result is returned.

__m64 _m64_pshradd2(__m64 a, const int count, __m64 b)

The four signed 16-bit data elements of a are each independently shifted to the right by count bits (the high order bits of each element are filled with the initial value of the sign bits of the data elements in a); they are then added to the four signed 16-bit data elements of b. The result is returned.

__int64 _i64_shladd(__int64 a, const int count, __int64 b)

a is shifted to the left by count bits and then added to b. The result is returned.

__int64 _i64_shrp(__int64 a, __int64 b, const int count)

a and b are concatenated to form a 128-bit value and shifted to the right count bits. The least significant 64 bits of the result are returned.

__m64 _m64_padd1uus(__m64 a, __m64 b)

a is added to b as eight separate byte-wide elements. The elements of a are treated as unsigned, while the elements of b are treated as signed. The results are treated as unsigned and are returned as one 64-bit word.

__m64 _m64_padd2uus(__m64 a, __m64 b)

a is added to b as four separate 16-bit wide elements. The elements of a are treated as unsigned, while the elements of b are treated as signed. The results are treated as unsigned and are returned as one 64-bit word.

__m64 _m64_psub1uus(__m64 a, __m64 b)

a is subtracted from b as eight separate byte-wide elements. The elements of a are treated as unsigned, while the elements of b are treated as signed. The results are treated as unsigned and are returned as one 64-bit word.

__m64 _m64_psub2uus(__m64 a, __m64 b)

a is subtracted from b as four separate 16-bit wide elements. The elements of a are treated as unsigned, while the elements of b are treated as signed. The results are treated as unsigned and are returned as one 64-bit word.

__m64 _m64_pavg1_nraz(__m64 a, __m64 b)

The unsigned byte-wide data elements of a are added to the unsigned byte-wide data elements of b and the results of each add are then independently shifted to the right by one position. The high-order bits of each element are filled with the carry bits of the sums.

__m64 _m64_pavg2_nraz(__m64 a, __m64 b)

The unsigned 16-bit wide data elements of a are added to the unsigned 16-bit wide data elements of b and the results of each add are then independently shifted to the right by one position. The high-order bits of each element are filled with the carry bits of the sums.