Coding Guidelines for Intel® Architectures

This section provides general guidelines for coding practices and techniques that insure most benefits of using:

IA-32 architecture supporting MMX(TM) technology and Streaming SIMD Extensions (SSE) and Streaming SIMD Extensions 2 (SSE2)
Itanium® architecture

This section describes practices, tools, coding rules and recommendations associated with the architecture features that can improve the performance on IA-32 and Itanium processors families. For all details about optimization for IA-32 processors, see Intel® Architecture Optimization Reference Manual. For all details about optimization for Itanium processor family, see the Intel Itanium 2 Processor Reference Manual for Software Development and Optimization.

Note

If a guideline refers to a particular architecture only, this architecture is explicitly named. The default is for both IA-32 and Itanium architectures.

Performance of compiler-generated code may vary from one compiler to another. Intel® Fortran Compiler generates code that is highly optimized for Intel architectures. You can significantly improve performance by using various compiler optimization options. In addition, you can help the compiler to optimize your Fortran program by following the guidelines described in this section.

When coding in Fortran, the most important factors to consider in achieving optimum processor performance are:

avoiding memory access stalls
ensuring good floating-point performance
ensuring good SIMD integer performance
using vectorization.

The following sections summarize and describe coding practices, rules and recommendations associated with the features that will contribute to optimizing the performance on Intel architecture-based processors.

Memory Access

The Intel compiler lays out Fortran arrays in column-major order. For example, in a two-dimensional array, elements A(22, 34) and A(23, 34) are contiguous in memory. For best performance, code arrays so that inner loops access them in a contiguous manner. Consider the following examples.

The code in example 1 will likely have higher performance than the code in example 2.

Example 1

DO J = 1, N
DO I = 1, N
B(I,J) = A(I, J) + 1
END DO
END DO

The code above illustrates access to arrays A and B in the inner loop I in a contiguous manner which results in good performance.

Example 2

DO I = 1, N
DO J = 1, N
B(I,J) = A(I, J) + 1
END DO
END DO

The code above illustrates access to arrays A and B in inner loop J in a non-contiguous manner which results in poor performance.

The compiler itself can transform the code so that inner loops access memory in a contiguous manner. To do that, you need to use advanced optimization options, such as -O3 for both IA-32 and Itanium acrchitectures, and -O3 and -axW|N|B|P for IA-32 only.

Memory Layout

Alignment is a very important factor in ensuring good performance. Aligned memory accesses are faster than unaligned accesses. If you use the interprocedural optimization on multiple files (the -ipo option), the compiler analizes the code and decides whether it is beneficial to pad arrays so that they start from an aligned boundary. Multiple arrays specified in a single common block can impose extra constraints on the compiler. For example, consider the following COMMON statement:

COMMON /AREA1/ A(200), X, B(200)

If the compiler added padding to align A(1) at a 16-byte aligned address, the element B(1) would not be at a 16-byte aligned address. So it is better to split AREA1 as follows.

COMMON /AREA1/ A(200)
COMMON /AREA2/ X
COMMON /AREA3/ B(200)

The above code provides the compiler maximum flexibility in determining the padding required for both A and B.

Optimizing for Floating-point Applications

To improve floating-point performance, generally follow these rules:

Avoid exceeding representable ranges during computation since handling these cases can have a performance impact. Use REAL variables in single-precision format unless the extra precision obtained through DOUBLE or REAL*10 variables is required. Using variables with a larger precision formation will also increase memory size and bandwidth requirements.
For IA-32 only: Avoid repeatedly changing rounding modes between more than two values, which can lead to poor performance when the computation is done using non-SSE instructions. Hence avoid using FLOOR and TRUNC instructions together when generating non-SSE code. The same applies for using CEIL and TRUNC.

Another way to avoid the problem is to use the -x{K|W|N|B|P} options to do the computation using SSE instructions.

Reduce the impact of denormal exceptions for both architectures as described below.

Denormal Exceptions

Floating point computations with underflow can result in denormal values that have an adverse impact on performance.

For IA-32: take advantage of the SIMD capabilities of Streaming SIMD Extensions (SSE), and Streaming SIMD Extensions 2 (SSE2) instructions. The -x{K|W|N|B|P} options enable the flush-to-zero (FTZ) mode in SSE and SSE2 instructions, whereby underflow results are automatically converted to zero, which improves application performance. In addition, the -xP option also enables the denormals-as-zero (DAZ) mode, whereby denormals are converted to zero on input, further improving performance. An application developer willing to trade pure IEEE-754 compliance for speed would benefit from these options. For more information on FTZ and DAZ, see Setting FTZ and DAZ Flags and "Floating-point Exceptions" in the Intel® Architecture Optimization Reference Manual.

For Itanium architecture: enable flush-to-zero (FTZ) mode with the -ftz option set by -O3 option.

Auto-vectorization

Many applications significantly increase their performance if they can implement vectorization, which uses streaming SIMD SSE2 instructions for the main computational loops. The Intel Compiler turns vectorization on (auto-vectorization) or you can implement it with compiler directives.

See Auto-vectorization (IA-32 Only) section for complete details.

Creating Multithreaded Applications

The Intel Fortran Compiler and the Intel® Threading Toolset have the capabilities that make developing multithreaded application easy. See Parallel Programming with Intel Fortran. Multithreaded applications can show significant benefit on multiprocessor Intel symmetric multiprocessing (SMP) systems or on Intel processors with Hyper-Threading technology.