Parallelism: an Overview

This section discusses the three major features of parallel programming supported by the Intel® Fortran compiler: OpenMP*, Auto-parallelization, and Auto-vectorization. Each of these features contributes to the application performance depending on the number of processors, target architecture (IA-32 or Itanium® architecture), and the nature of the application. The three features OpenMP, Auto-parallelization and Auto-vectorization, can be combined arbitrarily to contribute to the application performance.

Parallel programming can be explicit, that is, defined by a programmer using OpenMP directives. Parallel programming can be implicit, that is, detected automatically by the compiler. Implicit parallelism implements Auto-parallelization of outer-most loops and Auto-vectorization of innermost loops.

Parallelism defined with OpenMP and Auto-parallelization directives is based on  thread-level parallelism (TLP). Parallelism defined with Auto-vectorization techniques is based on instruction-level parallelism (ILP).

The Intel Fortran compiler supports OpenMP and Auto-parallelization on both IA-32 and Itanium architectures for multiprocessor systems as well as on single IA-32 processors with Hyper-Threading Technology (for Hyper-Threading Technology, refer to the IA-32 Intel® Architecture Optimization Reference Manual). Auto-vectorization is supported on the families of the Pentium®, Pentium with MMX(TM) technology, Pentium II, Pentium III, and Pentium 4 processors. To enhance the compilation of the code with Auto-vectorization, the users can also add vectorizer directives to their program. A closely related technique that is available on the Itanium-based systems is software pipelining (SWP).

The table below summarizes the different ways in which parallelism can be exploited with the Intel Fortran compiler.

Parallelism

Explicit

Parallelism programmed by the user

Implicit

Parallelism generated by compiler and by user-supplied hints

OpenMP* (TLP)

IA-32 and Itanium architectures

Auto-parallelization (TLP)
of outer-most loops

IA-32 and Itanium architectures

Auto-vectorization (ILP)
of inner-most loops

IA-32 only

Software pipelining for Itanium architecture

Supported on

Supported on

IA-32 or Itanium-based Multiprocessor systems;

IA-32 Hyper-Threading Technology-enabled systems.

Pentium®, Pentium with MMX™ Technology, Pentium II, Pentium III, and Pentium 4 processors

Parallel Program Development

The Intel Fortran Compiler supports the OpenMP Fortran version 2.0 API specification available from the www.openmp.org web site. The OpenMP directives relieve the user from having to deal with the low-level details of iteration space partitioning, data sharing, and thread scheduling and synchronization.

The Auto-parallelization feature of the Intel Fortran Compiler automatically translates serial portions of the input program into semantically equivalent multithreaded code. Automatic parallelization determines the loops that are good worksharing candidates, performs the dataflow analysis to verify correct parallel execution, and partitions the data for threaded code generation as is needed in programming with OpenMP directives. The OpenMP and Auto-parallelization applications provide the performance gains from shared memory on multiprocessor systems and IA-32 processors with the Hyper-Threading Technology.

Auto-vectorization detects low-level operations in the program that can be done in parallel, and then converts the sequential program to process 2, 4, 8 or up to 16 elements in one operation, depending on the data type. In some cases auto-parallelization and vectorization can be combined for better performance results. For example, in the code below, TLP can be exploited in the outermost loop, while ILP can be exploited in the innermost loop.

DO I = 1, 100            ! execute groups of iterations in different  
               ! threads (TLP)
DO J = 1, 32          ! execute in SIMD style with multimedia
               ! extension (ILP)
A(J,I) = A(J,I) + 1
ENDDO
ENDDO

Auto-vectorization can help improve performance of an application that runs on the systems based on Pentium®, Pentium with MMX(TM) technology, Pentium II, Pentium III, and Pentium 4 processors.

The following table lists the options that enable Auto-vectorization, Auto-parallelization, and OpenMP support.

Auto-vectorization, IA-32 only

-x{K|W|N|B|P}

Generates specialized code to run exclusively on processors with the extensions specified by {K|W|N|B|P}.

-ax{K|W|N|B|P}

Generates, in a single binary, code specialized to the extensions specified by {K|W|N|B|P} and also generic IA-32 code. The generic code is usually slower.

-vec_report{0|1|2|3|4|5}

Controls the diagnostic messages from the vectorizer, see subsection that follows the table.

Auto-parallelization, IA-32 and Itanium architectures

-parallel

Enables the auto-parallelizer to generate multithreaded code for loops that can be safely executed in parallel. Default: OFF.

-par_threshold{n}

Sets a threshold for the auto-parallelization of loops based on the probability of profitable execution of the loop in parallel, n=0 to 100. n=0 implies "always." Default: n=75.

-par_report{0|1|2|3}

Controls the auto-parallelizer's diagnostic levels.
Default: -par_report1.

OpenMP, IA-32 and Itanium architectures

-openmp

Enables the parallelizer to generate multithreaded code based on the OpenMP directives. Default: OFF.

-openmp_report{0|1|2}

Controls the OpenMP parallelizer's diagnostic levels. Default: /Qopenmp_report1.

-openmp_stubs

Enables compilation of OpenMP programs in sequential mode. The OpenMP directives are ignored and a stub OpenMP library is linked. Default: OFF.

Note

When both -openmp and -parallel are specified on the command line,  the -parallel option is only honored in routines that do not contain OpenMP Directives. For routines that contain OpenMP directives, only the -openmp option is honored.

With the right choice of options, the programmers can:

With a relatively small effort of adding the OpenMP directives to their code, the programmers can transform a sequential  program into a parallel program. The following are examples of the OpenMP directives within the code:

!OMP$ PARALLEL PRIVATE(NUM), SHARED (X,A,B,C)
            !Defines a parallel region
!OMP$ PARALLEL DO ! Specifies a parallel region that
            ! implicitly contains a single DO directive
DO I = 1, 1000

N
UM = FOO(B(i), C(I))
X(I) = BAR(A(I), NUM)
            ! Assume FOO and BAR have no side effects
ENDDO

See examples of the Auto-parallelization and Auto-vectorization directives in the respective sections.