Intel ARCHITECTURE IA-32 User Manual

Browse online or download User Manual for Computer Accessories Intel ARCHITECTURE IA-32. Intel ARCHITECTURE IA-32 User Manual

  • Download
  • Add to my manuals
  • Print
  • Page
    / 568
  • Table of contents
  • BOOKMARKS
  • Rated. / 5. Based on customer reviews
Page view 0
IA-32 Intel® Architecture
Optimization Reference
Manual
Order Number: 248966-013US
April 2006
Page view 0
1 2 3 4 5 6 ... 567 568

Summary of Contents

Page 1 - Optimization Reference

IA-32 Intel® ArchitectureOptimization ReferenceManualOrder Number: 248966-013USApril 2006

Page 2

xHardware Prefetch ... 6-19Example of Effective

Page 3 - Contents

IA-32 Intel® Architecture Optimization2-28In this example, a loop that executes 100 times assigns x to every even-numbered element and y to every odd-

Page 4

General Optimization Guidelines 22-29Memory AccessesThis section discusses guidelines for optimizing code and data memory accesses. The most important

Page 5

IA-32 Intel® Architecture Optimization2-30Assembly/Compiler Coding Rule 16. (H impact, H generality) Align data on natural operand size address bounda

Page 6

General Optimization Guidelines 22-31 Alignment of code is less of an issue for the Pentium 4 processor. Alignment of branch targets to maximize ba

Page 7

IA-32 Intel® Architecture Optimization2-32Store ForwardingThe processor’s memory system only sends stores to memory (including cache) after store reti

Page 8

General Optimization Guidelines 22-33If a variable is known not to change between when it is stored and when it is used again, the register that was s

Page 9

IA-32 Intel® Architecture Optimization2-34The size and alignment restrictions for store forwarding are illustrated in Figure 2-2.Coding rules to help

Page 10

General Optimization Guidelines 22-35A load that forwards from a store must wait for the store’s data to be written to the store buffer before proceed

Page 11

IA-32 Intel® Architecture Optimization2-36Example 2-14 illustrates a stalled store-forwarding situation that may appear in compiler generated code. So

Page 12

General Optimization Guidelines 22-37When moving data that is smaller than 64 bits between memory locations, 64-bit or 128-bit SIMD register moves are

Page 13

xiKey Practices of System Bus Optimization ... 7-17Key Practices of Memory Optimizati

Page 14 - Appendix DStack Alignment

IA-32 Intel® Architecture Optimization2-38Store-forwarding Restriction on Data AvailabilityThe value to be stored must be available before the load op

Page 15 - Examples

General Optimization Guidelines 22-39An example of a loop-carried dependence chain is shown in Example 2-17.Data Layout OptimizationsUser/Source Codin

Page 16

IA-32 Intel® Architecture Optimization2-40Cache line size for Pentium 4 and Pentium M processors can impact streaming applications (for example, multi

Page 17

General Optimization Guidelines 22-41However, if the access pattern of the array exhibits locality, such as if the array index is being swept through,

Page 18

IA-32 Intel® Architecture Optimization2-42non-sequential manner, the automatic hardware prefetcher cannot prefetch the data. The prefetcher can recogn

Page 19

General Optimization Guidelines 22-43If for some reason it is not possible to align the stack for 64-bits, the routine should access the parameter and

Page 20

IA-32 Intel® Architecture Optimization2-44Capacity Limits in Set-Associative CachesCapacity limits may occur if the number of outstanding memory refer

Page 21

General Optimization Guidelines 22-45Aliasing Cases in the Pentium® 4 and Intel® Xeon® ProcessorsAliasing conditions that are specific to the Pentium

Page 22

IA-32 Intel® Architecture Optimization2-46Aliasing Cases in the Pentium M ProcessorPentium M, Intel Core Solo and Intel Core Duo processors have the f

Page 23 - Introduction

General Optimization Guidelines 22-47Mixing Code and DataThe Pentium 4 processor’s aggressive prefetching and pre-decoding of instructions has two rel

Page 24 - About This Manual

xiiSign Extension to Full 64-Bits... 8-3Alternate Coding Rules

Page 25

IA-32 Intel® Architecture Optimization2-48and cross-modifying code (when more than one processor in a multi-processor system are writing to a code pag

Page 26

General Optimization Guidelines 22-49write misses; only four write-combining buffers are guaranteed to be available for simultaneous use. Write combin

Page 27 - Related Documentation

IA-32 Intel® Architecture Optimization2-50be no RFO since the line is not cached, and there is no such delay. For details on write-combining, see the

Page 28 - Notational Conventions

General Optimization Guidelines 22-51Locality enhancement to the last level cache can be accomplished with sequencing the data access pattern to take

Page 29 - Processor Family Overview

IA-32 Intel® Architecture Optimization2-52Minimizing Bus LatencyThe system bus on Intel Xeon and Pentium 4 processors provides up to6.4 GB/sec bandwid

Page 30 - SIMD Technology

General Optimization Guidelines 22-53User/Source Coding Rule 8. (H impact, H generality) To achieve effective amortization of bus latency, software

Page 31 - OP OP OP OP

IA-32 Intel® Architecture Optimization2-54Example 2-21 Non-temporal Stores and 64-byte Bus Write TransactionsExample 2-22 Non-temporal Stores and Part

Page 32 - • inherently parallel

General Optimization Guidelines 22-55PrefetchingThe Pentium 4 processor has three prefetching mechanisms: • hardware instruction prefetcher• software

Page 33 - Summary of SIMD Technologies

IA-32 Intel® Architecture Optimization2-56access patterns to suit the hardware prefetcher is highly recommended, and should be a higher-priority consi

Page 34 - Streaming SIMD Extensions 3

General Optimization Guidelines 22-57• new cache line flush instruction• new memory fencing instructionsFor a detailed description of using cacheabili

Page 35

xiiiTime-based Sampling... A-9Event-based Sampling...

Page 36 - Microarchitecture

IA-32 Intel® Architecture Optimization2-58Guidelines for Optimizing Floating-point CodeUser/Source Coding Rule 10. (M impact, M generality) Enable the

Page 37

General Optimization Guidelines 22-59to early out). However, be careful of introducing more than a total of two values for the floating point control

Page 38 - /HVVIUHTXHQWO\XVHGSDWKV

IA-32 Intel® Architecture Optimization2-60desired numeric precision, the size of the look-up tableland taking advantage of the parallelism of the Stre

Page 39 - The Front End

General Optimization Guidelines 22-61executing SSE/SSE2/SSE3 instructions and when speed is more important than complying to IEEE standard. The follow

Page 40 - Retirement

IA-32 Intel® Architecture Optimization2-62Underflow exceptions and denormalized source operands are usually treated according to the IEEE 754 specific

Page 41 - Front End Pipeline Detail

General Optimization Guidelines 22-63FPU control word (FCW), such as when performing conversions to integers. On Pentium M, Intel Core Solo and Intel

Page 42 - Execution Trace Cache

IA-32 Intel® Architecture Optimization2-64Assembly/Compiler Coding Rule 31. (H impact, M generality) Minimize changes to bits 8-12 of the floating poi

Page 43 - Branch Prediction

General Optimization Guidelines 22-65If there is more than one change to rounding, precision and infinity bits and the rounding mode is not important

Page 44 - Execution Core Detail

IA-32 Intel® Architecture Optimization2-66Example 2-23 Algorithm to Avoid Changing the Rounding Mode_fto132proclea ecx,[esp-8]sub esp,16 ; allocate f

Page 45

General Optimization Guidelines 22-67Assembly/Compiler Coding Rule 32. (H impact, L generality) Minimize the number of changes to the rounding mode. D

Page 46

xivUsing Performance Metrics with Hyper-Threading Technology ... B-50Using Performance Events of Intel Core S

Page 47

IA-32 Intel® Architecture Optimization2-68Assembly/Compiler Coding Rule 33. (H impact, L generality) Minimize the number of changes to the precision m

Page 48

General Optimization Guidelines 22-69This in turn allows instructions to be reordered to make instructions available to be executed in parallel. Out-o

Page 49 - Data Prefetch

IA-32 Intel® Architecture Optimization2-70• Scalar floating-point registers may be accessed directly, avoiding fxch and top-of-stack restrictions. On

Page 50

General Optimization Guidelines 22-71Recommendation: Use the compiler switch to generate SSE2 scalar floating-point code over x87 code. When working w

Page 51

IA-32 Intel® Architecture Optimization2-72Floating-Point StallsFloating-point instructions have a latency of at least two cycles. But, because of the

Page 52 - • buffering of writes

General Optimization Guidelines 22-73Note that transcendental functions are supported only in x87 floating point, not in Streaming SIMD Extensions or

Page 53

IA-32 Intel® Architecture Optimization2-74Complex InstructionsAssembly/Compiler Coding Rule 40. (ML impact, M generality) Avoid using complex instruct

Page 54 - Pentium

General Optimization Guidelines 22-75Use of the inc and dec InstructionsThe inc and dec instructions modify only a subset of the bits in the flag regi

Page 55 - • instruction cache

IA-32 Intel® Architecture Optimization2-76CMPXCHG8B, various rotate instructions, STC, and STD. An example of assembly with a partial flag register st

Page 56

General Optimization Guidelines 22-77(model 9) does incur a penalty. This is because every operation on a partial register updates the whole register.

Page 57 - Data Prefetching

xvExamplesExample 2-1 Assembly Code with an Unpredictable Branch ... 2-17Example 2-2 Code Optimization to Eliminate Branches

Page 58 - Out-of-Order Core

IA-32 Intel® Architecture Optimization2-78Table 2-3 illustrates using movzx to avoid a partial register stall when packing three byte values into a re

Page 59 - Core™ Duo Processors

General Optimization Guidelines 22-79less delay than the partial register update problem mentioned above, but the performance gain may vary. If the ad

Page 60 - • Micro-op fusion

IA-32 Intel® Architecture Optimization2-80Prefixes and Instruction DecodingAn IA-32 instruction can be up to 15 bytes in length. Prefixes can change t

Page 61

General Optimization Guidelines 22-81• Processing an instruction with the 0x66 prefix that (i) has a modr/m byte in its encoding and (ii) the opcode b

Page 62

IA-32 Intel® Architecture Optimization2-82String move/store instructions have multiple data granularities. For efficient data movement, larger data gr

Page 63

General Optimization Guidelines 22-83• Cache eviction:If the amount of data to be processed by a memory routine approaches half the size of the last l

Page 64 - • operational fairness

IA-32 Intel® Architecture Optimization2-84improve address alignment, a small piece of prolog code using movsb/stosb with count less than 4 can be used

Page 65 - Shared Resources

General Optimization Guidelines 22-85Memory routines in the runtime library generated by Intel Compilers are optimized across wide range of address al

Page 66 - Front End Pipeline

IA-32 Intel® Architecture Optimization2-86In some situations, the byte count of the data to operate is known by the context (versus from a parameter p

Page 67 - Multi-Core Processors

General Optimization Guidelines 22-87Clearing RegistersPentium 4 processor provides special support to xor, sub, or pxor operations when executed with

Page 68

xviExample 3-4 Identification of SSE2 with cpuid ... 3-5Example 3-5 Identification of SSE2 by the OS

Page 69

IA-32 Intel® Architecture Optimization2-88Using test instruction between the instruction that may modify part of the flag register and the instruction

Page 70 - Load and Store Operations

General Optimization Guidelines 22-89Use movapd as an alternative; it writes all 128 bits. Even though this instruction has a longer latency, the μops

Page 71

IA-32 Intel® Architecture Optimization2-90Prolog SequencesAssembly/Compiler Coding Rule 57. (M impact, MH generality) In routines that do not need a f

Page 72

General Optimization Guidelines 22-91Using memory as a destination operand may further reduce register pressure at the slight risk of making trace cac

Page 73 - General Optimization

IA-32 Intel® Architecture Optimization2-92Spill SchedulingThe spill scheduling algorithm used by a code generator will be impacted by the Pentium 4 pr

Page 74

General Optimization Guidelines 22-93Because micro-ops are delivered from the trace cache in the common cases, decoding rules are not required. Schedu

Page 75

IA-32 Intel® Architecture Optimization2-94Data elements in parallel. The number of elements which can be operated on in parallel range from four singl

Page 76

General Optimization Guidelines 22-95User/Source Coding Rule 19. (M impact, ML generality) Avoid the use of conditional branches inside loops and cons

Page 77 - Optimize Memory Access

IA-32 Intel® Architecture Optimization2-96The other NOPs have no special hardware support. Their input and output registers are interpreted by the har

Page 78

General Optimization Guidelines 22-97User/Source Coding RulesUser/Source Coding Rule 1. (M impact, L generality) If an indirect branch has two or more

Page 79 - Enable Vectorization

xviiExample 4-20 Clipping to an Arbitrary Signed Range [high, low]...4-27Example 4-21 Simplified Clipping to an Arbitrary Signed

Page 80

IA-32 Intel® Architecture Optimization2-98User/Source Coding Rule 8. (H impact, H generality) To achieve effective amortization of bus latency, softwa

Page 81 - Performance Tools

General Optimization Guidelines 22-99look-up-table-based algorithm using interpolation techniques. It is possible to improve transcendental performanc

Page 82 - VTune™ Performance Analyzer

IA-32 Intel® Architecture Optimization2-100order engine. When tuning, note that all IA-32 based processors have very high branch prediction rates. Con

Page 83 - Processor Perspectives

General Optimization Guidelines 22-101Assembly/Compiler Coding Rule 10. (M impact, L generality) Do not put more than four branches in 16-byte chunks.

Page 84

IA-32 Intel® Architecture Optimization2-102Assembly/Compiler Coding Rule 18. (H impact, M generality) A load that forwards from a store must have the

Page 85

General Optimization Guidelines 22-103first-level cache working set. Avoid having more than 8 cache lines that are some multiple of 64 KB apart in the

Page 86

IA-32 Intel® Architecture Optimization2-104Assembly/Compiler Coding Rule 32. (H impact, L generality) Minimize the number of changes to the rounding

Page 87

General Optimization Guidelines 22-105Assembly/Compiler Coding Rule 42. (M impact, H generality) inc and dec instructions should be replaced with an a

Page 88 - A and B. If the condition is

IA-32 Intel® Architecture Optimization2-106instead of a cmp of the register to zero, this saves the need to encode the zero and saves encoding space.

Page 89

General Optimization Guidelines 22-107Assembly/Compiler Coding Rule 56. (M impact, ML generality) For arithmetic or logical operations that have thei

Page 90 - Spin-Wait and Idle Loops

xviiiExample 6-12 Memory Copy Using Hardware Prefetch and Bus Segmentation..6-50Example 7-1 Serial Execution of Producer and Consumer Work Items ...

Page 91 - Static Prediction

IA-32 Intel® Architecture Optimization2-108Tuning SuggestionsTuning Suggestion 1. Rarely, a performance problem may be noted due to executing data on

Page 92

3-13Coding for SIMD ArchitecturesIntel Pentium 4, Intel Xeon and Pentium M processors include support for Streaming SIMD Extensions 2 (SSE2), Streamin

Page 93

IA-32 Intel® Architecture Optimization3-2Checking for Processor Support of SIMD TechnologiesThis section shows how to check whether a processor suppor

Page 94 - Inlining, Calls and Returns

Coding for SIMD Architectures 33-3For more information on cpuid see, Intel® Processor Identification with CPUID Instruction, order number 241618.Check

Page 95 - Branch Type Selection

IA-32 Intel® Architecture Optimization3-4To find out whether the operating system supports SSE, execute an SSE instruction and trap for an exception i

Page 96

Coding for SIMD Architectures 33-5Checking for Streaming SIMD Extensions 2 SupportChecking for support of SSE2 is like checking for SSE support. You m

Page 97

IA-32 Intel® Architecture Optimization3-6Checking for Streaming SIMD Extensions 3 SupportSSE3 includes 13 instructions, 11 of those are suited for SIM

Page 98 - Loop Unrolling

Coding for SIMD Architectures 33-7Example 3-6 Identification of SSE3 with cpuidSSE3 requires the same support from the operating system as SSE. To fin

Page 99

IA-32 Intel® Architecture Optimization3-8Example 3-7 Identification of SSE3 by the OSConsiderations for Code Conversion to SIMD ProgrammingThe VTune P

Page 100 - • inlining where appropriate

Coding for SIMD Architectures 33-9Figure 3-1 Converting to Streaming SIMD Extensions ChartOM15156Code benefitsfrom SIMDSTOPIdentify Hot Spots in CodeI

Page 101 - Memory Accesses

xixFiguresFigure 1-1 Typical SIMD Operations ... 1-3Figure 1-2 SIMD Instruction Regist

Page 102

IA-32 Intel® Architecture Optimization3-10To use any of the SIMD technologies optimally, you must evaluate the following situations in your code:• fra

Page 103 - Line 029e7140h

Coding for SIMD Architectures 33-11specific optimizations. Where appropriate, the coach displays pseudo-code to suggest the use of highly optimized in

Page 104 - Store Forwarding

IA-32 Intel® Architecture Optimization3-12costly application processing time. However, these routines have potential for increased performance when yo

Page 105 - Alignment

Coding for SIMD Architectures 33-13Coding MethodologiesSoftware developers need to compare the performance improvement that can be obtained from assem

Page 106 - Figure 2-2

IA-32 Intel® Architecture Optimization3-14The examples that follow illustrate the use of coding adjustments to enable the algorithm to benefit from th

Page 107

Coding for SIMD Architectures 33-15AssemblyKey loops can be coded directly in assembly language using an assembler or by using inlined assembly (C-asm

Page 108 - Example 2-14

IA-32 Intel® Architecture Optimization3-16SIMD Extensions 2 integer SIMD and __m128d is used for double precision floating-point SIMD. These types ena

Page 109

Coding for SIMD Architectures 33-17The intrinsic data types, however, are not a basic ANSI C data type, and therefore you must observe the following u

Page 110 - • parameter passing

IA-32 Intel® Architecture Optimization3-18Here, fvec.h is the class definition file and F32vec4 is the class representing an array of four floats. The

Page 111 - Data Layout Optimizations

Coding for SIMD Architectures 33-19The caveat to this is that only certain types of loops can be automatically vectorized, and in most cases user inte

Page 112

iiINFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLE

Page 113

xxFigure 6-2 Memory Access Latency and Execution Without Prefetch ... 6-23Figure 6-3 Memory Access Latency and Execution With Prefetch ...

Page 114 - Stack Alignment

IA-32 Intel® Architecture Optimization3-20Stack and Data AlignmentTo get the most performance out of code written for SIMD technologies data should be

Page 115

Coding for SIMD Architectures 33-21By adding the padding variable pad, the structure is now 8 bytes, and if the first element is aligned to 8 bytes (6

Page 116

IA-32 Intel® Architecture Optimization3-22Assuming you have a 64-bit aligned data vector and a 64-bit aligned coefficients vector, the filter operatio

Page 117 - Processors

Coding for SIMD Architectures 33-23• Functions that use Streaming SIMD Extensions or Streaming SIMD Extensions 2 data need to provide a 16-byte aligne

Page 118

IA-32 Intel® Architecture Optimization3-24Another way to improve data alignment is to copy the data into locations that are aligned on 64-bit boundari

Page 119 - Mixing Code and Data

Coding for SIMD Architectures 33-25The __declspec(align(16)) specifications can be placed before data declarations to force 16-byte alignment. This is

Page 120 - Write Combining

IA-32 Intel® Architecture Optimization3-26In C++ (but not in C) it is also possible to force the alignment of a class/struct/union type, as in the cod

Page 121

Coding for SIMD Architectures 33-27Improving Memory UtilizationMemory performance can be improved by rearranging data and algorithms for SSE 2, SSE, a

Page 122 - Locality Enhancement

IA-32 Intel® Architecture Optimization3-28There are two options for computing data in AoS format: perform operation on the data as it stands in AoS fo

Page 123

Coding for SIMD Architectures 33-29Performing SIMD operations on the original AoS format can require more calculations and some of the operations do n

Page 124 - Minimizing Bus Latency

xxiTablesTable 1-1 Pentium 4 and Intel Xeon Processor Cache Parameters ... 1-20Table 1-3 Cache Parameters of Pentium M, Intel® Core™ So

Page 125

IA-32 Intel® Architecture Optimization3-30but is somewhat inefficient as there is the overhead of extra instructions during computation. Performing th

Page 126

Coding for SIMD Architectures 33-31Note that SoA can have the disadvantage of requiring more independent memory stream references. A computation that

Page 127 - • software prefetch for data

IA-32 Intel® Architecture Optimization3-32Strip MiningStrip mining, also known as loop sectioning, is a loop transformation technique for enabling SIM

Page 128 - Cacheability Instructions

Coding for SIMD Architectures 33-33The main loop consists of two functions: transformation and lighting. For each object, the main loop calls a transf

Page 129 - Applications

IA-32 Intel® Architecture Optimization3-34In Example 3-19, the computation has been strip-mined to a size strip_size. The value strip_size is chosen s

Page 130

Coding for SIMD Architectures 33-35For the first iteration of the inner loop, each access to array B will generate a cache miss. If the size of one ro

Page 131

IA-32 Intel® Architecture Optimization3-36This situation can be avoided if the loop is blocked with respect to the cache size. In Figure 3-3, a block_

Page 132 - • denormalized operand

Coding for SIMD Architectures 33-37As one can see, all the redundant cache misses can be eliminated by applying this loop blocking technique. If MAX i

Page 133

IA-32 Intel® Architecture Optimization3-38Note that this can be applied to both SIMD integer and SIMD floating-point code.If there are multiple consum

Page 134 - Floating-point Modes

Coding for SIMD Architectures 33-39Recommendation: When targeting code generation for Intel Core Solo and Intel Core Duo processors, favor instruction

Page 135

xxiiTable C-5 Streaming SIMD Extension 64-bit Integer Instructions... C-14Table C-7 IA-32 x87 Floating-point Instructions...

Page 136

IA-32 Intel® Architecture Optimization3-40

Page 137

4-14Optimizing for SIMD Integer ApplicationsThe SIMD integer instructions provide performance improvements in applications that are integer-intensive

Page 138

IA-32 Intel® Architecture Optimization4-2For planning considerations of using the new SIMD integer instructions, refer to “Checking for Streaming SIMD

Page 139

Optimizing for SIMD Integer Applications 44-3Using SIMD Integer with x87 Floating-pointAll 64-bit SIMD integer instructions use the MMX registers, whi

Page 140

IA-32 Intel® Architecture Optimization4-4Using emms clears all of the valid bits, effectively emptying the x87 floating-point stack and making it read

Page 141

Optimizing for SIMD Integer Applications 44-5• Don’t empty when already empty: If the next instruction uses an MMX register, _mm_empty() incurs a cost

Page 142 - Core Duo Processors

IA-32 Intel® Architecture Optimization4-6Data AlignmentMake sure that 64-bit SIMD integer data is 8-byte aligned and that 128-bit SIMD integer data is

Page 143 - Memory Operands

Optimizing for SIMD Integer Applications 44-7Signed UnpackSigned numbers should be sign-extended when unpacking the values. This is similar to the zer

Page 144 - Floating-Point Stalls

IA-32 Intel® Architecture Optimization4-8Interleaved Pack with SaturationThe pack instructions pack two values into the destination register in a pred

Page 145 - Instruction Selection

Optimizing for SIMD Integer Applications 44-9Figure 4-2 illustrates two values interleaved in the destination register, and Example 4-4 shows code tha

Page 146 - Use of the lea Instruction

xxiiiIntroductionThe IA-32 Intel® Architecture Optimization Reference Manual describes how to optimize software to take advantage of the performance c

Page 147 - Flag Register Accesses

IA-32 Intel® Architecture Optimization4-10The pack instructions always assume that the source operands are signed numbers. The result in the destinati

Page 148 - Integer Divide

Optimizing for SIMD Integer Applications 44-11Non-Interleaved UnpackThe unpack instructions perform an interleave merge of the data elements of the de

Page 149

IA-32 Intel® Architecture Optimization4-12The other destination register will contain the opposite combination illustrated in Figure 4-4. Code in the

Page 150 - Partial Register Stall

Optimizing for SIMD Integer Applications 44-13Extract WordThe pextrw instruction takes the word in the designated MMX register selected by the two lea

Page 151

IA-32 Intel® Architecture Optimization4-14Insert WordThe pinsrw instruction loads a word from the lower half of a 32-bit integer register or from memo

Page 152 - • Address size prefix (0x67)

Optimizing for SIMD Integer Applications 44-15 If all of the operands in a register are being replaced by a series of pinsrw instructions, it can be

Page 153 - REP Prefix and Data Movement

IA-32 Intel® Architecture Optimization4-16Move Byte Mask to IntegerThe pmovmskb instruction returns a bit mask formed from the most significant bits o

Page 154 - • Address alignment:

Optimizing for SIMD Integer Applications 44-17Figure 4-7 pmovmskb Instruction ExampleExample 4-10 pmovmskb Instruction Code; Input:; source value; Out

Page 155 - • Cache eviction:

IA-32 Intel® Architecture Optimization4-18Packed Shuffle Word for 64-bit RegistersThe pshuf instruction (see Figure 4-8, Example 4-11) uses the immedi

Page 156

Optimizing for SIMD Integer Applications 44-19Packed Shuffle Word for 128-bit RegistersThe pshuflw/pshufhw instruction performs a full shuffle of any

Page 157 - Destination

IA-32 Intel® Architecture Optimizationxxivtarget the Intel NetBurst microarchitecture and the Pentium M processor microarchitecture.Tuning Your Applic

Page 158 - • scaled index register

IA-32 Intel® Architecture Optimization4-20Unpacking/interleaving 64-bit Data in 128-bit RegistersThe punpcklqdq/punpchqdq instructions interleave the

Page 159 - Compares

Optimizing for SIMD Integer Applications 44-21Data Movement There are two additional instructions to enable data movement from the 64-bit SIMD integer

Page 160 - Floating Point/SIMD Operands

IA-32 Intel® Architecture Optimization4-22pxor MM0, MM0pcmpeq MM1, MM1psubb MM0, MM1 [psubw MM0, MM1] (psubd MM0, MM1); three instructions above gen

Page 161

Optimizing for SIMD Integer Applications 44-23Building BlocksThis section describes instructions and algorithms which implement common code building b

Page 162 - Prolog Sequences

IA-32 Intel® Architecture Optimization4-24Absolute Difference of Signed NumbersChapter 4 computes the absolute difference of two signed numbers. The t

Page 163 - Instruction Scheduling

Optimizing for SIMD Integer Applications 44-25Absolute ValueUse Example 4-18 to compute |x|, where x is signed. This example assumes signed words to b

Page 164 - Spill Scheduling

IA-32 Intel® Architecture Optimization4-26Clipping to an Arbitrary Range [high, low]This section explains how to clip a values to a range [high, low].

Page 165 - Vectorization

Optimizing for SIMD Integer Applications 44-27Highly Efficient ClippingFor clipping signed words to an arbitrary range, the pmaxsw and pminsw instruct

Page 166 - • avoid global variables

IA-32 Intel® Architecture Optimization4-28The code above converts values to unsigned numbers first and then clips them to an unsigned range. The last

Page 167 - Miscellaneous

Optimizing for SIMD Integer Applications 44-29packed-subtract instructions with unsigned saturation, thus this technique can only be used on packed-by

Page 168

IntroductionxxvThe manual consists of the following parts:Introduction. Defines the purpose and outlines the contents of this manual.Chapter 1: IA-32

Page 169 - User/Source Coding Rules

IA-32 Intel® Architecture Optimization4-30Unsigned ByteThe pmaxub instruction returns the maximum between the eight unsigned bytes in either two SIMD

Page 170

Optimizing for SIMD Integer Applications 44-31The subtraction operation presented above is an absolute difference, that is, t = abs(x-y). The byte val

Page 171

IA-32 Intel® Architecture Optimization4-32The PAVGB instruction operates on packed unsigned bytes and the PAVGW instruction operates on packed unsigne

Page 172

Optimizing for SIMD Integer Applications 44-33Note that the output is a packed doubleword. If needed, a pack instruction can be used to convert the re

Page 173

IA-32 Intel® Architecture Optimization4-34Memory OptimizationsYou can improve memory accesses using the following techniques:• Avoiding partial memory

Page 174

Optimizing for SIMD Integer Applications 44-35Partial Memory AccessesConsider a case with large load after a series of small stores to the same area o

Page 175

IA-32 Intel® Architecture Optimization4-36Let us now consider a case with a series of small loads after a large store to the same area of memory (begi

Page 176

Optimizing for SIMD Integer Applications 44-37These transformations, in general, increase the number of instructions required to perform the desired o

Page 177

IA-32 Intel® Architecture Optimization4-38SSE3 provides an instruction LDDQU for loading from memory address that are not 16 byte aligned. LDDQU is a

Page 178

Optimizing for SIMD Integer Applications 44-39Increasing Bandwidth of Memory Fills and Video FillsIt is beneficial to understand how memory is accesse

Page 179 - PUSH, CALL, RET). 2-84

IA-32 Intel® Architecture OptimizationxxviChapter 7: Multiprocessor and Hyper-Threading Technology. Describes guidelines and techniques for optimizing

Page 180 - Tuning Suggestions

IA-32 Intel® Architecture Optimization4-40same DRAM page have shorter latencies than sequential accesses to different DRAM pages. In many systems the

Page 181 - Architectures

Optimizing for SIMD Integer Applications 44-41aligned versions; this can reduce the performance gains when using the 128-bit SIMD integer extensions.

Page 182 - Technologies

IA-32 Intel® Architecture Optimization4-42Packed SSE2 Integer versus MMX InstructionsIn general, 128-bit SIMD integer instructions should be favored o

Page 183

5-15Optimizing for SIMD Floating-point ApplicationsThis chapter discusses general rules of optimizing for the single-instruction, multiple-data (SIMD)

Page 184 - bool OSSupportCheck() {

IA-32 Intel® Architecture Optimization5-2• Use MMX technology instructions and registers or for copying data that is not used later in SIMD floating-p

Page 185

Optimizing for SIMD Floating-point Applications 55-3• Is the data arranged for efficient utilization of the SIMD floating-point registers?• Is this ap

Page 186

IA-32 Intel® Architecture Optimization5-4When using scalar floating-point instructions, it is not necessary to ensure that the data appears in vector

Page 187

Optimizing for SIMD Floating-point Applications 55-5For some applications, e.g., 3D geometry, the traditional data arrangement requires some changes t

Page 188 - Programming

IA-32 Intel® Architecture Optimization5-6simultaneously referred to as an xyz data representation, see the diagram below) are computed in parallel, an

Page 189

Optimizing for SIMD Floating-point Applications 55-7To utilize all 4 computation slots, the vertex data can be reorganized to allow computation on eac

Page 190 - Identifying Hot Spots

IntroductionxxviiRelated DocumentationFor more information on the Intel architecture, specific techniques, and processor architecture terminology refe

Page 191

IA-32 Intel® Architecture Optimization5-8Figure 5-2 shows how 1 result would be computed for 7 instructions if the data were organized as AoS and usin

Page 192 - Coding Techniques

Optimizing for SIMD Floating-point Applications 55-9Now consider the case when the data is organized as SoA. Example 5-2 demonstrates how 4 results ar

Page 193 - Coding Methodologies

IA-32 Intel® Architecture Optimization5-10To gather data from 4 different memory locations on the fly, follow steps:1. Identify the first half of the

Page 194

Optimizing for SIMD Floating-point Applications 55-11 y1 x1movhps xmm7, [ecx+16] // xmm7 = y2 x2 y1 x1movlps xmm0, [ecx+32] // xmm0 = -- -- y3 x3m

Page 195 - Intrinsics

IA-32 Intel® Architecture Optimization5-12Example 5-4 shows the same data -swizzling algorithm encoded using the Intel C++ Compiler’s intrinsics for S

Page 196

Optimizing for SIMD Floating-point Applications 55-13 Although the generated result of all zeros does not depend on the specific data contained in the

Page 197 - +”, “>>”)

IA-32 Intel® Architecture Optimization5-14Data DeswizzlingIn the deswizzle operation, we want to arrange the SoA format back into AoS format so the xx

Page 198 - Automatic Vectorization

Optimizing for SIMD Floating-point Applications 55-15You may have to swizzle data in the registers, but not in memory. This occurs when two different

Page 199

IA-32 Intel® Architecture Optimization5-16// Start deswizzling here movaps xmm7, xmm4 // xmm7= a1 a2 a3 a4 movhlps xmm7, xmm3 // xmm7= b3 b4 a

Page 200 - Stack and Data Alignment

Optimizing for SIMD Floating-point Applications 55-17Using MMX Technology Code for Copy or Shuffling FunctionsIf there are some parts in the code that

Page 201

IA-32 Intel® Architecture OptimizationxxviiiNotational ConventionsThis manual uses the following conventions:This type style Indicates an element of s

Page 202 - __m128* datatypes

IA-32 Intel® Architecture Optimization5-18Example 5-8 illustrates how to use MMX technology code for copying or shuffling.Horizontal ADD Using SSEAlth

Page 203 - __m128*

Optimizing for SIMD Floating-point Applications 55-19Figure 5-3 Horizontal Add Using movhlps/movlhpsExample 5-9 Horizontal Add Using movhlps/movlhpsvo

Page 204 - Compiler-Supported Alignment

IA-32 Intel® Architecture Optimization5-20 // START HORIZONTAL ADD movaps xmm5, xmm0 // xmm5= A1,A2,A3,A4movlhps xmm5, xmm1 // xmm5= A1,A2,B1,

Page 205

Optimizing for SIMD Floating-point Applications 55-21Use of cvttps2pi/cvttss2si InstructionsThe cvttps2pi and cvttss2si instructions encode the trunca

Page 206

IA-32 Intel® Architecture Optimization5-22avoided since there is a penalty associated with writing this register; typically, through the use of the cv

Page 207 - Improving Memory Utilization

Optimizing for SIMD Floating-point Applications 55-23 SSE3 and Complex ArithmeticsThe flexibility of SSE3 in dealing with AOS-type of data structure

Page 208 - SoA Data Structure

IA-32 Intel® Architecture Optimization5-24instructions to perform multiplications of single-precision complex numbers. Example 5-12 demonstrates using

Page 209

Optimizing for SIMD Floating-point Applications 55-25Example 5-12 Division of Two Pair of Single-precision Complex Number// Division of (ak + i bk ) /

Page 210

IA-32 Intel® Architecture Optimization5-26SSE3 and Horizontal ComputationSometimes the AOS type of data organization are more natural in many algebrai

Page 211

Optimizing for SIMD Floating-point Applications 55-27SIMD Optimizations and MicroarchitecturesPentium M, Intel Core Solo and Intel Core Duo processors

Page 212 - Strip Mining

1-11IA-32 Intel® Architecture Processor Family OverviewThis chapter gives an overview of the features relevant to software optimization for the curre

Page 213 - Example 3-19 Strip Mined Code

IA-32 Intel® Architecture Optimization5-28When targeting complex arithmetics on Intel Core Solo and Intel Core Duo processors, using single-precision

Page 214 - Loop Blocking

6-16Optimizing Cache UsageOver the past decade, processor speed has increased more than ten times. Memory access speed has increased at a slower pace.

Page 215 - A. Original Loop

IA-32 Intel® Architecture Optimization6-2• Memory Optimization Using Hardware Prefetching, Software Prefetch and Cacheability Instructions: discusses

Page 216 - Blocking

Optimizing Cache Usage 66-3• Facilitate compiler optimization: — Minimize use of global variables and pointers— Minimize use of complex control flow —

Page 217

IA-32 Intel® Architecture Optimization6-4• Optimize software prefetch scheduling distance:— Far ahead enough to allow interim computation to overlap m

Page 218

Optimizing Cache Usage 66-53. Follows only one stream per 4K page (load or store)4. Can prefetch up to 8 simultaneous independent streams from eight d

Page 219 - Tuning the Final Application

IA-32 Intel® Architecture Optimization6-6Data reference patterns can be classified as follows:Temporal data will be used again soonSpatial data will b

Page 220

Optimizing Cache Usage 66-7The prefetch instruction is implementation-specific; applications need to be tuned to each implementation to maximize perfo

Page 221 - Optimizing for SIMD Integer

IA-32 Intel® Architecture Optimization6-8The Prefetch Instructions – Pentium 4 Processor ImplementationStreaming SIMD Extensions include four flavors

Page 222

Optimizing Cache Usage 66-9Currently, the prefetch instruction provides a greater performance gain than preloading because it:• has no destination reg

Page 223 - Using the EMMS Instruction

iiiContentsIntroductionChapter 1 IA-32 Intel® Architecture Processor Family OverviewSIMD Technology...

Page 224

IA-32 Intel® Architecture Optimization1-2Intel Core Solo and Intel Core Duo processors incorporate microarchitectural enhancements for performance and

Page 225

IA-32 Intel® Architecture Optimization6-10The Non-temporal Store InstructionsThis section describes the behavior of streaming stores and reiterates so

Page 226 - Data Alignment

Optimizing Cache Usage 66-11• Reduce disturbance of frequently used cached (temporal) data, since they write around the processor caches.Streaming sto

Page 227 - Signed Unpack

IA-32 Intel® Architecture Optimization6-12evicting data from all processor caches). The Pentium M processor implements a combination of both approache

Page 228

Optimizing Cache Usage 66-13possible. This behavior should be considered reserved, and dependence on the behavior of any particular implementation ris

Page 229 - MM/M64 mm

IA-32 Intel® Architecture Optimization6-14In case the region is not mapped as WC, the streaming might update in-place in the cache and a subsequent sf

Page 230

Optimizing Cache Usage 66-15The maskmovq/maskmovdqu (non-temporal byte mask store of packed integer in an MMX technology or Streaming SIMD Extensions

Page 231 - Non-Interleaved Unpack

IA-32 Intel® Architecture Optimization6-16The degree to which a consumer of data knows that the data is weakly-ordered can vary for these cases. As a

Page 232

Optimizing Cache Usage 66-17The clflush InstructionThe cache line associated with the linear address specified by the value of byte address is invalid

Page 233 - Extract Word

IA-32 Intel® Architecture Optimization6-18Memory Optimization Using PrefetchThe Pentium 4 processor has two mechanisms for data prefetch: software-con

Page 234 - Insert Word

Optimizing Cache Usage 66-19Hardware PrefetchThe automatic hardware prefetch, can bring cache lines into the unified last-level cache based on prior d

Page 235 - Figure 4-6 pinsrw Instruction

IA-32 Intel® Architecture Processor Family Overview1-3each corresponding pair of data elements (X1 and Y1, X2 and Y2, X3 and Y3, and X4 and Y4). The r

Page 236 - Move Byte Mask to Integer

IA-32 Intel® Architecture Optimization6-20• May consume extra system bandwidth if the application’s memory traffic has significant portions with strid

Page 237 - 55 47 39 23 15 7

Optimizing Cache Usage 66-21Example 6-2 Populating an Array for Circular Pointer Chasing with Constant Strideregister char ** p;char *next; // Populat

Page 238 - X1 X2 X3 X4

IA-32 Intel® Architecture Optimization6-22 Example of Latency Hiding with S/W Prefetch InstructionAchieving the highest level of memory optimization u

Page 239

Optimizing Cache Usage 66-23execution units sit idle and wait until data is returned. On the other hand, the memory bus sits idle while the execution

Page 240

IA-32 Intel® Architecture Optimization6-24The performance loss caused by poor utilization of resources can be completely eliminated by correctly sched

Page 241 - Generating Constants

Optimizing Cache Usage 66-25• Balance single-pass versus multi-pass execution• Resolve memory bank conflict issues• Resolve cache management issuesThe

Page 242

IA-32 Intel® Architecture Optimization6-26lines of data per iteration. The PSD would need to be increased/decreased if more/less than two cache lines

Page 243 - Building Blocks

Optimizing Cache Usage 66-27This memory de-pipelining creates inefficiency in both the memory pipeline and execution pipeline. This de-pipelining effe

Page 244

IA-32 Intel® Architecture Optimization6-28Prefetch concatenation can bridge the execution pipeline bubbles between the boundary of an inner loop and i

Page 245 - Absolute Value

Optimizing Cache Usage 66-29Minimize Number of Software PrefetchesPrefetch instructions are not completely free in terms of bus cycles, machine cycles

Page 246 - 0x8000800080008000

IA-32 Intel® Architecture Optimization1-4SIMD improves the performance of 3D graphics, speech recognition, image processing, scientific applications a

Page 247 - Highly Efficient Clipping

IA-32 Intel® Architecture Optimization6-30Figure 6-5Figure demonstrates the effectiveness of software prefetches in latency hiding. The X axis indica

Page 248

Optimizing Cache Usage 66-31Figure 6-5 Memory Access Latency and Execution With Prefetch2 Load streams, 1 store stream5010015020025030035054 108 144

Page 249 - Signed Word

IA-32 Intel® Architecture Optimization6-32Mix Software Prefetch with Computation InstructionsIt may seem convenient to cluster all of the prefetch ins

Page 250 - Packed Multiply High Unsigned

Optimizing Cache Usage 66-33 Example 6-6 Spread Prefetch InstructionsNOTE. To avoid instruction execution stalls due to the over-utilization of the

Page 251 - Packed Average (Byte/Word)

IA-32 Intel® Architecture Optimization6-34Software Prefetch and Cache Blocking TechniquesCache blocking techniques, such as strip-mining, are used to

Page 252

Optimizing Cache Usage 66-35In the temporally-adjacent scenario, subsequent passes use the same data and find it already in second-level cache. Prefet

Page 253 - 128-bit Shifts

IA-32 Intel® Architecture Optimization6-36Figure 6-7 shows how prefetch instructions and strip-mining can be applied to increase performance in both o

Page 254 - Memory Optimizations

Optimizing Cache Usage 66-37In scenario to the right, in Figure 6-7, keeping the data in one way of the second-level cache does not improve cache loca

Page 255 - Partial Memory Accesses

IA-32 Intel® Architecture Optimization6-38Without strip-mining, all the x,y,z coordinates for the four vertices must be re-fetched from memory in the

Page 256

Optimizing Cache Usage 66-39Table 6-1 summarizes the steps of the basic usage model that incorporates only software prefetch with strip-mining. The st

Page 257

IA-32 Intel® Architecture Processor Family Overview1-5SSE and SSE2 instructions also introduced cacheability and memory ordering instructions that can

Page 258

IA-32 Intel® Architecture Optimization6-40happen to be powers of 2, aliasing condition due to finite number of way-associativity (see “Capacity Limits

Page 259 - Instruction

Optimizing Cache Usage 66-41references enables the hardware prefetcher to initiate bus requests to read some cache lines before the code actually refe

Page 260

IA-32 Intel® Architecture Optimization6-42selected to ensure that the batch stays within the processor caches through all passes. An intermediate cach

Page 261

Optimizing Cache Usage 66-43The choice of single-pass or multi-pass can have a number of performance implications. For instance, in a multi-pass pipel

Page 262

IA-32 Intel® Architecture Optimization6-44a line burst transaction. To achieve the best possible performance, it is recommended to align data along th

Page 263 - Floating-point Applications

Optimizing Cache Usage 66-45The following examples of using prefetching instructions in the operation of video encoder and decoder as well as in simpl

Page 264 - Planning Considerations

IA-32 Intel® Architecture Optimization6-46Later, the processor re-reads the data using prefetchnta, which ensures maximum bandwidth, yet minimizes dis

Page 265 - Scalar Floating-point Code

Optimizing Cache Usage 66-47The memory copy algorithm can be optimized using the Streaming SIMD Extensions with these considerations:• alignment of da

Page 266

IA-32 Intel® Architecture Optimization6-48Using the 8-byte Streaming Stores and Software PrefetchExample 6-11 presents the copy algorithm that uses se

Page 267

Optimizing Cache Usage 66-49In Example 6-11, eight _mm_load_ps and _mm_stream_ps intrinsics are used so that all of the data prefetched (a 128-byte ca

Page 268

IA-32 Intel® Architecture Optimization1-6SSE instructions are useful for 3D geometry, 3D rendering, speech recognition, and video encoding and decodin

Page 269

IA-32 Intel® Architecture Optimization6-50The instruction, temp = a[kk+CACHESIZE], is used to ensure the page table entry for array, and a is entered

Page 270

Optimizing Cache Usage 66-51prefetch_loop:movaps xmm0, [esi+ecx]movaps xmm0, [esi+ecx+64]add ecx,128cmp ecx,BLOCK_SIZEjne prefetch_loopxor ecx,ecxalig

Page 271 - Data Swizzling

IA-32 Intel® Architecture Optimization6-52Performance Comparisons of Memory Copy RoutinesThe throughput of a large-region, memory copy routine depends

Page 272 - Example 5-3 Swizzling Data

Optimizing Cache Usage 66-53The baseline for performance comparison is the throughput (bytes/sec) of 8-MByte region memory copy on a first-generation

Page 273

IA-32 Intel® Architecture Optimization6-54query each level of the cache hierarchy. Enumeration of each cache level is by specifying an index value (st

Page 274

Optimizing Cache Usage 66-55• Determine multi-threading resource topology in an MP system (See Section 7.10 of IA-32 Intel® Architecture Software Deve

Page 275

IA-32 Intel® Architecture Optimization6-56platform, software can extract information on the number and the identities of each logical processor sharin

Page 276 - Data Deswizzling

7-17Multi-Core and Hyper-Threading TechnologyThis chapter describes software optimization techniques for multithreaded applications running in an envi

Page 277 - Instructions

IA-32 Intel® Architecture Optimization7-2cores but shared by two logical processors in the same core if Hyper-Threading Technology is enabled. This ch

Page 278 - Instructions (continued)

Multi-Core and Hyper-Threading Technology 77-3Figure 7-1 illustrates how performance gains can be realized for any workload according to Amdahl’s law.

Page 279 - Functions

IA-32 Intel® Architecture Processor Family Overview1-7Intel® Extended Memory 64 Technology (Intel®EM64T)Intel EM64T is an extension of the IA-32 Intel

Page 280 - Horizontal ADD Using SSE

IA-32 Intel® Architecture Optimization7-4When optimizing application performance in a multithreaded environment, control flow parallelism is likely to

Page 281 - C1 C2 D1 D2 C3 C4 D3 D4

Multi-Core and Hyper-Threading Technology 77-5terms of time of completion relative to the same task when in a single-threaded environment) will vary,

Page 282

IA-32 Intel® Architecture Optimization7-6When two applications are employed as part of a multi-tasking workload, there is little synchronization overh

Page 283 - MXCSR register should be

Multi-Core and Hyper-Threading Technology 77-7Parallel Programming ModelsTwo common programming models for transforming independent task requirements

Page 284

IA-32 Intel® Architecture Optimization7-8Functional DecompositionApplications usually process a wide variety of tasks with diverse functions and many

Page 285 - SSE3 and Complex Arithmetics

Multi-Core and Hyper-Threading Technology 77-9overhead when buffers are exchanged between the producer and consumer. To achieve optimal scaling with t

Page 286

IA-32 Intel® Architecture Optimization7-10Producer-Consumer Threading Models Figure 7-3 illustrates the basic scheme of interaction between a pair of

Page 287

Multi-Core and Hyper-Threading Technology 77-11It is possible to structure the producer-consumer model in an interlaced manner such that it can minimi

Page 288

IA-32 Intel® Architecture Optimization7-12corresponding task to use its designated buffer. Thus, the producer and consumer tasks execute in parallel i

Page 289

Multi-Core and Hyper-Threading Technology 77-13Example 7-3 Thread Function for an Interlaced Producer Consumer Model// master thread starts the first

Page 290

IA-32 Intel® Architecture Optimization1-8Intel NetBurst® MicroarchitectureThe Pentium 4 processor, Pentium 4 processor Extreme Edition supporting Hype

Page 291 - Optimizing Cache Usage

IA-32 Intel® Architecture Optimization7-14Tools for Creating Multithreaded ApplicationsProgramming directly to a multithreading application programmin

Page 292

Multi-Core and Hyper-Threading Technology 77-15Automatic Parallelization of Code. While OpenMP directives allow programmers to quickly transform seria

Page 293 - Optimizing Cache Usage 6

IA-32 Intel® Architecture Optimization7-16Optimization GuidelinesThis section summarizes optimization guidelines for tuning multithreaded applications

Page 294 - Hardware Prefetching of Data

Multi-Core and Hyper-Threading Technology 77-17• Place each synchronization variable alone, separated by 128 bytes or in a separate cache line. See “T

Page 295

IA-32 Intel® Architecture Optimization7-18• Adjust the private stack of each thread in an application so the spacing between these stacks is not offse

Page 296 - Prefetch

Multi-Core and Hyper-Threading Technology 77-19• For each processor supporting Hyper-Threading Technology, consider adding functionally uncorrelated t

Page 297

IA-32 Intel® Architecture Optimization7-20The best practice to reduce the overhead of thread synchronization is to start by reducing the application’s

Page 298 - Implementation

Multi-Core and Hyper-Threading Technology 77-21the white paper “Developing Multi-threaded Applications: A Platform Consistent Approach” (referenced in

Page 299 - Cacheability Control

IA-32 Intel® Architecture Optimization7-22Synchronization for Short PeriodsThe frequency and duration that a thread needs to synchronize with other th

Page 300 - Streaming Non-temporal Stores

Multi-Core and Hyper-Threading Technology 77-23the processor must guarantee no violations of memory order occur. The necessity of maintaining the orde

Page 301 - WC semantics)

IA-32 Intel® Architecture Processor Family Overview1-9• to operate at high clock rates and to scale to higher performance and clock rates in the futur

Page 302 - Write-Combining

IA-32 Intel® Architecture Optimization7-24Example 7-4 Spin-wait Loop and PAUSE Instructions (a) An un-optimized spin-wait loop experiences performan

Page 303 - Streaming Store Usage Models

Multi-Core and Hyper-Threading Technology 77-25User/Source Coding Rule 21. (M impact, H generality) Insert the PAUSE instruction in fast spin loops an

Page 304

IA-32 Intel® Architecture Optimization7-26To reduce the performance penalty, one approach is to reduce the likelihood of many threads competing to acq

Page 305 - • hand-crafted code

Multi-Core and Hyper-Threading Technology 77-27If an application thread must remain idle for a long time, the application should use a thread blocking

Page 306 - The mfence Instruction

IA-32 Intel® Architecture Optimization7-28Avoid Coding Pitfalls in Thread SynchronizationSynchronization between multiple threads must be designed and

Page 307 - The clflush Instruction

Multi-Core and Hyper-Threading Technology 77-29In general, OS function calls should be used with care when synchronizing threads. When using OS-suppor

Page 308 - Software-controlled Prefetch

IA-32 Intel® Architecture Optimization7-30Prevent Sharing of Modified Data and False-SharingOn an Intel Core Duo processor, sharing of modified data i

Page 309 - Hardware Prefetch

Multi-Core and Hyper-Threading Technology 77-31User/Source Coding Rule 24. (H impact, M generality) Beware of false sharing within a cache line (64 by

Page 310

IA-32 Intel® Architecture Optimization7-32• Objects allocated dynamically by different threads may share cache lines. Make sure that the variables use

Page 311 - Constant Stride

Multi-Core and Hyper-Threading Technology 77-33• In managed environments that provide automatic object allocation, the object allocators and garbage

Page 312

IA-32 Intel® Architecture Optimization1-10The out-of-order core aggressively reorders µops so that µops whose inputs are ready (and have execution res

Page 313

IA-32 Intel® Architecture Optimization7-34Conserve Bus BandwidthIn a multi-threading environment, bus bandwidth may be shared by memory traffic origin

Page 314

Multi-Core and Hyper-Threading Technology 77-35reads. An approximate working guideline for software to operate below bus saturation is to check if bus

Page 315

IA-32 Intel® Architecture Optimization7-36Avoid Excessive Software PrefetchesPentium 4 and Intel Xeon Processors have an automatic hardware prefetcher

Page 316

Multi-Core and Hyper-Threading Technology 77-37latency of scattered memory reads can be improved by issuing multiple memory reads back-to-back to over

Page 317

IA-32 Intel® Architecture Optimization7-38Frequently, multiple partial writes to WC memory can be combined into full-sized writes using a software wri

Page 318

Multi-Core and Hyper-Threading Technology 77-39block size for loop blocking should be determined by dividing the target cache size by the number of lo

Page 319

IA-32 Intel® Architecture Optimization7-40User/Source Coding Rule 33. (H impact, M generality) Minimize the sharing of data between threads that execu

Page 320

Multi-Core and Hyper-Threading Technology 77-41Example 7-8 shows the batched implementation of the producer and consumer thread functions.Example 7-8

Page 321

IA-32 Intel® Architecture Optimization7-42Eliminate 64-KByte Aliased Data AccessesThe 64 KB aliasing condition is discussed in Chapter 2. Memory acces

Page 322

Multi-Core and Hyper-Threading Technology 77-43Preventing Excessive Evictions in First-Level Data CacheCached data in a first-level data cache are ind

Page 323

IA-32 Intel® Architecture Processor Family Overview1-11The Front EndThe front end of the Intel NetBurst microarchitecture consists of two parts:• fetc

Page 324

IA-32 Intel® Architecture Optimization7-44Per-thread Stack OffsetTo prevent private stack accesses in concurrent threads from thrashing the first-leve

Page 325

Multi-Core and Hyper-Threading Technology 77-45Example 7-9 Adding an Offset to the Stack Pointer of Three ThreadsVoid Func_thread_entry(DWORD *pArg){D

Page 326 - Non-Adjacent Passes Loops

IA-32 Intel® Architecture Optimization7-46Per-instance Stack OffsetEach instance an application runs in its own linear address space; but the address

Page 327

Multi-Core and Hyper-Threading Technology 77-47However, the buffer space does enable the first-level data cache to be shared cooperatively when two co

Page 328

IA-32 Intel® Architecture Optimization7-48Front-end OptimizationIn the Intel NetBurst microarchitecture family of processors, the instructions are dec

Page 329

Multi-Core and Hyper-Threading Technology 77-49On Hyper-Threading-Technology-enabled processors, excessive loop unrolling is likely to reduce the Trac

Page 330

IA-32 Intel® Architecture Optimization7-50initial APIC_ID (See Section 7.10 of IA-32 Intel Architecture Software Developer’s Manual, Volume 3A for mor

Page 331

Multi-Core and Hyper-Threading Technology 77-51Affinity masks can be used to optimize shared multi-threading resources. Example 7-11 Assembling 3-le

Page 332 - 60 invis

IA-32 Intel® Architecture Optimization7-52Arrangements of affinity-binding can benefit performance more than other arrangements. This applies to: • Sc

Page 333 - • write-once (non-temporal)

Multi-Core and Hyper-Threading Technology 77-53first to the primary logical processor of each processor core. This example is also optimized to the si

Page 334 - Cache Management

ivOut-of-Order Core... 1-30In-Order Retirement...

Page 335 - Video Decoder

IA-32 Intel® Architecture Optimization1-12The execution trace cache and the translation engine have cooperating branch prediction hardware. Branch tar

Page 336

IA-32 Intel® Architecture Optimization7-54Example 7-12 Assembling a Look up Table to Manage Affinity Masks and Schedule Threads to Each Core First AFF

Page 337 - • cache size

Multi-Core and Hyper-Threading Technology 77-55Example 7-13 Discovering the Affinity Masks for Sibling Logical Processors Sharing the Same Cache // Lo

Page 338

IA-32 Intel® Architecture Optimization7-56 PackageID[ProcessorNUM] = PACKAGE_ID;CoreID[ProcessorNum] = CORE_ID;SmtID[Processor

Page 339

Multi-Core and Hyper-Threading Technology 77-57For (ProcessorNum = 1; ProcessorNum < NumStartedLPs; ProcessorNum++) {ProcessorMask << = 1;For

Page 340

IA-32 Intel® Architecture Optimization7-58Optimization of Other Shared ResourcesResource optimization in multi-threaded application depends on the cac

Page 341

Multi-Core and Hyper-Threading Technology 77-59seldom reaches 50% of peak retirement bandwidth. Thus, improving single-thread execution throughput sho

Page 342

IA-32 Intel® Architecture Optimization7-60throughput of a physical processor package. The non-halted CPI metric can be interpreted as the inverse of t

Page 343

Multi-Core and Hyper-Threading Technology 77-61Using a function decomposition threading model, a multithreaded application can pair up a thread with c

Page 344 - Bit Location Name Meaning

IA-32 Intel® Architecture Optimization7-62Write-combining buffers are another example of execution resources shared between two logical processors. Wi

Page 345 - • Determine prefetch stride

8-1864-bit Mode Coding GuidelinesIntroductionThis chapter describes coding guidelines for application software written to run in 64-bit mode. These gu

Page 346 - Parameters

IA-32 Intel® Architecture Processor Family Overview1-13correct execution, the results of IA-32 instructions must be committed in original program orde

Page 347 - Hyper-Threading Technology

IA-32 Intel® Architecture Optimization8-2This optimization holds true for the lower 8 general purpose registers: EAX, ECX, EBX, EDX, ESP, EBP, ESI, ED

Page 348 - Performance and Usage Models

64-bit Mode Coding Guidelines 88-3If the compiler can determine at compile time that the result of a multiply will not exceed 64 bits, then the compil

Page 349 - Multi-Thread on MP

IA-32 Intel® Architecture Optimization8-4Can be replaced with: movsx r8, r9w ;If bits 63:8 do not need to be;preserved. movsx r8, r10b ;If bits 63:

Page 350 - Multitasking Environment

64-bit Mode Coding Guidelines 88-5IMUL RAX, RCXThe 64-bit version above is more efficient than using the following 32-bit version:MOV EAX, DWORD PTR[

Page 351

IA-32 Intel® Architecture Optimization8-6Use 32-Bit Versions of CVTSI2SS and CVTSI2SD When PossibleThe CVTSI2SS and CVTSI2SD instructions convert a si

Page 352 - • hardware utilization

9-19Power Optimization for Mobile UsagesOverviewMobile computing allows computers to operate anywhere, anytime. Battery life is a key factor in delive

Page 353 - • functional decomposition

IA-32 Intel® Architecture Optimization9-2Pentium M, Intel Core Solo and Intel Core Duo processors implement features designed to enable the reduction

Page 354 - Functional Decomposition

Power Optimization for Mobile Usages 99-3to accommodate demand and adapt power consumption. The interaction between the OS power management policy and

Page 355 - P(1)P(1) C(1)C(1)P(1)

IA-32 Intel® Architecture Optimization9-4ACPI C-StatesWhen computational demands are less than 100%, part of the time the processor is doing useful wo

Page 356 - C: consumer

Power Optimization for Mobile Usages 99-5The index of a C-state type designates the depth of sleep. Higher numbers indicate a deeper sleep state and l

Page 357

IA-32 Intel® Architecture Optimization1-14• a mechanism fetches data only and includes two distinct components: (1) a hardware mechanism to fetch the

Page 358 - Thread 1

IA-32 Intel® Architecture Optimization9-6Figure 9-3 Application of C-states to Idle TimeConsider that a processor is in lowest frequency (LFM- low fre

Page 359

Power Optimization for Mobile Usages 99-7• In an Intel Core Solo or Duo processor, after staying in C4 for an extended time, the processor may enter i

Page 360

IA-32 Intel® Architecture Optimization9-8Adjust Performance to Meet Quality of FeaturesWhen a system is battery powered, applications can extend batte

Page 361

Power Optimization for Mobile Usages 99-9• GetActivePwrScheme: Retrieves the active power scheme (current system power scheme) index. An application c

Page 362 - Optimization Guidelines

IA-32 Intel® Architecture Optimization9-10workload (usually that equates to reducing the number of instructions that the processor needs to execute, o

Page 363

Power Optimization for Mobile Usages 99-11disk operations over time. Use the GetDevicePowerState() Windows API to test disk state and delay the disk

Page 364

IA-32 Intel® Architecture Optimization9-12Using Enhanced Intel SpeedStep® TechnologyUse Enhanced Intel SpeedStep Technology to adjust the processor to

Page 365 - Thread Synchronization

Power Optimization for Mobile Usages 99-13The same application can be written in such a way that work units are divided into smaller granularity, but

Page 366

IA-32 Intel® Architecture Optimization9-14An additional positive effect of continuously operating at a lower frequency is that frequent changes in pow

Page 367

Power Optimization for Mobile Usages 99-15Eventually, if the interval is large enough, the processor will be able to enter deeper sleep and save a con

Page 368

IA-32 Intel® Architecture Processor Family Overview1-15Branch PredictionBranch prediction is important to the performance of a deeply pipelined proces

Page 369

IA-32 Intel® Architecture Optimization9-16thread enables the physical processor to operate at lower frequency relative to a single-threaded version. T

Page 370

Power Optimization for Mobile Usages 99-17demands only 50% of processor resources (based on idle history). The processor frequency may be reduced by s

Page 371 - Optimization with Spin-Locks

IA-32 Intel® Architecture Optimization9-18processor to enter the lowest possible C-state type (lower-numbered C state has less power saving). For exam

Page 372 - PAUSE instruction in the

Power Optimization for Mobile Usages 99-19imbalance can be accomplished using performance monitoring events. Intel Core Duo processor provides an even

Page 373 - Example 7-5

IA-32 Intel® Architecture Optimization9-20

Page 374

A-1AApplication Performance ToolsIntel offers an array of application performance tools that are optimized to take advantage of the Intel architecture

Page 375

IA-32 Intel® Architecture OptimizationA-2• Intel Performance LibrariesThe Intel Performance Library family consists of a set of software libraries opt

Page 376

Application Performance Tools AA-3family. Vectorization, processor dispatch, inter-procedural optimization, profile-guided optimization and OpenMP par

Page 377

IA-32 Intel® Architecture OptimizationA-4default, and targets the Intel Pentium 4 processor and subsequent processors. Code produced will run on any I

Page 378

Application Performance Tools AA-5Vectorizer Switch OptionsThe Intel C++ and Fortran Compiler can vectorize your code using the vectorizer switch opti

Page 379 - System Bus Optimization

IA-32 Intel® Architecture Optimization1-16To take advantage of the forward-not-taken and backward-taken static predictions, code should be arranged so

Page 380 - Conserve Bus Bandwidth

IA-32 Intel® Architecture OptimizationA-6Multithreading with OpenMP*Both the Intel C++ and Fortran Compilers support shared memory parallelism via Ope

Page 381

Application Performance Tools AA-7The -Qrcd option disables the change to truncation of the rounding mode in floating-point-to-integer conversions. Fo

Page 382

IA-32 Intel® Architecture OptimizationA-8When you use PGO, consider the following guidelines:• Minimize the changes to your program after instrumented

Page 383

Application Performance Tools AA-9SamplingSampling allows you to profile all active software on your system, including operating system, device driver

Page 384 - Memory Optimization

IA-32 Intel® Architecture OptimizationA-10Figure A-1 provides an example of a hotspots report by location. Event-based SamplingEvent-based sampling (

Page 385 - Shared-Memory Optimization

Application Performance Tools AA-11different events at a time. The number of the events that the VTune analyzer can collect at once on the Pentium 4 a

Page 386

IA-32 Intel® Architecture OptimizationA-12duration of read traffic compared to the duration of the workload is significantly less than unity, it indic

Page 387

Application Performance Tools AA-13stride inefficiency is most prominent on memory traffic. A useful indicator for large-stride inefficiency in a work

Page 388 - 4 KB in each thread

IA-32 Intel® Architecture OptimizationA-14The Call Graph View depicts the caller / callee relationships. Each thread in the application is the root of

Page 389

Application Performance Tools AA-15(SSE), Streaming SIMD Extensions 2 (SSE2) and Streaming SIMD Extensions 3 (SSE3). The library set includes the Inte

Page 390 - Per-thread Stack Offset

IA-32 Intel® Architecture Processor Family Overview1-17Some parts of the core may speculate that a common condition holds to allow faster execution. I

Page 391

IA-32 Intel® Architecture OptimizationA-16• Performance: Highly-optimized routines with a C interface that give Assembly-level performance in a C/C++

Page 392 - Per-instance Stack Offset

Application Performance Tools AA-17developed with the Intel Performance Libraries benefit from new architectural features of future generations of Int

Page 393

IA-32 Intel® Architecture OptimizationA-18The Intel Thread Checker product is an Intel VTune Performance Analyzer plug-in data collector that executes

Page 394 - Front-end Optimization

Application Performance Tools AA-19Figure A-2 shows Intel Thread Checker displaying the source code of the selected instance from a list of detected d

Page 395 - Resources

IA-32 Intel® Architecture OptimizationA-20Intel® Software CollegeThe Intel® Software College is a valuable resource for classes on Streaming SIMD Exte

Page 396

B-1BUsing Performance Monitoring EventsPerformance monitoring events provides facilities to characterize the interaction between programmed sequences

Page 397 - Processor

IA-32 Intel® Architecture OptimizationB-2The performance metrics listed n Tables B-1 through Table B-5 may be applicable to processors that support Hy

Page 398 - Processor (Contd.)

Using Performance Monitoring Events BB-3ReplayIn order to maximize performance for the common case, the Intel NetBurst microarchitecture sometimes ag

Page 399

IA-32 Intel® Architecture OptimizationB-4miss more than once during its life time, but a Misses Retired metric (for example, 1st-Level Cache Misses Re

Page 400

Using Performance Monitoring Events BB-5The first two metrics use performance counters, and thus can be used to cause interrupt upon overflow for samp

Page 401 - Sharing the Same Cache

IA-32 Intel® Architecture Optimization1-18execution units are not pipelined (meaning that µops cannot be dispatched in consecutive cycles and the thro

Page 402

IA-32 Intel® Architecture OptimizationB-6Non-Sleep Clockticks The performance monitoring counters can also be configured to count clocks whenever the

Page 403

Using Performance Monitoring Events BB-7that logical processor is not halted (it may include some portion of the clock cycles for that logical process

Page 404

IA-32 Intel® Architecture OptimizationB-8Microarchitecture NotesTrace Cache EventsThe trace cache is not directly comparable to an instruction cache.

Page 405

Using Performance Monitoring Events BB-9There is a simplified block diagram below of the sub-systems connected to the IOQ unit in the front side bus s

Page 406

IA-32 Intel® Architecture OptimizationB-10 Figure B-1 Relationships Between the Cache Hierarchy, IOQ, BSQ and Front Side BusChip SetSystem Memory1st

Page 407

Using Performance Monitoring Events BB-11Core references are nominally 64 bytes, the size of a 1st-level cache line. Smaller sizes are called partial

Page 408

IA-32 Intel® Architecture OptimizationB-12• IOQ_allocation, IOQ_active_entries: 64 bytes for hits or misses, smaller for partials' hits or misses

Page 409 - Guidelines

Using Performance Monitoring Events BB-13transactions of the writeback (WB) memory type for the FSB IOQ and the BSQ can be an indication of how often

Page 410 - Only When Necessary

IA-32 Intel® Architecture OptimizationB-14Current implementations of the BSQ_cache_reference event do not distinguish between programmatic read and wr

Page 411 - Assembly/Compiler Coding rule

Using Performance Monitoring Events BB-15Usage Notes on Bus ActivitiesA number of performance metrics in Table B-1 are based on IOQ_active_entries and

Page 412 - 64-Bit Arithmetic

IA-32 Intel® Architecture Processor Family Overview1-19CachesThe Intel NetBurst microarchitecture supports up to three levels of on-chip cache. At lea

Page 413 - Assembly/Compiler Coding Rule

IA-32 Intel® Architecture OptimizationB-16accesses (i.e., are also 3rd-level misses). This can decrease the average measured BSQ latencies for worklo

Page 414 - Using Software Prefetch

Using Performance Monitoring Events BB-17an expression built up from other metrics; for example, IPC is derived from two single-event metrics.• Column

Page 415 - Mobile Usages

IA-32 Intel® Architecture OptimizationB-18Table B-1 Pentium 4 Processor Performance MetricsMetric DescriptionEvent Name or Metric ExpressionEvent Mask

Page 416 - Mobile Usage Scenarios

Using Performance Monitoring Events BB-19Speculative Uops Retired Number of uops retired (include both instructions executed to completion and specula

Page 417

IA-32 Intel® Architecture OptimizationB-20Mispredicted returns The number of mispredicted returns including all causes. retired_mispred_branch_typeRET

Page 418 - ACPI C-States

Using Performance Monitoring Events BB-21TC Flushes Number of TC flushes (The counter will count twice for each occurrence. Divide the count by 2 to g

Page 419

IA-32 Intel® Architecture OptimizationB-22Logical Processor 1 Deliver ModeThe number of cycles that the trace and delivery engine (TDE) is delivering

Page 420

Using Performance Monitoring Events BB-23Logical Processor 0 Build ModeThe number of cycles that the trace and delivery engine (TDE) is building trace

Page 421

IA-32 Intel® Architecture OptimizationB-24Trace Cache MissesThe number of times that significant delays occurred in order to decode instructions and b

Page 422

Using Performance Monitoring Events BB-25Memory MetricsPage Walk DTLB All MissesThe number of page walk requests due to DTLB misses from either load o

Page 423 - Reducing Amount of Work

IA-32 Intel® Architecture Optimization1-20Levels in the cache hierarchy are not inclusive. The fact that a line is in level i does not imply that it i

Page 424 - • Switch off unused devices

IA-32 Intel® Architecture OptimizationB-2664K Aliasing Conflicts1The number of 64K aliasing conflicts. A memory reference causing 64K aliasing conflic

Page 425

Using Performance Monitoring Events BB-27MOB Load ReplaysThe number of replayed loads related to the Memory Order Buffer (MOB). This metric counts onl

Page 426 - Technology

IA-32 Intel® Architecture OptimizationB-282nd-Level Cache Reads Hit Shared The number of 2nd-level cache read references (loads and RFOs) that hit the

Page 427

Using Performance Monitoring Events BB-293rd-Level Cache Reads Hit Modified The number of 3rd-level cache read references (loads and RFOs) that hit th

Page 428 - Enhanced Deeper Sleep

IA-32 Intel® Architecture OptimizationB-30All WCB Evictions The number of times a WC buffer eviction occurred due to any causes (This can be used to d

Page 429 - Multi-Core Considerations

Using Performance Monitoring Events BB-31Bus MetricsBus Accesses from the Processor The number of all bus transactions that were allocated in the IO Q

Page 430

IA-32 Intel® Architecture OptimizationB-32Prefetch Ratio Fraction of all bus transactions (including retires) that were for HW or SW prefetching.(Bus

Page 431

Using Performance Monitoring Events BB-33Writes from the Processor The number of all write transactions on the bus that were allocated in IO Queue fro

Page 432

IA-32 Intel® Architecture OptimizationB-34All WC from the Processor The number of Write Combining memory transactions on the bus that originated from

Page 433 - (C1-C4)

Using Performance Monitoring Events BB-35Bus Accesses from All Agents The number of all bus transactions that were allocated in the IO Queue by all ag

Page 434

IA-32 Intel® Architecture Processor Family Overview1-21back within the processor, and 6-12 bus cycles to access memory if there is no bus congestion.

Page 435 - Application Performance

IA-32 Intel® Architecture OptimizationB-36Bus Reads Underway from the processor7 This is an accrued sum of the durations of all read (includes RFOs) t

Page 436 - Compilers

Using Performance Monitoring Events BB-37All UC Underway from the processor7 This is an accrued sum of the durations of all UC transactions by this pr

Page 437 - Code Optimization Options

IA-32 Intel® Architecture OptimizationB-38Bus Writes Underway from the processor7 This is an accrued sum of the durations of all write transactions by

Page 438

Using Performance Monitoring Events BB-39Write WC Full (BSQ)The number of write (but neither writeback nor RFO) transactions to WC-type memory. BSQ_al

Page 439 - Vectorizer Switch Options

IA-32 Intel® Architecture OptimizationB-40Reads Non-prefetch Full (BSQ) The number of read (excludes RFOs and HW|SW prefetches) transactions to WB-typ

Page 440 - Multithreading with OpenMP*

Using Performance Monitoring Events BB-41UC Write Partial (BSQ) The number of UC write transactions. Beware of granularity issues between BSQ and FSB

Page 441

IA-32 Intel® Architecture OptimizationB-42WB Writes Full Underway (BSQ)8 This is an accrued sum of the durations of writeback (evicted from cache) tra

Page 442 - VTune™ Performance Analyzer

Using Performance Monitoring Events BB-43Write WC Partial Underway (BSQ)8This is an accrued sum of the durations of partial write transactions to WC-t

Page 443 - Sampling

IA-32 Intel® Architecture OptimizationB-44SSE Input AssistsThe number of occurrences of SSE/SSE2 floating-point operations needing assistance to handl

Page 444 - Event-based Sampling

Using Performance Monitoring Events BB-451. A memory reference causing 64K aliasing conflict can be counted more than once in this stat. The resulting

Page 445 - Workload Characterization

vBranch Prediction... 2-15Eliminating B

Page 446

IA-32 Intel® Architecture Optimization1-22• avoids the need to access off-chip caches, which can increase the realized bandwidth compared to a normal

Page 447 - Call Graph

IA-32 Intel® Architecture OptimizationB-464. Most commonly used x87 instructions (e.g., fmul, fadd, fdiv, fsqrt, fstp, etc.) decode into a singleμop.

Page 448 - Performance Libraries

Using Performance Monitoring Events BB-47Table B-2 Metrics That Utilize Replay Tagging MechanismReplay Metric Tags1Bit field to set:IA32_PEBS_ENABLE B

Page 449 - Benefits Summary

IA-32 Intel® Architecture OptimizationB-48Tags for front_end_eventTable B-3 provides a list of the tags that are used by various metrics derived from

Page 450 - Optimizations with the Intel

Using Performance Monitoring Events BB-49Table B-4 Metrics That Utilize the Execution Tagging MechanismExecution Metric Tags Upstream ESCRTag Value in

Page 451 - Threading Tools

IA-32 Intel® Architecture OptimizationB-50Table B-5 New Metrics for Pentium 4 Processor (Family 15, Model 3)Using Performance Metrics with Hyper-Threa

Page 452

Using Performance Monitoring Events BB-51The performance metrics listed in Table B-1 fall into three categories:• Logical processor specific and suppo

Page 453 - Thread Profiler

IA-32 Intel® Architecture OptimizationB-52Branching Metrics Branches RetiredTagged Mispredicted Branches RetiredMispredicted Branches RetiredAll retur

Page 454 - Software College

Using Performance Monitoring Events BB-53Memory Metrics Split Load Replays1Split Store Replays1MOB Load Replays164k Aliasing Conflicts1st-Level Cache

Page 455 - Using Performance Monitoring

IA-32 Intel® Architecture OptimizationB-54Bus Metrics Bus Accesses from the Processor1Non-prefetch Bus Accesses from the Processor1Reads from the Proc

Page 456 - Bus Ratio

Using Performance Monitoring Events BB-55Characterization Metrics x87 Input Assistsx87 Output AssistsMachine Clear CountMemory Order Machine ClearSelf

Page 457

IA-32 Intel® Architecture Processor Family Overview1-23Hardware prefetching for Pentium 4 processor has the following characteristics:• works with exi

Page 458 - Counting Clocks

IA-32 Intel® Architecture OptimizationB-56Using Performance Events of Intel Core Solo and Intel Core Duo processorsThere are performance events specif

Page 459 - Non-Halted Clockticks

Using Performance Monitoring Events BB-57There are three cycle-counting events which will not progress on a halted core, even if the halted core is be

Page 460 - Non-Sleep Clockticks

IA-32 Intel® Architecture OptimizationB-58• Some events, such as writebacks, may have non-deterministic behavior for different runs. In such a case, o

Page 461 - Time Stamp Counter

Using Performance Monitoring Events BB-59• Serial_Execution_Cycles, event number 3C, unit mask 02HThis event counts the bus cycles during which the co

Page 462 - Microarchitecture Notes

IA-32 Intel® Architecture OptimizationB-60

Page 463

C-1CIA-32 Instruction Latency and ThroughputThis appendix contains tables of the latency, throughput and execution units that are associated with more

Page 464 - Side Bus

IA-32 Intel® Architecture OptimizationC-2OverviewThe current generation of IA-32 family of processors use out-of-order execution with dynamic scheduli

Page 465 - Reads due to program loads

IA-32 Instruction Latency and Throughput CC-3While several items on the above list involve selecting the right instruction, this appendix focuses on t

Page 466 - Writebacks (dirty evictions)

IA-32 Intel® Architecture OptimizationC-4DefinitionsThe IA-32 instruction performance data are listed in several tables. The tables contain the follow

Page 467

IA-32 Instruction Latency and Throughput CC-5accurately predict realistic performance of actual code sequences based on adding instruction latency dat

Page 468

IA-32 Intel® Architecture Optimization1-24Thus, software optimization of a data access pattern should emphasize tuning for hardware prefetch first to

Page 469 - Usage Notes on Bus Activities

IA-32 Intel® Architecture OptimizationC-6 Latency and Throughput with Register OperandsIA-32 instruction latency and throughput data are presented in

Page 470

IA-32 Instruction Latency and Throughput CC-7Table C-2 Streaming SIMD Extension 2 128-bit Integer InstructionsInstruction Latency1ThroughputExecution

Page 471

IA-32 Intel® Architecture OptimizationC-8PCMPGTB/PCMPGTD/PCMPGTW xmm, xmm2 2 1 2 2 1 MMX_ALUPEXTRW r32, xmm, imm8 7 7 3 2 2 2 MMX_SHFT,FP_MISCPINSRW x

Page 472

IA-32 Instruction Latency and Throughput CC-9 PSUBB/PSUBW/PSUBD xmm, xmm2 2 1 2 2 1 MMX_ALU PSUBSB/PSUBSW/PSUBUSB/PSUBUSW xmm, xmm2 2 1 2 2 1 MMX_A

Page 473

IA-32 Intel® Architecture OptimizationC-10COMISD xmm, xmm 7 6 1 2 2 1 FP_ADD, FP_MISCCVTDQ2PD xmm, xmm 8 8 4+1 3 3 4 FP_ADD, MMX_SHFTCVTPD2PI mm, xmm

Page 474

IA-32 Instruction Latency and Throughput CC-11DIVPD xmm, xmm 70 69 32+31 70 69 62 FP_DIVDIVSD xmm, xmm 39 38 32 39 38 31 FP_DIVMAXPD xmm, xmm 5 4 4 2

Page 475

IA-32 Intel® Architecture OptimizationC-12Table C-4 Streaming SIMD Extension Single-precision Floating-point Instructions Instruction Latency1Throughp

Page 476

IA-32 Instruction Latency and Throughput CC-13MOVLHPS3 xmm, xmm 44 22 MMX_SHFTMOVMSKPS r32, xmm 6 6 2 2 FP_MISCMOVSS xmm, xmm 4 4 2 2 MMX_SHFTMOVUPS x

Page 477

IA-32 Intel® Architecture OptimizationC-14 Table C-5 Streaming SIMD Extension 64-bit Integer Instructions Instruction Latency1Throughput Execution Uni

Page 478

IA-32 Instruction Latency and Throughput CC-15 PCMPGTB/PCMPGTD/PCMPGTW mm, mm22 11 MMX_ALUPMADDWD3 mm, mm 98 11 FP_MULPMULHW/PMULLW3 mm, mm98 11 FP_M

Page 479

IA-32 Intel® Architecture Processor Family Overview1-25Reordering loads with respect to each other can prevent a load miss from stalling later loads.

Page 480

IA-32 Intel® Architecture OptimizationC-16Table C-7 IA-32 x87 Floating-point Instructions Instruction Latency1ThroughputExecution Unit2CPUID 0F3n 0F2n

Page 481

IA-32 Instruction Latency and Throughput CC-17 FSCALE460 7FRNDINT430 11FXCH501FP_MOVEFLDZ60FINCSTP/FDECSTP60See “Table Footnotes”Table C-8 IA-32 Gener

Page 482

IA-32 Intel® Architecture OptimizationC-18Jcc7Not Appli-cable0.5 ALULOOP 8 1.5 ALUMOV 1 0.5 0.5 0.5 ALUMOVSB/MOVSW 1 0.5 0.5 0.5 ALUMOVZB/MOVZW 1 0.5

Page 483

IA-32 Instruction Latency and Throughput CC-19Table FootnotesThe following footnotes refer to all tables in this appendix.1. Latency information for m

Page 484

IA-32 Intel® Architecture OptimizationC-204. Latency and Throughput of transcendental instructions can vary substantially in a dynamic execution envir

Page 485

IA-32 Instruction Latency and Throughput CC-21For the sake of simplicity, all data being requested is assumed to reside in the first level data cache

Page 486

IA-32 Intel® Architecture OptimizationC-22

Page 487

D-1DStack AlignmentThis appendix details on the alignment of the stacks of data for Streaming SIMD Extensions and Streaming SIMD Extensions 2.Stack Fr

Page 488

IA-32 Intel® Architecture OptimizationD-2alignment for __m64 and double type data by enforcing that these 64-bit data items are at least eight-byte al

Page 489

Stack Alignment DD-3As an optimization, an alternate entry point can be created that can be called when proper stack alignment is provided by the call

Page 490

IA-32 Intel® Architecture Optimization1-26Intel® Pentium® M Processor MicroarchitectureLike the Intel NetBurst microarchitecture, the pipeline of the

Page 491

Stack Alignment DD-4Example D-1 in the following sections illustrate this technique. Note the entry points foo and foo.aligned, the latter is the alte

Page 492

Stack Alignment DD-5Example D-1 Aligned esp-Based Stack Framesvoid _cdecl foo (int k){ int j; foo: // See Note A push

Page 493

Stack Alignment DD-6Aligned ebp-Based Stack FramesIn ebp-based frames, padding is also inserted immediately before the return address. However, this f

Page 494

Stack Alignment DD-7Example D-2 Aligned ebp-based Stack Framesvoid _stdcall foo (int k){ int j; foo: push ebxmov ebx, espsub esp, 0x00000008and esp, 0

Page 495

Stack Alignment DD-8// the goal is to make esp and ebp// (0 mod 16) herej = k;mov edx, [ebx + 8] // k is (0 mod 16) if caller aligned// its stackmov [

Page 496

Stack Alignment DD-9Stack Frame OptimizationsThe Intel C++ Compiler provides certain optimizations that may improve the way aligned frames are set up

Page 497

IA-32 Intel® Architecture OptimizationD-10Inlined Assembly and ebxWhen using aligned frames, the ebx register generally should not be modified in inli

Page 498

E-1EMathematics of Prefetch Scheduling DistanceThis appendix discusses how far away to insert prefetch instructions. It presents a mathematical model

Page 499

IA-32 Intel® Architecture OptimizationE-2Ninstis the number of instructions in the scope of one loop iteration.Consider the following example of a heu

Page 500 - Tags for replay_event

Mathematics of Prefetch Scheduling Distance EE-3Tbdata transfer latency which is equal to number of lines per iteration * line burst latencyNote that

Page 501

IA-32 Intel® Architecture Processor Family Overview1-27The Intel Pentium M processor microarchitecture is designed for lower power consumption. There

Page 502 - Tags for execution_event

IA-32 Intel® Architecture OptimizationE-4Memory access plays a pivotal role in prefetch scheduling. For more understanding of a memory subsystem, cons

Page 503

Mathematics of Prefetch Scheduling Distance EE-5Tl varies dynamically and is also system hardware-dependent. The static variants include the core-to-f

Page 504 - Technology

IA-32 Intel® Architecture OptimizationE-6No Preloading or PrefetchThe traditional programming approach does not perform data preloading or prefetch. I

Page 505 - Parallel Counting

Mathematics of Prefetch Scheduling Distance EE-7The iteration latency is approximately equal to the computation latency plus the memory leadoff latenc

Page 506 - Parallel Counting (continued)

IA-32 Intel® Architecture OptimizationE-8The following formula shows the relationship among the parameters:It can be seen from this relationship that

Page 507

Mathematics of Prefetch Scheduling Distance EE-9For this particular example the prefetch scheduling distance is greater than 1. Data being prefetched

Page 508

IA-32 Intel® Architecture OptimizationE-10Memory Throughput Bound (Case: Tb >= Tc)When the application or loop is memory throughput bound, the memo

Page 509

Mathematics of Prefetch Scheduling Distance EE-11memory to you cannot do much about it. Typically, data copy from one space to another space, for exam

Page 510 - Intel Core Duo processors

IA-32 Intel® Architecture OptimizationE-12Now for the case Tl =18, Tb =8 (2 cache lines are needed per iteration) examine the following graph. Conside

Page 511 - Ratio Interpretation

Mathematics of Prefetch Scheduling Distance EE-13In reality, the front-side bus (FSB) pipelining depth is limited, that is, only four transactions are

Page 512 - Notes on Selected Events

IA-32 Intel® Architecture Optimization1-28The fetch and decode unit includes a hardware instruction prefetcher and three decoders that enable parallel

Page 513

IA-32 Intel® Architecture OptimizationE-14

Page 514

Index-1Index64-bit modedefault operand size, 8-1introduction, 8-1legacy instructions, 8-1multiplication notes, 8-2register usage, 8-2, 8-4sign-extensi

Page 515 - Throughput

IA-32 Intel® Architecture OptimizationIndex-2coding methodologies, 3-13coding techniques, 3-12absolute difference of signed numbers, 4-24absolute diff

Page 516 - Overview

IndexIndex-3floating-point stalls, 2-72flow dependency, E-7flush to zero, 5-22FXCH instruction, 2-70Ggeneral optimization techniques, 2-1branch predic

Page 517 - PADDQ and PMULUDQ, each have

IA-32 Intel® Architecture OptimizationIndex-4Llarge load stalls, 2-37latency, 2-72, 6-5lea instruction, 2-74loading and storing to and from the same D

Page 518 - Latency and Throughput

IndexIndex-5Ooptimizing cache utilizationcache management, 6-44examples, 6-15non-temporal store instructions, 6-10prefetch and load, 6-9prefetch Instr

Page 519

IA-32 Intel® Architecture OptimizationIndex-6Rreciprocal instructions, 5-2rounding control option, A-6Ssamplingevent-based, A-10Self-modifying code, 2

Page 520 - See “Table Footnotes”

INTEL SALES OFFICESASIA PACIFICAustraliaIntel Corp.Level 2448 St Kilda Road Melbourne VIC3004AustraliaFax:613-9862 5599ChinaIntel Corp.Rm 709, Shaanxi

Page 521

Intel Corp.999 CANADA PLACE, Suite 404,#11Vancouver BCV6C 3E2CanadaFax:604-844-2813Intel Corp.2650 Queensview Drive, Suite 250Ottawa ONK2B 8H6CanadaFa

Page 522

IA-32 Intel® Architecture Processor Family Overview1-29• Micro-ops (µops) fusion. Some of the most frequent pairs of µops derived from the same instru

Page 523

IA-32 Intel® Architecture Optimization1-30Data is fetched 64 bytes at a time; the instruction and data translation lookaside buffers support 128 entri

Page 524 - Instructions (continued)

IA-32 Intel® Architecture Processor Family Overview1-31In-Order RetirementThe retirement unit in the Pentium M processor buffers completed µops is the

Page 525

viFloating-Point Stalls... 2-72x87 Floating-point

Page 526

IA-32 Intel® Architecture Optimization1-32• Power-optimized busThe system bus is optimized for power efficiency; increased bus speed supports 667 MHz.

Page 527

IA-32 Intel® Architecture Processor Family Overview1-33Data PrefetchingIntel Core Solo and Intel Core Duo processors provide hardware mechanisms to pr

Page 528

IA-32 Intel® Architecture Optimization1-34The two logical processors each have a complete set of architectural registers while sharing one single phys

Page 529

IA-32 Intel® Architecture Processor Family Overview1-35In the first implementation of HT Technology, the physical execution resources are shared and t

Page 530

IA-32 Intel® Architecture Optimization1-36Processor Resources and Hyper-Threading TechnologyThe majority of microarchitecture resources in a physical

Page 531

IA-32 Intel® Architecture Processor Family Overview1-37For example: a cache miss, a branch misprediction, or instruction dependencies may prevent a lo

Page 532

IA-32 Intel® Architecture Optimization1-38Microarchitecture Pipeline and Hyper-Threading TechnologyThis section describes the HT Technology microarchi

Page 533 - Table Footnotes

IA-32 Intel® Architecture Processor Family Overview1-39Execution CoreThe core can dispatch up to six µops per cycle, provided the µops are ready to ex

Page 534

IA-32 Intel® Architecture Optimization1-40Pentium Processor Extreme Edition provide four logical processors in a physical package that has two executi

Page 535

IA-32 Intel® Architecture Processor Family Overview1-41Figure 1-7 Pentium D Processor, Pentium Processor Extreme Edition and Intel Core Duo ProcessorS

Page 536

viiConsiderations for Code Conversion to SIMD Programming... 3-8Identifying Hot Spots ...

Page 537

IA-32 Intel® Architecture Optimization1-42Microarchitecture Pipeline and Multi-Core ProcessorsIn general, each core in a multi-core processor resemble

Page 538

IA-32 Intel® Architecture Processor Family Overview1-43that the cache line that contains the memory location is owned by the first-level data cache of

Page 539 - Stack Alignment D

IA-32 Intel® Architecture Optimization1-44when data is written back to memory, the eviction consumes cache bandwidth and bus bandwidth. For multiple c

Page 540

2-12General Optimization GuidelinesThis chapter discusses general optimization techniques that can improve the performance of applications running on

Page 541

IA-32 Intel® Architecture Optimization2-2The following sections describe practices, tools, coding rules and recommendations associated with these fact

Page 542 - & 0x0f) == 0x08

General Optimization Guidelines 22-3* Streaming SIMD Extensions (SSE)** Streaming SIMD Extensions 2 (SSE2)General Practices and Coding GuidelinesThi

Page 543

IA-32 Intel® Architecture Optimization2-4Use Available Performance Tools• Current-generation compiler, such as the Intel C++ Compiler:— Set this compi

Page 544

General Optimization Guidelines 22-5Optimize Branch Predictability• Improve branch predictability and optimize instruction prefetching by arranging co

Page 545 - Stack Frame Optimizations

IA-32 Intel® Architecture Optimization2-6• Minimize use of global variables and pointers.• Use the const modifier; use the static modifier for global

Page 546 - Inlined Assembly and ebx

General Optimization Guidelines 22-7• Avoid longer latency instructions: integer multiplies and divides. Replace them with alternate code sequences (e

Page 547 - Scheduling Distance

viiiPacked Shuffle Word for 64-bit Registers ... 4-18Packed Shuffle Word for 128-bi

Page 548 - Mathematical Model for PSD

IA-32 Intel® Architecture Optimization2-8• Avoid the use of conditionals.• Keep induction (loop) variable expressions simple.• Avoid using pointers, t

Page 549

General Optimization Guidelines 22-9Performance ToolsIntel offers several tools that can facilitate optimizing your application’s performance.Intel® C

Page 550 - L2 lookup miss latency

IA-32 Intel® Architecture Optimization2-10General Compiler RecommendationsA compiler that has been extensively tuned for the target microarchitec-ture

Page 551 - • Optimize T

General Optimization Guidelines 22-11The VTune Performance Analyzer also enables engineers to use these counters to measure a number of workload chara

Page 552 - No Preloading or Prefetch

IA-32 Intel® Architecture Optimization2-12Intel Core Solo and Intel Core Duo processors have enhanced front end that is less sensitive to the 4-1-1 te

Page 553 - Execution cycles

General Optimization Guidelines 22-13• On the Pentium 4 and Intel Xeon processors, the primary code size limit of interest is imposed by the trace cac

Page 554 - Compute Bound (Case: T

IA-32 Intel® Architecture Optimization2-14Transparent Cache-Parameter StrategyIf CPUID instruction supports function leaf 4, also known as determinist

Page 555

General Optimization Guidelines 22-15Branch PredictionBranch optimizations have a significant impact on performance. By understanding the flow of bran

Page 556 - >= T

IA-32 Intel® Architecture Optimization2-16Assembly/Compiler Coding Rule 1. (MH impact, H generality) Arrange code to make basic blocks contiguous and

Page 557

General Optimization Guidelines 22-17See Example 2-2. The optimized code sets ebx to zero, then compares A and B. If A is greater than or equal to B,

Page 558

ixData Alignment... 5-4Data Arran

Page 559

IA-32 Intel® Architecture Optimization2-18The cmov and fcmov instructions are available on the Pentium II and subsequent processors, but not on Pentiu

Page 560

General Optimization Guidelines 22-19Static PredictionBranches that do not have a history in the BTB (see the “Branch Prediction” section) are predict

Page 561

IA-32 Intel® Architecture Optimization2-20Assembly/Compiler Coding Rule 3. (M impact, H generality) Arrange code to be consistent with the static bra

Page 562

General Optimization Guidelines 22-21Examples 2-6, Example 2-7 provide basic rules for a static prediction algorithm.In Example 2-6, the backward bran

Page 563

IA-32 Intel® Architecture Optimization2-22Inlining, Calls and ReturnsThe return address stack mechanism augments the static and dynamic predictors to

Page 564

General Optimization Guidelines 22-23Assembly/Compiler Coding Rule 6. (H impact, M generality) Do not inline a function if doing so increases the work

Page 565

IA-32 Intel® Architecture Optimization2-24Placing data immediately following an indirect branch can cause a performance problem. If the data consist o

Page 566

General Optimization Guidelines 22-25indirect branch into a tree where one or more indirect branches are preceded by conditional branches to those tar

Page 567 - INTEL SALES OFFICES

IA-32 Intel® Architecture Optimization2-26best performance from a coding effort. An example of peeling out the most favored target of an indirect bran

Page 568

General Optimization Guidelines 22-27• The Pentium 4 processor can correctly predict the exit branch for an inner loop that has 16 or fewer iterations

Comments to this Manuals

No comments