IA-32 Intel® ArchitectureOptimization ReferenceManualOrder Number: 248966-013USApril 2006
xHardware Prefetch ... 6-19Example of Effective
IA-32 Intel® Architecture Optimization2-28In this example, a loop that executes 100 times assigns x to every even-numbered element and y to every odd-
General Optimization Guidelines 22-29Memory AccessesThis section discusses guidelines for optimizing code and data memory accesses. The most important
IA-32 Intel® Architecture Optimization2-30Assembly/Compiler Coding Rule 16. (H impact, H generality) Align data on natural operand size address bounda
General Optimization Guidelines 22-31 Alignment of code is less of an issue for the Pentium 4 processor. Alignment of branch targets to maximize ba
IA-32 Intel® Architecture Optimization2-32Store ForwardingThe processor’s memory system only sends stores to memory (including cache) after store reti
General Optimization Guidelines 22-33If a variable is known not to change between when it is stored and when it is used again, the register that was s
IA-32 Intel® Architecture Optimization2-34The size and alignment restrictions for store forwarding are illustrated in Figure 2-2.Coding rules to help
General Optimization Guidelines 22-35A load that forwards from a store must wait for the store’s data to be written to the store buffer before proceed
IA-32 Intel® Architecture Optimization2-36Example 2-14 illustrates a stalled store-forwarding situation that may appear in compiler generated code. So
General Optimization Guidelines 22-37When moving data that is smaller than 64 bits between memory locations, 64-bit or 128-bit SIMD register moves are
xiKey Practices of System Bus Optimization ... 7-17Key Practices of Memory Optimizati
IA-32 Intel® Architecture Optimization2-38Store-forwarding Restriction on Data AvailabilityThe value to be stored must be available before the load op
General Optimization Guidelines 22-39An example of a loop-carried dependence chain is shown in Example 2-17.Data Layout OptimizationsUser/Source Codin
IA-32 Intel® Architecture Optimization2-40Cache line size for Pentium 4 and Pentium M processors can impact streaming applications (for example, multi
General Optimization Guidelines 22-41However, if the access pattern of the array exhibits locality, such as if the array index is being swept through,
IA-32 Intel® Architecture Optimization2-42non-sequential manner, the automatic hardware prefetcher cannot prefetch the data. The prefetcher can recogn
General Optimization Guidelines 22-43If for some reason it is not possible to align the stack for 64-bits, the routine should access the parameter and
IA-32 Intel® Architecture Optimization2-44Capacity Limits in Set-Associative CachesCapacity limits may occur if the number of outstanding memory refer
General Optimization Guidelines 22-45Aliasing Cases in the Pentium® 4 and Intel® Xeon® ProcessorsAliasing conditions that are specific to the Pentium
IA-32 Intel® Architecture Optimization2-46Aliasing Cases in the Pentium M ProcessorPentium M, Intel Core Solo and Intel Core Duo processors have the f
General Optimization Guidelines 22-47Mixing Code and DataThe Pentium 4 processor’s aggressive prefetching and pre-decoding of instructions has two rel
xiiSign Extension to Full 64-Bits... 8-3Alternate Coding Rules
IA-32 Intel® Architecture Optimization2-48and cross-modifying code (when more than one processor in a multi-processor system are writing to a code pag
General Optimization Guidelines 22-49write misses; only four write-combining buffers are guaranteed to be available for simultaneous use. Write combin
IA-32 Intel® Architecture Optimization2-50be no RFO since the line is not cached, and there is no such delay. For details on write-combining, see the
General Optimization Guidelines 22-51Locality enhancement to the last level cache can be accomplished with sequencing the data access pattern to take
IA-32 Intel® Architecture Optimization2-52Minimizing Bus LatencyThe system bus on Intel Xeon and Pentium 4 processors provides up to6.4 GB/sec bandwid
General Optimization Guidelines 22-53User/Source Coding Rule 8. (H impact, H generality) To achieve effective amortization of bus latency, software
IA-32 Intel® Architecture Optimization2-54Example 2-21 Non-temporal Stores and 64-byte Bus Write TransactionsExample 2-22 Non-temporal Stores and Part
General Optimization Guidelines 22-55PrefetchingThe Pentium 4 processor has three prefetching mechanisms: • hardware instruction prefetcher• software
IA-32 Intel® Architecture Optimization2-56access patterns to suit the hardware prefetcher is highly recommended, and should be a higher-priority consi
General Optimization Guidelines 22-57• new cache line flush instruction• new memory fencing instructionsFor a detailed description of using cacheabili
xiiiTime-based Sampling... A-9Event-based Sampling...
IA-32 Intel® Architecture Optimization2-58Guidelines for Optimizing Floating-point CodeUser/Source Coding Rule 10. (M impact, M generality) Enable the
General Optimization Guidelines 22-59to early out). However, be careful of introducing more than a total of two values for the floating point control
IA-32 Intel® Architecture Optimization2-60desired numeric precision, the size of the look-up tableland taking advantage of the parallelism of the Stre
General Optimization Guidelines 22-61executing SSE/SSE2/SSE3 instructions and when speed is more important than complying to IEEE standard. The follow
IA-32 Intel® Architecture Optimization2-62Underflow exceptions and denormalized source operands are usually treated according to the IEEE 754 specific
General Optimization Guidelines 22-63FPU control word (FCW), such as when performing conversions to integers. On Pentium M, Intel Core Solo and Intel
IA-32 Intel® Architecture Optimization2-64Assembly/Compiler Coding Rule 31. (H impact, M generality) Minimize changes to bits 8-12 of the floating poi
General Optimization Guidelines 22-65If there is more than one change to rounding, precision and infinity bits and the rounding mode is not important
IA-32 Intel® Architecture Optimization2-66Example 2-23 Algorithm to Avoid Changing the Rounding Mode_fto132proclea ecx,[esp-8]sub esp,16 ; allocate f
General Optimization Guidelines 22-67Assembly/Compiler Coding Rule 32. (H impact, L generality) Minimize the number of changes to the rounding mode. D
xivUsing Performance Metrics with Hyper-Threading Technology ... B-50Using Performance Events of Intel Core S
IA-32 Intel® Architecture Optimization2-68Assembly/Compiler Coding Rule 33. (H impact, L generality) Minimize the number of changes to the precision m
General Optimization Guidelines 22-69This in turn allows instructions to be reordered to make instructions available to be executed in parallel. Out-o
IA-32 Intel® Architecture Optimization2-70• Scalar floating-point registers may be accessed directly, avoiding fxch and top-of-stack restrictions. On
General Optimization Guidelines 22-71Recommendation: Use the compiler switch to generate SSE2 scalar floating-point code over x87 code. When working w
IA-32 Intel® Architecture Optimization2-72Floating-Point StallsFloating-point instructions have a latency of at least two cycles. But, because of the
General Optimization Guidelines 22-73Note that transcendental functions are supported only in x87 floating point, not in Streaming SIMD Extensions or
IA-32 Intel® Architecture Optimization2-74Complex InstructionsAssembly/Compiler Coding Rule 40. (ML impact, M generality) Avoid using complex instruct
General Optimization Guidelines 22-75Use of the inc and dec InstructionsThe inc and dec instructions modify only a subset of the bits in the flag regi
IA-32 Intel® Architecture Optimization2-76CMPXCHG8B, various rotate instructions, STC, and STD. An example of assembly with a partial flag register st
General Optimization Guidelines 22-77(model 9) does incur a penalty. This is because every operation on a partial register updates the whole register.
xvExamplesExample 2-1 Assembly Code with an Unpredictable Branch ... 2-17Example 2-2 Code Optimization to Eliminate Branches
IA-32 Intel® Architecture Optimization2-78Table 2-3 illustrates using movzx to avoid a partial register stall when packing three byte values into a re
General Optimization Guidelines 22-79less delay than the partial register update problem mentioned above, but the performance gain may vary. If the ad
IA-32 Intel® Architecture Optimization2-80Prefixes and Instruction DecodingAn IA-32 instruction can be up to 15 bytes in length. Prefixes can change t
General Optimization Guidelines 22-81• Processing an instruction with the 0x66 prefix that (i) has a modr/m byte in its encoding and (ii) the opcode b
IA-32 Intel® Architecture Optimization2-82String move/store instructions have multiple data granularities. For efficient data movement, larger data gr
General Optimization Guidelines 22-83• Cache eviction:If the amount of data to be processed by a memory routine approaches half the size of the last l
IA-32 Intel® Architecture Optimization2-84improve address alignment, a small piece of prolog code using movsb/stosb with count less than 4 can be used
General Optimization Guidelines 22-85Memory routines in the runtime library generated by Intel Compilers are optimized across wide range of address al
IA-32 Intel® Architecture Optimization2-86In some situations, the byte count of the data to operate is known by the context (versus from a parameter p
General Optimization Guidelines 22-87Clearing RegistersPentium 4 processor provides special support to xor, sub, or pxor operations when executed with
xviExample 3-4 Identification of SSE2 with cpuid ... 3-5Example 3-5 Identification of SSE2 by the OS
IA-32 Intel® Architecture Optimization2-88Using test instruction between the instruction that may modify part of the flag register and the instruction
General Optimization Guidelines 22-89Use movapd as an alternative; it writes all 128 bits. Even though this instruction has a longer latency, the μops
IA-32 Intel® Architecture Optimization2-90Prolog SequencesAssembly/Compiler Coding Rule 57. (M impact, MH generality) In routines that do not need a f
General Optimization Guidelines 22-91Using memory as a destination operand may further reduce register pressure at the slight risk of making trace cac
IA-32 Intel® Architecture Optimization2-92Spill SchedulingThe spill scheduling algorithm used by a code generator will be impacted by the Pentium 4 pr
General Optimization Guidelines 22-93Because micro-ops are delivered from the trace cache in the common cases, decoding rules are not required. Schedu
IA-32 Intel® Architecture Optimization2-94Data elements in parallel. The number of elements which can be operated on in parallel range from four singl
General Optimization Guidelines 22-95User/Source Coding Rule 19. (M impact, ML generality) Avoid the use of conditional branches inside loops and cons
IA-32 Intel® Architecture Optimization2-96The other NOPs have no special hardware support. Their input and output registers are interpreted by the har
General Optimization Guidelines 22-97User/Source Coding RulesUser/Source Coding Rule 1. (M impact, L generality) If an indirect branch has two or more
xviiExample 4-20 Clipping to an Arbitrary Signed Range [high, low]...4-27Example 4-21 Simplified Clipping to an Arbitrary Signed
IA-32 Intel® Architecture Optimization2-98User/Source Coding Rule 8. (H impact, H generality) To achieve effective amortization of bus latency, softwa
General Optimization Guidelines 22-99look-up-table-based algorithm using interpolation techniques. It is possible to improve transcendental performanc
IA-32 Intel® Architecture Optimization2-100order engine. When tuning, note that all IA-32 based processors have very high branch prediction rates. Con
General Optimization Guidelines 22-101Assembly/Compiler Coding Rule 10. (M impact, L generality) Do not put more than four branches in 16-byte chunks.
IA-32 Intel® Architecture Optimization2-102Assembly/Compiler Coding Rule 18. (H impact, M generality) A load that forwards from a store must have the
General Optimization Guidelines 22-103first-level cache working set. Avoid having more than 8 cache lines that are some multiple of 64 KB apart in the
IA-32 Intel® Architecture Optimization2-104Assembly/Compiler Coding Rule 32. (H impact, L generality) Minimize the number of changes to the rounding
General Optimization Guidelines 22-105Assembly/Compiler Coding Rule 42. (M impact, H generality) inc and dec instructions should be replaced with an a
IA-32 Intel® Architecture Optimization2-106instead of a cmp of the register to zero, this saves the need to encode the zero and saves encoding space.
General Optimization Guidelines 22-107Assembly/Compiler Coding Rule 56. (M impact, ML generality) For arithmetic or logical operations that have thei
xviiiExample 6-12 Memory Copy Using Hardware Prefetch and Bus Segmentation..6-50Example 7-1 Serial Execution of Producer and Consumer Work Items ...
IA-32 Intel® Architecture Optimization2-108Tuning SuggestionsTuning Suggestion 1. Rarely, a performance problem may be noted due to executing data on
3-13Coding for SIMD ArchitecturesIntel Pentium 4, Intel Xeon and Pentium M processors include support for Streaming SIMD Extensions 2 (SSE2), Streamin
IA-32 Intel® Architecture Optimization3-2Checking for Processor Support of SIMD TechnologiesThis section shows how to check whether a processor suppor
Coding for SIMD Architectures 33-3For more information on cpuid see, Intel® Processor Identification with CPUID Instruction, order number 241618.Check
IA-32 Intel® Architecture Optimization3-4To find out whether the operating system supports SSE, execute an SSE instruction and trap for an exception i
Coding for SIMD Architectures 33-5Checking for Streaming SIMD Extensions 2 SupportChecking for support of SSE2 is like checking for SSE support. You m
IA-32 Intel® Architecture Optimization3-6Checking for Streaming SIMD Extensions 3 SupportSSE3 includes 13 instructions, 11 of those are suited for SIM
Coding for SIMD Architectures 33-7Example 3-6 Identification of SSE3 with cpuidSSE3 requires the same support from the operating system as SSE. To fin
IA-32 Intel® Architecture Optimization3-8Example 3-7 Identification of SSE3 by the OSConsiderations for Code Conversion to SIMD ProgrammingThe VTune P
Coding for SIMD Architectures 33-9Figure 3-1 Converting to Streaming SIMD Extensions ChartOM15156Code benefitsfrom SIMDSTOPIdentify Hot Spots in CodeI
xixFiguresFigure 1-1 Typical SIMD Operations ... 1-3Figure 1-2 SIMD Instruction Regist
IA-32 Intel® Architecture Optimization3-10To use any of the SIMD technologies optimally, you must evaluate the following situations in your code:• fra
Coding for SIMD Architectures 33-11specific optimizations. Where appropriate, the coach displays pseudo-code to suggest the use of highly optimized in
IA-32 Intel® Architecture Optimization3-12costly application processing time. However, these routines have potential for increased performance when yo
Coding for SIMD Architectures 33-13Coding MethodologiesSoftware developers need to compare the performance improvement that can be obtained from assem
IA-32 Intel® Architecture Optimization3-14The examples that follow illustrate the use of coding adjustments to enable the algorithm to benefit from th
Coding for SIMD Architectures 33-15AssemblyKey loops can be coded directly in assembly language using an assembler or by using inlined assembly (C-asm
IA-32 Intel® Architecture Optimization3-16SIMD Extensions 2 integer SIMD and __m128d is used for double precision floating-point SIMD. These types ena
Coding for SIMD Architectures 33-17The intrinsic data types, however, are not a basic ANSI C data type, and therefore you must observe the following u
IA-32 Intel® Architecture Optimization3-18Here, fvec.h is the class definition file and F32vec4 is the class representing an array of four floats. The
Coding for SIMD Architectures 33-19The caveat to this is that only certain types of loops can be automatically vectorized, and in most cases user inte
iiINFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLE
xxFigure 6-2 Memory Access Latency and Execution Without Prefetch ... 6-23Figure 6-3 Memory Access Latency and Execution With Prefetch ...
IA-32 Intel® Architecture Optimization3-20Stack and Data AlignmentTo get the most performance out of code written for SIMD technologies data should be
Coding for SIMD Architectures 33-21By adding the padding variable pad, the structure is now 8 bytes, and if the first element is aligned to 8 bytes (6
IA-32 Intel® Architecture Optimization3-22Assuming you have a 64-bit aligned data vector and a 64-bit aligned coefficients vector, the filter operatio
Coding for SIMD Architectures 33-23• Functions that use Streaming SIMD Extensions or Streaming SIMD Extensions 2 data need to provide a 16-byte aligne
IA-32 Intel® Architecture Optimization3-24Another way to improve data alignment is to copy the data into locations that are aligned on 64-bit boundari
Coding for SIMD Architectures 33-25The __declspec(align(16)) specifications can be placed before data declarations to force 16-byte alignment. This is
IA-32 Intel® Architecture Optimization3-26In C++ (but not in C) it is also possible to force the alignment of a class/struct/union type, as in the cod
Coding for SIMD Architectures 33-27Improving Memory UtilizationMemory performance can be improved by rearranging data and algorithms for SSE 2, SSE, a
IA-32 Intel® Architecture Optimization3-28There are two options for computing data in AoS format: perform operation on the data as it stands in AoS fo
Coding for SIMD Architectures 33-29Performing SIMD operations on the original AoS format can require more calculations and some of the operations do n
xxiTablesTable 1-1 Pentium 4 and Intel Xeon Processor Cache Parameters ... 1-20Table 1-3 Cache Parameters of Pentium M, Intel® Core™ So
IA-32 Intel® Architecture Optimization3-30but is somewhat inefficient as there is the overhead of extra instructions during computation. Performing th
Coding for SIMD Architectures 33-31Note that SoA can have the disadvantage of requiring more independent memory stream references. A computation that
IA-32 Intel® Architecture Optimization3-32Strip MiningStrip mining, also known as loop sectioning, is a loop transformation technique for enabling SIM
Coding for SIMD Architectures 33-33The main loop consists of two functions: transformation and lighting. For each object, the main loop calls a transf
IA-32 Intel® Architecture Optimization3-34In Example 3-19, the computation has been strip-mined to a size strip_size. The value strip_size is chosen s
Coding for SIMD Architectures 33-35For the first iteration of the inner loop, each access to array B will generate a cache miss. If the size of one ro
IA-32 Intel® Architecture Optimization3-36This situation can be avoided if the loop is blocked with respect to the cache size. In Figure 3-3, a block_
Coding for SIMD Architectures 33-37As one can see, all the redundant cache misses can be eliminated by applying this loop blocking technique. If MAX i
IA-32 Intel® Architecture Optimization3-38Note that this can be applied to both SIMD integer and SIMD floating-point code.If there are multiple consum
Coding for SIMD Architectures 33-39Recommendation: When targeting code generation for Intel Core Solo and Intel Core Duo processors, favor instruction
xxiiTable C-5 Streaming SIMD Extension 64-bit Integer Instructions... C-14Table C-7 IA-32 x87 Floating-point Instructions...
IA-32 Intel® Architecture Optimization3-40
4-14Optimizing for SIMD Integer ApplicationsThe SIMD integer instructions provide performance improvements in applications that are integer-intensive
IA-32 Intel® Architecture Optimization4-2For planning considerations of using the new SIMD integer instructions, refer to “Checking for Streaming SIMD
Optimizing for SIMD Integer Applications 44-3Using SIMD Integer with x87 Floating-pointAll 64-bit SIMD integer instructions use the MMX registers, whi
IA-32 Intel® Architecture Optimization4-4Using emms clears all of the valid bits, effectively emptying the x87 floating-point stack and making it read
Optimizing for SIMD Integer Applications 44-5• Don’t empty when already empty: If the next instruction uses an MMX register, _mm_empty() incurs a cost
IA-32 Intel® Architecture Optimization4-6Data AlignmentMake sure that 64-bit SIMD integer data is 8-byte aligned and that 128-bit SIMD integer data is
Optimizing for SIMD Integer Applications 44-7Signed UnpackSigned numbers should be sign-extended when unpacking the values. This is similar to the zer
IA-32 Intel® Architecture Optimization4-8Interleaved Pack with SaturationThe pack instructions pack two values into the destination register in a pred
Optimizing for SIMD Integer Applications 44-9Figure 4-2 illustrates two values interleaved in the destination register, and Example 4-4 shows code tha
xxiiiIntroductionThe IA-32 Intel® Architecture Optimization Reference Manual describes how to optimize software to take advantage of the performance c
IA-32 Intel® Architecture Optimization4-10The pack instructions always assume that the source operands are signed numbers. The result in the destinati
Optimizing for SIMD Integer Applications 44-11Non-Interleaved UnpackThe unpack instructions perform an interleave merge of the data elements of the de
IA-32 Intel® Architecture Optimization4-12The other destination register will contain the opposite combination illustrated in Figure 4-4. Code in the
Optimizing for SIMD Integer Applications 44-13Extract WordThe pextrw instruction takes the word in the designated MMX register selected by the two lea
IA-32 Intel® Architecture Optimization4-14Insert WordThe pinsrw instruction loads a word from the lower half of a 32-bit integer register or from memo
Optimizing for SIMD Integer Applications 44-15 If all of the operands in a register are being replaced by a series of pinsrw instructions, it can be
IA-32 Intel® Architecture Optimization4-16Move Byte Mask to IntegerThe pmovmskb instruction returns a bit mask formed from the most significant bits o
Optimizing for SIMD Integer Applications 44-17Figure 4-7 pmovmskb Instruction ExampleExample 4-10 pmovmskb Instruction Code; Input:; source value; Out
IA-32 Intel® Architecture Optimization4-18Packed Shuffle Word for 64-bit RegistersThe pshuf instruction (see Figure 4-8, Example 4-11) uses the immedi
Optimizing for SIMD Integer Applications 44-19Packed Shuffle Word for 128-bit RegistersThe pshuflw/pshufhw instruction performs a full shuffle of any
IA-32 Intel® Architecture Optimizationxxivtarget the Intel NetBurst microarchitecture and the Pentium M processor microarchitecture.Tuning Your Applic
IA-32 Intel® Architecture Optimization4-20Unpacking/interleaving 64-bit Data in 128-bit RegistersThe punpcklqdq/punpchqdq instructions interleave the
Optimizing for SIMD Integer Applications 44-21Data Movement There are two additional instructions to enable data movement from the 64-bit SIMD integer
IA-32 Intel® Architecture Optimization4-22pxor MM0, MM0pcmpeq MM1, MM1psubb MM0, MM1 [psubw MM0, MM1] (psubd MM0, MM1); three instructions above gen
Optimizing for SIMD Integer Applications 44-23Building BlocksThis section describes instructions and algorithms which implement common code building b
IA-32 Intel® Architecture Optimization4-24Absolute Difference of Signed NumbersChapter 4 computes the absolute difference of two signed numbers. The t
Optimizing for SIMD Integer Applications 44-25Absolute ValueUse Example 4-18 to compute |x|, where x is signed. This example assumes signed words to b
IA-32 Intel® Architecture Optimization4-26Clipping to an Arbitrary Range [high, low]This section explains how to clip a values to a range [high, low].
Optimizing for SIMD Integer Applications 44-27Highly Efficient ClippingFor clipping signed words to an arbitrary range, the pmaxsw and pminsw instruct
IA-32 Intel® Architecture Optimization4-28The code above converts values to unsigned numbers first and then clips them to an unsigned range. The last
Optimizing for SIMD Integer Applications 44-29packed-subtract instructions with unsigned saturation, thus this technique can only be used on packed-by
IntroductionxxvThe manual consists of the following parts:Introduction. Defines the purpose and outlines the contents of this manual.Chapter 1: IA-32
IA-32 Intel® Architecture Optimization4-30Unsigned ByteThe pmaxub instruction returns the maximum between the eight unsigned bytes in either two SIMD
Optimizing for SIMD Integer Applications 44-31The subtraction operation presented above is an absolute difference, that is, t = abs(x-y). The byte val
IA-32 Intel® Architecture Optimization4-32The PAVGB instruction operates on packed unsigned bytes and the PAVGW instruction operates on packed unsigne
Optimizing for SIMD Integer Applications 44-33Note that the output is a packed doubleword. If needed, a pack instruction can be used to convert the re
IA-32 Intel® Architecture Optimization4-34Memory OptimizationsYou can improve memory accesses using the following techniques:• Avoiding partial memory
Optimizing for SIMD Integer Applications 44-35Partial Memory AccessesConsider a case with large load after a series of small stores to the same area o
IA-32 Intel® Architecture Optimization4-36Let us now consider a case with a series of small loads after a large store to the same area of memory (begi
Optimizing for SIMD Integer Applications 44-37These transformations, in general, increase the number of instructions required to perform the desired o
IA-32 Intel® Architecture Optimization4-38SSE3 provides an instruction LDDQU for loading from memory address that are not 16 byte aligned. LDDQU is a
Optimizing for SIMD Integer Applications 44-39Increasing Bandwidth of Memory Fills and Video FillsIt is beneficial to understand how memory is accesse
IA-32 Intel® Architecture OptimizationxxviChapter 7: Multiprocessor and Hyper-Threading Technology. Describes guidelines and techniques for optimizing
IA-32 Intel® Architecture Optimization4-40same DRAM page have shorter latencies than sequential accesses to different DRAM pages. In many systems the
Optimizing for SIMD Integer Applications 44-41aligned versions; this can reduce the performance gains when using the 128-bit SIMD integer extensions.
IA-32 Intel® Architecture Optimization4-42Packed SSE2 Integer versus MMX InstructionsIn general, 128-bit SIMD integer instructions should be favored o
5-15Optimizing for SIMD Floating-point ApplicationsThis chapter discusses general rules of optimizing for the single-instruction, multiple-data (SIMD)
IA-32 Intel® Architecture Optimization5-2• Use MMX technology instructions and registers or for copying data that is not used later in SIMD floating-p
Optimizing for SIMD Floating-point Applications 55-3• Is the data arranged for efficient utilization of the SIMD floating-point registers?• Is this ap
IA-32 Intel® Architecture Optimization5-4When using scalar floating-point instructions, it is not necessary to ensure that the data appears in vector
Optimizing for SIMD Floating-point Applications 55-5For some applications, e.g., 3D geometry, the traditional data arrangement requires some changes t
IA-32 Intel® Architecture Optimization5-6simultaneously referred to as an xyz data representation, see the diagram below) are computed in parallel, an
Optimizing for SIMD Floating-point Applications 55-7To utilize all 4 computation slots, the vertex data can be reorganized to allow computation on eac
IntroductionxxviiRelated DocumentationFor more information on the Intel architecture, specific techniques, and processor architecture terminology refe
IA-32 Intel® Architecture Optimization5-8Figure 5-2 shows how 1 result would be computed for 7 instructions if the data were organized as AoS and usin
Optimizing for SIMD Floating-point Applications 55-9Now consider the case when the data is organized as SoA. Example 5-2 demonstrates how 4 results ar
IA-32 Intel® Architecture Optimization5-10To gather data from 4 different memory locations on the fly, follow steps:1. Identify the first half of the
Optimizing for SIMD Floating-point Applications 55-11 y1 x1movhps xmm7, [ecx+16] // xmm7 = y2 x2 y1 x1movlps xmm0, [ecx+32] // xmm0 = -- -- y3 x3m
IA-32 Intel® Architecture Optimization5-12Example 5-4 shows the same data -swizzling algorithm encoded using the Intel C++ Compiler’s intrinsics for S
Optimizing for SIMD Floating-point Applications 55-13 Although the generated result of all zeros does not depend on the specific data contained in the
IA-32 Intel® Architecture Optimization5-14Data DeswizzlingIn the deswizzle operation, we want to arrange the SoA format back into AoS format so the xx
Optimizing for SIMD Floating-point Applications 55-15You may have to swizzle data in the registers, but not in memory. This occurs when two different
IA-32 Intel® Architecture Optimization5-16// Start deswizzling here movaps xmm7, xmm4 // xmm7= a1 a2 a3 a4 movhlps xmm7, xmm3 // xmm7= b3 b4 a
Optimizing for SIMD Floating-point Applications 55-17Using MMX Technology Code for Copy or Shuffling FunctionsIf there are some parts in the code that
IA-32 Intel® Architecture OptimizationxxviiiNotational ConventionsThis manual uses the following conventions:This type style Indicates an element of s
IA-32 Intel® Architecture Optimization5-18Example 5-8 illustrates how to use MMX technology code for copying or shuffling.Horizontal ADD Using SSEAlth
Optimizing for SIMD Floating-point Applications 55-19Figure 5-3 Horizontal Add Using movhlps/movlhpsExample 5-9 Horizontal Add Using movhlps/movlhpsvo
IA-32 Intel® Architecture Optimization5-20 // START HORIZONTAL ADD movaps xmm5, xmm0 // xmm5= A1,A2,A3,A4movlhps xmm5, xmm1 // xmm5= A1,A2,B1,
Optimizing for SIMD Floating-point Applications 55-21Use of cvttps2pi/cvttss2si InstructionsThe cvttps2pi and cvttss2si instructions encode the trunca
IA-32 Intel® Architecture Optimization5-22avoided since there is a penalty associated with writing this register; typically, through the use of the cv
Optimizing for SIMD Floating-point Applications 55-23 SSE3 and Complex ArithmeticsThe flexibility of SSE3 in dealing with AOS-type of data structure
IA-32 Intel® Architecture Optimization5-24instructions to perform multiplications of single-precision complex numbers. Example 5-12 demonstrates using
Optimizing for SIMD Floating-point Applications 55-25Example 5-12 Division of Two Pair of Single-precision Complex Number// Division of (ak + i bk ) /
IA-32 Intel® Architecture Optimization5-26SSE3 and Horizontal ComputationSometimes the AOS type of data organization are more natural in many algebrai
Optimizing for SIMD Floating-point Applications 55-27SIMD Optimizations and MicroarchitecturesPentium M, Intel Core Solo and Intel Core Duo processors
1-11IA-32 Intel® Architecture Processor Family OverviewThis chapter gives an overview of the features relevant to software optimization for the curre
IA-32 Intel® Architecture Optimization5-28When targeting complex arithmetics on Intel Core Solo and Intel Core Duo processors, using single-precision
6-16Optimizing Cache UsageOver the past decade, processor speed has increased more than ten times. Memory access speed has increased at a slower pace.
IA-32 Intel® Architecture Optimization6-2• Memory Optimization Using Hardware Prefetching, Software Prefetch and Cacheability Instructions: discusses
Optimizing Cache Usage 66-3• Facilitate compiler optimization: — Minimize use of global variables and pointers— Minimize use of complex control flow —
IA-32 Intel® Architecture Optimization6-4• Optimize software prefetch scheduling distance:— Far ahead enough to allow interim computation to overlap m
Optimizing Cache Usage 66-53. Follows only one stream per 4K page (load or store)4. Can prefetch up to 8 simultaneous independent streams from eight d
IA-32 Intel® Architecture Optimization6-6Data reference patterns can be classified as follows:Temporal data will be used again soonSpatial data will b
Optimizing Cache Usage 66-7The prefetch instruction is implementation-specific; applications need to be tuned to each implementation to maximize perfo
IA-32 Intel® Architecture Optimization6-8The Prefetch Instructions – Pentium 4 Processor ImplementationStreaming SIMD Extensions include four flavors
Optimizing Cache Usage 66-9Currently, the prefetch instruction provides a greater performance gain than preloading because it:• has no destination reg
iiiContentsIntroductionChapter 1 IA-32 Intel® Architecture Processor Family OverviewSIMD Technology...
IA-32 Intel® Architecture Optimization1-2Intel Core Solo and Intel Core Duo processors incorporate microarchitectural enhancements for performance and
IA-32 Intel® Architecture Optimization6-10The Non-temporal Store InstructionsThis section describes the behavior of streaming stores and reiterates so
Optimizing Cache Usage 66-11• Reduce disturbance of frequently used cached (temporal) data, since they write around the processor caches.Streaming sto
IA-32 Intel® Architecture Optimization6-12evicting data from all processor caches). The Pentium M processor implements a combination of both approache
Optimizing Cache Usage 66-13possible. This behavior should be considered reserved, and dependence on the behavior of any particular implementation ris
IA-32 Intel® Architecture Optimization6-14In case the region is not mapped as WC, the streaming might update in-place in the cache and a subsequent sf
Optimizing Cache Usage 66-15The maskmovq/maskmovdqu (non-temporal byte mask store of packed integer in an MMX technology or Streaming SIMD Extensions
IA-32 Intel® Architecture Optimization6-16The degree to which a consumer of data knows that the data is weakly-ordered can vary for these cases. As a
Optimizing Cache Usage 66-17The clflush InstructionThe cache line associated with the linear address specified by the value of byte address is invalid
IA-32 Intel® Architecture Optimization6-18Memory Optimization Using PrefetchThe Pentium 4 processor has two mechanisms for data prefetch: software-con
Optimizing Cache Usage 66-19Hardware PrefetchThe automatic hardware prefetch, can bring cache lines into the unified last-level cache based on prior d
IA-32 Intel® Architecture Processor Family Overview1-3each corresponding pair of data elements (X1 and Y1, X2 and Y2, X3 and Y3, and X4 and Y4). The r
IA-32 Intel® Architecture Optimization6-20• May consume extra system bandwidth if the application’s memory traffic has significant portions with strid
Optimizing Cache Usage 66-21Example 6-2 Populating an Array for Circular Pointer Chasing with Constant Strideregister char ** p;char *next; // Populat
IA-32 Intel® Architecture Optimization6-22 Example of Latency Hiding with S/W Prefetch InstructionAchieving the highest level of memory optimization u
Optimizing Cache Usage 66-23execution units sit idle and wait until data is returned. On the other hand, the memory bus sits idle while the execution
IA-32 Intel® Architecture Optimization6-24The performance loss caused by poor utilization of resources can be completely eliminated by correctly sched
Optimizing Cache Usage 66-25• Balance single-pass versus multi-pass execution• Resolve memory bank conflict issues• Resolve cache management issuesThe
IA-32 Intel® Architecture Optimization6-26lines of data per iteration. The PSD would need to be increased/decreased if more/less than two cache lines
Optimizing Cache Usage 66-27This memory de-pipelining creates inefficiency in both the memory pipeline and execution pipeline. This de-pipelining effe
IA-32 Intel® Architecture Optimization6-28Prefetch concatenation can bridge the execution pipeline bubbles between the boundary of an inner loop and i
Optimizing Cache Usage 66-29Minimize Number of Software PrefetchesPrefetch instructions are not completely free in terms of bus cycles, machine cycles
IA-32 Intel® Architecture Optimization1-4SIMD improves the performance of 3D graphics, speech recognition, image processing, scientific applications a
IA-32 Intel® Architecture Optimization6-30Figure 6-5Figure demonstrates the effectiveness of software prefetches in latency hiding. The X axis indica
Optimizing Cache Usage 66-31Figure 6-5 Memory Access Latency and Execution With Prefetch2 Load streams, 1 store stream5010015020025030035054 108 144
IA-32 Intel® Architecture Optimization6-32Mix Software Prefetch with Computation InstructionsIt may seem convenient to cluster all of the prefetch ins
Optimizing Cache Usage 66-33 Example 6-6 Spread Prefetch InstructionsNOTE. To avoid instruction execution stalls due to the over-utilization of the
IA-32 Intel® Architecture Optimization6-34Software Prefetch and Cache Blocking TechniquesCache blocking techniques, such as strip-mining, are used to
Optimizing Cache Usage 66-35In the temporally-adjacent scenario, subsequent passes use the same data and find it already in second-level cache. Prefet
IA-32 Intel® Architecture Optimization6-36Figure 6-7 shows how prefetch instructions and strip-mining can be applied to increase performance in both o
Optimizing Cache Usage 66-37In scenario to the right, in Figure 6-7, keeping the data in one way of the second-level cache does not improve cache loca
IA-32 Intel® Architecture Optimization6-38Without strip-mining, all the x,y,z coordinates for the four vertices must be re-fetched from memory in the
Optimizing Cache Usage 66-39Table 6-1 summarizes the steps of the basic usage model that incorporates only software prefetch with strip-mining. The st
IA-32 Intel® Architecture Processor Family Overview1-5SSE and SSE2 instructions also introduced cacheability and memory ordering instructions that can
IA-32 Intel® Architecture Optimization6-40happen to be powers of 2, aliasing condition due to finite number of way-associativity (see “Capacity Limits
Optimizing Cache Usage 66-41references enables the hardware prefetcher to initiate bus requests to read some cache lines before the code actually refe
IA-32 Intel® Architecture Optimization6-42selected to ensure that the batch stays within the processor caches through all passes. An intermediate cach
Optimizing Cache Usage 66-43The choice of single-pass or multi-pass can have a number of performance implications. For instance, in a multi-pass pipel
IA-32 Intel® Architecture Optimization6-44a line burst transaction. To achieve the best possible performance, it is recommended to align data along th
Optimizing Cache Usage 66-45The following examples of using prefetching instructions in the operation of video encoder and decoder as well as in simpl
IA-32 Intel® Architecture Optimization6-46Later, the processor re-reads the data using prefetchnta, which ensures maximum bandwidth, yet minimizes dis
Optimizing Cache Usage 66-47The memory copy algorithm can be optimized using the Streaming SIMD Extensions with these considerations:• alignment of da
IA-32 Intel® Architecture Optimization6-48Using the 8-byte Streaming Stores and Software PrefetchExample 6-11 presents the copy algorithm that uses se
Optimizing Cache Usage 66-49In Example 6-11, eight _mm_load_ps and _mm_stream_ps intrinsics are used so that all of the data prefetched (a 128-byte ca
IA-32 Intel® Architecture Optimization1-6SSE instructions are useful for 3D geometry, 3D rendering, speech recognition, and video encoding and decodin
IA-32 Intel® Architecture Optimization6-50The instruction, temp = a[kk+CACHESIZE], is used to ensure the page table entry for array, and a is entered
Optimizing Cache Usage 66-51prefetch_loop:movaps xmm0, [esi+ecx]movaps xmm0, [esi+ecx+64]add ecx,128cmp ecx,BLOCK_SIZEjne prefetch_loopxor ecx,ecxalig
IA-32 Intel® Architecture Optimization6-52Performance Comparisons of Memory Copy RoutinesThe throughput of a large-region, memory copy routine depends
Optimizing Cache Usage 66-53The baseline for performance comparison is the throughput (bytes/sec) of 8-MByte region memory copy on a first-generation
IA-32 Intel® Architecture Optimization6-54query each level of the cache hierarchy. Enumeration of each cache level is by specifying an index value (st
Optimizing Cache Usage 66-55• Determine multi-threading resource topology in an MP system (See Section 7.10 of IA-32 Intel® Architecture Software Deve
IA-32 Intel® Architecture Optimization6-56platform, software can extract information on the number and the identities of each logical processor sharin
7-17Multi-Core and Hyper-Threading TechnologyThis chapter describes software optimization techniques for multithreaded applications running in an envi
IA-32 Intel® Architecture Optimization7-2cores but shared by two logical processors in the same core if Hyper-Threading Technology is enabled. This ch
Multi-Core and Hyper-Threading Technology 77-3Figure 7-1 illustrates how performance gains can be realized for any workload according to Amdahl’s law.
IA-32 Intel® Architecture Processor Family Overview1-7Intel® Extended Memory 64 Technology (Intel®EM64T)Intel EM64T is an extension of the IA-32 Intel
IA-32 Intel® Architecture Optimization7-4When optimizing application performance in a multithreaded environment, control flow parallelism is likely to
Multi-Core and Hyper-Threading Technology 77-5terms of time of completion relative to the same task when in a single-threaded environment) will vary,
IA-32 Intel® Architecture Optimization7-6When two applications are employed as part of a multi-tasking workload, there is little synchronization overh
Multi-Core and Hyper-Threading Technology 77-7Parallel Programming ModelsTwo common programming models for transforming independent task requirements
IA-32 Intel® Architecture Optimization7-8Functional DecompositionApplications usually process a wide variety of tasks with diverse functions and many
Multi-Core and Hyper-Threading Technology 77-9overhead when buffers are exchanged between the producer and consumer. To achieve optimal scaling with t
IA-32 Intel® Architecture Optimization7-10Producer-Consumer Threading Models Figure 7-3 illustrates the basic scheme of interaction between a pair of
Multi-Core and Hyper-Threading Technology 77-11It is possible to structure the producer-consumer model in an interlaced manner such that it can minimi
IA-32 Intel® Architecture Optimization7-12corresponding task to use its designated buffer. Thus, the producer and consumer tasks execute in parallel i
Multi-Core and Hyper-Threading Technology 77-13Example 7-3 Thread Function for an Interlaced Producer Consumer Model// master thread starts the first
IA-32 Intel® Architecture Optimization1-8Intel NetBurst® MicroarchitectureThe Pentium 4 processor, Pentium 4 processor Extreme Edition supporting Hype
IA-32 Intel® Architecture Optimization7-14Tools for Creating Multithreaded ApplicationsProgramming directly to a multithreading application programmin
Multi-Core and Hyper-Threading Technology 77-15Automatic Parallelization of Code. While OpenMP directives allow programmers to quickly transform seria
IA-32 Intel® Architecture Optimization7-16Optimization GuidelinesThis section summarizes optimization guidelines for tuning multithreaded applications
Multi-Core and Hyper-Threading Technology 77-17• Place each synchronization variable alone, separated by 128 bytes or in a separate cache line. See “T
IA-32 Intel® Architecture Optimization7-18• Adjust the private stack of each thread in an application so the spacing between these stacks is not offse
Multi-Core and Hyper-Threading Technology 77-19• For each processor supporting Hyper-Threading Technology, consider adding functionally uncorrelated t
IA-32 Intel® Architecture Optimization7-20The best practice to reduce the overhead of thread synchronization is to start by reducing the application’s
Multi-Core and Hyper-Threading Technology 77-21the white paper “Developing Multi-threaded Applications: A Platform Consistent Approach” (referenced in
IA-32 Intel® Architecture Optimization7-22Synchronization for Short PeriodsThe frequency and duration that a thread needs to synchronize with other th
Multi-Core and Hyper-Threading Technology 77-23the processor must guarantee no violations of memory order occur. The necessity of maintaining the orde
IA-32 Intel® Architecture Processor Family Overview1-9• to operate at high clock rates and to scale to higher performance and clock rates in the futur
IA-32 Intel® Architecture Optimization7-24Example 7-4 Spin-wait Loop and PAUSE Instructions (a) An un-optimized spin-wait loop experiences performan
Multi-Core and Hyper-Threading Technology 77-25User/Source Coding Rule 21. (M impact, H generality) Insert the PAUSE instruction in fast spin loops an
IA-32 Intel® Architecture Optimization7-26To reduce the performance penalty, one approach is to reduce the likelihood of many threads competing to acq
Multi-Core and Hyper-Threading Technology 77-27If an application thread must remain idle for a long time, the application should use a thread blocking
IA-32 Intel® Architecture Optimization7-28Avoid Coding Pitfalls in Thread SynchronizationSynchronization between multiple threads must be designed and
Multi-Core and Hyper-Threading Technology 77-29In general, OS function calls should be used with care when synchronizing threads. When using OS-suppor
IA-32 Intel® Architecture Optimization7-30Prevent Sharing of Modified Data and False-SharingOn an Intel Core Duo processor, sharing of modified data i
Multi-Core and Hyper-Threading Technology 77-31User/Source Coding Rule 24. (H impact, M generality) Beware of false sharing within a cache line (64 by
IA-32 Intel® Architecture Optimization7-32• Objects allocated dynamically by different threads may share cache lines. Make sure that the variables use
Multi-Core and Hyper-Threading Technology 77-33• In managed environments that provide automatic object allocation, the object allocators and garbage
IA-32 Intel® Architecture Optimization1-10The out-of-order core aggressively reorders µops so that µops whose inputs are ready (and have execution res
IA-32 Intel® Architecture Optimization7-34Conserve Bus BandwidthIn a multi-threading environment, bus bandwidth may be shared by memory traffic origin
Multi-Core and Hyper-Threading Technology 77-35reads. An approximate working guideline for software to operate below bus saturation is to check if bus
IA-32 Intel® Architecture Optimization7-36Avoid Excessive Software PrefetchesPentium 4 and Intel Xeon Processors have an automatic hardware prefetcher
Multi-Core and Hyper-Threading Technology 77-37latency of scattered memory reads can be improved by issuing multiple memory reads back-to-back to over
IA-32 Intel® Architecture Optimization7-38Frequently, multiple partial writes to WC memory can be combined into full-sized writes using a software wri
Multi-Core and Hyper-Threading Technology 77-39block size for loop blocking should be determined by dividing the target cache size by the number of lo
IA-32 Intel® Architecture Optimization7-40User/Source Coding Rule 33. (H impact, M generality) Minimize the sharing of data between threads that execu
Multi-Core and Hyper-Threading Technology 77-41Example 7-8 shows the batched implementation of the producer and consumer thread functions.Example 7-8
IA-32 Intel® Architecture Optimization7-42Eliminate 64-KByte Aliased Data AccessesThe 64 KB aliasing condition is discussed in Chapter 2. Memory acces
Multi-Core and Hyper-Threading Technology 77-43Preventing Excessive Evictions in First-Level Data CacheCached data in a first-level data cache are ind
IA-32 Intel® Architecture Processor Family Overview1-11The Front EndThe front end of the Intel NetBurst microarchitecture consists of two parts:• fetc
IA-32 Intel® Architecture Optimization7-44Per-thread Stack OffsetTo prevent private stack accesses in concurrent threads from thrashing the first-leve
Multi-Core and Hyper-Threading Technology 77-45Example 7-9 Adding an Offset to the Stack Pointer of Three ThreadsVoid Func_thread_entry(DWORD *pArg){D
IA-32 Intel® Architecture Optimization7-46Per-instance Stack OffsetEach instance an application runs in its own linear address space; but the address
Multi-Core and Hyper-Threading Technology 77-47However, the buffer space does enable the first-level data cache to be shared cooperatively when two co
IA-32 Intel® Architecture Optimization7-48Front-end OptimizationIn the Intel NetBurst microarchitecture family of processors, the instructions are dec
Multi-Core and Hyper-Threading Technology 77-49On Hyper-Threading-Technology-enabled processors, excessive loop unrolling is likely to reduce the Trac
IA-32 Intel® Architecture Optimization7-50initial APIC_ID (See Section 7.10 of IA-32 Intel Architecture Software Developer’s Manual, Volume 3A for mor
Multi-Core and Hyper-Threading Technology 77-51Affinity masks can be used to optimize shared multi-threading resources. Example 7-11 Assembling 3-le
IA-32 Intel® Architecture Optimization7-52Arrangements of affinity-binding can benefit performance more than other arrangements. This applies to: • Sc
Multi-Core and Hyper-Threading Technology 77-53first to the primary logical processor of each processor core. This example is also optimized to the si
ivOut-of-Order Core... 1-30In-Order Retirement...
IA-32 Intel® Architecture Optimization1-12The execution trace cache and the translation engine have cooperating branch prediction hardware. Branch tar
IA-32 Intel® Architecture Optimization7-54Example 7-12 Assembling a Look up Table to Manage Affinity Masks and Schedule Threads to Each Core First AFF
Multi-Core and Hyper-Threading Technology 77-55Example 7-13 Discovering the Affinity Masks for Sibling Logical Processors Sharing the Same Cache // Lo
IA-32 Intel® Architecture Optimization7-56 PackageID[ProcessorNUM] = PACKAGE_ID;CoreID[ProcessorNum] = CORE_ID;SmtID[Processor
Multi-Core and Hyper-Threading Technology 77-57For (ProcessorNum = 1; ProcessorNum < NumStartedLPs; ProcessorNum++) {ProcessorMask << = 1;For
IA-32 Intel® Architecture Optimization7-58Optimization of Other Shared ResourcesResource optimization in multi-threaded application depends on the cac
Multi-Core and Hyper-Threading Technology 77-59seldom reaches 50% of peak retirement bandwidth. Thus, improving single-thread execution throughput sho
IA-32 Intel® Architecture Optimization7-60throughput of a physical processor package. The non-halted CPI metric can be interpreted as the inverse of t
Multi-Core and Hyper-Threading Technology 77-61Using a function decomposition threading model, a multithreaded application can pair up a thread with c
IA-32 Intel® Architecture Optimization7-62Write-combining buffers are another example of execution resources shared between two logical processors. Wi
8-1864-bit Mode Coding GuidelinesIntroductionThis chapter describes coding guidelines for application software written to run in 64-bit mode. These gu
IA-32 Intel® Architecture Processor Family Overview1-13correct execution, the results of IA-32 instructions must be committed in original program orde
IA-32 Intel® Architecture Optimization8-2This optimization holds true for the lower 8 general purpose registers: EAX, ECX, EBX, EDX, ESP, EBP, ESI, ED
64-bit Mode Coding Guidelines 88-3If the compiler can determine at compile time that the result of a multiply will not exceed 64 bits, then the compil
IA-32 Intel® Architecture Optimization8-4Can be replaced with: movsx r8, r9w ;If bits 63:8 do not need to be;preserved. movsx r8, r10b ;If bits 63:
64-bit Mode Coding Guidelines 88-5IMUL RAX, RCXThe 64-bit version above is more efficient than using the following 32-bit version:MOV EAX, DWORD PTR[
IA-32 Intel® Architecture Optimization8-6Use 32-Bit Versions of CVTSI2SS and CVTSI2SD When PossibleThe CVTSI2SS and CVTSI2SD instructions convert a si
9-19Power Optimization for Mobile UsagesOverviewMobile computing allows computers to operate anywhere, anytime. Battery life is a key factor in delive
IA-32 Intel® Architecture Optimization9-2Pentium M, Intel Core Solo and Intel Core Duo processors implement features designed to enable the reduction
Power Optimization for Mobile Usages 99-3to accommodate demand and adapt power consumption. The interaction between the OS power management policy and
IA-32 Intel® Architecture Optimization9-4ACPI C-StatesWhen computational demands are less than 100%, part of the time the processor is doing useful wo
Power Optimization for Mobile Usages 99-5The index of a C-state type designates the depth of sleep. Higher numbers indicate a deeper sleep state and l
IA-32 Intel® Architecture Optimization1-14• a mechanism fetches data only and includes two distinct components: (1) a hardware mechanism to fetch the
IA-32 Intel® Architecture Optimization9-6Figure 9-3 Application of C-states to Idle TimeConsider that a processor is in lowest frequency (LFM- low fre
Power Optimization for Mobile Usages 99-7• In an Intel Core Solo or Duo processor, after staying in C4 for an extended time, the processor may enter i
IA-32 Intel® Architecture Optimization9-8Adjust Performance to Meet Quality of FeaturesWhen a system is battery powered, applications can extend batte
Power Optimization for Mobile Usages 99-9• GetActivePwrScheme: Retrieves the active power scheme (current system power scheme) index. An application c
IA-32 Intel® Architecture Optimization9-10workload (usually that equates to reducing the number of instructions that the processor needs to execute, o
Power Optimization for Mobile Usages 99-11disk operations over time. Use the GetDevicePowerState() Windows API to test disk state and delay the disk
IA-32 Intel® Architecture Optimization9-12Using Enhanced Intel SpeedStep® TechnologyUse Enhanced Intel SpeedStep Technology to adjust the processor to
Power Optimization for Mobile Usages 99-13The same application can be written in such a way that work units are divided into smaller granularity, but
IA-32 Intel® Architecture Optimization9-14An additional positive effect of continuously operating at a lower frequency is that frequent changes in pow
Power Optimization for Mobile Usages 99-15Eventually, if the interval is large enough, the processor will be able to enter deeper sleep and save a con
IA-32 Intel® Architecture Processor Family Overview1-15Branch PredictionBranch prediction is important to the performance of a deeply pipelined proces
IA-32 Intel® Architecture Optimization9-16thread enables the physical processor to operate at lower frequency relative to a single-threaded version. T
Power Optimization for Mobile Usages 99-17demands only 50% of processor resources (based on idle history). The processor frequency may be reduced by s
IA-32 Intel® Architecture Optimization9-18processor to enter the lowest possible C-state type (lower-numbered C state has less power saving). For exam
Power Optimization for Mobile Usages 99-19imbalance can be accomplished using performance monitoring events. Intel Core Duo processor provides an even
IA-32 Intel® Architecture Optimization9-20
A-1AApplication Performance ToolsIntel offers an array of application performance tools that are optimized to take advantage of the Intel architecture
IA-32 Intel® Architecture OptimizationA-2• Intel Performance LibrariesThe Intel Performance Library family consists of a set of software libraries opt
Application Performance Tools AA-3family. Vectorization, processor dispatch, inter-procedural optimization, profile-guided optimization and OpenMP par
IA-32 Intel® Architecture OptimizationA-4default, and targets the Intel Pentium 4 processor and subsequent processors. Code produced will run on any I
Application Performance Tools AA-5Vectorizer Switch OptionsThe Intel C++ and Fortran Compiler can vectorize your code using the vectorizer switch opti
IA-32 Intel® Architecture Optimization1-16To take advantage of the forward-not-taken and backward-taken static predictions, code should be arranged so
IA-32 Intel® Architecture OptimizationA-6Multithreading with OpenMP*Both the Intel C++ and Fortran Compilers support shared memory parallelism via Ope
Application Performance Tools AA-7The -Qrcd option disables the change to truncation of the rounding mode in floating-point-to-integer conversions. Fo
IA-32 Intel® Architecture OptimizationA-8When you use PGO, consider the following guidelines:• Minimize the changes to your program after instrumented
Application Performance Tools AA-9SamplingSampling allows you to profile all active software on your system, including operating system, device driver
IA-32 Intel® Architecture OptimizationA-10Figure A-1 provides an example of a hotspots report by location. Event-based SamplingEvent-based sampling (
Application Performance Tools AA-11different events at a time. The number of the events that the VTune analyzer can collect at once on the Pentium 4 a
IA-32 Intel® Architecture OptimizationA-12duration of read traffic compared to the duration of the workload is significantly less than unity, it indic
Application Performance Tools AA-13stride inefficiency is most prominent on memory traffic. A useful indicator for large-stride inefficiency in a work
IA-32 Intel® Architecture OptimizationA-14The Call Graph View depicts the caller / callee relationships. Each thread in the application is the root of
Application Performance Tools AA-15(SSE), Streaming SIMD Extensions 2 (SSE2) and Streaming SIMD Extensions 3 (SSE3). The library set includes the Inte
IA-32 Intel® Architecture Processor Family Overview1-17Some parts of the core may speculate that a common condition holds to allow faster execution. I
IA-32 Intel® Architecture OptimizationA-16• Performance: Highly-optimized routines with a C interface that give Assembly-level performance in a C/C++
Application Performance Tools AA-17developed with the Intel Performance Libraries benefit from new architectural features of future generations of Int
IA-32 Intel® Architecture OptimizationA-18The Intel Thread Checker product is an Intel VTune Performance Analyzer plug-in data collector that executes
Application Performance Tools AA-19Figure A-2 shows Intel Thread Checker displaying the source code of the selected instance from a list of detected d
IA-32 Intel® Architecture OptimizationA-20Intel® Software CollegeThe Intel® Software College is a valuable resource for classes on Streaming SIMD Exte
B-1BUsing Performance Monitoring EventsPerformance monitoring events provides facilities to characterize the interaction between programmed sequences
IA-32 Intel® Architecture OptimizationB-2The performance metrics listed n Tables B-1 through Table B-5 may be applicable to processors that support Hy
Using Performance Monitoring Events BB-3ReplayIn order to maximize performance for the common case, the Intel NetBurst microarchitecture sometimes ag
IA-32 Intel® Architecture OptimizationB-4miss more than once during its life time, but a Misses Retired metric (for example, 1st-Level Cache Misses Re
Using Performance Monitoring Events BB-5The first two metrics use performance counters, and thus can be used to cause interrupt upon overflow for samp
IA-32 Intel® Architecture Optimization1-18execution units are not pipelined (meaning that µops cannot be dispatched in consecutive cycles and the thro
IA-32 Intel® Architecture OptimizationB-6Non-Sleep Clockticks The performance monitoring counters can also be configured to count clocks whenever the
Using Performance Monitoring Events BB-7that logical processor is not halted (it may include some portion of the clock cycles for that logical process
IA-32 Intel® Architecture OptimizationB-8Microarchitecture NotesTrace Cache EventsThe trace cache is not directly comparable to an instruction cache.
Using Performance Monitoring Events BB-9There is a simplified block diagram below of the sub-systems connected to the IOQ unit in the front side bus s
IA-32 Intel® Architecture OptimizationB-10 Figure B-1 Relationships Between the Cache Hierarchy, IOQ, BSQ and Front Side BusChip SetSystem Memory1st
Using Performance Monitoring Events BB-11Core references are nominally 64 bytes, the size of a 1st-level cache line. Smaller sizes are called partial
IA-32 Intel® Architecture OptimizationB-12• IOQ_allocation, IOQ_active_entries: 64 bytes for hits or misses, smaller for partials' hits or misses
Using Performance Monitoring Events BB-13transactions of the writeback (WB) memory type for the FSB IOQ and the BSQ can be an indication of how often
IA-32 Intel® Architecture OptimizationB-14Current implementations of the BSQ_cache_reference event do not distinguish between programmatic read and wr
Using Performance Monitoring Events BB-15Usage Notes on Bus ActivitiesA number of performance metrics in Table B-1 are based on IOQ_active_entries and
IA-32 Intel® Architecture Processor Family Overview1-19CachesThe Intel NetBurst microarchitecture supports up to three levels of on-chip cache. At lea
IA-32 Intel® Architecture OptimizationB-16accesses (i.e., are also 3rd-level misses). This can decrease the average measured BSQ latencies for worklo
Using Performance Monitoring Events BB-17an expression built up from other metrics; for example, IPC is derived from two single-event metrics.• Column
IA-32 Intel® Architecture OptimizationB-18Table B-1 Pentium 4 Processor Performance MetricsMetric DescriptionEvent Name or Metric ExpressionEvent Mask
Using Performance Monitoring Events BB-19Speculative Uops Retired Number of uops retired (include both instructions executed to completion and specula
IA-32 Intel® Architecture OptimizationB-20Mispredicted returns The number of mispredicted returns including all causes. retired_mispred_branch_typeRET
Using Performance Monitoring Events BB-21TC Flushes Number of TC flushes (The counter will count twice for each occurrence. Divide the count by 2 to g
IA-32 Intel® Architecture OptimizationB-22Logical Processor 1 Deliver ModeThe number of cycles that the trace and delivery engine (TDE) is delivering
Using Performance Monitoring Events BB-23Logical Processor 0 Build ModeThe number of cycles that the trace and delivery engine (TDE) is building trace
IA-32 Intel® Architecture OptimizationB-24Trace Cache MissesThe number of times that significant delays occurred in order to decode instructions and b
Using Performance Monitoring Events BB-25Memory MetricsPage Walk DTLB All MissesThe number of page walk requests due to DTLB misses from either load o
IA-32 Intel® Architecture Optimization1-20Levels in the cache hierarchy are not inclusive. The fact that a line is in level i does not imply that it i
IA-32 Intel® Architecture OptimizationB-2664K Aliasing Conflicts1The number of 64K aliasing conflicts. A memory reference causing 64K aliasing conflic
Using Performance Monitoring Events BB-27MOB Load ReplaysThe number of replayed loads related to the Memory Order Buffer (MOB). This metric counts onl
IA-32 Intel® Architecture OptimizationB-282nd-Level Cache Reads Hit Shared The number of 2nd-level cache read references (loads and RFOs) that hit the
Using Performance Monitoring Events BB-293rd-Level Cache Reads Hit Modified The number of 3rd-level cache read references (loads and RFOs) that hit th
IA-32 Intel® Architecture OptimizationB-30All WCB Evictions The number of times a WC buffer eviction occurred due to any causes (This can be used to d
Using Performance Monitoring Events BB-31Bus MetricsBus Accesses from the Processor The number of all bus transactions that were allocated in the IO Q
IA-32 Intel® Architecture OptimizationB-32Prefetch Ratio Fraction of all bus transactions (including retires) that were for HW or SW prefetching.(Bus
Using Performance Monitoring Events BB-33Writes from the Processor The number of all write transactions on the bus that were allocated in IO Queue fro
IA-32 Intel® Architecture OptimizationB-34All WC from the Processor The number of Write Combining memory transactions on the bus that originated from
Using Performance Monitoring Events BB-35Bus Accesses from All Agents The number of all bus transactions that were allocated in the IO Queue by all ag
IA-32 Intel® Architecture Processor Family Overview1-21back within the processor, and 6-12 bus cycles to access memory if there is no bus congestion.
IA-32 Intel® Architecture OptimizationB-36Bus Reads Underway from the processor7 This is an accrued sum of the durations of all read (includes RFOs) t
Using Performance Monitoring Events BB-37All UC Underway from the processor7 This is an accrued sum of the durations of all UC transactions by this pr
IA-32 Intel® Architecture OptimizationB-38Bus Writes Underway from the processor7 This is an accrued sum of the durations of all write transactions by
Using Performance Monitoring Events BB-39Write WC Full (BSQ)The number of write (but neither writeback nor RFO) transactions to WC-type memory. BSQ_al
IA-32 Intel® Architecture OptimizationB-40Reads Non-prefetch Full (BSQ) The number of read (excludes RFOs and HW|SW prefetches) transactions to WB-typ
Using Performance Monitoring Events BB-41UC Write Partial (BSQ) The number of UC write transactions. Beware of granularity issues between BSQ and FSB
IA-32 Intel® Architecture OptimizationB-42WB Writes Full Underway (BSQ)8 This is an accrued sum of the durations of writeback (evicted from cache) tra
Using Performance Monitoring Events BB-43Write WC Partial Underway (BSQ)8This is an accrued sum of the durations of partial write transactions to WC-t
IA-32 Intel® Architecture OptimizationB-44SSE Input AssistsThe number of occurrences of SSE/SSE2 floating-point operations needing assistance to handl
Using Performance Monitoring Events BB-451. A memory reference causing 64K aliasing conflict can be counted more than once in this stat. The resulting
vBranch Prediction... 2-15Eliminating B
IA-32 Intel® Architecture Optimization1-22• avoids the need to access off-chip caches, which can increase the realized bandwidth compared to a normal
IA-32 Intel® Architecture OptimizationB-464. Most commonly used x87 instructions (e.g., fmul, fadd, fdiv, fsqrt, fstp, etc.) decode into a singleμop.
Using Performance Monitoring Events BB-47Table B-2 Metrics That Utilize Replay Tagging MechanismReplay Metric Tags1Bit field to set:IA32_PEBS_ENABLE B
IA-32 Intel® Architecture OptimizationB-48Tags for front_end_eventTable B-3 provides a list of the tags that are used by various metrics derived from
Using Performance Monitoring Events BB-49Table B-4 Metrics That Utilize the Execution Tagging MechanismExecution Metric Tags Upstream ESCRTag Value in
IA-32 Intel® Architecture OptimizationB-50Table B-5 New Metrics for Pentium 4 Processor (Family 15, Model 3)Using Performance Metrics with Hyper-Threa
Using Performance Monitoring Events BB-51The performance metrics listed in Table B-1 fall into three categories:• Logical processor specific and suppo
IA-32 Intel® Architecture OptimizationB-52Branching Metrics Branches RetiredTagged Mispredicted Branches RetiredMispredicted Branches RetiredAll retur
Using Performance Monitoring Events BB-53Memory Metrics Split Load Replays1Split Store Replays1MOB Load Replays164k Aliasing Conflicts1st-Level Cache
IA-32 Intel® Architecture OptimizationB-54Bus Metrics Bus Accesses from the Processor1Non-prefetch Bus Accesses from the Processor1Reads from the Proc
Using Performance Monitoring Events BB-55Characterization Metrics x87 Input Assistsx87 Output AssistsMachine Clear CountMemory Order Machine ClearSelf
IA-32 Intel® Architecture Processor Family Overview1-23Hardware prefetching for Pentium 4 processor has the following characteristics:• works with exi
IA-32 Intel® Architecture OptimizationB-56Using Performance Events of Intel Core Solo and Intel Core Duo processorsThere are performance events specif
Using Performance Monitoring Events BB-57There are three cycle-counting events which will not progress on a halted core, even if the halted core is be
IA-32 Intel® Architecture OptimizationB-58• Some events, such as writebacks, may have non-deterministic behavior for different runs. In such a case, o
Using Performance Monitoring Events BB-59• Serial_Execution_Cycles, event number 3C, unit mask 02HThis event counts the bus cycles during which the co
IA-32 Intel® Architecture OptimizationB-60
C-1CIA-32 Instruction Latency and ThroughputThis appendix contains tables of the latency, throughput and execution units that are associated with more
IA-32 Intel® Architecture OptimizationC-2OverviewThe current generation of IA-32 family of processors use out-of-order execution with dynamic scheduli
IA-32 Instruction Latency and Throughput CC-3While several items on the above list involve selecting the right instruction, this appendix focuses on t
IA-32 Intel® Architecture OptimizationC-4DefinitionsThe IA-32 instruction performance data are listed in several tables. The tables contain the follow
IA-32 Instruction Latency and Throughput CC-5accurately predict realistic performance of actual code sequences based on adding instruction latency dat
IA-32 Intel® Architecture Optimization1-24Thus, software optimization of a data access pattern should emphasize tuning for hardware prefetch first to
IA-32 Intel® Architecture OptimizationC-6 Latency and Throughput with Register OperandsIA-32 instruction latency and throughput data are presented in
IA-32 Instruction Latency and Throughput CC-7Table C-2 Streaming SIMD Extension 2 128-bit Integer InstructionsInstruction Latency1ThroughputExecution
IA-32 Intel® Architecture OptimizationC-8PCMPGTB/PCMPGTD/PCMPGTW xmm, xmm2 2 1 2 2 1 MMX_ALUPEXTRW r32, xmm, imm8 7 7 3 2 2 2 MMX_SHFT,FP_MISCPINSRW x
IA-32 Instruction Latency and Throughput CC-9 PSUBB/PSUBW/PSUBD xmm, xmm2 2 1 2 2 1 MMX_ALU PSUBSB/PSUBSW/PSUBUSB/PSUBUSW xmm, xmm2 2 1 2 2 1 MMX_A
IA-32 Intel® Architecture OptimizationC-10COMISD xmm, xmm 7 6 1 2 2 1 FP_ADD, FP_MISCCVTDQ2PD xmm, xmm 8 8 4+1 3 3 4 FP_ADD, MMX_SHFTCVTPD2PI mm, xmm
IA-32 Instruction Latency and Throughput CC-11DIVPD xmm, xmm 70 69 32+31 70 69 62 FP_DIVDIVSD xmm, xmm 39 38 32 39 38 31 FP_DIVMAXPD xmm, xmm 5 4 4 2
IA-32 Intel® Architecture OptimizationC-12Table C-4 Streaming SIMD Extension Single-precision Floating-point Instructions Instruction Latency1Throughp
IA-32 Instruction Latency and Throughput CC-13MOVLHPS3 xmm, xmm 44 22 MMX_SHFTMOVMSKPS r32, xmm 6 6 2 2 FP_MISCMOVSS xmm, xmm 4 4 2 2 MMX_SHFTMOVUPS x
IA-32 Intel® Architecture OptimizationC-14 Table C-5 Streaming SIMD Extension 64-bit Integer Instructions Instruction Latency1Throughput Execution Uni
IA-32 Instruction Latency and Throughput CC-15 PCMPGTB/PCMPGTD/PCMPGTW mm, mm22 11 MMX_ALUPMADDWD3 mm, mm 98 11 FP_MULPMULHW/PMULLW3 mm, mm98 11 FP_M
IA-32 Intel® Architecture Processor Family Overview1-25Reordering loads with respect to each other can prevent a load miss from stalling later loads.
IA-32 Intel® Architecture OptimizationC-16Table C-7 IA-32 x87 Floating-point Instructions Instruction Latency1ThroughputExecution Unit2CPUID 0F3n 0F2n
IA-32 Instruction Latency and Throughput CC-17 FSCALE460 7FRNDINT430 11FXCH501FP_MOVEFLDZ60FINCSTP/FDECSTP60See “Table Footnotes”Table C-8 IA-32 Gener
IA-32 Intel® Architecture OptimizationC-18Jcc7Not Appli-cable0.5 ALULOOP 8 1.5 ALUMOV 1 0.5 0.5 0.5 ALUMOVSB/MOVSW 1 0.5 0.5 0.5 ALUMOVZB/MOVZW 1 0.5
IA-32 Instruction Latency and Throughput CC-19Table FootnotesThe following footnotes refer to all tables in this appendix.1. Latency information for m
IA-32 Intel® Architecture OptimizationC-204. Latency and Throughput of transcendental instructions can vary substantially in a dynamic execution envir
IA-32 Instruction Latency and Throughput CC-21For the sake of simplicity, all data being requested is assumed to reside in the first level data cache
IA-32 Intel® Architecture OptimizationC-22
D-1DStack AlignmentThis appendix details on the alignment of the stacks of data for Streaming SIMD Extensions and Streaming SIMD Extensions 2.Stack Fr
IA-32 Intel® Architecture OptimizationD-2alignment for __m64 and double type data by enforcing that these 64-bit data items are at least eight-byte al
Stack Alignment DD-3As an optimization, an alternate entry point can be created that can be called when proper stack alignment is provided by the call
IA-32 Intel® Architecture Optimization1-26Intel® Pentium® M Processor MicroarchitectureLike the Intel NetBurst microarchitecture, the pipeline of the
Stack Alignment DD-4Example D-1 in the following sections illustrate this technique. Note the entry points foo and foo.aligned, the latter is the alte
Stack Alignment DD-5Example D-1 Aligned esp-Based Stack Framesvoid _cdecl foo (int k){ int j; foo: // See Note A push
Stack Alignment DD-6Aligned ebp-Based Stack FramesIn ebp-based frames, padding is also inserted immediately before the return address. However, this f
Stack Alignment DD-7Example D-2 Aligned ebp-based Stack Framesvoid _stdcall foo (int k){ int j; foo: push ebxmov ebx, espsub esp, 0x00000008and esp, 0
Stack Alignment DD-8// the goal is to make esp and ebp// (0 mod 16) herej = k;mov edx, [ebx + 8] // k is (0 mod 16) if caller aligned// its stackmov [
Stack Alignment DD-9Stack Frame OptimizationsThe Intel C++ Compiler provides certain optimizations that may improve the way aligned frames are set up
IA-32 Intel® Architecture OptimizationD-10Inlined Assembly and ebxWhen using aligned frames, the ebx register generally should not be modified in inli
E-1EMathematics of Prefetch Scheduling DistanceThis appendix discusses how far away to insert prefetch instructions. It presents a mathematical model
IA-32 Intel® Architecture OptimizationE-2Ninstis the number of instructions in the scope of one loop iteration.Consider the following example of a heu
Mathematics of Prefetch Scheduling Distance EE-3Tbdata transfer latency which is equal to number of lines per iteration * line burst latencyNote that
IA-32 Intel® Architecture Processor Family Overview1-27The Intel Pentium M processor microarchitecture is designed for lower power consumption. There
IA-32 Intel® Architecture OptimizationE-4Memory access plays a pivotal role in prefetch scheduling. For more understanding of a memory subsystem, cons
Mathematics of Prefetch Scheduling Distance EE-5Tl varies dynamically and is also system hardware-dependent. The static variants include the core-to-f
IA-32 Intel® Architecture OptimizationE-6No Preloading or PrefetchThe traditional programming approach does not perform data preloading or prefetch. I
Mathematics of Prefetch Scheduling Distance EE-7The iteration latency is approximately equal to the computation latency plus the memory leadoff latenc
IA-32 Intel® Architecture OptimizationE-8The following formula shows the relationship among the parameters:It can be seen from this relationship that
Mathematics of Prefetch Scheduling Distance EE-9For this particular example the prefetch scheduling distance is greater than 1. Data being prefetched
IA-32 Intel® Architecture OptimizationE-10Memory Throughput Bound (Case: Tb >= Tc)When the application or loop is memory throughput bound, the memo
Mathematics of Prefetch Scheduling Distance EE-11memory to you cannot do much about it. Typically, data copy from one space to another space, for exam
IA-32 Intel® Architecture OptimizationE-12Now for the case Tl =18, Tb =8 (2 cache lines are needed per iteration) examine the following graph. Conside
Mathematics of Prefetch Scheduling Distance EE-13In reality, the front-side bus (FSB) pipelining depth is limited, that is, only four transactions are
IA-32 Intel® Architecture Optimization1-28The fetch and decode unit includes a hardware instruction prefetcher and three decoders that enable parallel
IA-32 Intel® Architecture OptimizationE-14
Index-1Index64-bit modedefault operand size, 8-1introduction, 8-1legacy instructions, 8-1multiplication notes, 8-2register usage, 8-2, 8-4sign-extensi
IA-32 Intel® Architecture OptimizationIndex-2coding methodologies, 3-13coding techniques, 3-12absolute difference of signed numbers, 4-24absolute diff
IndexIndex-3floating-point stalls, 2-72flow dependency, E-7flush to zero, 5-22FXCH instruction, 2-70Ggeneral optimization techniques, 2-1branch predic
IA-32 Intel® Architecture OptimizationIndex-4Llarge load stalls, 2-37latency, 2-72, 6-5lea instruction, 2-74loading and storing to and from the same D
IndexIndex-5Ooptimizing cache utilizationcache management, 6-44examples, 6-15non-temporal store instructions, 6-10prefetch and load, 6-9prefetch Instr
IA-32 Intel® Architecture OptimizationIndex-6Rreciprocal instructions, 5-2rounding control option, A-6Ssamplingevent-based, A-10Self-modifying code, 2
INTEL SALES OFFICESASIA PACIFICAustraliaIntel Corp.Level 2448 St Kilda Road Melbourne VIC3004AustraliaFax:613-9862 5599ChinaIntel Corp.Rm 709, Shaanxi
Intel Corp.999 CANADA PLACE, Suite 404,#11Vancouver BCV6C 3E2CanadaFax:604-844-2813Intel Corp.2650 Queensview Drive, Suite 250Ottawa ONK2B 8H6CanadaFa
IA-32 Intel® Architecture Processor Family Overview1-29• Micro-ops (µops) fusion. Some of the most frequent pairs of µops derived from the same instru
IA-32 Intel® Architecture Optimization1-30Data is fetched 64 bytes at a time; the instruction and data translation lookaside buffers support 128 entri
IA-32 Intel® Architecture Processor Family Overview1-31In-Order RetirementThe retirement unit in the Pentium M processor buffers completed µops is the
viFloating-Point Stalls... 2-72x87 Floating-point
IA-32 Intel® Architecture Optimization1-32• Power-optimized busThe system bus is optimized for power efficiency; increased bus speed supports 667 MHz.
IA-32 Intel® Architecture Processor Family Overview1-33Data PrefetchingIntel Core Solo and Intel Core Duo processors provide hardware mechanisms to pr
IA-32 Intel® Architecture Optimization1-34The two logical processors each have a complete set of architectural registers while sharing one single phys
IA-32 Intel® Architecture Processor Family Overview1-35In the first implementation of HT Technology, the physical execution resources are shared and t
IA-32 Intel® Architecture Optimization1-36Processor Resources and Hyper-Threading TechnologyThe majority of microarchitecture resources in a physical
IA-32 Intel® Architecture Processor Family Overview1-37For example: a cache miss, a branch misprediction, or instruction dependencies may prevent a lo
IA-32 Intel® Architecture Optimization1-38Microarchitecture Pipeline and Hyper-Threading TechnologyThis section describes the HT Technology microarchi
IA-32 Intel® Architecture Processor Family Overview1-39Execution CoreThe core can dispatch up to six µops per cycle, provided the µops are ready to ex
IA-32 Intel® Architecture Optimization1-40Pentium Processor Extreme Edition provide four logical processors in a physical package that has two executi
IA-32 Intel® Architecture Processor Family Overview1-41Figure 1-7 Pentium D Processor, Pentium Processor Extreme Edition and Intel Core Duo ProcessorS
viiConsiderations for Code Conversion to SIMD Programming... 3-8Identifying Hot Spots ...
IA-32 Intel® Architecture Optimization1-42Microarchitecture Pipeline and Multi-Core ProcessorsIn general, each core in a multi-core processor resemble
IA-32 Intel® Architecture Processor Family Overview1-43that the cache line that contains the memory location is owned by the first-level data cache of
IA-32 Intel® Architecture Optimization1-44when data is written back to memory, the eviction consumes cache bandwidth and bus bandwidth. For multiple c
2-12General Optimization GuidelinesThis chapter discusses general optimization techniques that can improve the performance of applications running on
IA-32 Intel® Architecture Optimization2-2The following sections describe practices, tools, coding rules and recommendations associated with these fact
General Optimization Guidelines 22-3* Streaming SIMD Extensions (SSE)** Streaming SIMD Extensions 2 (SSE2)General Practices and Coding GuidelinesThi
IA-32 Intel® Architecture Optimization2-4Use Available Performance Tools• Current-generation compiler, such as the Intel C++ Compiler:— Set this compi
General Optimization Guidelines 22-5Optimize Branch Predictability• Improve branch predictability and optimize instruction prefetching by arranging co
IA-32 Intel® Architecture Optimization2-6• Minimize use of global variables and pointers.• Use the const modifier; use the static modifier for global
General Optimization Guidelines 22-7• Avoid longer latency instructions: integer multiplies and divides. Replace them with alternate code sequences (e
viiiPacked Shuffle Word for 64-bit Registers ... 4-18Packed Shuffle Word for 128-bi
IA-32 Intel® Architecture Optimization2-8• Avoid the use of conditionals.• Keep induction (loop) variable expressions simple.• Avoid using pointers, t
General Optimization Guidelines 22-9Performance ToolsIntel offers several tools that can facilitate optimizing your application’s performance.Intel® C
IA-32 Intel® Architecture Optimization2-10General Compiler RecommendationsA compiler that has been extensively tuned for the target microarchitec-ture
General Optimization Guidelines 22-11The VTune Performance Analyzer also enables engineers to use these counters to measure a number of workload chara
IA-32 Intel® Architecture Optimization2-12Intel Core Solo and Intel Core Duo processors have enhanced front end that is less sensitive to the 4-1-1 te
General Optimization Guidelines 22-13• On the Pentium 4 and Intel Xeon processors, the primary code size limit of interest is imposed by the trace cac
IA-32 Intel® Architecture Optimization2-14Transparent Cache-Parameter StrategyIf CPUID instruction supports function leaf 4, also known as determinist
General Optimization Guidelines 22-15Branch PredictionBranch optimizations have a significant impact on performance. By understanding the flow of bran
IA-32 Intel® Architecture Optimization2-16Assembly/Compiler Coding Rule 1. (MH impact, H generality) Arrange code to make basic blocks contiguous and
General Optimization Guidelines 22-17See Example 2-2. The optimized code sets ebx to zero, then compares A and B. If A is greater than or equal to B,
ixData Alignment... 5-4Data Arran
IA-32 Intel® Architecture Optimization2-18The cmov and fcmov instructions are available on the Pentium II and subsequent processors, but not on Pentiu
General Optimization Guidelines 22-19Static PredictionBranches that do not have a history in the BTB (see the “Branch Prediction” section) are predict
IA-32 Intel® Architecture Optimization2-20Assembly/Compiler Coding Rule 3. (M impact, H generality) Arrange code to be consistent with the static bra
General Optimization Guidelines 22-21Examples 2-6, Example 2-7 provide basic rules for a static prediction algorithm.In Example 2-6, the backward bran
IA-32 Intel® Architecture Optimization2-22Inlining, Calls and ReturnsThe return address stack mechanism augments the static and dynamic predictors to
General Optimization Guidelines 22-23Assembly/Compiler Coding Rule 6. (H impact, M generality) Do not inline a function if doing so increases the work
IA-32 Intel® Architecture Optimization2-24Placing data immediately following an indirect branch can cause a performance problem. If the data consist o
General Optimization Guidelines 22-25indirect branch into a tree where one or more indirect branches are preceded by conditional branches to those tar
IA-32 Intel® Architecture Optimization2-26best performance from a coding effort. An example of peeling out the most favored target of an indirect bran
General Optimization Guidelines 22-27• The Pentium 4 processor can correctly predict the exit branch for an inner loop that has 16 or fewer iterations
Comments to this Manuals