Intel ARCHITECTURE IA-32 User Manual download pdf (Page 316)

100

101

IA-32 Intel® Architecture Optimization

6-26

lines of data per iteration. The PSD would need to be

increased/decreased if more/less than two cache lines are used per

iteration.

Software Prefetch Concatenation

Maximum performance can be achieved when execution pipeline is at

maximum throughput, without incurring any memory latency penalties.

This can be achieved by prefetching data to be used in successive

iterations in a loop. De-pipelining memory generates bubbles in the

execution pipeline. To explain this performance issue, a 3D geometry

pipeline that processes 3D vertices in strip format is used as an example.

A strip contains a list of vertices whose predefined vertex order forms

contiguous triangles. It can be easily observed that the memory pipe is

de-pipelined on the strip boundary due to ineffective prefetch

arrangement. The execution pipeline is stalled for the first two iterations

for each strip. As a result, the average latency for completing an

iteration will be 165(FIX) clocks. (See Appendix E, “Mathematics of

Prefetch Scheduling Distance”, for a detailed memory pipeline

description.)

Example 6-3 Prefetch Scheduling Distance

top_loop:

prefetchnta [edx + esi + 128*3]

prefetchnta [edx*4 + esi + 128*3]

. . . . .

movaps xmm1, [edx + esi]

movaps xmm2, [edx*4 + esi]

movaps xmm3, [edx + esi + 16]

movaps xmm4, [edx*4 + esi + 16]

. . . . .

add esi, 128

cmp esi, ecx

jl top_loop

1 2 ... 311 312 313 314 315 316 317 318 319 320 321 ... 567 568

Comments to this Manuals

No comments

Intel ARCHITECTURE IA-32 User Manual Page 316

Comments to this Manuals

Related products and manuals for Computer Accessories Intel ARCHITECTURE IA-32