Intel ARCHITECTURE IA-32 User Manual Page 316

  • Download
  • Add to my manuals
  • Print
  • Page
    / 568
  • Table of contents
  • BOOKMARKS
  • Rated. / 5. Based on customer reviews
Page view 315
IA-32 Intel® Architecture Optimization
6-26
lines of data per iteration. The PSD would need to be
increased/decreased if more/less than two cache lines are used per
iteration.
Software Prefetch Concatenation
Maximum performance can be achieved when execution pipeline is at
maximum throughput, without incurring any memory latency penalties.
This can be achieved by prefetching data to be used in successive
iterations in a loop. De-pipelining memory generates bubbles in the
execution pipeline. To explain this performance issue, a 3D geometry
pipeline that processes 3D vertices in strip format is used as an example.
A strip contains a list of vertices whose predefined vertex order forms
contiguous triangles. It can be easily observed that the memory pipe is
de-pipelined on the strip boundary due to ineffective prefetch
arrangement. The execution pipeline is stalled for the first two iterations
for each strip. As a result, the average latency for completing an
iteration will be 165(FIX) clocks. (See Appendix E, “Mathematics of
Prefetch Scheduling Distance”, for a detailed memory pipeline
description.)
Example 6-3 Prefetch Scheduling Distance
top_loop:
prefetchnta [edx + esi + 128*3]
prefetchnta [edx*4 + esi + 128*3]
. . . . .
movaps xmm1, [edx + esi]
movaps xmm2, [edx*4 + esi]
movaps xmm3, [edx + esi + 16]
movaps xmm4, [edx*4 + esi + 16]
. . . . .
. . . . .
add esi, 128
cmp esi, ecx
jl top_loop
Page view 315
1 2 ... 311 312 313 314 315 316 317 318 319 320 321 ... 567 568

Comments to this Manuals

No comments