Many
Manuals
search
Categories
Brands
Home
Intel
Computer Accessories
ARCHITECTURE IA-32
User Manual
Intel ARCHITECTURE IA-32 User Manual Page 514
Download
Share
Sharing
Add to my manuals
Print
Page
/
568
Table of contents
BOOKMARKS
Rated
.
/ 5. Based on
customer reviews
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
IA-32 Intel® Ar
chitectur
e Optimization
B-60
1
2
...
509
510
511
512
513
514
515
516
517
518
519
...
567
568
IA-32 Intel® Architecture
1
Optimization Reference
1
Contents
3
Appendix DStack Alignment
14
Examples
15
Introduction
23
Tuning Your Application
24
About This Manual
24
Related Documentation
27
Notational Conventions
28
IA-32 Intel
29
Architecture
29
Processor Family Overview
29
SIMD Technology
30
Y4 Y3 Y2 Y1
31
OP OP OP OP
31
• inherently parallel
32
Summary of SIMD Technologies
33
Streaming SIMD Extensions 2
34
Streaming SIMD Extensions 3
34
Intel NetBurst
36
Microarchitecture
36
)UHTXHQWO\XVHGSDWKV
38
/HVVIUHTXHQWO\XVHGSDWKV
38
The Front End
39
The Out-of-order Core
40
Retirement
40
Front End Pipeline Detail
41
Execution Trace Cache
42
Branch Prediction
43
Execution Core Detail
44
Data Prefetch
49
• multiple outstanding misses
52
• buffering of writes
52
Pentium
54
• fetch/decode unit
55
• instruction cache
55
Data Prefetching
57
Out-of-Order Core
58
Microarchitecture of Intel
59
Core™ Solo and
59
Core™ Duo Processors
59
• Power-optimized bus
60
• Data Prefetch
60
• Micro-op fusion
60
• operational fairness
64
Shared Resources
65
Front End Pipeline
66
Multi-Core Processors
67
Load and Store Operations
70
General Optimization
73
Optimize Memory Access
77
Enable Vectorization
79
Performance Tools
81
VTune™ Performance Analyzer
82
Processor Perspectives
83
A and B. If the condition is
88
Spin-Wait and Idle Loops
90
Static Prediction
91
Inlining, Calls and Returns
94
Branch Type Selection
95
Loop Unrolling
98
• inlining where appropriate
100
Memory Accesses
101
Line 029e7100h
103
Line 029e70c0h
103
Line 029e7140h
103
Store Forwarding
104
Alignment
105
Figure 2-2
106
Example 2-14
108
• parameter passing
110
Data Layout Optimizations
111
Stack Alignment
114
Aliasing Cases in the Pentium
117
4 and Intel
117
Processors
117
Mixing Code and Data
119
Write Combining
120
Locality Enhancement
122
Minimizing Bus Latency
124
• software prefetch for data
127
Cacheability Instructions
128
Applications
129
• arithmetic overflow
132
• arithmetic underflow
132
• denormalized operand
132
Floating-point Modes
134
Core Duo Processors
142
Memory Operands
143
Floating-Point Stalls
144
Instruction Selection
145
Complex Instructions
146
Use of the lea Instruction
146
Flag Register Accesses
147
Integer Divide
148
Alternate Sequence without
150
Partial Register Stall
150
• Operand size prefix (0x66)
152
• Address size prefix (0x67)
152
REP Prefix and Data Movement
153
• Throughput per iteration:
154
• Address alignment:
154
• Cache eviction:
155
Destination
157
• immediate constant
158
• base register
158
• scaled index register
158
Clearing Registers
159
Compares
159
Floating Point/SIMD Operands
160
Prolog Sequences
162
Instruction Scheduling
163
Spill Scheduling
164
Vectorization
165
• avoid global pointers
166
• avoid global variables
166
Miscellaneous
167
User/Source Coding Rules
169
PUSH, CALL, RET). 2-84
179
Tuning Suggestions
180
Coding for SIMD
181
Architectures
181
Technologies
182
bool OSSupportCheck() {
184
Programming
188
Identifying Hot Spots
190
Coding Techniques
192
Coding Methodologies
193
Assembly
195
Intrinsics
195
+”, “>>”)
197
Automatic Vectorization
198
Stack and Data Alignment
200
__m128* datatypes
202
__m128*
203
Compiler-Supported Alignment
204
Improving Memory Utilization
207
SoA Data Structure
208
Strip Mining
212
Example 3-19 Strip Mined Code
213
Loop Blocking
214
Example 3-20 Loop Blocking
215
A. Original Loop
215
Blocking
216
Tuning the Final Application
219
Optimizing for SIMD Integer
221
Using the EMMS Instruction
223
Data Alignment
226
Signed Unpack
227
MM/M64 mm
229
Non-Interleaved Unpack
231
Extract Word
233
Insert Word
234
Figure 4-6 pinsrw Instruction
235
Move Byte Mask to Integer
236
55 47 39 23 15 7
237
X4 X3 X2 X1
238
X1 X2 X3 X4
238
Generating Constants
241
Building Blocks
243
Absolute Value
245
0x8000800080008000
246
Highly Efficient Clipping
247
Signed Word
249
Packed Multiply High Unsigned
250
Packed Average (Byte/Word)
251
Packed 32*32 Multiply
253
Packed 64-bit Add/Subtract
253
128-bit Shifts
253
Memory Optimizations
254
Partial Memory Accesses
255
Instruction
259
Optimizing for SIMD
263
Floating-point Applications
263
Planning Considerations
264
Scalar Floating-point Code
265
Data Swizzling
271
Example 5-3 Swizzling Data
272
Data Deswizzling
276
Instructions
277
Instructions (continued)
278
Functions
279
Horizontal ADD Using SSE
280
C1 C2 C3 C4 D1 D2 D3 D4
281
C1 C2 D1 D2 C3 C4 D3 D4
281
MXCSR register should be
283
SSE3 and Complex Arithmetics
285
Optimizing Cache Usage
291
Optimizing Cache Usage 6
293
Hardware Prefetching of Data
294
Prefetch
296
Implementation
298
Cacheability Control
299
Fencing
300
Streaming Non-temporal Stores
300
WB) or Write-Through (WT)
301
WC semantics)
301
Write-Combining
302
Streaming Store Usage Models
303
• hand-crafted code
305
The lfence Instruction
306
The mfence Instruction
306
The clflush Instruction
307
Software-controlled Prefetch
308
Hardware Prefetch
309
Constant Stride
311
Non-Adjacent Passes Loops
326
60 invis
332
• write-once (non-temporal)
333
Cache Management
334
Video Encoder
335
Video Decoder
335
• alignment of data
337
• cache size
337
Bit Location Name Meaning
344
• Determine prefetch stride
345
Parameters
346
Multi-Core and
347
Hyper-Threading Technology
347
Performance and Usage Models
348
Single Thread
349
Multi-Thread on MP
349
Multitasking Environment
350
• workload
352
• thread interaction
352
• hardware utilization
352
• domain decomposition
353
• functional decomposition
353
Functional Decomposition
354
P(1)P(1) C(1)C(1)P(1)
355
P: producer
356
C: consumer
356
Thread 0
358
Thread 1
358
Optimization Guidelines
362
Thread Synchronization
365
Optimization with Spin-Locks
371
PAUSE instruction in the
372
Example 7-5
373
System Bus Optimization
379
Conserve Bus Bandwidth
380
Memory Optimization
384
Shared-Memory Optimization
385
4 KB in each thread
388
Per-thread Stack Offset
390
Per-instance Stack Offset
392
Front-end Optimization
394
Resources
395
Processor
397
Processor (Contd.)
398
Sharing the Same Cache
401
64-bit Mode Coding
409
Guidelines
409
Only When Necessary
410
Assembly/Compiler Coding rule
411
64-Bit Arithmetic
412
Assembly/Compiler Coding Rule
413
Possible
414
Using Software Prefetch
414
Power Optimization for
415
Mobile Usages
415
Mobile Usage Scenarios
416
ACPI C-States
418
Reducing Amount of Work
423
• Switch off unused devices
424
Technology
426
Enabling Intel
428
Enhanced Deeper Sleep
428
Multi-Core Considerations
429
(C1-C4)
433
Application Performance
435
Compilers
436
Code Optimization Options
437
Vectorizer Switch Options
439
Multithreading with OpenMP*
440
VTune™ Performance Analyzer
442
Sampling
443
Event-based Sampling
444
Workload Characterization
445
Call Graph
447
Performance Libraries
448
Benefits Summary
449
Optimizations with the Intel
450
Enhanced Debugger (EDB)
451
Threading Tools
451
Thread Profiler
453
Software College
454
Using Performance Monitoring
455
Bogus, Non-bogus, Retire
456
Bus Ratio
456
Counting Clocks
458
Non-Halted Clockticks
459
Non-Sleep Clockticks
460
Time Stamp Counter
461
Microarchitecture Notes
462
Side Bus
464
Reads due to program loads
465
Writebacks (dirty evictions)
466
Usage Notes on Bus Activities
469
Tags for replay_event
500
Tags for front_end_event
502
Tags for execution_event
502
Technology
504
Parallel Counting
505
Parallel Counting (continued)
506
Intel Core Duo processors
510
Ratio Interpretation
511
Notes on Selected Events
512
Throughput
515
Overview
516
PADDQ and PMULUDQ, each have
517
Definitions
518
Latency and Throughput
518
See “Table Footnotes”
520
Instructions (continued)
524
Table Footnotes
533
Stack Alignment D
539
& 0x0f) == 0x08
542
Stack Frame Optimizations
545
Inlined Assembly and ebx
546
Mathematics of Prefetch
547
Scheduling Distance
547
Mathematical Model for PSD
548
L2 lookup miss latency
550
• Optimize T
551
No Preloading or Prefetch
552
Front-Side Bus
553
Execution pipeline
553
Execution cycles
553
Compute Bound (Case: T
554
>= T
556
INTEL SALES OFFICES
567
Comments to this Manuals
No comments
Publish
Related products and manuals for Computer Accessories Intel ARCHITECTURE IA-32
Computer Accessories Intel 1520 User Manual
(176 pages)
Computer Accessories Intel 82600 User Manual
(40 pages)
Computer Accessories Intel CELERON 200 User Manual
(53 pages)
Computer Accessories Intel I/O Controller Hub 6300ESB User Manual
(14 pages)
Computer Accessories Intel 220T User Manual
(24 pages)
Computer Accessories Intel 520T User Manual
(31 pages)
Computer Accessories Intel 410T User Manual
(40 pages)
Computer Accessories Intel Express 510T User Manual
(144 pages)
Computer Accessories Intel 130T User Manual
(18 pages)
Computer Accessories Intel 4 User Manual
(10 pages)
Computer Accessories Intel IA-32 User Manual
(636 pages)
Computer Accessories Intel Evaluation Platform Board Manual IQ80960RM User Manual
(88 pages)
Computer Accessories Intel cPCI-7200 User Manual
(71 pages)
Computer Accessories Intel AXXSW1GB User Manual
(220 pages)
Computer Accessories Intel Express Hub User Manual
(4 pages)
Computer Accessories Intel SBC-455 User Manual
(97 pages)
Computer Accessories Intel Ethernet Switch Boards User Manual
(52 pages)
Computer Accessories Intel TOUCH-N-MOW 120000 User Manual
(12 pages)
Computer Accessories Intel ZT8101 User Manual
(124 pages)
Computer Accessories Intel NetStructure 470 User Manual
(146 pages)
Print document
Print page 514
Comments to this Manuals