![]() |
|
If you can't view the Datasheet, Please click here to try to view without PDF Reader . |
|
Datasheet File OCR Text: |
AMD64 Technology AMD64 Architecture Programmer's Manual Volume 1: Application Programming Publication No. 24592 Revision 3.09 Date September 2003 AMD64 Technology 24592--Rev. 3.09--September 2003 (c) 2002, 2003 Advanced Micro Devices, Inc. All rights reserved. The contents of this document are provided in connection with Advanced Micro Devices, Inc. ("AMD") products. AMD makes no representations or warranties with respect to the accuracy or completeness of the contents of this publication and reserves the right to make changes to specifications and product descriptions at any time without notice. No license, whether express, implied, arising by estoppel or otherwise, to any intellectual property rights is granted by this publication. Except as set forth in AMD's Standard Terms and Conditions of Sale, AMD assumes no liability whatsoever, and disclaims any express or implied warranty, relating to its products including, but not limited to, the implied warranty of merchantability, fitness for a particular purpose, or infringement of any intellectual property right. AMD's products are not designed, intended, authorized or warranted for use as components in systems intended for surgical implant into the body, or in other applications intended to support or sustain life, or in any other application in which the failure of AMD's product could create a situation where personal injury, death, or severe property or environmental damage may occur. AMD reserves the right to discontinue or make changes to its products at any time without notice. Trademarks AMD, the AMD arrow logo, and combinations thereof, and 3DNow! are trademarks, and AMD-K6 is a registered trademark of Advanced Micro Devices, Inc. MMX is a trademark and Pentium is a registered trademark of Intel Corporation. Windows NT is a registered trademark of Microsoft Corporation. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies. 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Contents Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Revision History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix About This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix Audience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix Contact Information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx Definitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx Related Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxi 1 Overview of the AMD64 Architecture . . . . . . . . . . . . . . . . . . . . 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 New Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Instruction Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Media Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Floating-Point Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Modes of Operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Long Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 64-Bit Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Compatibility Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Legacy Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Memory Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Segment Registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Physical Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Memory Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Byte Ordering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 64-bit Canonical Addresses . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Effective Addresses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Address-Size Prefix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 RIP-Relative Addressing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Near and Far Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Stack Operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Instruction Pointer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.2 2 Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1 2.2 2.3 2.4 2.5 3 Contents General-Purpose Programming. . . . . . . . . . . . . . . . . . . . . . . . . 27 iii AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 3.1 3.2 3.3 3.4 3.5 3.6 3.7 Registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Legacy Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 64-Bit-Mode Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Implicit Uses of GPRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Flags Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Instruction Pointer Register. . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Operand Sizes and Overrides . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Operand Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Data Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Instruction Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Data Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Data Conversion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Load Segment Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Load Effective Address. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Rotate and Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Compare and Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Logical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 String . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Control Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Input/Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Semaphores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Processor Information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Cache and Memory Management . . . . . . . . . . . . . . . . . . . . . . 79 No Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 System Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 General Rules for Instructions in 64-Bit Mode. . . . . . . . . . . . 81 Address Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Canonical Address Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Branch-Displacement Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Operand Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 High 32 Bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Invalid and Reassigned Instructions . . . . . . . . . . . . . . . . . . . . 83 Instructions with 64-Bit Default Operand Size. . . . . . . . . . . . 84 Instruction Prefixes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Legacy Prefixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 REX Prefixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Feature Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Control Transfers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Privilege Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Procedure Stack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Jumps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 iv Contents 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology 3.8 3.9 3.10 Procedure Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Returning from Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 System Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 General Considerations for Branching . . . . . . . . . . . . . . . . . 102 Branching in 64-Bit Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Interrupts and Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Input/Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 I/O Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 I/O Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Protected-Mode I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Memory Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Accessing Memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Forcing Memory Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Caches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Cache Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Cache Pollution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Cache-Control Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . 122 Performance Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 124 Use Large Operand Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Use Short Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Align Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Avoid Branches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Prefetch Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Keep Common Operands in Registers. . . . . . . . . . . . . . . . . . 125 Avoid True Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Avoid Store-to-Load Dependencies . . . . . . . . . . . . . . . . . . . . 126 Optimize Stack Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Consider Repeat-Prefix Setup Time . . . . . . . . . . . . . . . . . . . 126 Replace GPR with Media Instructions . . . . . . . . . . . . . . . . . 126 Organize Data in Memory Blocks. . . . . . . . . . . . . . . . . . . . . . 126 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Origins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Types of Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Integer Vector Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Floating-Point Vector Operations . . . . . . . . . . . . . . . . . . . . . 130 Data Conversion and Reordering. . . . . . . . . . . . . . . . . . . . . . 131 Block Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Matrix and Special Arithmetic Operations. . . . . . . . . . . . . . 135 Branch Removal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 XMM Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 MXCSR Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Other Data Registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 rFLAGS Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 4 128-Bit Media and Scientific Programming . . . . . . . . . . . . . 127 4.1 4.2 4.3 Contents v AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Operand Sizes and Overrides . . . . . . . . . . . . . . . . . . . . . . . . . 147 Operand Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Data Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Integer Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Floating-Point Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 Floating-Point Number Representation . . . . . . . . . . . . . . . . 153 Floating-Point Number Encodings. . . . . . . . . . . . . . . . . . . . . 156 Floating-Point Rounding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 Instruction Summary--Integer Instructions. . . . . . . . . . . . . 160 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 Data Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Data Conversion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 Data Reordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Compare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Logical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 Save and Restore State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 Instruction Summary--Floating-Point Instructions. . . . . . . 187 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Data Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Data Conversion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 Data Reordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Compare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 Logical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 Instruction Effects on Flags . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Instruction Prefixes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 Supported Prefixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 Special-Use and Reserved Prefixes . . . . . . . . . . . . . . . . . . . . 208 Prefixes That Cause Exceptions . . . . . . . . . . . . . . . . . . . . . . 208 Feature Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 General-Purpose Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . 210 SIMD Floating-Point Exception Causes . . . . . . . . . . . . . . . . 211 SIMD Floating-Point Exception Priority . . . . . . . . . . . . . . . . 216 SIMD Floating-Point Exception Masking . . . . . . . . . . . . . . . 218 Saving, Clearing, and Passing State . . . . . . . . . . . . . . . . . . . 222 Saving and Restoring State . . . . . . . . . . . . . . . . . . . . . . . . . . 222 Parameter Passing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 Accessing Operands in MMXTM Registers. . . . . . . . . . . . . . . 223 Performance Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 224 Use Small Operand Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 Reorganize Data for Parallel Operations . . . . . . . . . . . . . . . 224 Remove Branches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 vi Contents 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Use Streaming Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 Align Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 Organize Data for Cacheability . . . . . . . . . . . . . . . . . . . . . . . 225 Prefetch Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 Use 128-Bit Media Code for Moving Data . . . . . . . . . . . . . . . 226 Retain Intermediate Results in XMM Registers . . . . . . . . . 226 Replace GPR Code with 128-bit media Code. . . . . . . . . . . . 226 Replace x87 Code with 128-Bit Media Code . . . . . . . . . . . . . 226 5 64-Bit Media Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 5.1 5.2 5.3 Origins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 Compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 Parallel Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 Data Conversion and Reordering. . . . . . . . . . . . . . . . . . . . . . 231 Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Saturation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 Branch Removal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 Floating-Point (3DNow!TM) Vector Operations . . . . . . . . . . . 236 Registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 MMXTM Registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 Other Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 Operand Sizes and Overrides . . . . . . . . . . . . . . . . . . . . . . . . . 240 Operand Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 Data Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 Integer Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 Floating-Point Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 Instruction Summary--Integer Instructions. . . . . . . . . . . . . 245 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 Exit Media State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Data Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 Data Conversion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 Data Reordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 Compare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 Logical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 Save and Restore State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 Instruction Summary--Floating-Point Instructions. . . . . . . 265 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 Data Conversion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 Compare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 Instruction Effects on Flags . . . . . . . . . . . . . . . . . . . . . . . . . . 272 Instruction Prefixes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 Supported Prefixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 5.4 5.5 5.6 5.7 5.8 5.9 Contents vii AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 5.10 5.11 5.12 5.13 5.14 5.15 Special-Use and Reserved Prefixes . . . . . . . . . . . . . . . . . . . . 273 Prefixes That Cause Exceptions . . . . . . . . . . . . . . . . . . . . . . 273 Feature Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 General-Purpose Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . 274 x87 Floating-Point Exceptions (#MF) . . . . . . . . . . . . . . . . . . 276 Actions Taken on Executing 64-Bit Media Instructions . . . 276 Mixing Media Code with x87 Code . . . . . . . . . . . . . . . . . . . . 278 Mixing Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 Clearing MMXTM State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 State-Saving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 Saving and Restoring State . . . . . . . . . . . . . . . . . . . . . . . . . . 279 State-Saving Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 Performance Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 281 Use Small Operand Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 Reorganize Data for Parallel Operations . . . . . . . . . . . . . . . 281 Remove Branches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 Align Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 Organize Data for Cacheability . . . . . . . . . . . . . . . . . . . . . . . 282 Prefetch Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 Retain Intermediate Results in MMXTM Registers . . . . . . . 283 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 Origins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 Compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 Registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 x87 Data Registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288 x87 Status Word Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 x87 Control Word Register . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 x87 Tag Word Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 Pointers and Opcode State . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 x87 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 Floating-Point Emulation (CR0.EM) . . . . . . . . . . . . . . . . . . . 299 Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300 Operand Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300 Number Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 Number Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 Instruction Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 Data Transfer and Conversion . . . . . . . . . . . . . . . . . . . . . . . . 317 Load Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320 Transcendental Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 6 x87 Floating-Point Programming . . . . . . . . . . . . . . . . . . . . . . 285 6.1 6.2 6.3 6.4 6.5 viii Contents 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology 6.6 6.7 6.8 6.9 6.10 6.11 Compare and Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 Stack Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330 No Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 Instruction Effects on rFLAGS . . . . . . . . . . . . . . . . . . . . . . . 335 Instruction Prefixes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 Feature Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 General-Purpose Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . 337 x87 Floating-Point Exception Causes . . . . . . . . . . . . . . . . . . 338 x87 Floating-Point Exception Priority. . . . . . . . . . . . . . . . . . 342 x87 Floating-Point Exception Masking . . . . . . . . . . . . . . . . . 344 State-Saving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 State-Saving Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 Performance Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 352 Replace x87 Code with 128-Bit Media Code . . . . . . . . . . . . . 352 Use FCOMI-FCMOVx Branching . . . . . . . . . . . . . . . . . . . . . . 352 Use FSINCOS Instead of FSIN and FCOS . . . . . . . . . . . . . . 353 Break Up Dependency Chains . . . . . . . . . . . . . . . . . . . . . . . . 353 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 Contents ix AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 x Contents 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Figures Figure 1-1. Application-Programming Register Set . . . . . . . . . . . . . . . . . . . . 2 Figure 2-1. Virtual-Memory Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Figure 2-2. Segment Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Figure 2-3. Long-Mode Memory Management . . . . . . . . . . . . . . . . . . . . . . . 14 Figure 2-4. Legacy-Mode Memory Management . . . . . . . . . . . . . . . . . . . . . 15 Figure 2-5. Byte Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Figure 2-6. Example of 10-Byte Instruction in Memory. . . . . . . . . . . . . . . . 18 Figure 2-7. Complex Address Calculation (Protected Mode) . . . . . . . . . . . 19 Figure 2-8. Near and Far Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Figure 2-9. Stack Pointer Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Figure 2-10.Instruction Pointer (rIP) Register . . . . . . . . . . . . . . . . . . . . . . . 25 Figure 3-1. General-Purpose Programming Registers . . . . . . . . . . . . . . . . . 28 Figure 3-2. General Registers in Legacy and Compatibility Modes . . . . . . 29 Figure 3-3. General Registers in 64-Bit Mode. . . . . . . . . . . . . . . . . . . . . . . . 31 Figure 3-4. GPRs in 64-Bit Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Figure 3-5. rFLAGS Register--Flags Visible to Application Software . . . 38 Figure 3-6. General-Purpose Data Types. . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Figure 3-7. Mnemonic Syntax Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Figure 3-8. BSWAP Doubleword Exchange. . . . . . . . . . . . . . . . . . . . . . . . . . 57 Figure 3-9. Privilege-Level Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Figure 3-10.Procedure Stack, Near Call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Figure 3-11.Procedure Stack, Far Call to Same Privilege . . . . . . . . . . . . . . 98 Figure 3-12.Procedure Stack, Far Call to Greater Privilege . . . . . . . . . . . . 99 Figure 3-13.Procedure Stack, Near Return . . . . . . . . . . . . . . . . . . . . . . . . . 100 Figure 3-14.Procedure Stack, Far Return from Same Privilege . . . . . . . . 101 Figure 3-15.Procedure Stack, Far Return from Less Privilege . . . . . . . . . 101 Figure 3-16.Procedure Stack, Interrupt to Same Privilege . . . . . . . . . . . . 108 Figure 3-17.Procedure Stack, Interrupt to Higher Privilege . . . . . . . . . . . 109 Figure 3-18.I/O Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Figure 3-19.Memory Hierarchy Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Figure 4-1. Parallel Operations on Vectors of Integer Elements . . . . . . . 129 Figures xi AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Figure 4-2. Parallel Operations on Vectors of Floating-Point Elements . 130 Figure 4-3. Unpack and Interleave Operation . . . . . . . . . . . . . . . . . . . . . . 131 Figure 4-4. Pack Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Figure 4-5. Shuffle Operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Figure 4-6. Move Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 Figure 4-7. Move Mask Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Figure 4-8. Multiply-Add Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Figure 4-9. Sum-of-Absolute-Differences Operation . . . . . . . . . . . . . . . . . 137 Figure 4-10.Branch-Removal Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 Figure 4-11.Move Mask Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Figure 4-12.128-bit Media Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Figure 4-13.128-Bit Media Control and Status Register (MXCSR) . . . . . . 141 Figure 4-14.128-Bit Media Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 Figure 4-15.128-Bit Media Floating-Point Data Types . . . . . . . . . . . . . . . . 151 Figure 4-16.Mnemonic Syntax for Typical Instruction . . . . . . . . . . . . . . . . 160 Figure 4-17.Integer Move Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Figure 4-18.MASKMOVDQU Move Mask Operation . . . . . . . . . . . . . . . . . 165 Figure 4-19.PMOVMSKB Move Mask Operation. . . . . . . . . . . . . . . . . . . . . 166 Figure 4-20.PACKSSDW Pack Operation. . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Figure 4-21.PUNPCKLWD Unpack and Interleave Operation . . . . . . . . . 170 Figure 4-22.PINSRW Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Figure 4-23.PSHUFD Shuffle Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Figure 4-24.PSHUFHW Shuffle Operation . . . . . . . . . . . . . . . . . . . . . . . . . 173 Figure 4-25.Arithmetic Operation on Vectors of Bytes . . . . . . . . . . . . . . . 174 Figure 4-26.PMULxW Multiply Operation. . . . . . . . . . . . . . . . . . . . . . . . . . 177 Figure 4-27.PMULUDQ Multiply Operation . . . . . . . . . . . . . . . . . . . . . . . . 178 Figure 4-28.PMADDWD Multiply-Add Operation. . . . . . . . . . . . . . . . . . . . 179 Figure 4-29.PSADBW Sum-of-Absolute-Differences Operation. . . . . . . . . 181 Figure 4-30.PCMPEQB Compare Operation . . . . . . . . . . . . . . . . . . . . . . . . 184 Figure 4-31.Floating-Point Move Operations. . . . . . . . . . . . . . . . . . . . . . . . 189 Figure 4-32.MOVMSKPS Move Mask Operation. . . . . . . . . . . . . . . . . . . . . 192 Figure 4-33.UNPCKLPS Unpack and Interleave Operation . . . . . . . . . . . 196 Figure 4-34.SHUFPS Shuffle Operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 xii Figures 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Figure 4-35.ADDPS Arithmetic Operation. . . . . . . . . . . . . . . . . . . . . . . . . . 198 Figure 4-36.CMPPD Compare Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 Figure 4-37.COMISD Compare Operation . . . . . . . . . . . . . . . . . . . . . . . . . . 206 Figure 4-38.SIMD Floating-Point Detection Process. . . . . . . . . . . . . . . . . . 217 Figure 5-1. Parallel Integer Operations on Elements of Vectors . . . . . . . 231 Figure 5-2. Unpack and Interleave Operation . . . . . . . . . . . . . . . . . . . . . . 232 Figure 5-3. Shuffle Operation (1 of 256) . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Figure 5-4. Multiply-Add Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 Figure 5-5. Branch-Removal Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Figure 5-6. Floating-Point (3DNow!TM Instruction) Operations . . . . . . . . 236 Figure 5-7. 64-bit Media Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 Figure 5-8. 64-Bit Media Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 Figure 5-9. 64-Bit Floating-Point (3DNow!) Vector Operand . . . . . . . . . . 243 Figure 5-10.Mnemonic Syntax for Typical Instruction . . . . . . . . . . . . . . . . 246 Figure 5-11.MASKMOVQ Move Mask Operation . . . . . . . . . . . . . . . . . . . . 249 Figure 5-12.PACKSSDW Pack Operation. . . . . . . . . . . . . . . . . . . . . . . . . . . 252 Figure 5-13.PUNPCKLWD Unpack and Interleave Operation . . . . . . . . . 253 Figure 5-14.PSHUFW Shuffle Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 Figure 5-15.PSWAPD Swap Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 Figure 5-16.PMADDWD Multiply-Add Operation. . . . . . . . . . . . . . . . . . . . 259 Figure 5-17.PFACC Accumulate Operation . . . . . . . . . . . . . . . . . . . . . . . . . 269 Figure 6-1. x87 Registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 Figure 6-2. x87 Physical and Stack Registers . . . . . . . . . . . . . . . . . . . . . . . 288 Figure 6-3. x87 Status Word Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 Figure 6-4. x87 Control Word Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 Figure 6-5. x87 Tag Word Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 Figure 6-6. x87 Pointers and Opcode State . . . . . . . . . . . . . . . . . . . . . . . . . 297 Figure 6-7. x87 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 Figure 6-8. x87 Floating-Point Data Types . . . . . . . . . . . . . . . . . . . . . . . . . 302 Figure 6-9. x87 Packed Decimal Data Type . . . . . . . . . . . . . . . . . . . . . . . . 304 Figure 6-10.Mnemonic Syntax for Typical Instruction . . . . . . . . . . . . . . . . 316 Figures xiii AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 xiv Figures 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Tables Table 1-1. Table 1-2. Table 2-1. Table 3-1. Table 3-2. Table 3-3. Table 3-4. Table 3-5. Table 3-6. Table 3-7. Table 3-8. Table 3-9. Table 4-1. Table 4-2. Table 4-3. Table 4-4. Table 4-5. Table 4-6. Table 4-7. Table 4-8. Table 4-9. Operating Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Application Registers and Stack, by Operating Mode . . . . . . . . 4 Address-Size Prefixes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Implicit Uses of Legacy GPRs. . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Representable Values of General-Purpose Data Types . . . . . . 43 Operand-Size Overrides. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 rFLAGS for CMOVcc Instructions . . . . . . . . . . . . . . . . . . . . . . . 51 rFLAGS for SETcc Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . 66 rFLAGS for Jcc Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Legacy Instruction Prefixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Instructions that Implicitly Reference RSP in 64-Bit Mode . . 96 Near Branches in 64-Bit Mode. . . . . . . . . . . . . . . . . . . . . . . . . . 103 MXCSR Register Reset Values. . . . . . . . . . . . . . . . . . . . . . . . . 142 Range of Values in 128-Bit Media Integer Data Types . . . . . 149 Saturation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 Range of Values in Normalized Floating-Point Data Types . 152 Example of Denormalization. . . . . . . . . . . . . . . . . . . . . . . . . . . 154 NaN Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Supported Floating-Point Encodings . . . . . . . . . . . . . . . . . . . . 157 Indefinite-Value Encodings. . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 Types of Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Table 3-10. Interrupts and Exceptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Table 4-10. Example PANDN Bit Values . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 Table 4-11. SIMD Floating-Point Exception Flags . . . . . . . . . . . . . . . . . . . 212 Table 4-12. Invalid-Operation Exception (IE) Causes . . . . . . . . . . . . . . . . 214 Table 4-13. Priority of SIMD Floating-Point Exceptions . . . . . . . . . . . . . . 216 Table 4-14. SIMD Floating-Point Exception Masks . . . . . . . . . . . . . . . . . . 218 Table 4-15. Masked Responses to SIMD Floating-Point Exceptions. . . . . 219 Table 5-1. Table 5-2. Table 5-3. Table 5-4. Table 5-5. Table 5-6. Range of Values in 64-Bit Media Integer Data Types . . . . . . 242 Saturation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 Range of Values in 64-Bit Media Floating-Point Data Types 244 64-Bit Floating-Point Exponent Ranges . . . . . . . . . . . . . . . . . . 244 Example PANDN Bit Values . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 Mapping Between Internal and Software-Visible Tag Bits . . 278 Tables xv AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Table 6-1. Table 6-2. Table 6-3. Table 6-4. Table 6-5. Table 6-6. Table 6-7. Table 6-8. Table 6-9. Precision Control (PC) Summary . . . . . . . . . . . . . . . . . . . . . . . 294 Types of Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 Mapping Between Internal and Software-Visible Tag Bits . . 296 Instructions that Access the x87 Environment . . . . . . . . . . . . 299 Range of Finite Floating-Point Values. . . . . . . . . . . . . . . . . . . 303 Example of Denormalization. . . . . . . . . . . . . . . . . . . . . . . . . . . 307 NaN Results from NaN Source Operands . . . . . . . . . . . . . . . . 309 Supported Floating-Point Encodings . . . . . . . . . . . . . . . . . . . . 310 Unsupported Floating-Point Encodings. . . . . . . . . . . . . . . . . . 312 Table 6-10. Indefinite-Value Encodings. . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 Table 6-11. Precision Control Field (PC) Values and Bit Precision . . . . . 313 Table 6-12. Types of Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 Table 6-13. rFLAGS Conditions for FCMOVcc . . . . . . . . . . . . . . . . . . . . . . 319 Table 6-14. rFLAGS Values for FCOMI Instruction . . . . . . . . . . . . . . . . . . 328 Table 6-15. Condition-Code Settings for FXAM . . . . . . . . . . . . . . . . . . . . . 330 Table 6-16. Instruction Effects on rFLAGS . . . . . . . . . . . . . . . . . . . . . . . . . 335 Table 6-17. x87 Floating-Point (#MF) Exception Flags . . . . . . . . . . . . . . . 339 Table 6-18. Invalid-Operation Exception (IE) Causes . . . . . . . . . . . . . . . . 340 Table 6-19. Priority of x87 Floating-Point Exceptions . . . . . . . . . . . . . . . . 343 Table 6-20. x87 Floating-Point (#MF) Exception Masks . . . . . . . . . . . . . . 344 Table 6-21. Masked Responses to x87 Floating-Point Exceptions . . . . . . 345 Table 6-22. Unmasked Responses to x87 Floating-Point Exceptions . . . . 348 xvi Tables 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Revision History Date September 2003 September, 2002 Revision 3.09 3.07 Description Corrected several factual errors. Corrected minor organizational problems in sections dealing with `Prefetch' instructions in chapters 3, 4, and 5. Clarified the general description of the operation of certain 128-bit media instructions in chapter 1. Corrected a factual error in the description of the FNINIT/FINIT instructions in chapter 6. Corrected operand descriptions for the CMOVcc instructions in chapter 3. Added Revision History. Corrected marketing denotations. Chapter : xvii AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 xviii Chapter : 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Preface About This Book This book is part of a multivolume work entitled the AMD64 Architecture Programmer's Manual. This table lists each volume and its order number. Title Volume 1, Application Programming Volume 2, System Programming Volume 3, General-Purpose and System Instructions Volume 4, 128-Bit Media Instructions Volume 5, 64-Bit Media and x87 Floating-Point Instructions Order No. 24592 24593 24594 26568 26569 Audience This volume (Volume 1) is intended for programmers writing application programs, compilers, or assemblers. It assumes prior experience in microprocessor programming, although it does not assume prior experience with the legacy x86 or AMD64 microprocessor architecture. This volume describes the AMD64 architecture's resources and functions that are accessible to application software, including memory, registers, instructions, operands, I/O facilities, and application-software aspects of control transfers (including interrupts and exceptions) and performance optimization. System-programming topics--including the use of instructions running at a current privilege level (CPL) of 0 (mostprivileged)--are described in Volume 2. Details about each instruction are described in volumes 3, 4, and 5. Contact Information To submit questions or comments concerning this document, c o n t a c t o u r t e ch n i c a l d o c u m e n t a t i o n s t a f f a t AMD64.Feedback@amd.com. Preface xix AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Organization This volume begins with an overview of the architecture and its memory organization and is followed by chapters that describe the four application-programming models available in the AMD64 architecture: General-Purpose Programming--This model uses the integer general-purpose registers (GPRs). The chapter describing it also describes the basic application environment for exceptions, control transfers, I/O, and memory optimization that applies to all other application-programming models. 128-bit Media Programming--This model uses the 128-bit XMM registers and supports integer and floating-point operations on vector (packed) and scalar data types. 64-bit Media Programming--This model uses the 64-bit MMXTM registers and supports integer and floating-point operations on vector (packed) and scalar data types. x87 Floating-Point Programming--This model uses the 80-bit x87 registers and supports floating-point operations on scalar data types. Definitions assumed throughout this volume are listed below. The index at the end of this volume cross-references topics within the volume. For other topics relating to the AMD64 architecture, see the tables of contents and indexes of the other volumes. Definitions Some of the following definitions assume a knowledge of the legacy x86 architecture. See "Related Documents" on page xxxi for further information about the legacy x86 architecture. Terms and Notation 1011b A binary value--in this example, a 4-bit value. F0EAh A hexadecimal value--in this example a 2-byte value. [1,2) A range that includes the left-most value (in this case, 1) but excludes the right-most value (in this case, 2). xx Preface 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology 7-4 A bit range, from bit 7 to 4, inclusive. The high-order bit is shown first. 128-bit media instructions Instructions that use the 128-bit XMM registers. These are a combination of the SSE and SSE2 instruction sets. 64-bit media instructions Instructions that use the 64-bit MMX registers. These are primarily a combination of MMX and 3DNow!TM instruction sets, with some additional instructions from the SSE and SSE2 instruction sets. 16-bit mode Legacy mode or compatibility mode in which a 16-bit address size is active. See legacy mode and compatibility mode. 32-bit mode Legacy mode or compatibility mode in which a 32-bit address size is active. See legacy mode and compatibility mode. 64-bit mode A submode of long mode. In 64-bit mode, the default address size is 64 bits and new features, such as register extensions, are supported for system and application software. #GP(0) Notation indicating a general-protection exception (#GP) with error code of 0. absolute Said of a displacement that references the base of a code segment rather than an instruction pointer. Contrast with relative. biased exponent The sum of a floating-point value's exponent and a constant bias for a particular floating-point data type. The bias makes the range of the biased exponent always positive, which allows reciprocation without overflow. Preface xxi AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 byte Eight bits. clear To write a bit value of 0. Compare set. compatibility mode A submode of long mode. In compatibility mode, the default address siz e is 32 bits, and legacy 16-bit and 32-bit applications run without modification. commit To irreversibly write, in program order, an instruction's result to software-visible storage, such as a register (including flags), the data cache, an internal write buffer, or memory. CPL Current privilege level. CR0-CR4 A register range, from register CR0 through CR4, inclusive, with the low-order register first. CR0.PE = 1 Notation indicating that the PE bit of the CR0 register has a value of 1. direct Referencing a memory location whose address is included in the instruction's syntax as an immediate operand. The address may be an absolute or relative address. Compare indirect. dirty data Data held in the processor's caches or internal buffers that is more recent than the copy held in main memory. displacement A signed value that is added to the base of a segment (absolute addressing) or an instruction pointer (relative addressing). Same as offset. doubleword Two words, or four bytes, or 32 bits. xxii Preface 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology double quadword Eight words, or 16 bytes, or 128 bits. Also called octword. DS:rSI The contents of a memory location whose segment address is in the DS register and whose offset relative to that segment is in the rSI register. EFER.LME = 0 Notation indicating that the LME bit of the EFER register has a value of 0. effective address size The address size for the current instruction after accounting for the default address size and any address-size override prefix. effective operand size Th e o p e ra n d s i z e for t h e c u r re n t i n s t r u c t i o n a f t e r accounting for the default operand size and any operandsize override prefix. element See vector. exception An abnormal condition that occurs as the result of executing an instruction. The processor's response to an exception depends on the type of the exception. For all exceptions except 128-bit media SIMD floating-point exceptions and x87 floating-point exceptions, control is transferred to the handler (or service routine) for that exception, as defined by the exception's vector. For floating-point exceptions defined by the IEEE 754 standard, there are both masked and unmasked responses. When unmasked, the exception handler is called, and when masked, a default response is provided instead of calling the handler. FF /0 Notation indicating that FF is the first byte of an opcode, and a subopcode in the ModR/M byte has a value of 0. flush An often ambiguous term meaning (1) writeback, if modified, and invalidate, as in "flush the cache line," or (2) Preface xxiii AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 invalidate, as in "flush the pipeline," or (3) change a value, as in "flush to zero." GDT Global descriptor table. IDT Interrupt descriptor table. IGN Ignore. Field is ignored. indirect Referencing a memory location whose address is in a register or other memory location. The address may be an absolute or relative address. Compare direct. IRB The virtual-8086 mode interrupt-redirection bitmap. IST The long-mode interrupt-stack table. IVT The real-address mode interrupt-vector table. LDT Local descriptor table. legacy x86 The legacy x86 architecture. See "Related Documents" on page xxxi for descriptions of the legacy x86 architecture. legacy mode An operating mode of the AMD64 architecture in which existing 16-bit and 32-bit applications and operating systems run without modification. A processor implementation of the AMD64 architecture can run in either long mode or legacy mode. Legacy mode has three submodes, real mode, protected mode, and virtual-8086 mode. long mode An operating mode unique to the AMD64 architecture. A processor implementation of the AMD64 architecture can run in either long mode or legacy mode. Long mode has two submodes, 64-bit mode and compatibility mode. xxiv Preface 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology lsb Least-significant bit. LSB Least-significant byte. main memory Physical memory, such as RAM and ROM (but not cache memory) that is installed in a particular computer system. mask (1) A control bit that prevents the occurrence of a floatingpoint exception from invoking an exception-handling routine. (2) A field of bits used for a control purpose. MBZ Must be zero. If software attempts to set an MBZ bit to 1, a general-protection exception (#GP) occurs. memory Unless otherwise specified, main memory. ModRM A byte following an instruction opcode that specifies address calculation based on mode (Mod), register (R), and memory (M) variables. moffset A 16, 32, or 64-bit offset that specifies a memory operand directly, without using a ModRM or SIB byte. msb Most-significant bit. MSB Most-significant byte. multimedia instructions A combination of 128-bit media instructions and 64-bit media instructions. octword Same as double quadword. offset Same as displacement. Preface xxv AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 overflow The condition in which a floating-point number is larger in magnitude than the largest, finite, positive or negative number that can be represented in the data-type format being used. packed See vector. PAE Physical-address extensions. physical memory Actual memory, consisting of main memory and cache. probe A check for an address in a processor's caches or internal buffers. External probes originate outside the processor, and internal probes originate within the processor. protected mode A submode of legacy mode. quadword Four words, or eight bytes, or 64 bits. RAZ Read as zero (0), regardless of what is written. real-address mode See real mode. real mode A short name for real-address mode, a submode of legacy mode. relative Referencing with a displacement (also called offset) from an instruction pointer rather than the base of a code segment. Contrast with absolute. REX An instruction prefix that specifies a 64-bit operand size and provides access to additional registers. xxvi Preface 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology RIP-relative addressing Addressing relative to the 64-bit RIP instruction pointer. set To write a bit value of 1. Compare clear. SIB A byte following an instruction opcode that specifies address calculation based on scale (S), index (I), and base (B). SIMD Single instruction, multiple data. See vector. SSE Streaming SIMD extensions instruction set. See 128-bit media instructions and 64-bit media instructions. SSE2 Extensions to the SSE instruction set. See 128-bit media instructions and 64-bit media instructions. sticky bit A bit that is set or cleared by hardware and that remains in that state until explicitly changed by software. TOP The x87 top-of-stack pointer. TPR Task-priority register (CR8). TSS Task-state segment. underflow The condition in which a floating-point number is smaller in magnitude than the smallest nonzero, positive or negative number that can be represented in the data-type format being used. vector (1) A set of integer or floating-point values, called elements, that are packed into a single operand. Most of the 128-bit and 64-bit media instructions use vectors as operands. Preface xxvii AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Vectors are also called packed or SIMD (single-instruction multiple-data) operands. (2) An index into an interrupt descriptor table (IDT), used to access exception handlers. Compare exception. virtual-8086 mode A submode of legacy mode. word Two bytes, or 16 bits. x86 See legacy x86. Registers In the following list of registers, the names are used to refer either to a given register or to the contents of that register: AH-DH The high 8-bit AH, BH, CH, and DH registers. Compare AL-DL. AL-DL The low 8-bit AL, BL, CL, and DL registers. Compare AH-DH. AL-r15B The low 8-bit AL, BL, CL, DL, SIL, DIL, BPL, SPL, and R8B-R15B registers, available in 64-bit mode. BP Base pointer register. CRn Control register number n. CS Code segment register. eAX-eSP The 16-bit AX, BX, CX, DX, DI, SI, BP, and SP registers or the 32-bit EAX, EBX, ECX, EDX, EDI, ESI, EBP, and ESP registers. Compare rAX-rSP. EFER Extended features enable register. xxviii Preface 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology eFLAGS 16-bit or 32-bit flags register. Compare rFLAGS. EFLAGS 32-bit (extended) flags register. eIP 16-bit or 32-bit instruction-pointer register. Compare rIP. EIP 32-bit (extended) instruction-pointer register. FLAGS 16-bit flags register. GDTR Global descriptor table register. GPRs General-purpose registers. For the 16-bit data size, these are AX, BX, CX, DX, DI, SI, BP, and SP. For the 32-bit data size, these are EAX, EBX, ECX, EDX, EDI, ESI, EBP, and ESP. For the 64-bit data size, these include RAX, RBX, RCX, RDX, RDI, RSI, RBP, RSP, and R8-R15. IDTR Interrupt descriptor table register. IP 16-bit instruction-pointer register. LDTR Local descriptor table register. MSR Model-specific register. r8-r15 The 8-bit R8B-R15B registers, or the 16-bit R8W-R15W registers, or the 32-bit R8D-R15D registers, or the 64-bit R8-R15 registers. rAX-rSP The 16-bit AX, BX, CX, DX, DI, SI, BP, and SP registers, or the 32-bit EAX, EBX, ECX, EDX, EDI, ESI, EBP, and ESP registers, or the 64-bit RAX, RBX, RCX, RDX, RDI, RSI, Preface xxix AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 RBP, and RSP registers. Replace the placeholder r with nothing for 16-bit size, "E" for 32-bit size, or "R" for 64-bit size. RAX 64-bit version of the EAX register. RBP 64-bit version of the EBP register. RBX 64-bit version of the EBX register. RCX 64-bit version of the ECX register. RDI 64-bit version of the EDI register. RDX 64-bit version of the EDX register. rFLAGS 16-bit, 32-bit, or 64-bit flags register. Compare RFLAGS. RFLAGS 64-bit flags register. Compare rFLAGS. rIP 16-bit, 32-bit, or 64-bit instruction-pointer register. Compare RIP. RIP 64-bit instruction-pointer register. RSI 64-bit version of the ESI register. RSP 64-bit version of the ESP register. SP Stack pointer register. SS Stack segment register. xxx Preface 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology TPR Task priority register, a new register introduced in the AMD64 architecture to speed interrupt management. TR Task register. Endian Order The x86 and AMD64 architectures address memory using littleendian byte-ordering. Multibyte values are stored with their least-significant byte at the lowest byte address, and they are illustrated with their least significant byte at the right side. Strings are illustrated in reverse order, because the addresses of their bytes increase from right to left. Related Documents Peter Abel, IBM PC Assembly Language and Programming, Prentice-Hall, Englewood Cliffs, NJ, 1995. Rakesh Agarwal, 80x86 Architecture & Programming: Volume II, Prentice-Hall, Englewood Cliffs, NJ, 1991. AMD data sheets and application notes for particular hardware implementations of the AMD64 architecture. AMD, AMD-K6(R) MMXTM Enhanced Processor Multimedia Technology, Sunnyvale, CA, 2000. AMD, 3DNow!TM Technology Manual, Sunnyvale, CA, 2000. AMD, AMD Extensions to the 3DNow!TM and MMXTM Instruction Sets, Sunnyvale, CA, 2000. Don Anderson and Tom Shanley, Pentium(R) Processor System Architecture, Addison-Wesley, New York, 1995. Nabajyoti Barkakati and Randall Hyde, Microsoft Macro Assembler Bible, Sams, Carmel, Indiana, 1992. Barry B. Brey, 8086/8088, 80286, 80386, and 80486 Assembly Language Programming, Macmillan Publishing Co., New York, 1994. Barry B. Brey, Programming the 80286, 80386, 80486, and Pentium Based Personal Computer, Prentice-Hall, Englewood Cliffs, NJ, 1995. Ralf Brown and Jim Kyle, PC Interrupts, Addison-Wesley, New York, 1994. Penn Brumm and Don Brumm, 80386/80486 Assembly Language Programming, Windcrest McGraw-Hill, 1993. Preface xxxi AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Geoff Chappell, DOS Internals, Addison-Wesley, New York, 1994. Chips and Technologies, Inc. Super386 DX Programmer's Reference Manual, Chips and Technologies, Inc., San Jose, 1992. John Crawford and Patrick Gelsinger, Programming the 80386, Sybex, San Francisco, 1987. Cyrix Corporation, 5x86 Processor BIOS Writer's Guide, Cyrix Corporation, Richardson, TX, 1995. Cyrix Corporation, M1 Processor Data Book, Cyrix Corporation, Richardson, TX, 1996. Cyrix Corporation, MX Processor MMX Extension Opcode Table, Cyrix Corporation, Richardson, TX, 1996. Cyrix Corporation, MX Processor Data Book, Cyrix Corporation, Richardson, TX, 1997. Ray Duncan, Extending DOS: A Programmer's Guide to Protected-Mode DOS, Addison Wesley, NY, 1991. William B. Giles, Assembly Language Programming for the Intel 80xxx Family, Macmillan, New York, 1991. Frank van Gilluwe, The Undocumented PC, Addison-Wesley, New York, 1994. John L. Hennessy and David A. Patterson, Computer Architecture, Morgan Kaufmann Publishers, San Mateo, CA, 1996. Thom Hogan, The Programmer's PC Sourcebook, Microsoft Press, Redmond, WA, 1991. Hal Katircioglu, Inside the 486, Pentium(R), and Pentium Pro, Peer-to-Peer Communications, Menlo Park, CA, 1997. IBM Corporation, 486SLC Microprocessor Data Sheet, IBM Corporation, Essex Junction, VT, 1993. IBM Corporation, 486SLC2 Microprocessor Data Sheet, IBM Corporation, Essex Junction, VT, 1993. IBM Corporation, 80486DX2 Processor Floating Point Instructions, IBM Corporation, Essex Junction, VT, 1995. IBM Corporation, 80486DX2 Processor BIOS Writer's Guide, IBM Corporation, Essex Junction, VT, 1995. IBM Corporation, Blue Lightening 486DX2 Data Book, IBM Corporation, Essex Junction, VT, 1994. xxxii Preface 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Institute of Electrical and Electronics Engineers, IEEE Standard for Binary Floating-Point Arithmetic, ANSI/IEEE Std 754-1985. Institute of Electrical and Electronics Engineers, IEEE Standard for Radix-Independent Floating-Point Arithmetic, ANSI/IEEE Std 854-1987. Muhammad Ali Mazidi and Janice Gillispie Mazidi, 80X86 IBM PC and Compatible Computers, Prentice-Hall, Englewood Cliffs, NJ, 1997. Hans-Peter Messmer, The Indispensable Pentium Book, Addison-Wesley, New York, 1995. Karen Miller, An Assembly Language Introduction to Computer Architecture: Using the Intel Pentium(R), Oxford University Press, New York, 1999. Stephen Morse, Eric Isaacson, and Douglas Albert, The 80386/387 Architecture, John Wiley & Sons, New York, 1987. NexGen Inc., Nx586 Processor Data Book, NexGen Inc., Milpitas, CA, 1993. NexGen Inc., Nx686 Processor Data Book, NexGen Inc., Milpitas, CA, 1994. Bipin Patwardhan, Introduction to the Streaming SIMD Extensions in the Pentium(R) III, www.x86.org/articles/sse_pt1/ simd1.htm, June, 2000. Peter Norton, Peter Aitken, and Richard Wilton, PC Programmer's Bible, Microsoft(R) Press, Redmond, WA, 1993. PharLap 386|ASM Reference Manual, Pharlap, Cambridge MA, 1993. PharLap TNT DOS-Extender Reference Manual, Pharlap, Cambridge MA, 1995. Sen-Cuo Ro and Sheau-Chuen Her, i386/i486 Advanced Programming, Van Nostrand Reinhold, New York, 1993. Jeffrey P. Royer, Introduction to Protected Mode Programming, course materials for an onsite class, 1992. Tom Shanley, Protected Mode System Architecture, Addison Wesley, NY, 1996. SGS-Thomson Corporation, 80486DX Processor SMM Programming Manual, SGS-Thomson Corporation, 1995. Walter A. Triebel, The 80386DX Microprocessor, PrenticeHall, Englewood Cliffs, NJ, 1992. Preface xxxiii AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 John Wharton, The Complete x86, MicroDesign Resources, Sebastopol, California, 1994. Web sites and newsgroups: www.amd.com news.comp.arch news.comp.lang.asm.x86 news.intel.microprocessors news.microsoft xxxiv Preface 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology 1 1.1 Overview of the AMD64 Architecture Introduction The AMD64 architecture is a simple yet powerful 64-bit, backward-compatible extension of the industry-standard (legacy) x86 architecture. It adds 64-bit addressing and expands register resources to support higher performance for recompiled 64-bit programs, while supporting legacy 16-bit and 32-bit applications and operating systems without modification or recompilation. It is the architectural basis on which new processors can provide seamless, high-performance support for both the vast body of existing software and new 64-bit software required for higher-performance applications. The need for a 64-bit x86 architecture is driven by applications that address large amounts of virtual and physical memory, such as high-performance servers, database management systems, and CAD tools. These applications benefit from both 64-bit addresses and an increased number of registers. The small number of registers available in the legacy x86 architecture limits performance in computation-intensive applications. Increasing the number of registers provides a performance boost to many such applications. 1.1.1 New Features The AMD64 architecture introduces these new features: Register Extensions (see Figure 1-1 on page 2): - 8 new general-purpose registers (GPRs). - All 16 GPRs are 64 bits wide. - 8 new 128-bit XMM registers. - Uniform byte-register addressing for all GPRs. - A new instruction prefix (REX) accesses the extended registers. Long Mode (see Table 1-1 on page 3): - Up to 64 bits of virtual address. - 64-bit instruction pointer (RIP). - New instruction-pointer-relative data-addressing mode. - Flat address space. Chapter 1: Overview of the AMD64 Architecture 1 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 General-Purpose Registers (GPRs) RAX RBX RCX RDX RBP RSI RDI RSP R8 R9 R10 R11 R12 R13 R14 R15 63 0 64-Bit Media and Floating-Point Registers MMX0/FPR0 MMX1/FPR1 MMX2/FPR2 MMX3/FPR3 MMX4/FPR4 MMX5/FPR5 MMX6/FPR6 MMX7/FPR7 63 0 128-Bit Media Registers XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 XMM9 XMM10 XMM11 XMM12 XMM13 XMM14 XMM15 127 0 Flags Register 0 63 EFLAGS 0 RFLAGS Instruction Pointer EIP 63 0 RIP Legacy x86 registers, supported in all modes Register extensions, supported in 64-bit mode Application-programming registers also include the 128-bit media control-and-status register and the x87 tag-word, control-word, and status-word registers 513-101.eps Figure 1-1. Application-Programming Register Set 2 Chapter 1: Overview of the AMD64 Architecture 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Table 1-1. Operating Modes Operating System Required Defaults Typical Application Register Recompile Address Operand GPR Extensions Required Size (bits) Size (bits) Width (bits) yes New 64-bit OS Compatibility Mode Protected Mode no 64 32 16 32 16 32 16 no 16 Legacy 16-bit OS 16 16 yes 64 32 16 32 Operating Mode Long Mode 64-Bit Mode 32 no Legacy Mode Legacy 32-bit OS Virtual-8086 Mode Real Mode no 16 1.1.2 Registers Table 1-2 on page 4 compares the register and stack resources available to application software, by operating mode. The left set of columns shows the legacy x86 resources, which are available in the AMD64 architecture's legacy and compatibility modes. The right set of columns shows the comparable resources in 64-bit mode. Gray shading indicates differences between the modes. These register differences (not including stack-width difference) represent the register extensions shown in Figure 1-1. Chapter 1: Overview of the AMD64 Architecture 3 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Table 1-2. Application Registers and Stack, by Operating Mode Legacy and Compatibility Modes Name EAX, EBX, ECX, EDX, EBP, ESI, EDI, ESP XMM0-XMM7 MMX0-MMX73 FPR0-FPR73 EIP EFLAGS -- Number 8 8 8 8 1 1 Size (bits) 32 128 64 80 32 32 16 or 32 Name RAX, RBX, RCX, RDX, RBP, RSI, RDI, RSP, R8-R15 XMM0-XMM15 MMX0-MMX73 FPR0-FPR73 RIP RFLAGS -- 64-Bit Mode1 Number 16 16 8 8 1 1 Size (bits) 64 128 64 80 64 64 64 Register or Stack General-Purpose Registers (GPRs)2 128-Bit XMM Registers 64-Bit MMX Registers x87 Registers Instruction Pointer2 Flags2 Stack Note: 1. Gray-shaded entries indicate differences between the modes. These differences (except stack-width difference) are the AMD64 architecture's register extensions. 2. This list of GPRs shows only the 32-bit registers. The 16-bit and 8-bit mappings of the 32-bit registers are also accessible, as described in "Registers" on page 27. 3. The MMX0-MMX7 registers are mapped onto the FPR0-FPR7 physical registers, as shown in Figure 1-1. The x87 stack registers, ST(0)-ST(7), are the logical mappings of the FPR0-FPR7 physical registers. As Table 1-2 shows, the legacy x86 architecture (called legacy mode in the AMD64 architecture) supports eight GPRs. In reality, however, the general use of at least four registers (EBP, ESI, EDI, and ESP) is compromised because they serve special purposes when executing many instructions. The AMD64 architecture's addition of eight new GPRs--and the increased width of these registers from 32 bits to 64 bits--allows compilers to substantially improve software performance. Compilers have more flexibility in using registers to hold variables. Compilers can also minimize memory traffic--and thus boost performance--by localizing work within the GPRs. 1.1.3 Instruction Set The AMD64 architecture supports the full legacy x86 instruction set, and it adds a few new instructions to support long mode (see Table 1-1 for a summary of operating modes). The application-programming instructions are organized and described in the following subsets: General-Purpose Instructions--These are the basic x86 integer instructions used in virtually all programs. Most of 4 Chapter 1: Overview of the AMD64 Architecture 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology these instructions load, store, or operate on data located in the general-purpose registers (GPRs) or memory. Some of the instructions alter sequential program flow program by branching to other program locations. 128-Bit Media Instructions--These are the streaming SIMD extension (SSE and SSE2) instructions that load, store, or operate on data located primarily in the 128-bit XMM registers. They perform integer and floating-point operations on vector (packed) and scalar data types. Because the vector instructions can independently and simultaneously perform a single operation on multiple sets of data, they are called single-instruction, multiple-data (SIMD) instructions. They are useful for high-performance media and scientific applications that operate on blocks of data. 64-Bit Media Instructions--These are the multimedia extension (MMXTM technology) and AMD 3DNow!TM technology instructions. They load, store, or operate on data located primarily on the 64-bit MMX registers. Like their 128-bit counterparts, described above, they perform integer and floating-point operations on vector (packed) and scalar data types. Thus, they are also SIMD instructions and are useful in media applications that operate on blocks of data. x87 Floating-Point Instructions--These are the floating-point instructions used in legacy x87 applications. They load, store, or operate on data located in the x87 registers. Some of these application-programming instructions bridge two or more of the above subset s . For exampl e, there are instructions that move data between the general-purpose registers and the XMM or MMX registers, and many of the integer vector (packed) instructions can operate on either XMM or MMX registers, although not simultaneously. If instructions bridge two or more subsets, their descriptions are repeated in all subsets to which they apply. 1.1.4 Media Instructions Media applications--such as image processing, music synthesis, speech recognition, full-motion video, and 3D graphics rendering--share certain characteristics: They process large amounts of data. They often perform the same sequence of operations repeatedly across the data. Chapter 1: Overview of the AMD64 Architecture 5 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 The data are often represented as small quantities, such as 8 bits for pixel values, 16 bits for audio samples, and 32 bits for object coordinates in floating-point format. The 128-bit and 64-bit media instructions are designed to accelerate these applications. The instructions use a form of vector (or packed) parallel processing known as singleinstruction, multiple data (SIMD) processing. This vector technology has the following characteristics: A single register can hold multiple independent pieces of data. For example, a single 128-bit XMM register can hold 16 8-bit integer data elements, or four 32-bit single-precision floating-point data elements. The vector instructions can operate on all data elements in a register, independently and simultaneously. For example, a PADDB instruction operating on byte elements of two vector operands in 128-bit XMM registers performs 16 simultaneous additions and returns 16 independent results in a single operation. 128-bit and 64-bit media instructions take SIMD vector technology a step further by including special instructions that perform operations commonly found in media applications. For example, a graphics application that adds the brightness values of two pixels must prevent the add operation from wrapping around to a small value if the result overflows the destination register, because an overflow result can produce unexpected effects such as a dark pixel where a bright one is expected. The 128-bit and 64-bit media instructions include saturatingarithmetic instructions to simplify this type of operation. A result that otherwise would wrap around due to overflow or underflow is instead forced to saturate at the largest or smallest value that can be represented in the destination register. 1.1.5 Floating-Point Instructions The AMD64 architecture provides three floating-point instruction subsets, using three distinct register sets: 128-Bit Media Instructions support 32-bit single-precision and 64-bit double-precision floating-point operations, in addition to integer operations. Operations on both vector data and scalar data are supported, with a dedicated floating-point exception-reporting mechanism. These floating-point operations comply with the IEEE-754 standard. 6 Chapter 1: Overview of the AMD64 Architecture 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology 64-Bit Media Instructions (the subset of 3DNow! technology instructions) support single-precision floating-point operations. Operations on both vector data and scalar data are supported, but these instructions do not support floating-point exception reporting. x87 Floating-Point Instructions support single-precision, double-precision, and 80-bit extended-precision floatingpoint operations. Only scalar data are supported, with a dedicated floating-point exception-reporting mechanism. The x87 floating-point instructions contain special instructions for performing trigonometric and logarithmic transcendental operations. The single-precision and doubleprecision floating-point operations comply with the IEEE754 standard. Maximum floating-point performance can be achieved using the 128-bit media instructions. One of these vector instructions can support up to four single-precision (or two doubleprecision) operations in parallel. In 64-bit mode, the AMD64 architecture doubles the number of legacy XMM registers from 8 to 16. Applications gain additional benefits using the 64-bit media and x87 instructions. The separate register sets supported by these instructions relieve pressure on the XMM registers available to the 128-bit media instructions. This provides application programs with three distinct sets of floating-point registers. In addition, certain high-end implementations of the AMD64 architecture may support 128-bit media, 64-bit media, and x87 instructions with separate execution units. 1.2 Modes of Operation Table 1-1 on page 3 summarizes the modes of operation supported by the AMD64 architecture. In most cases, the default address and operand sizes can be overridden with instruction prefixes. The register extensions shown in the second-from-right column of Table 1-1 are those illustrated in Figure 1-1 on page 2. 1.2.1 Long Mode Long mode is an extension of legacy protected mode. Long mode consists of two submodes: 64-bit mode and compatibility mode. 64-bit mode supports all of the new features and register extensions of the AMD64 architecture. Compatibility mode 7 Chapter 1: Overview of the AMD64 Architecture AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 supports binary compatibility with existing 16-bit and 32-bit applications. Long mode does not support legacy real mode or legacy virtual-8086 mode, and it does not support hardware task switching. Throughout this document, references to long mode refer to both 64-bit mode and compatibility mode. If a function is specific to either of these submodes, then the name of the specific submode is used instead of the name long mode. 1.2.2 64-Bit Mode 64-bit mode--a submode of long mode--supports the full range of 64-bit virtual-addressing and register-extension features. This mode is enabled by the operating system on an individual code-segment basis. Because 64-bit mode supports a 64-bit virtual-address space, it requires a new 64-bit operating system and tool chain. Existing application binaries can run without recompilation in compatibility mode, under an operating system that runs in 64-bit mode, or the applications can also be recompiled to run in 64-bit mode. Addressing features include a 64-bit instruction pointer (RIP) and a new RIP-relative data-addressing mode. This mode accommodates modern operating systems by supporting only a flat address space, with single code, data, and stack space. Register Extensions. 64-bit mode implements register extensions through a new group of instruction prefixes, called REX prefixes. These extensions add eight GPRs (R8-R15), widen all GPRs to 64 bits, and add eight 128-bit XMM registers (XMM8-XMM15). The REX instruction prefixes also provide a new byte-register capability that makes the low byte of any of the sixteen GPRs available for byte operations. This results in a uniform set of byte, word, doubleword, and quadword registers that is better suited to compiler register-allocation. 64-Bit Addresses and Operands. In 64-bit mode, the default virtualaddress size is 64 bits (implementations can have fewer). The default operand size for most instructions is 32 bits. For most i n s t r u c t i o n s , t h e s e d e fa u l t s c a n b e ove r r i dd e n o n a n instruction-by-instruction basis using instruction prefixes. REX prefixes specify the 64-bit operand size and new registers. RIP-Relative Data Addressing. 64-bit mode supports data addressing relative to the 64-bit instruction pointer (RIP). The legacy x86 8 Chapter 1: Overview of the AMD64 Architecture 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology architecture supports IP-relative addressing only in controltransfer instructions. RIP-relative addressing improves the efficiency of position-independent code and code that addresses global data. Opcodes. A few instruction opcodes and prefix bytes are redefined to allow register extensions and 64-bit addressing. These differences are described in "General-Purpose Instructions in 64-Bit Mode" in Volume 3 and "Differences Between Long Mode and Legacy Mode" in Volume 3. 1.2.3 Compatibility Mode Compatibility mode--the second submode of long mode-- allows 64-bit operating systems to run existing 16-bit and 32-bit x86 applications. These legacy applications run in compatibility mode without recompilation. Applications running in compatibility mode use 32-bit or 16-bit addressing and can access the first 4GB of virtual-address space. Legacy x86 instruction prefixes toggle between 16-bit and 32-bit address and operand sizes. As with 64-bit mode, compatibility mode is enabled by the operating system on an individual code-segment basis. Unlike 64-bit mode, however, x86 segmentation functions the same as in the legacy x86 architecture, using 16-bit or 32-bit protectedmode semantics. From the application viewpoint, compatibility mode looks like the legacy x86 protected-mode environment. From the operating-system viewpoint, however, address translation, interrupt and exception handling, and system data structures use the 64-bit long-mode mechanisms. 1.2.4 Legacy Mode Legacy mode preserves binary compatibility not only with existing 16-bit and 32-bit applications but also with existing 16bit and 32-bit operating systems. Legacy mode consists of the following three submodes: Protected Mode--Protected mode supports 16-bit and 32-bit programs with memory segmentation, optional paging, and privilege-checking. Programs running in protected mode can access up to 4GB of memory space. Virtual-8086 Mode--Virtual-8086 mode supports 16-bit realmode programs running as tasks under protected mode. It uses a simple form of memory segmentation, optional paging, and limited protection-checking. Programs running in virtual-8086 mode can access up to 1MB of memory space. Chapter 1: Overview of the AMD64 Architecture 9 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Real Mode--Real mode supports 16-bit programs using simple register-based memory segmentation. It does not support paging or protection-checking. Programs running in real mode can access up to 1MB of memory space. Legacy mode is compatible with existing 32-bit processor implementations of the x86 architecture. Processors that implement the AMD64 architecture boot in legacy real mode, just like processors that implement the legacy x86 architecture. Throughout this document, references to legacy mode refer to all three submodes--protected mode, virtual-8086 mode, and real mode. If a function is specific to either of these submodes, then the name of the specific submode is used instead of the name legacy mode. 10 Chapter 1: Overview of the AMD64 Architecture 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology 2 Memory Model This chapter describes the memory characteristics that apply to application software in the various operating modes of the AMD64 architecture. These characteristics apply to all instructions in the architecture. Several additional system-level details about memory and cache management are described in Volume 2. 2.1 Memory Organization Virtual memory consists of the entire address space available to programs. It is a large linear-address space that is translated by a combination of hardware and operating-system software to a smaller physical-address space, parts of which are located in memory and parts on disk or other external storage media. Figure 2-1 on page 12 shows how the virtual-memory space is treated in the two submodes of long mode: 64-bit mode--This mode uses a flat segmentation model of virtual memory. The 64-bit virtual-memory space is treated as a single, flat (unsegmented) address space. Program addresses access locations that can be anywhere in the linear 64-bit address space. The operating system can use separate selectors for code, stack, and data segments for memory-protection purposes, but the base address of all these segments is always 0. (For an exception to this general rule, see "FS and GS as Base of Address Calculation" on page 20.) Compatibility mode--This mode uses a protected, multisegment model of virtual memory, just as in legacy protected mode. The 32-bit virtual-memory space is treated as a segmented set of address spaces for code, stack, and data segments, each with its own base address and protection parameters. A segmented space is specified by adding a segment selector to an address. 2.1.1 Virtual Memory Chapter 2: Memory Model 11 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 64-Bit Mode (Flat Segmentation Model) 264 - 1 Legacy and Compatibility Mode (Multi-Segment Model) 232 - 1 Code Segment (CS) Base code Stack Segment (SS) Base Base Address for All Segments 0 stack data 0 513-107.eps Data Segment (DS) Base Figure 2-1. Virtual-Memory Segmentation Segmented memory has been used as a method by which operating systems could isolate programs, and the data used by programs, from each other in an effort to increase the reliability of systems running multiple programs simultaneously. However, most modern operating systems do not use the segmentation features available in the legacy x86 architecture. Instead, these operating systems handle segmentation functions entirely in software. For this reason, the AMD64 architecture dispenses with most of the legacy segmentation functions in 64-bit mode. This allows new 64-bit operating systems to be coded more simply, and it supports more efficient management of multiprogramming environments than is possible in the legacy x86 architecture. 2.1.2 Segment Registers Segment registers hold the selectors used to access memory segments. Figure 2-2 on page 13 shows the application-visible portion of the segment registers. In legacy and compatibility modes, all segment registers are accessible to software. In 64bit mode, only the CS, FS, and GS segments are recognized by Chapter 2: Memory Model 12 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology the processor, and software can use the FS and GS segmentbase registers as base registers for address calculation, as described in "FS and GS as Base of Address Calculation" on page 20. For references to the DS, ES, or SS segments in 64-bit mode, the processor assumes that the base for each of these segments is zero, neither their segment limit nor attributes are checked, and the processor simply checks that all such addresses are in canonical form, as described in "64-bit Canonical Addresses" on page 18. Legacy Mode and Compatibility Mode CS DS ES FS GS SS 15 0 15 64-Bit Mode CS (Attributes only) ignored ignored FS (Base only) GS (Base only) ignored 0 513-312.eps Figure 2-2. Segment Registers For details on segmentation and the segment registers, see "Segmented Virtual Memory" in Volume 2. 2.1.3 Physical Memory Physical memory is the installed memory (excluding cache memory) in a particular computer system that can be accessed through the processor's bus interface. The maximum size of the physical memory space is determined by the number of address bits on the bus interface. In a virtual-memory system, the large virtual-address space (also called linear-address space) is translated to a smaller physical-address space by a combination of segmentation and paging hardware and software. Segmentation is illustrated in Figure 2-1 on page 12. Paging is a mechanism for translating linear (virtual) addresses into fixedsize blocks called pages, which the operating system can move, as needed, between memory and external storage media Chapter 2: Memory Model 13 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 (typically disk). The AMD64 architecture supports an expanded version of the legacy x86 paging mechanism, one that is able to translate the full 64-bit virtual-address space into the physicaladdress space supported by the particular implementation. 2.1.4 Memory Management Memory management consists of the methods by which a dd re s s e s g e n e ra t e d by p rog ra m s a re t ra n s l a t e d v i a segmentation and/or paging into addresses in physical memory. Memory management is not visible to application programs. It is handled by the operating system and processor hardware. The following description gives a very brief overview of these functions. Details are given in "System-Management Instructions" in Volume 2. Long-Mode Memory Management. Figure 2-3 shows the flow, from top to bottom, of memory management functions performed in the two submodes of long mode. 64-Bit Mode 63 0 15 Compatibility Mode 0 31 0 Virtual (Linear) Address Selector Effective Address Segmentation 63 32 31 0 0 Virtual Address Paging Paging 51 0 51 0 Physical Address Physical Address 513-184.eps Figure 2-3. Long-Mode Memory Management In 64-bit mode, programs generate virtual (linear) addresses that can be up to 64 bits in size. The virtual addresses are 14 Chapter 2: Memory Model 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology passed to the long-mode paging function, which generates physical addresses that can be up to 52 bits in size. (Specific implementations of the architecture can support fewer virtualaddress and physical-address sizes.) In compatibility mode, legacy 16-bit and 32-bit applications run using legacy x86 protected-mode segmentation semantics. The 16-bit or 32-bit effective addresses generated by programs are combined with their segments to produce 32-bit virtual (linear) addresses that are zero-extended to a maximum of 64 bits. The paging that follows is the same long-mode paging function used in 64-bit mode. It translates the virtual addresses into physical addresses. The combination of segment selector and effective address is also called a logical address or far pointer. The virtual address is also called the linear address. Legacy-Mode Memory Management. Figure 2-4 shows the memorymanagement functions performed in the three submodes of legacy mode. Protected Mode 15 0 31 0 15 Virtual-8086 Mode 0 15 0 15 Real Mode 0 15 0 Selector Effective Address (EA) Selector EA Selector EA Segmentation 31 0 Segmentation 19 0 Segmentation 19 0 Linear Address Linear Address Linear Address Paging Paging 31 0 31 0 31 0 19 0 Physical Address (PA) Physical Address (PA) PA 513-185.eps Figure 2-4. Legacy-Mode Memory Management Chapter 2: Memory Model 15 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 The memory-management functions differ, depending on the submode, as follows: Protected Mode--Protected mode supports 16-bit and 32-bit programs with table-based memory segmentation, paging, and privilege-checking. The segmentation function takes 32bit effective addresses and 16-bit segment selectors and produces 32-bit linear addresses into one of 16K memory segments, each of which can be up to 4GB in size. Paging is optional. The 32-bit physical addresses are either produced by the paging function or the linear addresses are used without modification as physical addresses. Virtual-8086 Mode--Virtual-8086 mode supports 16-bit programs running as tasks under protected mode. 20-bit linear addresses are formed in the same way as in real mode, but they can optionally be translated through the paging function to form 32-bit physical addresses that access up to 4GB of memory space. Real Mode--Real mode supports 16-bit programs using register-based shift-and-add segmentation, but it does not support paging. Sixteen-bit effective addresses are zeroextended and added to a 16-bit segment-base address that is left-shifted four bits, producing a 20-bit linear address. The linear address is zero-extended to a 32-bit physical address that can access up to 1MB of memory space. 2.2 Memory Addressing Instructions and data are stored in memory in little-endian byte order. Little-endian ordering places the least-significant byte of the instruction or data item at the lowest memory address and the most-significant byte at the highest memory address. Figure 2-5 on page 17 shows a generalization of little-endian memory and register images of a quadword data type. The leastsignificant byte is at the lowest address in memory and at the right-most byte location of the register image. 2.2.1 Byte Ordering 16 Chapter 2: Memory Model 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Quadword in Memory byte 7 byte 6 byte 5 byte 4 byte 3 byte 2 byte 1 byte 0 07h 06h 05h 04h 03h 02h 01h 00h High (most-significant) Low (least-significant) Low (least-significant) High (most-significant) Quadword in General-Purpose Register byte 7 63 byte 6 byte 5 byte 4 byte 3 byte 2 byte 1 byte 0 0 513-116.eps Figure 2-5. Byte Ordering Figure 2-6 on page 18 shows the memory image of a 10-byte instruction. Instructions are byte data types. They are read from memory one byte at a time, starting with the leastsignificant byte (lowest address). For example, the following instruction specifies the 64-bit instruction MOV RAX, 1122334455667788 instruction that consists of the following ten bytes: 48 B8 8877665544332211 48 is a REX instruction prefix that specifies a 64-bit operand size, B8 is the opcode that--together with the REX prefix-- s p e c i f i e s t h e 6 4 - b i t R A X d e s t i n a t i o n re g i s t e r, a n d 8877665544332211 is the 8-byte immediate value to be moved, where 88 represents the eighth (least-significant) byte and 11 represents the first (most-significant) byte. In memory, the REX prefix byte (48) would be stored at the lowest address, and the first immediate byte (11) would be stored at the highest instruction address. Chapter 2: Memory Model 17 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 11 22 33 44 55 66 77 88 B8 48 09h 08h 07h 06h 05h 04h 03h 02h 01h 00h High (most-significant) Low (least-significant) 513-186.eps Figure 2-6. 2.2.2 64-bit Canonical Addresses Example of 10-Byte Instruction in Memory L o n g m o d e d e f i n e s 6 4 b i t s o f v i r t u a l a d d re s s , b u t implementations of the AMD64 architecture may support fewer bits of virtual address. Although implementations might not use all 64 bits of the virtual address, they check bits 63 through the most-significant implemented bit to see if those bits are all zeros or all ones. An address that complies with this property is said to be in canonical address form. If a virtual-memory reference is not in canonical form, the implementation causes a general-protection exception or stack fault. Programs provide effective addresses to the hardware prior to segmentation and paging translations. Long-mode effective addresses are a maximum of 64 bits wide, as shown in Figure 2-3 on page 14. Programs running in compatibility mode generate (by default) 32-bit effective addresses, which the hardware zeroextends to 64 bits. Legacy-mode effective addresses, with no address-size override, are 32 or 16 bits wide, as shown in Figure 2-4. These sizes can be overridden with an address-size instruction prefix, as described in "Instruction Prefixes" on page 85. There are five methods for generating effective addresses, depending on the specific instruction encoding: 2.2.3 Effective Addresses 18 Chapter 2: Memory Model 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Absolute Addresses--These addresses are given as displacements (or offsets) from the base address of a data segment. They point directly to a memory location in the data segment. Instruction-Relative Addresses--These addresses are given as displacements (or offsets) from the current instruction pointer (IP), also called the program counter (PC). They are generated by control-transfer instructions. A displacement in the instruction encoding, or one read from memory, serves as an offset from the address that follows the transfer. See "RIP-Relative Addressing" on page 22 for details about RIPrelative addressing in 64-bit mode. ModR/M Addressing--These addresses are calculated using a scale, index, base, and displacement. Instruction encodings contain two bytes--MODR/M and optional SIB (scale, index, base) and a variable length displacement--that specify the variables for the calculation. The base and index values are contained in general-purpose registers specified by the SIB byte. The scale and displacement values are specified directly in the instruction encoding. Figure 2-7 shows the components of a complex-address calculation. The resultant effective address is added to the data-segment base address to form a linear address, as described in "Segmented Virtual Memory" in Volume 2. "Instruction Formats" in Volume 3 gives further details on specifying this form of address. The encoding of instructions specifies how the address is calculated. Base Index Displacement * + Scale by 1, 2, 4, or 8 Effective Address 513-108.eps Figure 2-7. Complex Address Calculation (Protected Mode) Chapter 2: Memory Model 19 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Stack Addresses--PUSH, POP, CALL, RET, IRET, and INT instructions implicitly use the stack pointer, which contains the address of the procedure stack. See "Stack Operation" on page 23 for details about the size of the stack pointer. String Addresses--String instructions generate sequential addresses using the rDI and rSI registers, as described in "Implicit Uses of GPRs" on page 34. In 64-bit mode, with no address-size override, the size of effective-address calculations is 64 bits. An effective-address calculation uses 64-bit base and index registers and signextends displacements to 64 bits. Due to the flat address space in 64-bit mode, virtual addresses are equal to effective addresses. (For an exception to this general rule, see "FS and GS as Base of Address Calculation" on page 20.) Long-Mode Zero-Extension of 16-Bit and 32-Bit Addresses. In long mode, all 16-bit and 32-bit address calculations are zero-extended to form 64-bit addresses. Address calculations are first truncated to the effective-address size of the current mode (64-bit mode or compatibility mode), as overridden by any address-size prefix. The result is then zero-extended to the full 64-bit address width. Because of this, 16-bit and 32-bit applications running in compatibility mode can access only the low 4GB of the longmode virtual-address space. Likewise, a 32-bit address generated in 64-bit mode can access only the low 4GB of the long-mode virtual-address space. Displacements and Immediates. In general, the maximum size of address displacements and immediate operands is 32 bits. They can be 8, 16, or 32 bits in size, depending on the instruction or, for displacements, the effective address size. In 64-bit mode, displacements are sign-extended to 64 bits during use, but their actual size (for value representation) remains a maximum of 32 bits. The same is true for immediates in 64-bit mode, when the operand size is 64 bits. However, support is provided in 64-bit mode for some 64-bit displacement and immediate forms of the MOV instruction. FS and GS as Base of Address Calculation. In 64-bit mode, the FS and GS segment-base registers (unlike the DS, ES, and SS segmentbase registers) can be used as non-zero data-segment base registers for address calculations, as described in "Segmented Virtual Memory" in Volume 2. 64-bit mode assumes all other 20 Chapter 2: Memory Model 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology data-segment registers (DS, ES, and SS) have a base address of 0. 2.2.4 Address-Size Prefix The default address size of an instruction is determined by the default-size (D) bit and long-mode (L) bit in the current codesegment descriptor (for details, see "Segmented Virtual Memory" in Volume 2). Application software can override the default address size in any operating mode by using the 67h address-size instruction prefix byte. The address-size prefix allows mixing 32-bit and 64-bit addresses on an instruction-byinstruction basis. Table 2-1 shows the effects of using the address-size prefix in all operating modes. In 64-bit mode, the default address size is 64 bits. The address size can be overridden to 32 bits. 16-bit addresses are not supported in 64-bit mode. In compatibility and legacy modes, the address-size prefix works the same as in the legacy x86 architecture. Table 2-1. Address-Size Prefixes Default Address Size (Bits) Effective Address Size (Bits) 64 32 32 16 32 16 32 16 32 16 AddressSize Prefix (67h)1 Required? no yes no yes yes no no yes yes no Operating Mode 64-Bit Mode 64 Long Mode Compatibility Mode 32 16 Legacy Mode (Protected, Virtual-8086, or Real Mode) Note: 32 16 1. "No' indicates that the default address size is used. Chapter 2: Memory Model 21 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 2.2.5 RIP-Relative Addressing RIP-relative addressing--that is, addressing relative to the 64bit instruction pointer (also called program counter)--is available in 64-bit mode. The effective address is formed by adding the displacement to the 64-bit RIP of the next instruction. In the legacy x86 architecture, addressing relative to the instruction pointer (IP or EIP) is available only in controltransfer instructions. In the 64-bit mode, any instruction that uses ModRM addressing (see "ModRM and SIB Bytes" in Volume 3) can use RIP-relative addressing. The feature is particularly useful for addressing data in position-independent code and for code that addresses global data. Programs usually have many references to data, especially global data, that are not register-based. To load such a program, the loader typically selects a location for the program in memory and then adjusts the program's references to global data based on the load location. RIP-relative addressing of data makes this adjustment unnecessary. Range of RIP-Relative Addressing. Without RIP-relative addressing, instructions encoded with a ModRM byte address memory relative to zero. With RIP-relative addressing, instructions with a ModRM byte can address memory relative to the 64-bit RIP using a signed 32-bit displacement. This provides an offset range of 2GB from the RIP. Effect of Address-Size Prefix on RIP-relative Addressing. R I P - re l a t ive addressing is enabled by 64-bit mode, not by a 64-bit addresssize. Conversely, use of the address-size prefix does not disable RIP-relative addressing. The effect of the address-size prefix is to truncate and zero-extend the computed effective address to 32 bits, like any other addressing mode. Encoding. For details on instruction encoding of RIP-relative addressing, see in "RIP-Relative Addressing" in Volume 3. 2.3 Pointers Pointers are variables that contain addresses rather than data. They are used by instructions to reference memory. Instructions access data using near and far pointers. Stack pointers locate the current stack. 22 Chapter 2: Memory Model 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology 2.3.1 Near and Far Pointers Near pointers contain only an effective address, which is used as an offset into the current segment. Far pointers contain both an effective address and a segment selector that specifies one of several segments. Figure 2-8 illustrates the two types of pointers. Near Pointer Effective Address (EA) Selector Far Pointer Effective Address (EA) 513-109.eps Figure 2-8. Near and Far Pointers In 64-bit mode, the AMD64 architecture supports only the flatmemory model in which there is only one data segment, so the effective address is used as the virtual (linear) address and far pointers are not needed. In compatibility mode and legacy protected mode, the AMD64 architecture supports multiple memory segments, so effective addresses can be combined with segment selectors to form far pointers, and the terms logical address (segment selector and effective address) and far pointer are synonyms. Near pointers can also be used in compatibility mode and legacy mode. 2.4 Stack Operation A stack is a portion of a stack segment in memory that is used to link procedures. Software conventions typically define stacks using a stack frame, which consists of two registers--a stackframe base pointer (rBP) and a stack pointer (rSP)--as shown in Figure 2-9 on page 24. These stack pointers can be either near pointers or far pointers. The stack-segment (SS) register, points to the base address of the current stack segment. The stack pointers contain offsets from the base address of the current stack segment. All instructions that address memory using the rBP or rSP registers cause the processor to access the current stack segment. Chapter 2: Memory Model 23 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Stack Frame Before Procedure Call Stack-Frame Base Pointer (rBP) and Stack Pointer (rSP) Stack Frame After Procedure Call Stack-Frame Base Pointer (rBP) Stack Pointer (rSP) passed data Stack-Segment (SS) Base Address Stack-Segment (SS) Base Address 513-110.eps Figure 2-9. Stack Pointer Mechanism In typical APIs, the stack-frame base pointer and the stack pointer point to the same location before a procedure call (the top-of-stack of the prior stack frame). After data is pushed onto the stack, the stack-frame base pointer remains where it was and the stack pointer advances downward to the address below the pushed data, where it becomes the new top-of-stack. In legacy and compatibility modes, the default stack pointer size is 16 bits (SP) or 32 bits (ESP), depending on the defaultsize (B) bit in the stack-segment descriptor, and multiple stacks can be maintained in separate stack segments. In 64-bit mode, stack pointers are always 64 bits wide (RSP). Further application-programming details on the stack mechanism are described in "Control Transfers" on page 93. System-programming details on the stack segments are described in "Segmented Virtual Memory" in Volume 2. 2.5 Instruction Pointer The instruction pointer is used in conjunction with the codesegment (CS) register to locate the next instruction in memory. The instruction-pointer register contains the displacement (offset)--from the base address of the current CS segment, or from address 0 in 64-bit mode--to the next instruction to be executed. The pointer is incremented sequentially, except for branch instructions, as described in "Control Transfers" on page 93. 24 Chapter 2: Memory Model 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology In legacy and compatibility modes, the instruction pointer is a 16-bit (IP) or 32-bit (EIP) register. In 64-bit mode, the instruction pointer is extended to a 64-bit (RIP) register to support 64-bit offsets. The case-sensitive acronym, rIP, is used to refer to any of these three instruction-pointer sizes, depending on the software context. Figure 2-10 shows the relationship between RIP, EIP, and IP. The 64-bit RIP can be used for RIP-relative addressing, as described in "RIP-Relative Addressing" on page 22. IP EIP RIP 63 32 31 0 513-140.eps rIP Figure 2-10. Instruction Pointer (rIP) Register The contents of the rIP are not directly readable by software. However, the rIP is pushed onto the stack by a call instruction. The memory model described in this chapter is used by all of the programming environments that make up the AMD64 architecture. The next four chapters of this volume describe the application programming environments, which include: General-purpose programming (Chapter 3 on page 27). 128-bit media programming (Chapter 4 on page 127). 64-bit media programming (Chapter 5 on page 229). x87 floating-point programming (Chapter 6 on page 285). Chapter 2: Memory Model 25 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 26 Chapter 2: Memory Model 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology 3 General-Purpose Programming The general-purpose programming model includes the generalpurpose registers (GPRs), integer instructions and operands that use the GPRs, program-flow control methods, memory optimization methods, and I/O. This programming model includes the original x86 integer-programming architecture, plus 64-bit extensions and a few additional instructions. Only the application-programming instructions and resources are described in this chapter. Integer instructions typically used in s y s t e m p rog ra m m i n g , i n c l u d i n g a l l o f t h e p r iv i l e g e d instructions, are described in Volume 2, along with other system-programming topics. The general-purpose programming model is used to some extent by almost all programs, including programs consisting primarily of 128-bit media instructions, 64-bit media instructions, x87 floating-point instructions, or system instructions. For this reason, an understanding of the general-purpose programming model is essential for any programming work using the AMD64 instruction set architecture. 3.1 Registers Figure 3-1 on page 28 shows an overview of the registers used in general-purpose application programming. They include the general-purpose registers (GPRs), segment registers, flags register, and instruction-pointer register. The number and width of available registers depends on the operating mode. The registers and register ranges shaded light gray in Figure 3-1 are available only in 64-bit mode. Those shaded dark gray are available only in legacy mode and compatibility mode. Thus, in 64-bit mode, the 32-bit general-purpose, flags, and instructionpointer registers available in legacy mode and compatibility mode are extended to 64-bit widths, eight new GPRs are available, and the DS, ES, and SS segment registers are ignored. When naming registers, if reference is made to multiple register widths, a lower-case r notation is used. For example, the notation rAX refers to the 16-bit AX, 32-bit EAX, or 64-bit RAX register, depending on an instruction's effective operand size. Chapter 3: General-Purpose Programming 27 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 General-Purpose Registers (GPRs) rAX rBX rCX rDX rBP rSI rDI rSP R8 R9 R10 R11 R12 Segment Registers CS DS ES FS GS SS 15 0 63 32 31 0 63 32 31 0 R13 R14 R15 Flags and Instruction Pointer Registers rFLAGS rIP Available to sofware in all modes Available to sofware only in 64-bit mode Ignored by hardware in 64-bit mode 513-131.eps Figure 3-1. General-Purpose Programming Registers In legacy and compatibility modes, all of the legacy x86 registers are available. Figure 3-2 shows a detailed view of the GPR, flag, and instruction-pointer registers. 3.1.1 Legacy Registers 28 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology register encoding 0 3 1 2 6 7 5 4 31 high 8-bit AH (4) BH (7) CH (5) DH (6) SI DI BP SP 16 15 low 8-bit AL BL CL DL 16-bit AX BX CX DX SI DI BP SP 32-bit EAX EBX ECX EDX ESI EDI EBP ESP 0 FLAGS IP 31 0 FLAGS EFLAGS IP EIP 513-311.eps Figure 3-2. General Registers in Legacy and Compatibility Modes The legacy GPRs include: Eight 8-bit registers (AH, AL, BH, BL, CH, CL, DH, DL). Eight 16-bit registers (AX, BX, CX, DX, DI, SI, BP, SP). Eight 32-bit registers (EAX, EBX, ECX, EDX, EDI, ESI, EBP, ESP). The size of register used by an instruction depends on the effective operand size or, for certain instructions, the opcode, address size, or stack size. The 16-bit and 32-bit registers are encoded as 0 through 7 in Figure 3-2. For opcodes that specify a byte operand, registers encoded as 0 through 3 refer to the lowbyte registers (AL, BL, CL, DL) and registers encoded as 4 through 7 refer to the high-byte registers (AH, BH, CH, DH). The 16-bit FLAGS register, which is also the low 16 bits of the 32-bit EFLAGS register, shown in Figure 3-2, contains control and status bits accessible to application software, as described in Section 3.1.4, "Flags Register," on page 37. The 16-bit IP or Chapter 3: General-Purpose Programming 29 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 32-bit EIP instruction-pointer register contains the address of the next instruction to be executed, as described in Section 2.5, "Instruction Pointer," on page 24. 3.1.2 64-Bit-Mode Registers In 64-bit mode, eight new GPRs are added to the eight legacy GPRs, all 16 GPRs are 64 bits wide, and the low bytes of all registers are accessible. Figure 3-3 on page 31 shows the GPRs, flags register, and instruction-pointer register available in 64bit mode. The GPRs include: Sixteen 8-bit low-byte registers (AL, BL, CL, DL, SIL, DIL, BPL, SPL, R8B, R9B, R10B, R11B, R12B, R13B, R14B, R15B). Four 8-bit high-byte registers (AH, BH, CH, DH), addressable only when no REX prefix is used. Sixteen 16-bit registers (AX, BX, CX, DX, DI, SI, BP, SP, R8W, R9W, R10W, R11W, R12W, R13W, R14W, R15W). Sixteen 32-bit registers (EAX, EBX, ECX, EDX, EDI, ESI, EBP, ESP, R8D, R9D, R10D, R11D, R12D, R13D, R14D, R15D). Sixteen 64-bit registers (RAX, RBX, RCX, RDX, RDI, RSI, RBP, RSP, R8, R9, R10, R11, R12, R13, R14, R15). The size of register used by an instruction depends on the effective operand size or, for certain instructions, the opcode, address size, or stack size. For most instructions, access to the extended GPRs requires a REX prefix (Section 3.5.2, "REX Prefixes," on page 89). The four high-byte registers (AH, BH, CH, DH) available in legacy mode are not addressable when a REX prefix is used. In general, byte and word operands are stored in the low 8 or 16 bits of GPRs without modifying their high 56 or 48 bits, respectively. Doubleword operands, however, are normally stored in the low 32 bits of GPRs and zero-extended to 64 bits. The 64-bit RFLAGS register, shown in Figure 3-3 on page 31, contains the legacy EFLAGS in its low 32-bit range. The high 32 bits are reserved. They can be written with anything but they always read as zero (RAZ). The 64-bit RIP instruction-pointer register contains the address of the next instruction to be executed, as described in Section 3.1.5, "Instruction Pointer Register," on page 41. 30 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology not modified for 8-bit operands not modified for 16-bit operands register encoding 0 3 1 2 6 7 5 4 8 9 10 11 12 13 14 15 63 32 31 16 15 zero-extended for 32-bit operands AH* BH* CH* DH* low 8-bit AL BL CL DL SIL** DIL** BPL** SPL** R8B R9B R10B R11B R12B R13B R14B R15B 87 0 16-bit AX BX CX DX SI DI BP SP R8W R9W R10W R11W R12W R13W R14W R15W 32-bit EAX EBX ECX EDX ESI EDI EBP ESP R8D R9D R10D R11D R12D R13D R14D R15D 64-bit RAX RBX RCX RDX RSI RDI RBP RSP R8 R9 R10 R11 R12 R13 R14 R15 0 RFLAGS RIP 513-309.eps 63 32 31 0 * Not addressable when a REX prefix is used. ** Only addressable when a REX prefix is used. Figure 3-3. General Registers in 64-Bit Mode Figure 3-4 on page 32 illustrates another way of viewing the 64bit-mode GPRs, showing how the legacy GPRs overlap the extended GPRs. Gray-shaded bits are not modified in 64-bit mode. Chapter 3: General-Purpose Programming 31 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 63 0 32 31 Gray areas are not modified in 64-bit mode. 0 RAX 16 15 AH* 87 AL AX 0 EAX BH* BL BX 3 0 RBX EBX CH* CL CX 1 0 RCX ECX DH* DL DX 2 0 RDX EDX SIL** 6 Register Encoding SI 0 RSI DIL** ESI 7 DI 0 RDI BPL** EDI 5 BP 0 RBP SPL** EBP 4 SP 0 RSP R8B ESP 8 R8W 0 R8 R8D 15 Figure 3-4. 32 ... R15B R15W 0 R15 * Not addressable when a REX prefix is used. ** Only addressable when a REX prefix is used. R15D GPRs in 64-Bit Mode Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Default Operand Size. For most instructions, the default operand size in 64-bit mode is 32 bits. To access 16-bit operand sizes, an instruction must contain an operand-size prefix (66h), as described in Section 3.2.2, "Operand Sizes and Overrides," on page 44. To access the full 64-bit operand size, most instructions must contain a REX prefix. For details on operand size, see Section 3.2.2, "Operand Sizes and Overrides," on page 44. Byte Registers. 64-bit mode provides a uniform set of low-byte, low-word, low-doubleword, and quadword registers that is wellsuited for register allocation by compilers. Access to the four new low-byte registers in the legacy-GPR range (SIL, DIL, BPL, SPL), or any of the low-byte registers in the extended registers (R8B-R15B), requires a REX instruction prefix. However, the legacy high-byte registers (AH, BH, CH, DH) are not accessible when a REX prefix is used. Zero-Extension of 32-Bit Results. As Figure 3-3 and Figure 3-4 show, when performing 32-bit operations with a GPR destination in 64-bit mode, the processor zero-extends the 32-bit result into the full 64-bit destination. 8-bit and 16-bit operations on GPRs preserve all unwritten upper bits of the destination GPR. This is consistent with legacy 16-bit and 32-bit semantics for partialwidth results. Software should explicitly sign-extend the results of 8-bit, 16bit, and 32-bit operations to the full 64-bit width before using the results in 64-bit address calculations. The following four code examples show how 64-bit, 32-bit, 16bit, and 8-bit ADDs work. In these examples, "48" is a REX prefix specifying 64-bit operand size, and "01C3" and "00C3" are the opcode and ModRM bytes of each instruction (see "Opcode Syntax" in Volume 3 for details on the opcode and ModRM encoding). Example 1: 64-bit Add: Before:RAX =0002_0001_8000_2201 RBX =0002_0002_0123_3301 48 01C3 ADD RBX,RAX ;48 is a REX prefix for size. Result:RBX = 0004_0003_8123_5502 Chapter 3: General-Purpose Programming 33 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Example 2: 32-bit Add: Before:RAX = 0002_0001_8000_2201 RBX = 0002_0002_0123_3301 01C3 ADD EBX,EAX ;32-bit add Result:RBX = 0000_0000_8123_5502 (32-bit result is zero extended) Example 3: 16-bit Add: Before:RAX = 0002_0001_8000_2201 RBX = 0002_0002_0123_3301 66 01C3 ADD BX,AX ;66 is 16-bit size override Result:RBX = 0002_0002_0123_5502 (bits 63:16 are preserved) Example 4: 8-bit Add: Before:RAX = 0002_0001_8000_2201 RBX = 0002_0002_0123_3301 00C3 ADD BL,AL ;8-bit add Result:RBX = 0002_0002_0123_3302 (bits 63:08 are preserved) GPR High 32 Bits Across Mode Switches. T h e p r o c e s s o r d o e s n o t preserve the upper 32 bits of the 64-bit GPRs across switches from 64-bit mode to compatibility or legacy modes. When using 32-bit operands in compatibility or legacy mode, the high 32 bits of GPRs are undefined. Software must not rely on these u n d e f i n e d b i t s , b e c a u s e t h ey c a n ch a n g e f r o m o n e implementation to the next or even on a cycle-to-cycle basis within a given implementation. The undefined bits are not a function of the data left by any previously running process. 3.1.3 Implicit Uses of GPRs Most instructions can use any of the GPRs for operands. However, as Table 3-1 shows, some instructions use some GPRs implicitly. Details about implicit use of GPRs are described in "General-Purpose Instruction Reference" in Volume 3. Table 3-1 on page 35 shows implicit register uses only for application instructions. Certain system instructions also make implicit use of registers. These system instructions are described in "System Instruction Reference" in Volume 3. 34 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Table 3-1. Implicit Uses of Legacy GPRs Registers1 Low 8-Bit Name 16-Bit 32-Bit 64-Bit Implicit Uses AL AX EAX RAX2 Accumulator * Operand for decimal arithmetic, multiply, divide, string, compare-and-exchange, table-translation, and I/O instructions. * Special accumulator encoding for ADD, XOR, and MOV instructions. * Used with EDX to hold double-precision operands. * CPUID processor-feature information. * Address generation in 16-bit code. * Memory address for XLAT instruction. * CPUID processor-feature information. * * * * * * * * Bit index for shift instructions. Iteration count for loop and repeated string instructions. Jump conditional if zero. CPUID processor-feature information. Operand for multiply and divide instructions. Port number for I/O instructions. Used with EAX to hold double-precision operands. CPUID processor-feature information. BL BX EBX RBX 2 Base CL CX ECX RCX2 Count DL DX EDX RDX2 I/O Address SIL2 SI ESI RSI2 Source Index Destination Index Base Pointer Stack Pointer None * Memory address of source operand for string instructions. * Memory index for 16-bit addresses. * Memory address of destination operand for string instructions. * Memory index for 16-bit addresses. * Memory address of stack-frame base pointer. * Memory address of last stack entry (top of stack). No implicit uses DIL2 BPL2 SPL2 R8B-R 15B2 Note: DI BP SP R8W- R15W2 EDI EBP ESP R8D-R 15D2 RDI2 RBP2 RSP2 R8- R152 1. Gray-shaded registers have no implicit uses. 2. Accessible only in 64-bit mode. Chapter 3: General-Purpose Programming 35 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Arithmetic Operations. Several forms of the add, subtract, multiply, and divide instructions use AL or rAX implicitly. The multiply and divide instructions also use the concatenation of rDX:rAX for double-sized results (multiplies) or quotient and remainder (divides). Sign-Extensions. The instructions that double the size of operands by sign extension (for example, CBW, CWDE, CDQE, CWD, CDQ, CQO) use rAX register implicitly for the operand. The CWD, CDQ, and CQO instructions also uses the rDX register. Special MOVs. The MOV instruction has several opcodes that implicitly use the AL or rAX register for one operand. String Operations. Many types of string instructions use the accumulators implicitly. Load string, store string, and scan string instructions use AL or rAX for data and rDI or rSI for the offset of a memory address. I/O-Address-Space Operations. The I/O and string I/O instructions use rAX to hold data that is received from or sent to a device located in the I/O-address space. DX holds the device I/Oaddress (the port number). Table Translations. The table translate instruction (XLATB) uses AL for an memory index and rBX for memory base address. Compares and Exchanges. Compare and exchange instructions (CMPXCHG) use the AL or rAX register for one operand. Decimal Arithmetic. The decimal arithmetic instructions (AAA, AAD, AAM, AAS, DAA, DAS) that adjust binary-coded decimal (BCD) operands implicitly use the AL and AH register for their operations. Shifts and Rotates. Shift and rotate instructions can use the CL register to specify the number of bits an operand is to be shifted or rotated. Conditional Jumps. Special conditional-jump instructions use the rCX register instead of flags. The JCXZ and JrCXZ instructions check the value of the rCX register and pass control to the target instruction when the value of rCX register reaches 0. 36 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Repeated String Operations. Wi t h t h e ex c e p t i o n o f I / O s t r i n g instructions, all string operations use rSI as the source-operand pointer and rDI as the destination-operand pointer. I/O string instructions use rDX to specify the input-port or output-port number. For repeated string operations (those preceded with a repeat-instruction prefix), the rSI and rDI registers are incremented or decremented as the string elements are moved from the source location to the destination. Repeat-string operations also use rCX to hold the string length, and decrement it as data is moved from one location to the other. Stack Operations. Stack operations make implicit use of the rSP register, and in some cases, the rBP register. The rSP register is used to hold the top-of-stack pointer (or simply, stack pointer). rSP is decremented when items are pushed onto the stack, and incremented when they are popped off the stack. The ENTER and LEAVE instructions use rBP as a stack-frame base pointer. Here, rBP points to the last entry in a data structure that is passed from one block-structured procedure to another. The use of rSP or rBP as a base register in an address calculation implies the use of SS (stack segment) as the default segment. Using any other GPR as a base register without a segment-override prefix implies the use of the DS data segment as the default segment. The push all and pop all instructions (PUSHA, PUSHAD, POPA, POPAD) implicitly use all of the GPRs. CPUID Information. The CPUID instruction makes implicit use of the EAX, EBX, ECX, and EDX registers. Software loads a function code into EAX, executes the CPUID instruction, and then reads the associated processor-feature information in EAX, EBX, ECX, and EDX. 3.1.4 Flags Register Figure 3-5 on page 38 shows the 64-bit RFLAGS register and the flag bits visible to application software. Bits 15-0 are the FLAGS register (accessed in legacy real and virtual-8086 modes), bits 31-0 are the EFLAGS register (accessed in legacy protected mode and compatibility mode), and bits 63-0 are the RFLAGS register (accessed in 64-bit mode). The name rFLAGS refers to any of the three register widths, depending on the current software context. Chapter 3: General-Purpose Programming 37 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 63 32 Reserved, Read as Zero (RAZ) 31 16 15 12 11 10 9 OD FF 8 7 S F 6 Z F 5 4 A F 3 2 P F 1 0 C F See Volume 2 for System Flags Reserved or System Flag Symbol OF DF SF ZF AF PF CF Description Overflow Flag Direction Flag Sign Flag Zero Flag Auxiliary Carry Flag Parity Flag Carry Flag Bit 11 10 7 6 4 2 0 Figure 3-5. rFLAGS Register--Flags Visible to Application Software The low 16 bits (FLAGS portion) of rFLAGS are accessible by application software and hold the following flags: One control flag (the direction flag DF). Six status flags (carry flag CF, parity flag PF, auxiliary carry flag AF, zero flag ZF, sign flag SF, and overflow flag OF). The direction flag (DF) flag controls the direction of string operations. The status flags provide result information from logical and arithmetic operations and control information for conditional move and jump instructions. Bits 31-16 of the rFLAGS register contain flags that are accessible only to system software. These flags are described in "System Registers" in Volume 2. The highest 32 bits of RFLAGS are reserved. In 64-bit mode, writes to these bits are ignored. They are read as zeros (RAZ). The rFLAGS register is initialized to 02h on reset, so that all of the programmable bits are cleared to zero. 38 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology The effects that rFLAGS bit-values have on instructions are summarized in the following places: Conditional Moves (CMOVcc)--Table 3-4 on page 51. Conditional Jumps (Jcc)--Table 3-5 on page 66. Conditional Sets (SETcc)--Table 3-6 on page 71. The effects that instructions have on rFLAGS bit-values are summarized in "Instruction Effects on RFLAGS" in Volume 3. The sections below describe each application-visible flag. All of these flags are readable and writable. For example, the POPF, POPFD, POPFQ, IRET, IRETD, and IRETQ instructions write all flags. The carry and direction flags are writable by dedicated application instructions. Other application-visible flags are written indirectly by specific instructions. Reserved bits and bits whose writability is prevented by the current values of system flags, current privilege level (CPL), or the current operating mode, are unaffected by the POPFx instructions. Carry Flag (CF). Bit 0. Hardware sets the carry flag to 1 if the last integer addition or subtraction operation resulted in a carry (for addition) or a borrow (for subtraction) out of the mostsignificant bit position of the result. Otherwise, hardware clears the flag to 0. The increment and decrement instructions--unlike the addition and subtraction instructions--do not affect the carry flag. The bit shift and bit rotate instructions shift bits of operands into the carry flag. Logical instructions like AND, OR, XOR clear the carry flag. Bit-test instructions (BTx) set the value of the carry flag depending on the value of the tested bit of the operand. Software can set or clear the carry flag with the STC and CLC instructions, respectively. Software can complement the flag with the CMC instruction. Parity Flag (PF). Bit 2. Hardware sets the parity flag to 1 if there is an even number of 1 bits in the least-significant byte of the last result of certain operations. Otherwise (i.e., for an odd number of 1 bits), hardware clears the flag to 0. Software can read the flag to implement parity checking. Auxiliary Carry Flag (AF). Bit 4. Hardware sets the auxiliary carry flag to 1 if the last binary-coded decimal (BCD) operation Chapter 3: General-Purpose Programming 39 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 resulted in a carry (for addition) or a borrow (for subtraction) out of bit 3. Otherwise, hardware clears the flag to 0. The main application of this flag is to support decimal arithmetic operations. Most commonly, this flag is used internally by correction commands for decimal addition (AAA) and subtraction (AAS). Zero Flag (ZF). Bit 6. Hardware sets the zero flag to 1 if the last arithmetic operation resulted in a value of zero. Otherwise (for a non-zero result), hardware clears the flag to 0. The compare and test instructions also affect the zero flag. The zero flag is typically used to test whether the result of an arithmetic or logical operation is zero, or to test whether two operands are equal. Sign Flag (SF). Bit 7. Hardware sets the sign flag to 1 if the last arithmetic operation resulted in a negative value. Otherwise (for a positive-valued result), hardware clears the flag to 0. Thus, in such operations, the value of the sign flag is set equal to the value of the most-significant bit of the result. Depending on the size of operands, the most-significant bit is bit 7 (for bytes), bit 15 (for words), bit 31 (for doublewords), or bit 63 (for quadwords). Direction Flag (DF). Bit 10. The direction flag determines the order in which strings are processed. Software can set the direction flag to 1 to specify decrementing the data pointer for the next string instruction (LODSx, STOSx, MOVSx, SCASx, CMPSx, OUTSx, or INSx). Clearing the direction flag to 0 specifies incrementing the data pointer. The pointers are stored in the rSI or rDI register. Software can set or clear the flag with the STD and CLD instructions, respectively. Overflow Flag (OF). Bit 11. Hardware sets the overflow flag to 1 to indicate that the most-significant (sign) bit of the result of the last signed integer operation differed from the signs of both source operands. Otherwise, hardware clears the flag to 0. A set overflow flag means that the magnitude of the positive or negative result is too big (overflow) or too small (underflow) to fit its defined data type. The OF flag is undefined after the DIV instruction and after a shift of more than one bit. Logical instructions clear the overflow flag. 40 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology 3.1.5 Instruction Pointer Register The instruction pointer register--IP, EIP, or RIP, or simply rIP for any of the three depending on the context--is used in conjunction with the code-segment (CS) register to locate the next instruction in memory. See Section 2.5, "Instruction Pointer," on page 24 for details. 3.2 Operands Operands are either referenced by an instruction's encoding or included as an immediate value in the instruction encoding. Depending on the instruction, referenced operands can be located in registers, memory locations, or I/O ports. 3.2.1 Data Types Figure 3-6 on page 42 shows the register images of the generalpurpose data types. In the general-purpose programming environment, these data types can be interpreted by instruction syntax or the software context as the following types of numbers and strings: Signed (two's-complement) integers. Unsigned integers. BCD digits. Packed BCD digits. Strings, including bit strings. The double quadword data type is supported in the RDX:RAX registers by the MUL, IMUL, DIV, IDIV, and CQO instructions. Software can interpret the data types in ways other than those shown in Figure 3-6 on page 42 but the AMD64 instruction set does not directly support such interpretations and software must handle them entirely on its own. Table 3-2 on page 43 shows the range of representable values for the general-purpose data types. Chapter 3: General-Purpose Programming 41 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 127 s Signed Integer 16 bytes (64-bit mode only) s 0 Double Quadword 8 bytes (64-bit mode only) s Quadword Doubleword 2 bytes s 63 4 bytes s 31 Word Byte 0 15 7 Unsigned Integer 127 0 16 bytes (64-bit mode only) 8 bytes (64-bit mode only) 63 31 15 Double Quadword Quadword Doubleword 2 bytes Word Byte Packed BCD BCD Digit 7 3 0 4 bytes Bit 513-326.eps Figure 3-6. General-Purpose Data Types Signed and Unsigned Integers. The architecture supports signed and unsigned 1 byte, 2 bytes, 4 byte and 8 byte integers. The sign bit is stored in the most significant bit. 42 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Table 3-2. Representable Values of General-Purpose Data Types Byte -27 to +(27 -1) 0 to +28-1 (0 to 255) 00 to 99 0 to 9 Word -215 to +(215 -1) 0 to +216-1 (0 to 65,535) Doubleword -231 to +(231 -1) 0 to +232-1 (0 to 4.29 x 109) Quadword -263 to +(263 -1) 0 to +264-1 (0 to 1.84 x 1019) Double Quadword2 -2127 to +(2127 -1) 0 to +2128-1 (0 to 3.40 x 1038) Data Type Signed Integers1 Unsigned Integers Packed BCD Digits BCD Digit Note: multiple packed BCD-digit bytes multiple BCD-digit bytes 1. The sign bit is the most-significant bit (e.g., bit 7 for a byte, bit 15 for a word, etc.). 2. The double quadword data type is supported in the RDX:RAX registers by the MUL, IMUL, DIV, IDIV, and CQO instructions. Binary-Coded-Decimal (BCD) Digits. BCD digits have values ranging from 0 to 9. These values can be represented in binary encoding with four bits. For example, 0000b represents the decimal number 0 and 1001b represents the decimal number 9. Values ranging from 1010b to 1111b are invalid for this data type. Because a byte contains eight bits, two BCD digits can be stored in a single byte. This is referred to as packed-BCD. If a single BCD digit is stored per byte, it is referred to as unpacked-BCD. In the x87 floating-point programming environment (described in Section 6, "x87 Floating-Point Programming," on page 285) an 80-bit packed BCD data type is also supported, along with conversions between floating-point and BCD data types, so that data expressed in the BCD format can be operated on as floating-point values. Integer add, subtract, multiply, and divide instructions can be used to operate on single (unpacked) BCD digits. The result must be adjusted to produce a correct BCD representation. For unpacked BCD numbers, the ASCII-adjust instructions are provided to simplify that correction. In the case of division, the adjustment must be made prior to executing the integer-divide instruction. Similarly, integer add and subtract instructions can be used to operate on packed-BCD digits. The result must be adjusted to produce a correct packed-BCD representation. Decimal-adjust Chapter 3: General-Purpose Programming 43 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 instructions are provided to simplify packed-BCD result corrections. Strings. Strings are a continuous sequence of a single data type. The string instructions can be used to operate on byte, word, doubleword, or quadword data types. The maximum length of a string of any data type is 232-1 bytes, in legacy or compatibility modes, or 264-1 bytes in 64-bit mode. One of the more common types of strings used by applications are byte data-type strings known as ASCII strings, which can be used to represent character data. Bit strings are also supported by instructions that operate specifically on bit strings. In general, bit strings can start and end at any bit location within any byte, although the BTx bitstring instructions assume that strings start on a byte boundary. The length of a bit string can range in size from a single bit up to 232-1 bits, in legacy or compatibility modes, or 264--1 bits in 64-bit mode. 3.2.2 Operand Sizes and Overrides Default Operand Size. In legacy and compatibility modes, the default operand size is either 16 bits or 32 bits, as determined by the default-size (D) bit in the current code-segment descriptor (for details, see "Segmented Virtual Memory" in Volume 2). In 64-bit mode, the default operand size for most instructions is 32 bits. Application software can override the default operand size by using an operand-size instruction prefix. Table 3-3 on page 45 shows the instruction prefixes for operand-size overrides in all operating modes. In 64-bit mode, the default operand size for most instructions is 32 bits. A REX prefix (see Section 3.5.2, "REX Prefixes," on page 89) specifies a 64-bit operand size, and a 66h prefix specifies a 16-bit operand size. The REX prefix takes precedence over the 66h prefix. 44 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Table 3-3. Operand-Size Overrides Default Operand Size (Bits) Effective Operand Size (Bits) 64 64-Bit Mode Long Mode Compatibility Mode 16 322 32 16 32 32 16 32 16 32 16 32 16 Instruction Prefix 66h1 x no yes no yes yes no no yes yes no Not Applicable REX yes no no Operating Mode Legacy Mode (Protected, Virtual-8086, or Real Mode) Note: 32 16 1. A "no" indicates that the default operand size is used. An "x" means "don't care." 2. Near branches, instructions that implicitly reference the stack pointer, and certain other instructions default to 64-bit operand size. See "General-Purpose Instructions in 64-Bit Mode" in Volume 3 There are several exceptions to the 32-bit operand-size default in 64-bit mode, including near branches and instructions that implicitly reference the RSP stack pointer. For example, the n e a r CA L L , n e a r J M P, J c c , L O O P c c , P O P, a n d P U S H instructions all default to a 64-bit operand size in 64-bit mode. Such instructions do not need a REX prefix for the 64-bit operand size. For details, see "General-Purpose Instructions in 64-Bit Mode" in Volume 3. Effective Operand Size. The term effective operand size describes the operand size for the current instruction, after accounting for the instruction's default operand size and any operand-size override or REX prefix that is used with the instruction. Chapter 3: General-Purpose Programming 45 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Immediate Operand Size. In legacy mode and compatibility modes, the size of immediate operands can be 8, 16, or 32 bits, depending on the instruction. In 64-bit mode, the maximum size of an immediate operand is also 32 bits, except that 64-bit immediates can be copied into a 64-bit GPR using the MOV instruction. When the operand size of a MOV instruction is 64 bits, the processor sign-extends immediates to 64 bits before using them. Support for true 64-bit immediates is accomplished by expanding the semantics of the MOV reg, imm16/32 instructions. In legacy and compatibility modes, these instructions--opcodes B 8 h t h ro u g h B F h -- c o py a 1 6 - b i t o r 3 2 - b i t i m m e d i a t e (depending on the effective operand size) into a GPR. In 64-bit mode, if the operand size is 64 bits (requires a REX prefix), these instructions can be used to copy a true 64-bit immediate into a GPR. 3.2.3 Operand Addressing Operands for general-purpose instructions are referenced by the instruction's syntax or they are incorporated in the instruction as an immediate value. Referenced operands can be in registers, memory, or I/O ports. Register Operands. Most general-purpose instructions that take register operands reference the general-purpose registers (GPRs). A few general-purpose instructions reference operands in the RFLAGS register, XMM registers, or MMXTM registers. The type of register addressed is specified in the instruction syntax. When addressing GPRs or XMM registers, the REX instruction prefix can be used to access the extended GPRs or XMM registers, as described in Section 3.5, "Instruction Prefixes," on page 85. Memory Operands. Many general-purpose instructions can access operands in memory. Section 2.2, "Memory Addressing," on page 16 describes the general methods and conditions for addressing memory operands. I/O Ports. Operands in I/O ports are referenced according to the conventions described in Section 3.8, "Input/Output," on page 109. Immediate Operands. In certain instructions, a source operand-- called an immediate operand, or simply immediate--is included 46 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology as part of the instruction rather than being accessed from a register or memory location. For details on the siz e of immediate operands, see "Immediate Operand Size" on page 46. 3.2.4 Data Alignment A data access is aligned if its address is a multiple of its operand size, in bytes. The following examples illustrate this definition: Byte accesses are always aligned. Bytes are the smallest addressable parts of memory. Word (two-byte) accesses are aligned if their address is a multiple of 2. Doubleword (four-byte) accesses are aligned if their address is a multiple of 4. Quadword (eight-byte) accesses are aligned if their address is a multiple of 8. The AMD64 architecture does not impose data-alignment requirements for accessing data in memory. However, depending on the location of the misaligned operand with respect to the width of the data bus and other aspects of the hardware implementation (such as store-to-load forwarding mechanisms), a misaligned memory access can require more bus cycles than an aligned access. For maximum performance, avoid misaligned memory accesses. Performance on many hardware implementations will benefit from observing the following operand-alignment and operandsize conventions: Avoid misaligned data accesses. Maintain consistent use of operand size across all loads and stores. Larger operand sizes (doubleword and quadword) tend to make more efficient use of the data bus and any data-forwarding features that are implemented by the hardware. When using word or byte stores, avoid loading data from the same doubleword of memory, other than the identical start addresses of the stores. Chapter 3: General-Purpose Programming 47 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 3.3 Instruction Summary This section summarizes the functions of the general-purpose instructions. The instructions are organized by functional group--such as, data-transfer instructions, arithmetic instructions, and so on. Details on individual instructions are given in the alphabetically organized "General-Purpose Instruction Reference" in Volume 3. 3.3.1 Syntax Each instruction has a mnemonic syntax used by assemblers to specify the operation and the operands to be used for source and destination (result) data. Figure 3-7 shows an example of the mnemonic syntax for a compare (CMP) instruction. In this example, the CMP mnemonic is followed by two operands, a 32bit register or memory operand and an 8-bit immediate operand. CMP reg/mem32, imm8 Mnemonic First Source Operand and Destination Operand Second Source Operand 513-139.eps Figure 3-7. Mnemonic Syntax Example In most instructions that take two operands, the first (left-most) operand is both a source operand and the destination operand. The second (right-most) operand serves only as a source. Instructions can have one or more prefixes that modify default instruction functions or operand properties. These prefixes are summarized in Section 3.5, "Instruction Prefixes," on page 85. Instructions that access 64-bit operands in a general-purpose register (GPR) or any of the extended GPR or XMM registers require a REX instruction prefix. Unless otherwise stated in this section, the word register means a general-purpose register (GPR). Several instructions affect the flag bits in the RFLAGS register. "Instruction Effects on 48 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology RFLAGS" in Volume 3 summarizes the effects that instructions have on rFLAGS bits. 3.3.2 Data Transfer The data-transfer instructions copy data between registers and memory. Move. MOV--Move MOVSX--Move with Sign-Extend MOVZX--Move with Zero-Extend MOVD--Move Doubleword or Quadword MOVNTI--Move Non-Temporal Doubleword or Quadword MOVx copies a byte, word, doubleword, or quadword from a register or memory location to a register or memory location. The source and destination cannot both be memory locations. An immediate constant can be used as a source operand with the MOV instruction. For MOV, the destination must be of the s a m e s i z e a s t h e s o u rc e , b u t t h e M OV S X a n d M OV Z X instructions copy values of smaller size to a larger size by using sign-extension or zero-extension. The MOVD instruction copies a doubleword or quadword between a general-purpose register or memory and an XMM or MMX register. The MOV instruction is in many aspects similar to the assignment operator in high-level languages. The simplest example of their use is to initialize variables. To initialize a register to 0, rather than using a MOV instruction it may be more efficient to use the XOR instruction with identical destination and source operands. The MOVNTI instruction stores a doubleword or quadword from a register into memory as "non-temporal" data, which assumes a single access (as opposed to frequent subsequent accesses of "temporal data"). The operation therefore minimizes cache pollution. The exact method by which cache pollution is minimized depends on the hardware implementation of the instruction. For further information, see Section 3.9, "Memory Optimization," on page 113. Conditional Move. CMOVcc--Conditional Move If condition Chapter 3: General-Purpose Programming 49 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Th e C M OV c c i n s t r u c t i o n s c o n d i t i o n a l ly c o py a wo rd , doubleword, or quadword from a register or memory location to a register location. The source and destination must be of the same size. The CMOVcc instructions perform the same task as MOV but work conditionally, depending on the state of status flags in the RFLAGS register. If the condition is not satisfied, the instruction has no effect and control is passed to the next instruction. The mnemonics of CMOVcc instructions indicate the condition that must be satisfied. Several mnemonics are often used for one opcode to make the mnemonics easier to remember. For example, CMOVE (conditional move if equal) and CMOVZ (conditional move if zero) are aliases and compile to the same opcode. Table 3-4 on page 51 shows the RFLAGS values required for each CMOVcc instruction. In assembly languages, the conditional move instructions correspond to small conditional statements like: IF a = b THEN x = y CMOVcc instructions can replace two instructions--a conditional jump and a move. For example, to perform a highlevel statement like: IF ECX = 5 THEN EAX = EBX without a CMOVcc instruction, the code would look like: cmp ecx, 5 jnz Continue mov eax, ebx Continue: ; ; ; ; test if ecx equals 5 test condition and skip if not met move continuation but with a CMOVcc instruction, the code would look like: cmp ecx, 5 cmovz eax, ebx ; test if ecx equals to 5 ; test condition and move Replacing conditional jumps with conditional moves also has the advantage that it can avoid branch-prediction penalties that may be caused by conditional jumps. Support for CMOVcc instructions depends on the processor implementation. To find out if a processor is able to perform CMOVcc instructions, use the CPUID instruction. 50 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Table 3-4. rFLAGS for CMOVcc Instructions Mnemonic CMOVO CMOVNO CMOVB CMOVC CMOVNAE CMOVAE CMOVNB CMOVNC CMOVE CMOVZ CMOVNE CMOVNZ CMOVBE CMOVNA CMOVA CMOVNBE CMOVS CMOVNS CMOVP CMOVPE CMOVNP CMOVPO CMOVL CMOVNGE CMOVGE CMOVNL CMOVLE CMOVNG CMOVG CMOVNLE Required Flag State OF = 1 OF = 0 CF = 1 Description Conditional move if overflow Conditional move if not overflow Conditional move if below Conditional move if carry Conditional move if not above or equal Conditional move if above or equal Conditional move if not below Conditional move if not carry Conditional move if equal Conditional move if zero Conditional move if not equal Conditional move if not zero Conditional move if below or equal Conditional move if not above Conditional move if not below or equal Conditional move if not below or equal Conditional move if sign Conditional move if not sign Conditional move if parity Conditional move if parity even Conditional move if not parity Conditional move if parity odd Conditional move if less Conditional move if not greater or equal Conditional move if greater or equal Conditional move if not less Conditional move if less or equal Conditional move if not greater Conditional move if greater Conditional move if not less or equal CF = 0 ZF = 1 ZF = 0 CF = 1 or ZF = 1 CF = 0 and ZF = 0 SF = 1 SF = 0 PF = 1 PF = 0 SF <> OF SF = OF ZF = 1 or SF <> OF ZF = 0 and SF = OF Chapter 3: General-Purpose Programming 51 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Stack Operations. POP--Pop Stack POPA--Pop All to GPR Words POPAD--Pop All to GPR Doublewords PUSH--Push onto Stack PUSHA--Push All GPR Words onto Stack PUSHAD--Push All GPR Doublewords onto Stack ENTER--Create Procedure Stack Frame LEAVE--Delete Procedure Stack Frame PUSH copies the specified register, memory location, or immediate value to the top of stack. This instruction decrements the stack pointer by 2, 4, or 8, depending on the operand size, and then copies the operand into the memory location pointed to by SS:rSP. POP copies a word, doubleword, or quadword from the memory location pointed to by the SS:rSP registers (the top of stack) to a specified register or memory location. Then, the rSP register is incremented by 2, 4, or 8. After the POP operation, rSP points to the new top of stack. PUSHA or PUSHAD stores eight word-sized or doublewordsized registers onto the stack: eAX, eCX, eDX, eBX, eSP, eBP, eSI and eDI, in that order. The stored value of eSP is sampled at the moment when the PUSHA instruction started. The resulting stack-pointer value is decremented by 16 or 32. POPA or POPAD extracts eight word-sized or doubleword-sized registers from the stack: eDI, eSI, eBP, eSP, eBX, eDX, eCX and eAX, in that order (which is the reverse of the order used in the PUSHA instruction). The stored eSP value is ignored by the POPA instruction. The resulting stack pointer value is incremented by 16 or 32. It is a common practice to use PUSH instructions to pass parameters (via the stack) to functions and subroutines. The typical instruction sequence used at the beginning of a subroutine looks like: push mov sub ebp ebp, esp esp, N ; save current EBP ; set stack frame pointer value ; allocate space for local variables 52 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology The rBP register is used as a stack frame pointer--a base address of the stack area used for parameters passed to subroutines and local variables. Positive offsets of the stack frame pointed to by rBP provide access to parameters passed while negative offsets give access to local variables. This technique allows creating reentrant subroutines. The ENTER and LEAVE instructions provide support for procedure calls, and are mainly used in high-level languages. The ENTER instruction is typically the first instruction of the procedure, and the LEAVE instruction is the last before the RET instruction. The ENTER instruction creates a stack frame for a procedure. The first operand, size, specifies the number of bytes allocated in the stack. The second operand, depth, specifies the number of stack-frame pointers copied from the calling procedure's stack (i.e., the nesting level). The depth should be an integer in the range 0-31. Typically, when a procedure is called, the stack contains the following four components: Parameters passed to the called procedure (created by the calling procedure). Return address (created by the CALL instruction). Array of stack-frame pointers (pointers to stack frames of procedures with smaller nesting-level depth) which are used to access the local variables of such procedures. Local variables used by the called procedure. All these data are called the stack frame. The ENTER instruction simplifies management of the last two components of a stack frame. First, the current value of the rBP register is pushed onto the stack. The value of the rSP register at that moment is a frame pointer for the current procedure: positive offsets from this pointer give access to the parameters passed to the procedure, and negative offsets give access to the local variables which will be allocated later. During procedure execution, the value of the frame pointer is stored in the rBP register, which at that moment contains a frame pointer of the calling procedure. This frame pointer is saved in a temporary register. If the depth operand is greater than one, the array of depth-1 frame pointers of procedures with smaller nesting level is pushed onto the stack. This array is copied from the stack Chapter 3: General-Purpose Programming 53 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 frame of the calling procedure, and it is addressed by the rBP register from the calling procedure. If the depth operand is greater than zero, the saved frame pointer of the current procedure is pushed onto the stack (forming an array of depth frame pointers). Finally, the saved value of the frame pointer is copied to the rBP register, and the rSP register is decremented by the value of the first operand, allocating space for local variables used in the procedure. See "Stack Operations" on page 52 for a parameter-passing instruction sequence using PUSH that is equivalent to ENTER. The LEAVE instruction removes local variables and the array of frame pointers, allocated by the previous ENTER instruction, from the stack frame. This is accomplished by the following two steps: first, the value of the frame pointer is copied from the rBP register to the rSP register. This releases the space allocated by local variables and an array of frame pointers of procedures with smaller nesting levels. Second, the rBP register is popped from the stack, restoring the previous value of the frame pointer (or simply the value of the rBP register, if the depth operand is zero). Thus, the LEAVE instruction is equivalent to the following code: mov rSP, rBP pop rBP 3.3.3 Data Conversion T h e d a t a - c o nve r s i o n i n s t r u c t i o n s p e r f o r m va r i o u s transformations of data, such as operand-size doubling by sign extension, conversion of little-endian to big-endian format, extraction of sign masks, searching a table, and support for operations with decimal numbers. Sign Extension. CBW--Convert Byte to Word CWDE--Convert Word to Doubleword CDQE--Convert Doubleword to Quadword CWD--Convert Word to Doubleword CDQ--Convert Doubleword to Quadword CQO--Convert Quadword to Octword The CBW, CWDE, and CDQE instructions sign-extend the AL, AX, or EAX register to the upper half of the AX, EAX, or RAX register, respectively. By doing so, these instructions create a double-sized destination operand in rAX that has the same 54 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology numerical value as the source operand. The CBW, CWDE, and CDQE instructions have the same opcode, and the action taken depends on the effective operand size. The CWD, CDQ and CQO instructions sign-extend the AX, EAX, or RAX register to all bit positions of the DX, EDX, or RDX register, respectively. By doing so, these instructions create a double-sized destination operand in rDX:rAX that has the same numerical value as the source operand. The CWD, CDQ, and CQO instructions have the same opcode, and the action taken depends on the effective operand size. Flags are not affected by these instructions. The instructions can be used to prepare an operand for signed division (performed by the IDIV instruction) by doubling its storage size. Extract Sign Mask. MOVMSKPS--Extract Packed Single-Precision FloatingPoint Sign Mask MOVMSKPD--Extract Packed Double-Precision FloatingPoint Sign Mask The MOVMSKPS instruction moves the sign bits of four packed single-precision floating-point values in an XMM register to the four low-order bits of a general-purpose register, with zeroextension. MOVMSKPD does a similar operation for two packed double-precision floating-point values: it moves the two sign bits to the two low-order bits of a general-purpose register, with zero-extension. The result of either instruction is a sign-bit mask. Translate. XLAT--Translate Table Index The XLAT instruction replaces the value stored in the AL register with a table element. The initial value in AL serves as an unsigned index into the table, and the start (base) of table is specified by the DS:rBX registers (depending on the effective address size). This instruction is not recommended. The following instruction serves to replace it: MOV AL,[rBX + AL] Chapter 3: General-Purpose Programming 55 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 ASCII Adjust. AAA--ASCII Adjust After Addition AAD--ASCII Adjust Before Division AAM--ASCII Adjust After Multiply AAS--ASCII Adjust After Subtraction Th e A A A , A A D, A A M , a n d A A S i n s t r u c t i o n s p e r fo r m corrections of arithmetic operations with non-packed BCD values (i.e., when the decimal digit is stored in a byte register). There are no instructions which directly operate on decimal numbers (either packed or non-packed BCD). However, the ASCII-adjust instructions correct decimal-arithmetic results. These instructions assume that an arithmetic instruction, such as ADD, was performed on two BCD operands, and that the result was stored in the AL or AX register. This result can be incorrect or it can be a non-BCD value (for example, when a decimal carry occurs). After executing the proper ASCII-adjust i n s t r u c t i o n , t h e A X re g i s t e r c o n t a i n s a c o r re c t B C D representation of the result. (The AAD instruction is an exception to this, because it should be applied before a DIV instruction, as explained below). All of the ASCII-adjust instructions are able to operate with multiple-precision decimal values. AAA should be applied after addition of two non-packed decimal digits. AAS should be applied after subtraction of two non-packed decimal digits. AAM should be applied after multiplication of two non-packed decimal digits. AAD should be applied before the division of two non-packed decimal numbers. Although the base of the numeration for ASCII-adjust instructions is assumed to be 10, the AAM and AAD instructions can be used to correct multiplication and division with other bases. BCD Adjust. DAA--Decimal Adjust after Addition DAS--Decimal Adjust after Subtraction The DAA and DAS instructions perform corrections of addition and subtraction operations on packed BCD values. (Packed BCD values have two decimal digits stored in a byte register, with the higher digit in the higher four bits, and the lower one in the 56 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology lower four bits.) There are no instructions for correction of multiplication and division with packed BCD values. DAA should be applied after addition of two packed-BCD numbers. DAS should be applied after subtraction of two packed-BCD numbers. DAA and DAS can be used in a loop to perform addition or subtraction of two multiple-precision decimal numbers stored in packed-BCD format. Each loop cycle would operate on corresponding bytes (containing two decimal digits) of operands. Endian Conversion. BSWAP--Byte Swap The BSWAP instruction changes the byte order of a doubleword or quadword operand in a register, as shown in Figure 3-8. In a doubleword, bits 7-0 are exchanged with bits 31-24, and bits 15-8 are exchanged with bits 23-16. In a quadword, bits 7-0 are exchanged with bits 63-56, bits 15-8 with bits 55-48, bits 23-16 with bits 47-40, and bits 31-24 with bits 39-32. See the following illustration. 31 24 23 16 15 87 0 31 24 23 16 15 87 0 Figure 3-8. BSWAP Doubleword Exchange A second application of the BSWAP instruction to the same operand restores its original value. The result of applying the BSWAP instruction to a 16-bit register is undefined. To swap bytes of a 16-bit register, use the XCHG instruction. The BSWAP instruction is used to convert data between littleendian and big-endian byte order. Chapter 3: General-Purpose Programming 57 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 3.3.4 Load Segment Registers These instructions load segment registers. LDS, LES, LFS, LGS, LSS--Load Far Pointer MOV segReg--Move Segment Register POP segReg--Pop Stack Into Segment Register The LDS, LES, LFD, LGS, and LSS instructions atomically load the two parts of a far pointer into a segment register and a general-purpose register. A far pointer is a 16-bit segment selector and a 16-bit or 32-bit offset. The load copies the segment-selector portion of the pointer from memory into the segment register and the offset portion of the pointer from memory into a general-purpose register. The effective operand size determines the size of the offset loaded by the LDS, LES, LFD, LGS, and LSS instructions. The instructions load not only the software-visible segment selector into the segment register, but they also cause the hardware to load the associated segment-descriptor information into the software-invisible (hidden) portion of that segment register. The MOV segReg and POP segReg instructions load a segment selector from a general-purpose register or memory (for MOV segReg) or from the top of the stack (for POP segReg) to a segment register. These instructions not only load the softwarevisible segment selector into the segment register but also cause the hardware to load the associated segment-descriptor information into the software-invisible (hidden) portion of that segment register. In 64-bit mode, the POP DS, POP ES, and POP SS instructions are invalid. 3.3.5 Load Effective Address LEA--Load Effective Address The LEA instruction calculates and loads the effective address (offset within a given segment) of a source operand and places it in a general-purpose register. LEA is related to MOV, which copies data from a memory location to a register, but LEA takes the address of the source operand, whereas MOV takes the contents of the memory location specified by the source operand. In the simplest cases, LEA can be replaced with MOV. For example: lea eax, [ebx] 58 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology has the same effect as: mov eax, ebx However, LEA allows software to use any valid addressing mode for the source operand. For example: lea eax, [ebx+edi] loads the sum of EBX and EDI registers into the EAX register. This could not be accomplished by a single MOV instruction. LEA has a limited capability to perform multiplication of operands in general-purpose registers using scaled-index addressing. For example: lea eax, [ebx+ebx*8] loads the value of the EBX register, multiplied by 9, into the EAX register. 3.3.6 Arithmetic The arithmetic instructions perform basic arithmetic operations, such as addition, subtraction, multiplication, and division on integer operands. Add and Subtract. ADC--Add with Carry ADD--Signed or Unsigned Add SBB--Subtract with Borrow SUB--Subtract NEG--Two's Complement Negation The ADD instruction performs addition of two integer operands. There are opcodes that add an immediate value to a byte, word, doubleword, or quadword register or a memory location. In these opcodes, if the size of the immediate is smaller than that of the destination, the immediate is first signextended to the size of the destination operand. The arithmetic flags (OF, SF, ZF, AF, CF, PF) are set according to the resulting value of the destination operand. The ADC instruction performs addition of two integer operands, plus 1 if the carry flag (CF) is set. The SUB instruction performs subtraction of two integer operands. Chapter 3: General-Purpose Programming 59 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 The SBB instruction performs subtraction of two integer operands, and it also subtracts an additional 1 if the carry flag is set. Th e A D C a n d S B B i n s t r u c t i o n s s i m p l i f y a dd i t i o n a n d subtraction of multiple-precision integer operands, because they correctly handle carries (and borrows) between parts of a multiple-precision operand. The NEG instruction performs negation of an integer operand. The value of the operand is replaced with the result of subtracting the operand from zero. Multiply and Divide. MUL--Multiply Unsigned IMUL--Signed Multiply DIV--Unsigned Divide IDIV--Signed Divide The MUL instruction performs multiplication of unsigned integer operands. The size of operands can be byte, word, doubleword, or quadword. The product is stored in a destination which is double the size of the source operands (multiplicand and factor). The MUL instruction's mnemonic has only one operand, which is a factor. The multiplicand operand is always assumed to be an accumulator register. For byte-sized multiplies, AL contains the multiplicand, and the result is stored in AX. For word-sized, doubleword-sized, and quadword-sized multiplies, rAX contains the multiplicand, and the result is stored in rDX and rAX. The IMUL instruction performs multiplication of signed integer operands. There are forms of the IMUL instruction with one, two, and three operands, and it is thus more powerful than the M U L i n s t r u c ti o n . The o n e -operand for m of the I M U L instruction behaves similarly to the MUL instruction, except that the operands and product are signed integer values. In the two-operand form of IMUL, the multiplicand and product use the same register (the first operand), and the factor is specified in the second operand. In the three-operand form of IMUL, the product is stored in the first operand, the multiplicand is specified in the second operand, and the factor is specified in the third operand. 60 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology The DIV instruction performs division of unsigned integers. The instruction divides a double-sized dividend in AH:AL or rDX:rAX by the divisor specified in the operand of the instruction. It stores the quotient in AL or rAX and the remainder in AH or rDX. The IDIV instruction performs division of signed integers. It behaves similarly to DIV, with the exception that the operands are treated as signed integer values. Division is the slowest of all integer arithmetic operations and should be avoided wherever possible. One possibility for i m p r ov i n g p e r f o r m a n c e i s t o re p l a c e d iv i s i o n w i t h multiplication, such as by replacing i/j/k with i/(j*k). This replacement is possible if no overflow occurs during the computation of the product. This can be determined by considering the possible ranges of the divisors. Increment and Decrement. DEC--Decrement by 1 INC--Increment by 1 The INC and DEC instructions are used to increment and decrement, respectively, an integer operand by one. For both instructions, an operand can be a byte, word, doubleword, or quadword register or memory location. These instructions behave in all respects like the corresponding ADD and SUB instructions, with the second operand as an immediate value equal to 1. The only exception is that the carry flag (CF) is not affected by the INC and DEC instructions. Apart from their obvious arithmetic uses, the INC and DEC instructions are often used to modify addresses of operands. In this case it can be desirable to preserve the value of the carry flag (to use it later), so these instructions do not modify the carry flag. 3.3.7 Rotate and Shift The rotate and shift instructions perform cyclic rotation or noncyclic shift, by a given number of bits (called the count), in a given byte-sized, word-sized, doubleword-sized or quadwordsized operand. When the count is greater than 1, the result of the rotate and shift instructions can be considered as an iteration of the same Chapter 3: General-Purpose Programming 61 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 1-bit operation by count number of times. Because of this, the descriptions below describe the result of 1-bit operations. The count can be 1, the value of the CL register, or an immediate 8-bit value. To avoid redundancy and make rotation and shifting quicker, the count is masked to the 5 or 6 leastsignificant bits, depending on the effective operand size, so that its value does not exceed 31 or 63 before the rotation or shift takes place. Rotate. RCL--Rotate Through Carry Left RCR--Rotate Through Carry Right ROL--Rotate Left ROR--Rotate Right The RCx instructions rotate the bits of the first operand to the left or right by the number of bits specified by the source (count) operand. The bits rotated out of the destination operand are rotated into the carry flag (CF) and the carry flag is rotated into the opposite end of the first operand. The ROx instructions rotate the bits of the first operand to the left or right by the number of bits specified by the source operand. Bits rotated out are rotated back in at the opposite end. The value of the CF flag is determined by the value of the last bit rotated out. In single-bit left-rotates, the overflow flag (OF) is set to the XOR of the CF flag after rotation and the most-significant bit of the result. In single-bit right-rotates, the OF flag is set to the XOR of the two most-significant bits. Thus, in both cases, the OF flag is set to 1 if the single-bit rotation changed the value of the most-significant bit (sign bit) of the operand. The value of the OF flag is undefined for multi-bit rotates. Bit-rotation instructions provide many ways to reorder bits in an operand. This can be useful, for example, in character conversion, including cryptography techniques. Shift. SAL--Shift Arithmetic Left SAR--Shift Arithmetic Right SHL--Shift Left 62 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology SHR--Shift Right SHLD--Shift Left Double SHRD--Shift Right Double The SHx instructions (including SHxD) perform shift operations on unsigned operands. The SAx instructions operate with signed operands. SHL and SAL instructions effectively perform multiplication of an operand by a power of 2, in which case they work as moreefficient alternatives to the MUL instruction. Similarly, SHR and SAR instructions can be used to divide an operand (signed or unsigned, depending on the instruction used) by a power of 2. Although the SAR instruction divides the operand by a power of 2, the behavior is different from the IDIV instruction. For example, shifting -11 (FFFFFFF5h) by two bits to the right (i.e. divide -11 by 4), gives a result of FFFFFFFDh, or -3, whereas the IDIV instruction for dividing -11 by 4 gives a result of -2. This is because the IDIV instruction rounds off the quotient to zero, whereas the SAR instruction rounds off the remainder to zero for positive dividends, and to negative infinity for negative dividends. This means that, for positive operands, SAR behaves like the corresponding IDIV instruction, and for negative operands, it gives the same result if and only if all the shiftedout bits are zeroes, and otherwise the result is smaller by 1. The SAR instruction treats the most-significant bit (msb) of an operand in a special way: the msb (the sign bit) is not changed, but is copied to the next bit, preserving the sign of the result. The least-significant bit (lsb) is shifted out to the CF flag. In the SAL instruction, the msb is shifted out to CF flag, and the lsb is cleared to 0. The SHx instructions perform logical shift, i.e. without special treatment of the sign bit. SHL is the same as SAL (in fact, their opcodes are the same). SHR copies 0 into the most-significant bit, and shifts the least-significant bit to the CF flag. The SHxD instructions perform a double shift. These instructions perform left and right shift of the destination operand, taking the bits to copy into the most-significant bit (for the SHRD instruction) or into the least-significant bit (for the SHLD instruction) from the source operand. These instructions behave like SHx, but use bits from the source Chapter 3: General-Purpose Programming 63 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 operand instead of zero bits to shift into the destination operand. The source operand is not changed. 3.3.8 Compare and Test The compare and test instructions perform arithmetic and logical comparison of operands and set corresponding flags, depending on the result of comparison. These instruction are used in conjunction with conditional instructions such as Jcc or SETcc to organize branching and conditionally executing blocks in programs. Assembler equivalents of conditional operators in high-level languages (do...while, if...then...else, and similar) also include compare and test instructions. Compare. CMP--Compare The CMP instruction performs subtraction of the second operand (source) from the first operand (destination), like the SUB instruction, but it does not store the resulting value in the destination operand. It leaves both operands intact. The only effect of the CMP instruction is to set or clear the arithmetic flags (OF, SF, ZF, AF, CF, PF) according to the result of subtraction. The CMP instruction is often used together with the conditional jump instructions (Jcc), conditional SET instructions (SETcc) and other instructions such as conditional loops (LOOPcc) whose behavior depends on flag state. Test. TEST--Test Bits The TEST instruction is in many ways similar to the AND instruction: it performs logical conjunction of the corresponding bits of both operands, but unlike the AND instruction it leaves the operands unchanged. The purpose of this instruction is to update flags for further testing. The TEST instruction is often used to test whether one or more bits in an operand are zero. In this case, one of the instruction operands would contain a mask in which all bits are cleared to zero except the bits being tested. For more advanced bit testing and bit modification, use the BTx instructions. 64 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Bit Scan. BSF--Bit Scan Forward BSR--Bit Scan Reverse The BSF and BSR instructions search a source operand for the least-significant (BSF) or most-significant (BSR) bit that is set to 1. If a set bit is found, its bit index is loaded into the destination operand, and the zero flag (ZF) is set. If no set bit is found, the z ero flag is cleared and the contents of the destination are undefined. Bit Test. BT--Bit Test BTC--Bit Test and Complement BTR--Bit Test and Reset BTS--Bit Test and Set The BTx instructions copy a specified bit in the first operand to the carry flag (CF) and leave the source bit unchanged (BT), or complement the source bit (BTC), or clear the source bit to 0 (BTR), or set the source bit to 1 (BTS). These instructions are useful for implementing semaphore arrays. Unlike the XCHG instruction, the BTx instructions set the carry flag, so no additional test or compare instruction is needed. Also, because these instructions operate directly on bits rather than larger data types, the semaphore arrays can be smaller than is possible when using XCHG. In such semaphore applications, bit-test instructions should be preceded by the LOCK prefix. Set Byte on Condition. SETcc--Set Byte if condition The SETcc instructions store a 1 or 0 value to their byte operand depending on whether their condition (represented by certain rFLAGS bits) is true or false, respectively. Table 3-5 on page 66 shows the rFLAGS values required for each SETcc instruction. Chapter 3: General-Purpose Programming 65 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Table 3-5. rFLAGS for SETcc Instructions Mnemonic SETO SETNO SETB SETC SETNAE SETAE SETNB SETNC SETE SETZ SETNE SETNZ SETBE SETNA SETA SETNBE SETS SETNS SETP SETPE SETNP SETPO SETL SETNGE SETGE SETNL SETLE SETNG SETG SETNLE Required Flag State OF = 1 OF = 0 Description Set byte if overflow Set byte if not overflow Set byte if below Set byte if carry Set byte if not above or equal (unsigned operands) Set byte if above or equal Set byte if not below Set byte if not carry (unsigned operands) Set byte if equal Set byte if zero Set byte if not equal Set byte if not zero Set byte if below or equal Set byte if not above (unsigned operands) Set byte if not below or equal Set byte if not below or equal (unsigned operands) Set byte if sign Set byte if not sign Set byte if parity Set byte if parity even Set byte if not parity Set byte if parity odd Set byte if less Set byte if not greater or equal (signed operands) Set byte if greater or equal Set byte if not less (signed operands) Set byte if less or equal Set byte if not greater (signed operands) Set byte if greater Set byte if not less or equal (signed operands) CF = 1 CF = 0 ZF = 1 ZF = 0 CF = 1 or ZF = 1 CF = 0 and ZF = 0 SF = 1 SF = 0 PF = 1 PF = 0 SF <> OF SF = OF ZF = 1 or SF <> OF ZF = 0 and SF = OF 66 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology SETcc instructions are often used to set logical indicators. Like CMOVcc instructions (page 49), SETcc instructions can replace two instructions--a conditional jump and a move. Replacing conditional jumps with conditional sets can help avoid branchprediction penalties that may be caused by conditional jumps. If the logical value True (logical 1) is represented in a high-level language as an integer with all bits set to 1, software can accomplish such representation by first executing the opposite SETcc instruction--for example, the opposite of SETZ is SETNZ--and then decrementing the result. Bounds. BOUND--Check Array Bounds The BOUND instruction checks whether the value of the first operand, a signed integer index into an array, is within the minimal and maximal bound values pointed to by the second operand. The values of array bounds are often stored at the beginning of the array. If the bounds of the range are exceeded, the processor generates a bound-range exception. The primary disadvantage of using the BOUND instruction is its use of the time-consuming exception mechanism to signal a failure of the bounds test. 3.3.9 Logical The logical instructions perform bitwise operations. AND--Logical AND OR--Logical OR XOR--Exclusive OR NOT--One's Complement Negation The AND, OR, and XOR instructions perform their respective logical operations on the corresponding bits of both operands and store the result in the first operand. The CF flag and OF flag are cleared to 0, and the ZF flag, SF flag, and PF flag are set according to the resulting value of the first operand. The NOT instruction performs logical inversion of all bits of its operand. Each zero bit becomes one and vice versa. All flags remain unchanged. Apart from performing logical operations, AND and OR can test a register for a zero or non-zero value, sign (negative or positive), and parity status of its lowest byte. To do this, both Chapter 3: General-Purpose Programming 67 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 operands must be the same register. The XOR instruction with two identical operands is an efficient way of loading the value 0 into a register. 3.3.10 String The string instructions perform common string operations such as copying, moving, comparing, or searching strings. These instructions are widely used for processing text. Compare Strings. CMPS--Compare Strings CMPSB--Compare Strings by Byte CMPSW--Compare Strings by Word CMPSD--Compare Strings by Doubleword CMPSQ--Compare Strings by Quadword The CMPSx instructions compare the values of two implicit operands of the same size located at seg:[rSI] and ES:[rDI]. After the copy, both the rSI and rDI registers are autoincremented (if the DF flag is 0) or auto-decremented (if the DF flag is 1). Scan String. SCAS--Scan String SCASB--Scan String as Bytes SCASW--Scan String as Words SCASD--Scan String as Doubleword SCASQ--Scan String as Quadword The SCASx instructions compare the values of a memory operands in ES:rDI to a value of the same size in the AL/rAX register. Bits in rFLAGS are set to indicate the outcome of the comparison. After the comparison, the rDI register is autoincremented (if the DF flag is 0) or auto-decremented (if the DF flag is 1). Move String. MOVS--Move String MOVSB--Move String Byte MOVSW--Move String Word MOVSD--Move String Doubleword MOVSQ--Move String Quadword 68 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology The MOVSx instructions copy an operand from the memory location seg:[rSI] to the memory location ES:[rDI]. After the copy, both the rSI and rDI registers are auto-incremented (if the DF flag is 0) or auto-decremented (if the DF flag is 1). Load String. LODS--Load String LODSB--Load String Byte LODSW--Load String Word LODSD--Load String Doubleword LODSQ--Load String Quadword The LODSx instructions load a value from the memory location seg:[rSI] to the accumulator register (AL or rAX). After the load, the rSI register is auto-incremented (if the DF flag is 0) or auto-decremented (if the DF flag is 1). Store String. STOS--Store String STOSB--Store String Bytes STOSW--Store String Words STOSD--Store String Doublewords STOSQ--Store String Quadword The STOSx instructions copy the accumulator register (AL or rAX) to a memory location ES:[rDI]. After the copy, the rDI register is auto-incremented (if the DF flag is 0) or autodecremented (if the DF flag is 1). 3.3.11 Control Transfer Control-transfer instructions, or branches, are used to iterate through loops and move through conditional program logic. Jump. JMP--Jump JMP performs an unconditional jump to the specified address. There are several ways to specify the target address. Relative Short Jump and Relative Near Jump--The target address is determined by adding an 8-bit (short jump) or 16bit or 32-bit (near jump) signed displacement to the rIP of the instruction following the JMP. The jump is performed within the current code segment (CS). Chapter 3: General-Purpose Programming 69 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Register-Indirect and Memory-Indirect Near Jump--The target rIP value is contained in a register or in a memory location. The jump is performed within the current CS. Direct Far Jump--For all far jumps, the target address is outside the current code segment. Here, the instruction specifies the 16-bit target-address code segment and the 16bit or 32-bit offset as an immediate value. The direct far jump form is invalid in 64-bit mode. Memory-Indirect Far Jump--For this form, the target address (CS:rIP) is in a address outside the current code segment. A 32-bit or 48-bit far pointer in a specified memory location points to the target address. The size of the target rIP is determined by the effective operand size for the JMP instruction. For far jumps, the target selector can specify a code-segment selector, in which case it is loaded into CS, and a 16-bit or 32-bit target offset is loaded into rIP. The target selector can also be a call-gate selector or a task-state-segment (TSS) selector, used for performing task switches. In these cases, the target offset of the JMP instruction is ignored, and the new values loaded into CS and rIP are taken from the call gate or from the TSS. Conditional Jump. Jcc--Jump if condition Conditional jump instructions jump to an instruction specified by the operand, depending on the state of flags in the rFLAGS register. The operands specifies a signed relative offset from the current contents of the rIP. If the state of the corresponding flags meets the condition, a conditional jump instruction passes control to the target instruction, otherwise control is passed to the instruction following the conditional jump instruction. The flags tested by a specific Jcc instruction depend on the opcode. In several cases, multiple mnemonics correspond to one opcode. Table 3-6 on page 71 shows the rFLAGS values required for each Jcc instruction. 70 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Table 3-6. rFLAGS for Jcc Instructions Mnemonic JO JNO JB JC JNAE JNB JNC JAE JZ JE JNZ JNE JNA JBE JNBE JA JS JNS JP JPE JNP JPO JL JNGE JGE JNL JNG JLE JNLE JG Required Flag State OF = 1 OF = 0 CF = 1 Description Jump near if overflow Jump near if not overflow Jump near if below Jump near if carry Jump near if not above or equal Jump near if not below Jump near if not carry Jump near if above or equal Jump near if 0 Jump near if equal Jump near if not zero Jump near if not equal Jump near if not above Jump near if below or equal Jump near if not below or equal Jump near if above Jump near if sign Jump near if not sign Jump near if parity Jump near if parity even Jump near if not parity Jump near if parity odd Jump near if less Jump near if not greater or equal Jump near if greater or equal Jump near if not less Jump near if not greater Jump near if less or equal Jump near if not less or equal Jump near if greater CF = 0 ZF = 1 ZF = 0 CF = 1 or ZF = 1 CF = 0 and ZF = 0 SF = 1 SF = 0 PF = 1 PF = 0 SF <> OF SF = OF ZF = 1 or SF <> OF ZF = 0 and SF = OF Chapter 3: General-Purpose Programming 71 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Unlike the unconditional jump (JMP), conditional jump instructions have only two forms--near conditional jumps and short conditional jumps. To create a far-conditional-jump code sequence corresponding to a high-level language statement like: IF A = B THEN GOTO FarLabel where FarLabel is located in another code segment, use the opposite condition in a conditional short jump before the unconditional far jump. For example: cmp A,B jne NextInstr jmp far ptr WhenNE NextInstr: ; ; ; ; compare operands continue program if not equal far jump if operands are equal continue program Three special conditional jump instructions use the rCX register instead of flags. The JCXZ, JECXZ, and JRCXZ instructions check the value of the CX, ECX, and RCX registers, respectively, and pass control to the target instruction when the value of rCX register reaches 0. These instructions are often used to control safe cycles, preventing execution when the value in rCX reaches 0. Loop. LOOPcc--Loop if condition The LOOPcc instructions include LOOPE, LOOPNE, LOOPNZ, and LOOPZ. These instructions decrement the rCX register by 1 without changing any flags, and then check to see if the loop condition is met. If the condition is met, the program jumps to the specified target code. LOOPE and LOOPZ are synonyms. Their loop condition is met if the value of the rCX register is non-zero and the zero flag (ZF) is set to 1 when the instruction starts. LOOPNE and LOOPNZ are also synonyms. Their loop condition is met if the value of the rCX register is non-zero and the ZF flag is cleared to 0 when the instruction starts. LOOP, unlike the other mnemonics, does not check the ZF flag. Its loop condition is met if the value of the rCX register is non-zero. Call. CALL--Procedure Call 72 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology The CALL instruction performs a call to a procedure whose address is specified in the operand. The return address is placed on the stack by the CALL, and points to the instruction immediately following the CALL. When the called procedure finishes execution and is exited using a return instruction, control is transferred to the return address saved on the stack. The CALL instruction has th e same forms as the JMP instruction, except that CALL lacks the short-relative (1-byte offset) form. Relative Near Call--These specify an offset relative to the instruction following the CALL instruction. The operand is an immediate 16-bit or 32-bit offset from the called procedure, within the same code segment. Register-Indirect and Memory-Indirect Near Call--These specify a target address contained in a register or memory location. Direct Far Call--These specify a target address outside the current code segment. The address is pointed to by a 32-bit or 48-bit far-pointer specified by the instruction, which consists of a 16-bit code selector and a 16-bit or 32-bit offset. The direct far call form is invalid in 64-bit mode. Memory-Indirect Far Call--These specify a target address outside the current code segment. The address is pointed to by a 32-bit or 48-bit far pointer in a specified memory location. The size of the rIP is in all cases determined by the operand-size attribute of the CALL instruction. CALLs push the return address to the stack. The data pushed on the stack depends on whether a near or far call is performed, and whether a privilege change occurs. See Section 3.7.5, "Procedure Calls," on page 96 for further information. For far CALLs, the selector portion of the target address can specify a code-segment selector (in which case the selector is loaded into the CS register), or a call-gate selector, (used for calls that change privilege level), or a task-state-segment (TSS) selector (used for task switches). In the latter two cases, the offset portion of the CALL instruction's target address is ignored, and the new values loaded into CS and rIP are taken from the call gate or TSS. Chapter 3: General-Purpose Programming 73 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Return. RET--Return from Call The RET instruction returns from a procedure originally called using the CALL instruction. CALL places a return address (which points to the instruction following the CALL) on the stack. RET takes the return address from the stack and transfers control to the instruction located at that address. Like CALL instructions, RET instructions have both a near and far form. An optional immediate operand for the RET specifies the number of bytes to be popped from the procedure stack for parameters placed on the stack. See Section 3.7.6, "Returning from Procedures," on page 99 for additional information. Interrupts and Exceptions. INT--Interrupt to Vector Number INTO--Interrupt to Overflow Vector IRET--Interrupt Return Word IRETD--Interrupt Return Doubleword IRETQ--Interrupt Return Quadword The INT instruction implements a software interrupt by calling an interrupt handler. The operand of the INT instruction is an immediate byte value specifying an index in the interrupt descriptor table (IDT), which contains addresses of interrupt handlers (see Section 3.7.10, "Interrupts and Exceptions," on page 104 for further information on the IDT). The 1-byte INTO instruction calls interrupt 4 (the overflow exception, #OF), if the overflow flag in RFLAGS is set to 1, otherwise it does nothing. Signed arithmetic instructions can be followed by the INTO instruction if the result of the arithmetic operation can potentially overflow. (The 1-byte INT 3 instruction is considered a system instruction and is therefore not described in this volume). IRET, IRETD, and IRETQ perform a return from an interrupt handler. The mnemonic specifies the operand size, which determines the format of the return addresses popped from the stack (IRET for 16-bit operand size, IRETD for 32-bit operand size, and IRETQ for 64-bit operand size). However, some assemblers can use the IRET mnemonic for all operand sizes. Actions performed by IRET are opposite to actions performed 74 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology by an interrupt or exception. In real and protected mode, IRET pops the rIP, CS, and RFLAGS contents from the stack, and it pops SS:rSP if a privilege-level change occurs or if it executes from 64-bit mode. In protected mode, the IRET instruction can also cause a task switch if the nested task (NT) bit in the RFLAGS register is set. For details on using IRET to switch tasks, see "Task Management" in Volume 2. 3.3.12 Flags The flags instructions read and write bits of the RFLAGS register that are visible to application software. "Flags Register" on page 37 illustrates the RFLAGS register. Push and Pop Flags. POPF--Pop to FLAGS Word POPFD--Pop to EFLAGS Doubleword POPFQ--Pop to RFLAGS Quadword PUSHF--Push FLAGS Word onto Stack PUSHFD--Push EFLAGS Doubleword onto Stack PUSHFQ--Push RFLAGS Quadword onto Stack The push and pop flags instructions copy data between the rFLAGS register and the stack. POPF and PUSHF copy 16 bits of data between the stack and the FLAGS register (the low 16 bits of EFLAGS), leaving the high 48 bits of RFLAGS unchanged. POPFD and PUSHFD copy 32 bits between the stack and the RFLAGS register. POPFQ and PUSHFQ copy 64 bits between the stack and the RFLAGS register. Only the bits illustrated in Figure 3-5 on page 38 are affected. Reserved bits and bits whose writability is prevented by the current values of system flags, current privilege level (CPL), or current operating mode, are unaffected by the POPF, POPFQ, and POPFD instructions. For details on stack operations, see "Control Transfers" on page 93. Set and Clear Flags. CLC--Clear Carry Flag CMC--Complement Carry Flag STC--Set Carry Flag CLD--Clear Direction Flag STD--Set Direction Flag Chapter 3: General-Purpose Programming 75 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 CLI--Clear Interrupt Flag STI--Set Interrupt Flag These instructions change the value of a flag in the rFLAGS register that is visible to application software. Each instruction affects only one specific flag. The CLC, CMC, and STC instructions change the carry flag (CF). CLC clears the flag to 0, STC sets the flag to 1, and CMC inverts the flag. These instructions are useful prior to executing instructions whose behavior depends on the CF flag--for example, shift and rotate instructions. The CLD and STD instructions change the direction flag (DF) and influence the function of string instructions (CMPSx, SCASx, MOVSx, LODSx, STOSx, INSx, OUTSx). CLD clears the flag to 0, and STD sets the flag to 1. A cleared DF flag indicates the forward direction in string sequences, and a set DF flag indicates the backward direction. Thus, in string instructions, the rSI and/or rDI register values are auto-incremented when DF = 0 and auto-decremented when DF = 1. Two other instructions, CLI and STI, clear and set the interrupt flag (IF). CLI clears the flag, causing the processor to ignore external maskable interrupts. STI sets the flag, allowing the processor to recognize maskable external interrupts. These instructions are used primarily by system software--especially, interrupt handlers--and are described in "Exceptions and Interrupts" in Volume 2. Load and Store Flags. LAHF--Load Status Flags into AH Register SAHF--Store AH into Flags LAHF loads the lowest byte of the RFLAGS register into the AH register. This byte contains the carry flag (CF), parity flag (PF), auxiliary flag (AF), zero flag (ZF), and sign flag (SF). SAHF stores the AH register into the lowest byte of the RFLAGS register. 3.3.13 Input/Output The I/O instructions perform reads and writes of bytes, words, and doublewords from and to the I/O address space. This address space can be used to access and manage external devices, and is independent of the main-memory address space. By contrast, memory-mapped I/O uses the main-memory address space and is Chapter 3: General-Purpose Programming 76 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology accessed using the MOV instructions rather than the I/O instructions. When operating in legacy protected mode or in long mode, the RFLAGS register's I/O privilege level (IOPL) field and the I/Opermission bitmap in the current task-state segment (TSS) are used to control access to the I/O addresses (called I/O ports). See "Input/Output" on page 109 for further information. General I/O. IN--Input from Port OUT--Output to Port The IN instruction reads a byte, word, or doubleword from the I/O port address specified by the source operand, and loads it into the accumulator register (AL or eAX). The source operand can be an immediate byte or the DX register. The OUT instruction writes a byte, word, or doubleword from the accumulator register (AL or eAX) to the I/O port address specified by the destination operand, which can be either an immediate byte or the DX register. If the I/O port address is specified with an immediate operand, the range of port addresses accessible by the IN and OUT instructions is limited to ports 0 through 255. If the I/O port address is specified by a in the DX register, all 65,536 ports are accessible. String I/O. INS--Input String INSB--Input String Byte INSW--Input String Word INSD--Input String Doubleword OUTS--Output String OUTSB--Output String Byte OUTSW--Output String Word OUTSD--Output String Doubleword The INSx instructions (INSB, INSW, INSD) read a byte, word, or doubleword from the I/O port specified by the DX register, and load it into the memory location specified by ES:[rDI]. Chapter 3: General-Purpose Programming 77 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 The OUTSx instructions (OUTSB, OUTSW, OUTSD) write a byte, word, or doubleword from an implicit memory location specified by seg:[rSI], to the I/O port address stored in the DX register. The INSx and OUTSx instructions are commonly used with a repeat prefix to transfer blocks of data. The memory pointer address is not incremented or decremented. This usage is intended for peripheral I/O devices that are expecting a stream of data. 3.3.14 Semaphores The semaphore instructions support the implementation of reliable signaling between processors in a multi-processing environment, usually for the purpose of sharing resources. CMPXCHG--Compare and Exchange CMPXCHG8B--Compare and Exchange Eight Bytes XADD--Exchange and Add XCHG--Exchange The CMPXCHG instruction compares a value in the AL or rAX register with the first (destination) operand, and sets the arithmetic flags (ZF, OF, SF, AF, CF, PF) according to the result. If the compared values are equal, the source operand is loaded into the destination operand. If they are not equal, the first operand is loaded into the accumulator. CMPXCHG can be used to try to intercept a semaphore, i.e. test if its state is free, and if so, load a new value into the semaphore, making its state busy. The test and load are performed atomically, so that concurrent processes or threads which use the semaphore to access a shared object will not conflict. The CMPXCHG8B instruction compares the 64-bit values in the EDX:EAX registers with a 64-bit memory location. If the values are equal, the zero flag (ZF) is set, and the ECX:EBX value is copied to the memory location. Otherwise, the ZF flag is cleared, and the memory value is copied to EDX:EAX. The XADD instruction exchanges the values of its two operands, then it stores their sum in the first (destination) operand. A L O C K p re f i x c a n b e u s e d t o m a ke t h e C M P X C H G , CMPXCHG8B and XADD instructions atomic if one of the operands is a memory location. 78 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology The XCHG instruction exchanges the values of its two operands. If one of the operands is in memory, the processor's bus-locking mechanism is engaged automatically during the exchange, whether of not the LOCK prefix is used. 3.3.15 Processor Information CPUID--Processor Identification The CPUID instruction returns information about the processor implementation and its support for instruction subsets and architectural features. Software operating at any privilege level can execute the CPUID instruction to read this information. After the information is read, software can select procedures t h a t o p t i m i z e p e r fo r m a n c e for a p a r t i c u l a r h a rdwa re implementation. Some processor implementations may not support the CPUID instruction. Support for the CPUID instruction is determined by testing the RFLAGS.ID bit. If software can write this bit, then the CPUID instruction is supported by the processor implementation. Otherwise, execution of CPUID results in an invalid-opcode exception. See "Feature Detection" on page 90 for details about using the CPUID instruction. For a full description of the CPUID instruction and its function codes, see "CPUID" in Volume 3. 3.3.16 Cache and Memory Management Applications can use the cache and memory-management instructions to control memory reads and writes to influence the caching of read/write data. "Memory Optimization" on page 113 describes how these instructions interact with the memory subsystem. LFENCE--Load Fence SFENCE--Store Fence MFENCE--Memory Fence PREFETCHlevel--Prefetch Data to Cache Level level PREFETCH--Prefetch L1 Data-Cache Line PREFETCHW--Prefetch L1 Data-Cache Line for Write CLFLUSH--Cache Line Invalidate The LFENCE, SFENCE, and MFENCE instructions can be used to force ordering on memory accesses. The order of memory accesses can be important when the reads and writes are to a m e m o ry - m a p p e d I / O d ev i c e , a n d i n m u l t i p ro c e s s o r environments where memory synchronization is required. Chapter 3: General-Purpose Programming 79 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 LFENCE affects ordering on memory reads, but not writes. SFENCE affects ordering on memory writes, but not reads. MFENCE orders both memory reads and writes. These instructions do not take operands. They are simply inserted between the memory references that are to be ordered. For details about the fence instructions, see "Forcing Memory Order" on page 115. Th e P R E F E T C H l e v e l , P R E F E T C H , a n d P R E F E T C H W instructions load data from memory into one or more cache levels. PREFETCHlevel loads a memory block into a specified level in the data-cache hierarchy (including a non-temporal caching level). The size of the memory block is implementation dependent. PREFETCH loads a cache line into the L1 data cache. PREFETCHW loads a cache line into the L1 data cache and sets the cache line's memory-coherency state to modified, in anticipation of subsequent data writes to that line. (Both PREFETCH and PREFETCHW are 3DNow!TM instructions.) For details about the prefetch instructions, see "Cache-Control Instructions" on page 122. For a description of MOESI memorycoherency states, see "Memory System" in Volume 2. The CLFLUSH instruction writes unsaved data back to memory for the specified cache line from all processor caches, invalidates the specified cache, and causes the processor to send a bus cycle which signals external caching devices to write back and invalidate their copies of the cache line. CLFLUSH provides a finer-grained mechanism than the WBINVD instruction, which writes back and invalidates all cache lines. Moreover, CLFLUSH can be used at all privilege levels, unlike WBINVD which can be used only by system software running at privilege level 0. 3.3.17 No Operation NOP--No Operation Th e N O P i n s t r u c t i o n s p e r fo rm s n o o p e ra t i o n ( ex c e p t incrementing the instruction pointer rIP by one). It is an alternative mnemonic for the XCHG rAX, rAX instruction. Depending on the hardware implementation, the NOP instruction may use one or more cycles of processor time. 3.3.18 System Calls System Call and Return. SYSENTER--System Call SYSEXIT--System Return 80 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology SYSCALL--Fast System Call SYSRET--Fast System Return The SYSENTER and SYSCALL instructions perform a call to a routine running at current privilege level (CPL) 0--for example, a kernel procedure--from a user level program (CPL 3). The addresses of the target procedure and (for SYSENTER) the target stack are specified implicitly through the modelspecific registers (MSRs). Control returns from the operating system to the callee when the operating system executes a SYSEXIT or SYSRET instruction. SYSEXIT are SYSRET are privileged instructions and thus can be issued only by a privilege-level-0 procedure. The SYSENTER and SYSEXIT instructions form a complementary pair, as do SYSCALL and SYSRET. SYSENTER and SYSEXIT are invalid in 64-bit mode. In this case, use the faster SYSCALL and SYSRET instructions. For details on these on other system-related instructions, see "System-Management Instructions" in Volume 2 and "System Instruction Reference" in Volume 3. 3.4 General Rules for Instructions in 64-Bit Mode Th i s s e c t i o n p rov i d e s d e t a i l s o f t h e g e n e ra l - p u rp o s e instructions in 64-bit mode, and how they differ from the same instructions in legacy and compatibility modes. The differences apply only to general-purpose instructions. Most of them do not apply to 128-bit media, 64-bit media, or x87 floating-point instructions. 3.4.1 Address Size In 64-bit mode, the following rules apply to address size: Defaults to 64 bits. Can be overridden to 32 bits (by means of opcode prefix 67h). Can't be overridden to 16 bits. 3.4.2 Canonical Address Format Bits 63 through the most-significant implemented virtualaddress bit must be all zeros or all ones in any memory reference. See "64-bit Canonical Addresses" on page 18 for details. (This rule applies to long mode, which includes both 64bit mode and compatibility mode.) Chapter 3: General-Purpose Programming 81 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 3.4.3 BranchDisplacement Size Branch-address displacements are 8 bits or 32 bits, as in legacy mode, but are sign-extended to 64 bits prior to using them for address computations. See "Displacements and Immediates" on page 20 for details. In 64-bit mode, the following rules apply to operand size: 64-Bit Operand Size Option: If an instruction's operand size (16-bit or 32-bit) in legacy mode depends on the default-size (D) bit in the current code-segment descriptor and the operand-size prefix, then the operand-size choices in 64-bit mode are extended from 16-bit and 32-bit to include 64 bits (with a REX prefix), or the operand size is fixed at 64 bits. See "General-Purpose Instructions in 64-Bit Mode" in Volume 3 for details. Default Operand Size: The default operand size for most instructions is 32 bits, and a REX prefix must be used to change the operand size to 64 bits. However, two groups of instructions default to 64-bit operand size and do not need a REX prefix: (1) near branches and (2) all instructions, except far branches, that implicitly reference the RSP. See "General-Purpose Instructions in 64-Bit Mode" in Volume 3 for details. Fixed Operand Size: If an instruction's operand size is fixed in legacy mode, that operand size is usually fixed at the same size in 64-bit mode. (There are some exceptions.) For example, the CPUID instruction always operates on 32-bit operands, irrespective of attempts to override the operand size. See "General-Purpose Instructions in 64-Bit Mode" in Volume 3 for details. Immediate Operand Size: The maximum size of immediate operands is 32 bits, as in legacy mode, except that 64-bit immediates can be MOVed into 64-bit GPRs. When the operand size is 64 bits, immediates are sign-extended to 64 bits prior to using them. See "Immediate Operand Size" on page 46 for details. Shift-Count and Rotate-Count Operand Size: When the operand size is 64 bits, shifts and rotates use one additional bit (6 bits total) to specify shift-count or rotate-count, allowing 64-bit shifts and rotates. 3.4.4 Operand Size 82 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology 3.4.5 High 32 Bits In 64-bit mode, the following rules apply to extension of results into the high 32 bits when results smaller than 64 bits are written: Zero-Extension of 32-Bit Results: 32-bit results are zeroextended into the high 32 bits of 64-bit GPR destination registers. No Extension of 8-Bit and 16-Bit Results: 8-bit and 16-bit results leave the high 56 or 48 bits, respectively, of 64-bit GPR destination registers unchanged. Undefined High 32 Bits After Mode Change: The processor does not preserve the upper 32 bits of the 64-bit GPRs across changes from 64-bit mode to compatibility or legacy modes. In compatibility and legacy mode, the upper 32 bits of the GPRs are undefined and not accessible to software. 3.4.6 Invalid and Reassigned Instructions The following general-purpose instructions are invalid in 64-bit mode: AAA--ASCII Adjust After Addition AAD--ASCII Adjust Before Division AAM--ASCII Adjust After Multiply AAS--ASCII Adjust After Subtraction BOUND--Check Array Bounds CALL (far absolute)--Procedure Call Far DAA--Decimal Adjust after Addition DAS--Decimal Adjust after Subtraction INTO--Interrupt to Overflow Vector JMP (far absolute)--Jump Far LDS--Load DS Segment Register LES--Load ES Segment Register POP DS--Pop Stack into DS Segment POP ES--Pop Stack into ES Segment POP SS--Pop Stack into SS Segment POPA, POPAD--Pop All to GPR Words or Doublewords PUSH CS--Push CS Segment Selector onto Stack PUSH DS--Push DS Segment Selector onto Stack PUSH ES--Push ES Segment Selector onto Stack PUSH SS--Push SS Segment Selector onto Stack Chapter 3: General-Purpose Programming 83 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 PUSHA, PUSHAD--Push All to GPR Words or Doublewords The following general-purpose instructions are invalid in long mode (64-bit mode and compatibility mode): SYSENTER--System Call (use SYSCALL instead) SYSEXIT--System Exit (use SYSRET instead) The opcodes for the following general-purpose instructions are reassigned in 64-bit mode: ARPL--Adjust Requestor Privilege Level. Opcode becomes the MOVSXD instruction. DEC (one-byte opcode only)--Decrement by 1. Opcode becomes a REX prefix. Use the two-byte DEC opcode instead. INC (one-byte opcode only)--Increment by 1. Opcode becomes a REX prefix. Use the two-byte INC opcode instead. 3.4.7 Instructions with 64-Bit Default Operand Size Most instructions default to 32-bit operand size in 64-bit mode. However, the following near branches instructions and instructions that implicitly reference the stack pointer (RSP) default to 64-bit operand size in 64-bit mode: Near Branches: - Jcc--Jump Conditional Near - JMP--Jump Near - LOOP--Loop - LOOPcc--Loop Conditional Instructions That Implicitly Reference RSP: - ENTER--Create Procedure Stack Frame - LEAVE--Delete Procedure Stack Frame - POP reg/mem--Pop Stack (register or memory) - POP reg--Pop Stack (register) - POP FS--Pop Stack into FS Segment Register - POP GS--Pop Stack into GS Segment Register - POPF, POPFD, POPFQ--Pop to rFLAGS Word, Doubleword, or Quadword - PUSH imm32--Push onto Stack (sign-extended doubleword) - PUSH imm8--Push onto Stack (sign-extended byte) 84 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology - PUSH reg/mem--Push onto Stack (register or memory) PUSH reg--Push onto Stack (register) PUSH FS--Push FS Segment Register onto Stack PUSH GS--Push GS Segment Register onto Stack PUSHF, PUSHFD, PUSHFQ--Push rFLAGS Word, Doubleword, or Quadword onto Stack The default 64-bit operand size eliminates the need for a REX prefix with these instructions when registers RAX-RSP (the first set of eight GPRs) are used as operands. A REX prefix is still required if R8-R15 (the extended set of eight GPRs) are used as operands, because the prefix is required to address the extended registers. The 64-bit default operand size can be overridden to 16 bits using the 66h operand-size override. However, it is not possible to override the operand size to 32 bits, because there is no 32-bit operand-size override prefix for 64-bit mode. For details on the operand-size prefix, see "Instruction Prefixes" in Volume 3. For details on near branches, see "Near Branches in 64-Bit Mode" on page 103. For details on instructions that implicitly reference RSP, see "Stack Operand-Size in 64-Bit Mode" on page 95. For details on opcodes and operand-size overrides, see "General-Purpose Instructions in 64-Bit Mode" in Volume 3. 3.5 Instruction Prefixes An instruction prefix is a byte that precedes an instruction's opcode and modifies the instruction's operation or operands. Instruction prefixes are of two types: Legacy Prefixes REX Prefixes Legacy prefixes are organized into five groups, in which each prefix has a unique value. REX prefixes, which enable use of the AMD64 register extensions in 64-bit mode, are organized as a single group in which the value of the prefix indicates the combination of register-extension features to be enabled. 3.5.1 Legacy Prefixes Table 3-7 shows the legacy prefixes. These are organized into five groups, as shown in the left-most column of the table. Each 85 Chapter 3: General-Purpose Programming AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 prefix has a unique hexadecimal value. The legacy prefixes can appear in any order in the instruction, but only one prefix from each of the five groups can be used in a single instruction. The result of using multiple prefixes from a single group is undefined. There are several restrictions on the use of prefixes. For example, the address-size prefix changes address size only for a memory operand, and only a single memory operand can be overridden in an instruction. In general, the operand-size prefix cannot be used with x87 floating-point instructions, and when used with 128-bit or 64-bit media instructions that prefix acts in a special way to modify the opcode. The repeat prefixes cause repetition only with certain string instructions, and when used with 128-bit or 64-bit media instructions the prefixes act in a special way to modify the opcode. The lock prefix can be used with only a small number of general-purpose instructions. Table 3-7 summarizes the functionality of instruction prefixes. Details about the prefixes and their restrictions are given in "Instruction Prefixes" in Volume 3. Table 3-7. Legacy Instruction Prefixes Prefix Group Operand-Size Override Address-Size Override Note: Mnemonic Prefix Code (Hex) 661 Description Changes the default operand size of a memory or register operand, as shown in Table 3-3 on page 45. Changes the default address size of a memory operand, as shown in Table 2-1 on page 21. none none 67 1. When used with 128-bit or 64-bit media instructions, this prefix acts in a special-purpose way to modify the opcode. 86 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Table 3-7. Legacy Instruction Prefixes (continued) Prefix Group Mnemonic CS DS ES FS GS SS Prefix Code (Hex) 2E 3E 26 64 65 36 Description Forces use of the CS segment for memory operands. Forces use of the DS segment for memory operands. Forces use of the ES segment for memory operands. Forces use of the FS segment for memory operands. Forces use of the GS segment for memory operands. Forces use of the SS segment for memory operands. Causes certain read-modify-write instructions on memory to occur atomically. Repeats a string operation (INS, MOVS, OUTS, LODS, and STOS) until the rCX register equals 0. F31 Repeat REPE or REPZ Repeats a compare-string or scanstring operation (CMPSx and SCASx) until the rCX register equals 0 or the zero flag (ZF) is cleared to 0. Repeats a compare-string or scanstring operation (CMPSx and SCASx) until the rCX register equals 0 or the zero flag (ZF) is set to 1. Segment Override Lock LOCK F0 REP REPNE or REPNZ F21 Note: 1. When used with 128-bit or 64-bit media instructions, this prefix acts in a special-purpose way to modify the opcode. Operand-Size and Address-Size Prefixes. T h e o p e ra n d - s i z e a n d address-size prefixes allow mixing of data and address sizes on an instruction-by-instruction basis. An instruction's default address size can be overridden in any operating mode by using the 67h address-size prefix. Chapter 3: General-Purpose Programming 87 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Table 3-3 on page 45 shows the operand-size overrides for all operating modes. In 64-bit mode, the default operand size for most general-purpose instructions is 32 bits. A REX prefix (described in "REX Prefixes" on page 89) specifies a 64-bit operand size, and a 66h prefix specifies a 16-bit operand size. The REX prefix takes precedence over the 66h prefix. Table 2-1 on page 21 shows the address-size overrides for all operating modes. In 64-bit mode, the default address size is 64 bits. The address size can be overridden to 32 bits. 16-bit addresses are not supported in 64-bit mode. In compatibility mode, the address-size prefix works the same as in the legacy x86 architecture. For further details on these prefixes, see "Operand-Size Override Prefix" in Volume 3 and "Address-Size Override Prefix" in Volume 3. Segment Override Prefix. The DS segment is the default segment for most memory operands. Many instructions allow this default data segment to be overridden using one of the six segmentoverride prefixes shown in Table 3-7. Data-segment overrides will be ignored when accessing data in the following cases: When a stack reference is made that pushes data onto or pops data off of the stack. In those cases, the SS segment is always used. When the destination of a string is memory it is always referenced using the ES segment. Instruction fetches from the CS segment cannot be overridden. However, the CS segment-override prefix can be used to access instructions as data objects and to access data stored in the code segment. For further details on these prefixes, see "Segment-Override Prefixes" in Volume 3. Lock Prefix. The LOCK prefix causes certain read-modify-write instructions that access memory to occur atomically. The mechanism for doing so is implementation-dependent (for example, the mechanism may involve locking of data-cache lines that contain copies of the referenced memory operands, and/or bus signaling or packet-messaging on the bus). The prefix is intended to give the processor exclusive use of shared memory operands in a multiprocessor system. 88 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology The prefix can only be used with forms of the following instructions that write a memory operand: ADC, ADD, AND, BTC, BTR, BTS, CMPXCHG, CMPXCHG8B, DEC, INC, NEG, NOT, OR, SBB, SUB, XADD, XCHG, and XOR. An invalidopcode exception occurs if LOCK is used with any other instruction. For further details on these prefixes, see "Lock Prefix" in Volume 3. Repeat Prefixes. There are two repeat prefixes byte codes, F3h and F2h. Byte code F3h is the more general and is usually treated as two distinct instructions by assemblers. Byte code F2h is only used with CMPSx and SCASx instructions: REP (F3h)--This more generalized repeat prefix repeats its associated string instruction the number of times specified in the counter register (rCX). Repetition stops when the value in rCX reaches 0. This prefix is used with the INS, LODS, MOVS, OUTS, and STOS instructions. REPE or REPZ (F3h)--This version of REP prefix repeats its associated string instruction the number of times specified in the counter register (rCX). Repetition stops when the value in rCX reaches 0 or when the zero flag (ZF) is cleared to 0. The prefix can only be used with the CMPSx and SCASx instructions. REPNE or REPNZ (F2h)--The REPNE or REPNZ prefix repeats its associated string instruction the number of times specified in the counter register (rCX). Repetition stops when the value in rCX reaches 0 or when the zero flag (ZF) is set to 1. The prefix can only be used with the CMPSx and SCASx instructions. The size of the rCX counter is determined by the effective address size. For further details about these prefixes, including optimization of their use, see "Repeat Prefixes" in Volume 3. 3.5.2 REX Prefixes REX prefixes are a new group of instruction-prefix bytes that can be used only in 64-bit mode. They enable the 64-bit register extensions. REX prefixes specify the following features: Use of an extended GPR register, shown in Figure 3-3 on page 31. Use of an extended XMM register, shown in Figure 4-12 on page 140. Chapter 3: General-Purpose Programming 89 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Use of a 64-bit (quadword) operand size, as described in "Operands" on page 41. Use of extended control and debug registers, as described in Volume 2. REX prefix bytes have a value in the range 40h to 4Fh, depending on the particular combination of register extensions desired. With few exceptions, a REX prefix is required to access a 64-bit GPR or one of the extended GPR or XMM registers. A few instructions (described in "General-Purpose Instructions in 64-Bit Mode" in Volume 3) default to 64-bit operand size and do not need the REX prefix to access an extended 64-bit GPR. An instruction can have only one REX prefix, and one such prefix is all that is needed to express the full selection of 64-bitmode register-extension features. The prefix, if used, must immediately precede the first opcode byte of an instruction. Any other placement of a REX prefix is ignored. The legacy instruction-size limit of 15 bytes still applies to instructions that contain a REX prefix. For further details on the REX prefixes, see "REX Prefixes" in Volume 3. 3.6 Feature Detection The CPUID instruction provides information about the processor implementation and its capabilities. Software operating at any privilege level can execute the CPUID instruction to collect this information. After the information is collected, software can select procedures that optimize performance for a particular hardware implementation. For example, application software can determine whether the AMD64 architecture's long mode is supported by the processor, and it can determine the processor implementation's performance capabilities. Support for the CPUID instruction is implementationdependent, as determined by software's ability to write the RFLAGS.ID bit. The following code sample shows how to test for the presence of the CPUID instruction using. 90 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology pushfd pop mov xor push popfd pushfd pop cmp jz eax ebx, eax eax, 00200000h eax eax eax, ebx NO_CPUID ; ; ; ; ; ; ; ; ; ; save EFLAGS store EFLAGS in EAX save in EBX for later testing toggle bit 21 push to stack save changed EAX to EFLAGS push EFLAGS to TOS store EFLAGS in EAX see if bit 21 has changed if no change, no CPUID A f t e r s o f t wa re h a s d e t e r m i n e d t h a t t h e p ro c e s s o r implementation supports the CPUID instruction, software can test for support of specific features by loading a function code (value) into the EAX register and executing the CPUID instruction. Processor feature information is returned in the EAX, EBX, ECX, and EDX registers, as described fully in "CPUID" in Volume 3. The architecture supports CPUID information about standard functions and extended functions. In general, standard functions include the earliest features offered in the x86 architecture. Extended functions include newer features of the x86 and AMD64 architectures, such as SSE, SSE2, and 3DNow! instructions, and long mode. Standard functions are accessed by loading EAX with the value 0 (standard-function 0) or 1 (standard-function 1) and executing the CPUID instruction. All software using the CPUID instruction must execute standard-function 0, which identifies the processor vendor and the largest standard-function input value supported by the processor implementation. The CPUID standard-function 1 returns the processor version and standardfeature bits. Software can test for support of extended functions by first executing the CPUID instruction with the value 8000_0000h in EAX. The processor returns, in EAX, the largest extendedfunction input value defined for the CPUID instruction on the processor implementation. If the value in EAX is greater than 8000_0000h, extended functions are supported, although specific extended functions must be tested individually. Chapter 3: General-Purpose Programming 91 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 The following code sample shows how to test for support of any extended functions: mov eax, 80000000h CPUID cmp eax, 80000000h jbe NO_EXTENDEDMSR ; ; ; ; query for extended functions get extended function limit is EAX greater than 80000000? no extended-feature support If extended functions are supported, software can test for support of specific extended features. For example, software can determine whether the processor implementation supports long mode by executing the CPUID instruction with extended 8000_0001h in the EAX register, then testing to see if bit 29 in the EDX register is set to 1. The following code sample shows how to test for long-mode support. mov eax, 80000001h CPUID test edx, 20000000h jnz YES_Long_Mode ; ; ; ; query for function 8000_0001h get feature bits in EDX test bit 29 in EDX long mode is supported General-purpose instructions are supported in all hardware implementations of the AMD64 architecture, except that the general-purpose instructions discussed below are implemented only if their associated CPUID function bit is set. The following functions are reported by CPUID function 1: CMPXCHG8B, indicated by bit 8. CMOVcc (conditional moves), indicated by bit 15. CLFLUSH, indicated by bit 19. LFENCE and MFENCE, indicated by the SSE2 bit (bit 26). MOVD, MOVMSKPD, and MOVNTI, indicated by the SSE2 bit (bit 26). MOVMSKPS, indicated by the SSE bit (bit 25). PREFETCHlevel, indicated by the SSE bit (bit 25). SFENCE, indicated by the SSE bit (bit 25). SYSENTER and SYSEXIT, indicated by bit 11. The following features are reported by CPUID function 8000_0001h: MOVSXD, indicated by the long-mode bit (bit 29). SYSCALL and SYSRET, indicated by bit 11. PREFETCH and PREFETCHW, indicated by both the longmode bit (bit 29) and the 3DNow!TM technology bit (bit 31). 92 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Also, implementation of certain media instructions (such as FXSAVE and FXRSTOR) and system instructions (such as RDMSR and WRMSR) is indicated by CPUID function bits. See "Processor Feature Identification" in Volume 2 for a full description of the CPUID instruction and its function codes. 3.7 Control Transfers From the application-program's viewpoint, program-control flow is sequential--that is, instructions are addressed and executed sequentially--except when a branch instruction (a call, return, jump, interrupt, or return from interrupt) is encountered, in which case program flow changes to the branch instruction's target address. Branches are used to iterate through loops and move through conditional program logic. Branches cause a new instruction pointer to be loaded into the rIP register, and sometimes cause the CS register to point to a different code segment. The CS:rIP values can be specified as part of a branch instruction, or they can be read from a register or memory. Branches can also be used to transfer control to another program or procedure running at a different privilege level. In such cases, the processor automatically checks the source program and target program privileges to ensure that the transfer is allowed before loading CS:rIP with the new values. 3.7.1 Overview 3.7.2 Privilege Levels The processor's protected modes include legacy protected mode and long mode (both compatibility mode and 64-bit mode). In all protected modes and virtual x86 mode, privilege levels are used to isolate and protect programs and data from each other. The privilege levels are designated with a numerical value from 0 to 3, with 0 being the most privileged and 3 being the least privileged. Privilege 0 is normally reserved for critical systemsoftware components that require direct access to, and control over, all processor and system resources. Privilege 3 is used by application software. The intermediate privilege levels (1 and 2) are used, for example, by device drivers and library routines that access and control a limited set of processor and system resources. Figure 3-9 on page 94 shows the relationship of the four privilege-levels to each other. The protection scheme is implemented using the segmented memory-management Chapter 3: General-Purpose Programming 93 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 mechanism described in "Segmented Virtual Memory" in Volume 2. Memory Management File Allocation Interrupt Handling Privilege 0 Privilege 1 Privilege 2 513-236.eps Device-Drivers Library Routines Privilege 3 Application Programs Figure 3-9. 3.7.3 Procedure Stack Privilege-Level Relationships A procedure stack is often used by control transfer operations, particularly those that change privilege levels. Information from the calling program is passed to the target program on the procedure stack. CALL instructions, interrupts, and exceptions all push information onto the procedure stack. The pushed information includes a return pointer to the calling program and, for call instructions, optionally includes parameters. When a privilege-level change occurs, the calling program's stack pointer (the pointer to the top of the stack) is pushed onto the stack. Interrupts and exceptions also push a copy of the calling program's rFLAGs register and, in some cases, an error code associated with the interrupt or exception. The RET or IRET control-transfer instructions reverse the operation of CALLs, interrupts, and exceptions. These return instructions pop the return pointer off the stack and transfer control back to the calling program. If the calling program's stack pointer was pushed, it is restored by popping the saved values off the stack and into the SS and rSP registers. Stack Alignment. Control-transfer performance can degrade significantly when the stack pointer is not aligned properly. Stack pointers should be word aligned in 16-bit segments, doubleword aligned in 32-bit segments, and quadword aligned in 64-bit mode. 94 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Stack Operand-Size in 64-Bit Mode. In 64-bit mode, the stack pointer size is always 64 bits. The stack size is not controlled by the default-size (B) bit in the SS descriptor, as it is in compatibility and legacy modes, nor can it be overridden by an instruction prefix. Address-size overrides are ignored for implicit stack references. Except for far branches, all instructions that implicitly reference the stack pointer default to 64-bit operand size in 64bit mode. Table 3-8 on page 96 lists these instructions. The default 64-bit operand size eliminates the need for a REX prefix with these instructions. However, a REX prefix is still required if R8-R15 (the extended set of eight GPRs) are used as operands, because the prefix is required to address the extended registers. Pushes and pops of 32-bit stack values are not possible in 64-bit mode with these instructions, because there is no 32-bit operand-size override prefix for 64-bit mode. 3.7.4 Jumps Jump instructions provide a simple means for transferring program control from one location to another. Jumps do not affect the procedure stack, and return instructions cannot transfer control back to the instruction following a jump. Two general types of jump instruction are available: unconditional (JMP) and conditional (Jcc). There are two types of unconditional jumps (JMP): Near Jumps--When the target address is within the current code segment. Far Jumps--When the target address is outside the current code segment. Although unconditional jumps can be used to change code segments, they cannot be used to change privilege levels. Conditional jumps (Jcc) test the state of various bits in the rFLAGS register (or rCX) and jump to a target location based on the results of that test. Only near forms of conditional jumps are available, so Jcc cannot be used to transfer control to another code segment. Chapter 3: General-Purpose Programming 95 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Table 3-8. Instructions that Implicitly Reference RSP in 64-Bit Mode Operand Size (bits) Opcode (hex) E8, FF /2 C8 C9 8F /0 58 to 5F 0F A1 0F A9 9D 68 6A FF /6 50-57 0F A0 0F A8 9C C2, C3 Mnemonic Description Default Possible Overrides1 CALL ENTER LEAVE POP reg/mem POP reg POP FS POP GS POPF POPFQ PUSH imm32 PUSH imm8 PUSH reg/mem PUSH reg PUSH FS PUSH GS PUSHF PUSHFQ RET Note: Call Procedure Near Create Procedure Stack Frame Delete Procedure Stack Frame Pop Stack (register or memory) Pop Stack (register) Pop Stack into FS Segment Register Pop Stack into GS Segment Register Pop to EFLAGS Word or Quadword Push onto Stack (sign-extended doubleword) Push onto Stack (sign-extended byte) Push onto Stack (register or memory) Push onto Stack (register) Push FS Segment Register onto Stack Push GS Segment Register onto Stack Push rFLAGS Word or Quadword onto Stack Return From Call (near) 64 16 1. There is no 32-bit operand-size override prefix in 64-bit mode. 3.7.5 Procedure Calls The CALL instruction transfers control unconditionally to a new address, but unlike jump instructions, it saves a return pointer (CS:rIP) on the stack. The called procedure can use the RET instruction to pop the return pointers to the calling procedure from the stack and continue execution with the instruction following the CALL. 96 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology There are four types of CALL: Near Call--When the target address is within the current code segment. Far Call--When the target address is outside the current code segment. Interprivilege-Level Far Call--A far call that changes privilege level. Task Switch--A call to a target address in another task. Near Call. When a near CALL is executed, only the calling procedure's rIP (the return offset) is pushed onto the stack. After the rIP is pushed, control is transferred to the new rIP value specified by the CALL instruction. Parameters can be pushed onto the stack by the calling procedure prior to executing the CALL instruction. Figure 3-10 shows the stack pointer before (old rSP value) and after (new rSP value) the CALL. The stack segment (SS) is not changed. Procedure Stack Parameters ... Return rIP Old rSP New rSP 513-175.eps Figure 3-10. Procedure Stack, Near Call Far Call, Same Privilege. A far CALL changes the code segment, so the full return pointer (CS:rIP) is pushed onto the stack. After the return pointer is pushed, control is transferred to the new CS:rIP value specified by the CALL instruction. Parameters can be pushed onto the stack by the calling procedure prior to executing the CALL instruction. Figure 3-11 on page 98 shows the stack pointer before (old rSP value) and after (new rSP value) the CALL. The stack segment (SS) is not changed. Chapter 3: General-Purpose Programming 97 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Procedure Stack Parameters ... Old rSP Return CS Return rIP New rSP 513-176.eps Figure 3-11. Procedure Stack, Far Call to Same Privilege Far Call, Greater Privilege. A f a r CA L L t o a m o re - p r i v i l e g e d procedure performs a stack switch prior to transferring control to the called procedure. Switching stacks isolates the moreprivileged procedure's stack from the less-privileged procedure's stack, and it provides a mechanism for saving the return pointer back to the procedure that initiated the call. Calls to more-privileged software can only take place through a system descriptor called a call-gate descriptor. Call-gate descriptors are created and maintained by system software. In 64-bit mode, only indirect far calls (those whose target memory address is in a register or other memory location) are supported. Absolute far calls (those that reference the base of the code segment) are not supported in 64-bit mode. When a call to a more-privileged procedure occurs, the processor locates the new procedure's stack pointer from its task-state segment (TSS). The old stack pointer (SS:rSP) is pushed onto the new stack, and (in legacy mode only) any parameters specified by the count field in the call-gate descriptor are copied from the old stack to the new stack (long mode does not support this automatic parameter copying). The return pointer (CS:rIP) is then pushed, and control is transferred to the new procedure. Figure 3-12 on page 99 shows an example of a stack switch resulting from a call to a moreprivileged procedure. "Segmented Virtual Memory" in Volume 2 provides additional information on privilegechanging CALLs. 98 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Old Procedure Stack Parameters ... Old SS:rSP Parameters * Called Procedure Stack Return SS Return rSP ... Return CS Return rIP New SS:rSP * Parameters are copied only in Legacy Mode, not in Long Mode. 513-177.eps Figure 3-12. Procedure Stack, Far Call to Greater Privilege Task Switch. In legacy mode, when a call to a new task occurs, the processor suspends the currently-executing task and stores the processor-state information at the point of suspension in the current task's task-state segment (TSS). The new task's state information is loaded from its TSS, and the processor resumes execution within the new task. In long mode, hardware task switching is disabled. Task switching is fully described in "Segmented Virtual Memory" in Volume 2. 3.7.6 Returning from Procedures The RET instruction reverses the effect of a CALL instruction. The return address is popped off the procedure stack, transferring control unconditionally back to the calling procedure at the instruction following the CALL. A return that changes privilege levels also switches stacks. The three types of RET are: Near Return--Transfers control back to procedure within the current code segment. the calling Far Return--Transfers control back to the calling procedure outside the current code segment. Interprivilege-Level Far Return--A far return that changes privilege levels. Chapter 3: General-Purpose Programming 99 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 All of the RET instruction types can be used with an immediate operand indicating the number of parameter bytes present on the stack. These parameters are released from the stack--that is, the stack pointer is adjusted by the value of the immediate operand--but the parameter bytes are not actually popped off of the stack (i.e., read into a register or memory location). Near Return. Wh e n a n e a r R E T i s e x e c u t e d , t h e c a l l i n g procedure's return offset is popped off of the stack and into the rIP register. Execution begins from the newly-loaded offset. If an immediate operand is included with the RET instruction, the stack pointer is adjusted by the number of bytes indicated. Figure 3-13 shows the stack pointer before (old rSP value) and after (new rSP value) the RET. The stack segment (SS) is not changed. Procedure Stack New rSP Parameters ... Return rIP Old rSP 513-178.eps Figure 3-13. Procedure Stack, Near Return Far Return, Same Privilege. A far RET changes the code segment, so the full return pointer is popped off the stack and into the CS and rIP registers. Execution begins from the newly-loaded segment and offset. If an immediate operand is included with the RET instruction, the stack pointer is adjusted by the number of bytes indicated. Figure 3-14 on page 101 shows the stack pointer before (old rSP value) and after (new rSP value) the RET. The stack segment (SS) is not changed. 100 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Procedure Stack New rSP Parameters ... Return CS Return rIP Old rSP 513-179.eps Figure 3-14. Procedure Stack, Far Return from Same Privilege Far Return, Less Privilege. Privilege-changing far RETs can only return to less-privileged code segments, otherwise a generalprotection exception occurs. The full return pointer is popped off the stack and into the CS and rIP registers, and execution begins from the newly-loaded segment and offset. A far RET that changes privilege levels also switches stacks. The return procedure's stack pointer is popped off the stack and into the SS and rSP registers. If an immediate operand is included with the RET instruction, the newly-loaded stack pointer is adjusted by the number of bytes indicated. Figure 3-15 shows the stack pointer before (old SS:rSP value) and after (new SS:rSP value) the RET. "Segmented Virtual Memory" in Volume 2 provides additional information on privilege-changing RETs. Old Procedure Stack Return SS Return rSP Parameters ... Parameters Return CS Return rIP Return Procedure Stack New SS:rSP ... Old SS:rSP 513-180.eps Figure 3-15. Procedure Stack, Far Return from Less Privilege 101 Chapter 3: General-Purpose Programming AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 3.7.7 System Calls A disadvantage of far CALLs and far RETs is that they use segment-based protection and privilege-checking. This involves significant overhead associated with loading new segment selectors and their corresponding descriptors into the segment registers. The overhead includes not only the time required to load the descriptors from memory but also the time required to perform the privilege, type, and limit checks. Privilegechanging CALLs to the operating system are slowed further by the control transfer through a gate descriptor. SYSCALL and SYSRET. SYSCALL and SYSRET are low-latency system-call and system-return control-transfer instructions. They can be used in protected mode. These instructions eliminate segment-based privilege checking by using predetermined target and return code segments and stack segments. The operating system sets up and maintains the predetermined segments using special registers within the processor, so the segment descriptors do not need to be fetched f r o m m e m o ry w h e n t h e i n s t r u c t i o n s a re u s e d . T h e simplifications made to privilege checking allow SYSCALL and SYSRET to complete in far fewer processor clock cycles than CALL and RET. SYSRET can only be used to return from CPL = 0 procedures and is not available to application software. SYSCALL can be used by applications to call operating system service routines running at CPL = 0. The SYSCALL instruction does not take operands. Linkage conventions are initialized and maintained by the operating system. "System-Management Instructions" in Volume 2 contains detailed information on the operation of SYSCALL and SYSRET. SYSENTER and SYSEXIT. T h e S Y S E N T E R a n d S Y S E X I T instructions provide similar capabilities to SYSCALL and SYSRET. However, these instructions can be used only in legacy mode and are not supported in long mode. SYSCALL and SYSRET are the preferred instructions for calling privileged software. See "System-Management Instructions" in Volume 2 for further information on SYSENTER and SYSEXIT. 3.7.8 General Considerations for Branching Branching causes delays which are a function of the hardwareimplementation's branch-prediction capabilities. Sequential flow avoids the delays caused by branching but is still exposed to delays caused by cache misses, memory bus bandwidth, and other factors. Chapter 3: General-Purpose Programming 102 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology In general, branching code should be replaced with sequential code whenever practical. This is especially important if the branch body is small (resulting in frequent branching) and when branches depend on random data (resulting in frequent mispredictions of the branch target). In certain hardware implementations, far branches (as opposed to near branches) may not be predictable by the hardware, and recursive functions (those that call themselves) may overflow a returnaddress stack. All calls and returns should be paired for optimal performance. Hardware implementations that include a return-address stack can lose stack synchronization if calls and returns are not paired. 3.7.9 Branching in 64Bit Mode Near Branches in 64-Bit Mode. The long-mode architecture expands the near-branch mechanisms to accommodate branches in the full 64-bit virtual-address space. In 64-bit mode, the operand siz e for all near branches defaults to 64 bits, so these instructions update the full 64-bit RIP. Table 3-9 lists the near-branch instructions. Table 3-9. Near Branches in 64-Bit Mode Operand Size (bits) Mnemonic Opcode (hex) Description Default Possible Overrides1 CALL Jcc JCXZ JECXZ JRCXZ JMP LOOP LOOPcc RET Note: E8, FF /2 70 to 7F, 0F 80 to 0F 8F E3 EB, E9, FF /4 E2 E0, E1 C2, C3 Call Procedure Near Jump Conditional Jump on CX/ECX/RCX Zero 64 Jump Near Loop Loop if Zero/Equal or Not Zero/Equal Return From Call (near) 16 1. There is no 32-bit operand-size override prefix in 64-bit mode. Chapter 3: General-Purpose Programming 103 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 The default 64-bit operand size eliminates the need for a REX prefix with these instructions when registers RAX-RSP (the first set of eight GPRs) are used as operands. A REX prefix is still required if R8-R15 (the extended set of eight GPRs) are used as operands, because the prefix is required to address the extended registers. The following aspects of near branches are controlled by the effective operand size: Truncation of the instruction pointer. Size of a stack pop or push, resulting from a CALL or RET. Size of a stack-pointer increment or decrement, resulting from a CALL or RET. Indirect-branch operand size. In 64-bit mode, all of the above actions are forced to 64 bits. However, the size of the displacement field for relative branches is still limited to 32 bits. The operand size of near branches is fixed at 64 bits without the need for a REX prefix. However, the address size of near branches is not forced in 64-bit mode. Such addresses are 64 bits by default, but they can be overridden to 32 bits by a prefix. Branches to 64-Bit Offsets. Because immediates are generally limited to 32 bits, the only way a full 64-bit absolute RIP can be specified in 64-bit mode is with an indirect branch. For this reason, direct forms of far branches are invalid in 64-bit mode. 3.7.10 Interrupts and Exceptions Interrupts and exceptions are a form of control transfer operation. They are used to call special system-service routines, called interrupt handlers, which are designed to respond to the interrupt or exception condition. Pointers to the interrupt handlers are stored by the operating system in an interruptdescriptor table, or IDT. In legacy real mode, the IDT contains an array of 4-byte far pointers to interrupt handlers. In legacy protected mode, the IDT contains an array of 8-byte gate descriptors. In long mode, the gate descriptors are 16 bytes. Interrupt gates, task gates, and trap gates can be stored in the IDT, but not call gates. Interrupt handlers are usually privileged software because they typically require access to restricted system resources. System software is responsible for creating the interrupt gates and 104 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology storing them in the IDT. "Exceptions and Interrupts" in Volume 2 contains detailed information on the interrupt mechanism and the requirements on system software for managing the mechanism. The IDT is indexed using the interrupt number, or vector. How the vector is specified depends on the source, as described below. The first 32 of the available 256 interrupt vectors are reserved for internal use by the processor--for exceptions (as described below) and other purposes. Interrupts are caused either by software or hardware. The INT, INT3, and INTO instructions implement a software interrupt by calling an interrupt handler directly. These are general-purpose (privilege-level-3) instructions. The operand of the INT instruction is an immediate byte value specifying the interrupt vector used to index the IDT. INT3 and INTO are specific forms of software interrupts used to call interrupt 3 and interrupt 4, respectively. External interrupts are produced by system logic which passes the IDT index to the processor via input signals. External interrupts can be either maskable or non-maskable. Exceptions usually occur as a result of software execution errors or other internal-processor errors. Exceptions can also occur in non-error situations, such as debug-program single-stepping or address-breakpoint detection. In the case of exceptions, the processor produces the IDT index based on the detected condition. The handlers for interrupts and exceptions are identical for a given vector. The processor's response to an exception depends on the type of the exception. For all exceptions except 128-bit-media and x87 floating-point exceptions, control automatically transfers to the handler (or service routine) for that exception, as defined by the exceptions vector. For 128-bit-media and x87 floating-point exceptions, there is both a masked and unmasked response. When unmasked, these exceptions invoke their exception handler. When masked, a default masked response is provided instead of invoking the exception handler. E x c e p t i o n s a n d s o f t wa re - i n i t i a t e d i n t e r r u p t s o c c u r synchronously with respect to the processor clock. There are three types of exceptions: Faults--A fault is a precise exception that is reported on the boundary before the interrupted instruction. Generally, Chapter 3: General-Purpose Programming 105 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 faults are caused by an undesirable error condition involving the interrupted instruction, although some faults (such as page faults) are common and normal occurrences. After the service routine completes, the machine state prior to the faulting instruction is restored, and the instruction is retried. Traps--A trap is a precise exception that is reported on the boundary following the interrupted instruction. The instruction causing the exception finishes before the service routine is invoked. Software interrupts and certain breakpoint exceptions used in debugging are traps. Aborts--Aborts are imprecise exceptions. The instruction causing the exception, and possibly an indeterminate additional number of instructions, complete execution before the service routine is invoked. Because they are imprecise, aborts typically do not allow reliable program restart. Table 3-10 shows the interrupts and exceptions that can occur, together with their vector numbers, mnemonics, source, and causes. For a detailed description of interrupts and exceptions, see "Exceptions and Interrupts" in Volume 2. Control transfers to interrupt handlers are similar to far calls, except that for the former, the rFLAGS register is pushed onto the stack before the return address. Interrupts and exceptions to several of the first 32 interrupts can also push an error code onto the stack. No parameters are passed by an interrupt. As with CALLs, interrupts that cause a privilege change also perform a stack switch. Table 3-10. Interrupts and Exceptions Generated By GeneralPurpose Instructions yes yes no yes Vector Interrupt (Exception) Mnemonic Source Cause 0 1 2 3 Divide-By-Zero-Error Debug Non-Maskable-Interrupt Breakpoint #DE #DB NMI #BP Software Internal External Software DIV, IDIV instructions Instruction accesses and data accesses External NMI signal INT3 instruction 106 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Table 3-10. Interrupts and Exceptions (continued) Generated By GeneralPurpose Instructions yes yes yes no yes Vector Interrupt (Exception) Mnemonic Source Cause 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Overflow Bound-Range Invalid-Opcode Device-Not-Available Double-Fault Coprocessor-SegmentOverrun Invalid-TSS Segment-Not-Present Stack General-Protection Page-Fault Reserved x87 Floating-Point Exception-Pending Alignment-Check Machine-Check SIMD Floating-Point #OF #BR #UD #NM #DF -- #TS #NP #SS #GP #PF Software Software Internal Internal Internal External Internal Internal Internal Internal Internal INTO instruction BOUND instruction Invalid instructions x87 instructions Interrupt during an interrupt Unsupported (reserved) Task-state segment access and task switch Segment access through a descriptor SS register loads and stack references Memory accesses and protection checks Memory accesses when paging enabled -- yes yes yes yes yes #MF #AC #MC #XF Software Internal Internal External Internal x87 floating-point and 64-bit media floating-point instructions Memory accesses Model specific 128-bit media floating-point instructions no yes yes no Chapter 3: General-Purpose Programming 107 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Table 3-10. Interrupts and Exceptions (continued) Generated By GeneralPurpose Instructions Vector Interrupt (Exception) Mnemonic Source Cause 20--31 32--255 0--255 Reserved (Internal and External) External Interrupts (Maskable) Software Interrupts -- -- External Software -- External interrupt signalling INT instruction no yes Interrupt to Same Privilege in Legacy Mode. When an interrupt to a handler running at the same privilege occurs, the processor pushes a copy of the rFLAGS register, followed by the return pointer (CS:rIP), onto the stack. If the interrupt generates an error code, it is pushed onto the stack as the last item. Control is then transferred to the interrupt handler. Figure 3-16 shows the stack pointer before (old rSP value) and after (new rSP value) the interrupt. The stack segment (SS) is not changed. Interrupt Handler Stack Old rSP rFLAGS Return CS Return rIP Error Code New rSP 513-182.eps Figure 3-16. Procedure Stack, Interrupt to Same Privilege Interrupt to More Privilege or in Long Mode. When an interrupt to a more-privileged handler occurs or the processor is operating in long mode the processor locates the handler's stack pointer from the TSS. The old stack pointer (SS:rSP) is pushed onto the new stack, along with a copy of the rFLAGS register. The return pointer (CS:rIP) to the interrupted program is then copied to the stack. If the interrupt generates an error code, it is pushed 108 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology onto the stack as the last item. Control is then transferred to the interrupt handler. Figure 3-17 shows an example of a stack switch resulting from an interrupt with a change in privilege. Old Procedure Stack Interrupt Handler Stack Return SS Return rSP rFLAGS Return CS Return rIP Error Code Old SS:rSP New SS:rSP 513-181.eps Figure 3-17. Procedure Stack, Interrupt to Higher Privilege Interrupt Returns. The IRET, IRETD, and IRETQ instructions are used to return from an interrupt handler. Prior to executing an IRET, the interrupt handler must pop the error code off of the stack if one was pushed by the interrupt or exception. IRET restores the interrupted program's rIP, CS, and rFLAGS by popping their saved values off of the stack and into their respective registers. If a privilege change occurs or IRET is executed in 64-bit mode, the interrupted program's stack pointer (SS:rSP) is also popped off of the stack. Control is then transferred back to the interrupted program. 3.8 Input/Output I/O devices allow the processor to communicate with the outside world, usually to a human or to another system. In fact, a system without I/O has little utility. Typical I/O devices include a keyboard, mouse, LAN connection, printer, storage devices, and monitor. The speeds these devices must operate at va ry g re a t l y, a n d u s u a l ly d e p e n d o n w h e t h e r t h e communication is to a human (slow) or to another machine (fast). There are exceptions. For example, humans can consume graphics data at very high rates. Chapter 3: General-Purpose Programming 109 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 There are two methods for communicating with I/O devices in AMD64 processor implementations. One method involves accessing I/O through ports located in I/O-address space ("I/O Addressing" on page 110), and the other method involves accessing I/O devices located in the memory-address space ("Memory Organization" on page 11). The address spaces are separate and independent of each other. I/O-address space was originally introduced as an optimized means for accessing I/O-device control ports. Then, systems usually had few I/O devices, devices tended to be relatively lowspeed, device accesses needed to be strongly ordered to g u a ra n t e e p r o p e r o p e ra t i o n , a n d d ev i c e p ro t e c t i o n requirements were minimal or non-existent. Memory-mapped I/O has largely supplanted I/O-address space access as the preferred means for modern operating systems to interface with I/O devices. Memory-mapped I/O offers greater flexibility in protection, vastly more I/O ports, higher speeds, and strong or weak ordering to suit the device requirements. 3.8.1 I/O Addressing Access to I/O-address space is provided by the IN and OUT instructions, and the string variants of these instructions, INS and OUTS. The operation of these instructions are described in "Input/Output" on page 76. Although not required, processor implementations generally transmit I/O-port addresses and I/O data over the same external signals used for memory addressing and memory data. Different bus-cycles generated by the processor differentiate I/O-address space accesses from memory-address space accesses. I/O-Address Space. Figure 3-18 on page 111 shows the 64 Kbyte I/O-address space. I/O ports can be addressed as bytes, words, or doublewords. As with memory addressing, word-I/O and doubleword-I/O ports are simply two or four consecutivelyaddressed byte-I/O ports. Word and doubleword I/O ports can be aligned on any byte boundary, but there is typically a performance penalty for unaligned accesses. Performance is optimized by aligning word-I/O ports on word boundaries, and doubleword-I/O ports on doubleword boundaries. 110 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology FFFF 216 - 1 0000 0 513-187.eps Figure 3-18. I/O Address Space Memory-Mapped I/O. Memory-mapped I/O devices are attached to the system memory bus and respond to memory transactions as if they were memory devices, such as DRAM. Access to memorymapped I/O devices can be performed using any instruction that accesses memory, but typically MOV instructions are used to transfer data between the processor and the device. Some I/O devices may have restrictions on read-modify-write accesses. Any location in memory can be used as a memory-mapped I/O address. System software can use the paging facilities to virtualize memory devices and protect them from unauthorized access. See "System-Management Instructions" in Volume 2 for a discussion of memory virtualization and paging. 3.8.2 I/O Ordering The order of read and write accesses between the processor and an I/O device is usually important for properly controlling device operation. Accesses to I/O-address space and memoryaddress space differ in the default ordering enforced by the processor and the ability of software to control ordering. I/O-Address Space. The processor always orders I/O-address space operations strongly, with respect to other I/O and memory operations. Software cannot modify the I/O ordering enforced by the processor. IN instructions are not executed until all previous writes to I/O space and memory have completed. OUT instructions delay execution of the following instruction until all writes--including the write performed by the OUT--have completed. Unlike memory writes, writes to I/O addresses are never buffered by the processor. The processor can use more than one bus transaction to access an unaligned, multi-byte I/O port. Unaligned accesses to I/Oaddress space do not have a defined bus transaction ordering, Chapter 3: General-Purpose Programming 111 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 and that ordering can change from one implementation to another. If the use of an unaligned I/O port is required, and the order of bus transactions to that port is important, software should decompose the access into multiple, smaller aligned accesses. Memory-Mapped I/O. To m a x i m i z e s o f t wa re p e r f o r m a n c e , processor implementations can execute instructions out of program order. This can cause the sequence of memory accesses to also be out of program order, called weakly ordered. As described in "Accessing Memory" on page 113, the processor can perform memory reads in any order, it can perform reads without knowing whether it requires the result (speculation), and it can reorder reads ahead of writes. In the case of writes, multiple writes to memory locations in close proximity to each other can be combined into a single write or a burst of multiple writes. Writes can also be delayed, or buffered, by the processor. Application software that needs to force memory ordering to memory-mapped I/O devices can do so using the read/write barrier instructions: LFENCE, SFENCE, and MFENCE. These instructions are described in "Forcing Memory Order" on page 115. Serializing instructions, I/O instructions, and locked instructions can also be used as read/write barriers, but they modify program state and are an inferior method for enforcing strong-memory ordering. Typically, the operating system controls access to memorymapped I/O devices. The AMD64 architecture provides facilities for system software to specify the types of accesses and their ordering for entire regions of memory. These facilities are also used to manage the cacheability of memory regions. See "System-Management Instructions" in Volume 2 for further information. 3.8.3 Protected-Mode I/O In protected mode, access to the I/O-address space is governed by the I/O privilege level (IOPL) field in the rFLAGS register, and the I/O-permission bitmap in the current task-state segment (TSS). I/O-Privilege Level. R F L AG S . I O P L g ove r n s a c c e s s t o I O P L sensitive instructions. All of the I/O instructions (IN, INS, OUT, and OUTS) are IOPL-sensitive. IOPL-sensitive instructions cannot be executed by a program unless the program's current112 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology privilege level (CPL) is numerically less (more privileged) than or equal to the RFLAGS.IOPL field, otherwise a generalprotection exception (#GP) occurs. O n ly s o f t wa re r u n n i n g a t C P L = 0 c a n ch a n g e t h e RFLAGS.IOPL field. Two instructions, POPF and IRET, can be used to change the field. If application software (or any software running at CPL>0) attempts to change RFLAGS.IOPL, the attempt is ignored. System software uses RFLAGS.IOPL to control the privilege level required to access I/O-address space devices. Access can be granted on a program-by-program basis using different copies of RFLAGS for every program, each with a different IOPL. RFLAGS.IOPL acts as a global control over a program's access to I/O-address space devices. System software can grant less-privileged programs access to individual I/O devices (overriding RFLAGS.IOPL) by using the I/O-permission bitmap stored in a program's TSS. For details about the I/O-permission bitmap, see "I/O-Permission Bitmap" in Volume 2. 3.9 Memory Optimization Generally, application software is unaware of the memory hierarchy implemented within a particular system design. The application simply sees a homogenous address space within a single level of memory. In reality, both system and processor implementations can use any number of techniques to speed up accesses into memory, doing so in a manner that is transparent to applications. Application software can be written to maximize this speed-up even though the methods used by the hardware are not visible to the application. This section gives an overview of the memory hierarchy and access techniques that can be implemented within a system design, and how applications can optimize their use. 3.9.1 Accessing Memory Implementations of the AMD64 architecture commit the results of each instruction--i.e., store the result of the executed instruction in software-visible resources, such as a register (including flags), the data cache, an internal write buffer, or memory--in program order, which is the order specified by the instruction sequence in a program. Transparent to the application, implementations can execute instructions in any order and temporarily hold out-of-order results until the 113 Chapter 3: General-Purpose Programming AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 instructions are committed. Implementations can also speculatively execute instructions--executing instructions before knowing their results will be used (for example, executing both sides of a branch). By executing instructions out-of-order and speculatively, a processor can boost application performance by executing instructions that are ready, rather than delaying them behind instructions that are waiting for data. However, the processor commits results in program order (the order expected by software). When executing instructions out-of-order and speculatively, processor implementations often find it useful to also allow outof-order and speculative memory accesses. However, such memory accesses are potentially visible to software and system devices. The following sections describe the architectural rules for memory accesses. See "Memory System" in Volume 2 for information on how system software can further specify the flexibility of memory accesses. Read Ordering. The ordering of memory reads does not usually affect program execution because the ordering does not usually affect the state of software-visible resources. The rules governing read ordering are: Out-of-order reads are allowed. Out-of-order reads can occur as a result of out-of-order instruction execution. The processor can read memory out-of-order to prevent stalling instructions that are executed out-of-order. Speculative reads are allowed. A speculative read occurs when the processor begins executing a memory-read instruction before it knows whether the instruction's result will actually be needed. For example, the processor can predict a branch to occur and begin executing instructions following the predicted branch, before it knows whether the prediction is valid. When one of the speculative instructions reads data from memory, the read itself is speculative. Reads can usually be reordered ahead of writes. Reads are generally given a higher priority by the processor than writes because instruction execution stalls if the read data required by an instruction is not immediately available. Allowing reads ahead of writes usually maximizes software performance. Reads can be reordered ahead of writes, except that a read cannot be reordered ahead of a prior write if the read is from 114 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology the same location as the prior write. In this case, the read instruction stalls until the write instruction is committed. This is because the result of the write instruction is required by the read instruction for software to operate correctly. Some system devices might be sensitive to reads. Normally, applications do not have direct access to system devices, but instead call an operating-system service routine to perform the access on the application's behalf. In this case, it is system software's responsibility to enforce strong read-ordering. Write Ordering. Writes affect program order because they affect the state of software-visible resources. The rules governing write ordering are restrictive: Generally, out-of-order writes are not allowed. Write instructions executed out-of-order cannot commit (write) their result to memory until all previous instructions have completed in program order. The processor can, however, hold the result of an out-of-order write instruction in a private buffer (not visible to software) until that result can be committed to memory. System software can create non-cacheable write-combining regions in memory when the order of writes is known to not affect system devices. When writes are performed to writecombining memory, they can appear to complete out of order relative to other writes. See "Memory System" in Volume 2 for additional information. Speculative writes are not allowed. As with out-of-order writes, speculative write instructions cannot commit their result to memory until all previous instructions have completed in program order. Processors can hold the result in a private buffer (not visible to software) until the result can be committed. 3.9.2 Forcing Memory Order Special instructions are provided for application software to force memory ordering in situations where such ordering is important. These instructions are: Load Fence--The LFENCE instruction forces ordering of memory loads (reads). All memory loads preceding the LFENCE (in program order) are completed prior to completing memory loads following the LFENCE. Memory loads cannot be reordered around an LFENCE instruction, Chapter 3: General-Purpose Programming 115 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 but other non-serializing instructions (such as memory writes) can be reordered around the LFENCE. Store Fence--The SFENCE instruction forces ordering of memory stores (writes). All memory stores preceding the SFENCE (in program order) are completed prior to completing memory stores following the SFENCE. Memory stores cannot be reordered around an SFENCE instruction, but other non-serializing instructions (such as memory loads) can be reordered around the SFENCE. Memory Fence--The MFENCE instruction forces ordering of all memory accesses (reads and writes). All memory accesses preceding the MFENCE (in program order) are completed prior to completing any memory access following the MFENCE. Memory accesses cannot be reordered around an MFENCE instruction, but other non-serializing instructions that do not access memory can be reordered around the MFENCE. Although they serve different purposes, other instructions can be used as read/write barriers when the order of memory accesses must be strictly enforced. These read/write barrier instructions force all prior reads and writes to complete before subsequent reads or writes are executed. Unlike the fence instructions listed above, these other instructions alter the software-visible state. This makes these instructions less general and more difficult to use as read/write barriers than the fence instructions, although their use may reduce the total number of instructions executed. The following instructions are usable as read/write barriers: Serializing instructions--Serializing instructions force the processor to commit the serializing instruction and all previous instructions before the next instruction is fetched from memory. The serializing instructions available to applications are CPUID and IRET. A serializing instruction is committed when the following operations are complete: - The instruction has executed. - All registers modified by the instruction are updated. - All memory updates performed by the instruction are complete. - All data held in the write buffers have been written to memory. (Write buffers are described in "Write Buffering" on page 119). 116 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology I/O instructions--Reads from and writes to I/O-address space use the IN and OUT instructions, respectively. When the processor executes an I/O instruction, it orders it with respect to other loads and stores, depending on the instruction: - IN instructions (IN, INS, and REP INS) are not executed until all previous stores to memory and I/O-address space are complete. - Instructions following an OUT instruction (OUT, OUTS, or REP OUTS) are not executed until all previous stores to memory and I/O-address space are complete, including the store performed by the OUT. Locked instructions--A locked instruction is one that contains the LOCK instruction prefix. A locked instruction is used to perform an atomic read-modify-write operation on a memory operand, so it needs exclusive access to the memory location for the duration of the operation. Locked instructions order memory accesses in the following way: - All previous loads and stores (in program order) are completed prior to executing the locked instruction. - The locked instruction is completed before allowing loads and stores for subsequent instructions (in program order) to occur. Only certain instructions can be locked. See "Lock Prefix" in Volume 3 for a list of instructions that can use the LOCK prefix. 3.9.3 Caches Depending on the instruction, operands can be encoded in the instruction opcode or located in registers, I/O ports, or memory locations. An operand that is located in memory can actually be physically present in one or more locations within a system's memory hierarchy. Memory Hierarchy. A system's memory hierarchy may have some or all of the following levels: Main Memory--Main memory is external to the processor chip and is the memory-hierarchy level farthest from the processor's execution units. All physical-memory addresses are present in main memory, which is implemented using relatively slow, but high-density memory devices. External Caches--External caches are external to the processor chip, but are implemented using lower-capacity, higher-performance memory devices than system memory. Chapter 3: General-Purpose Programming 117 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 The system uses external caches to hold copies of frequently-used instructions and data found in main memory. A subset of the physical-memory addresses can be present in the external caches at any time. A system can contain any number of external caches, or none at all. Internal Caches--Internal caches are present on the processor chip itself, and are the closest memory-hierarchy level to the processor's execution units. Because of their presence on the processor chip, access to internal caches is very fast. Internal caches contain copies of the most frequently-used instructions and data found in main memory and external caches, and their capacities are relatively small in comparison to external caches. A processor implementation can contain any number of internal caches, or none at all. Implementations often contain a first-level instruction cache and first-level data (operand) cache, and they may also contain a highercapacity (and slower) second-level internal cache for storing both instructions and data. Figure 3-19 on page 119 shows an example of a four-level memory hierarchy that combines main memory, external thirdlevel (L3) cache, and internal second-level (L2) and two firstlevel (L1) caches. As the figure shows, the first-level and second-level caches are implemented on the processor chip, and the third-level cache is external to the processor. The first-level cache is a split cache, with separate caches used for instructions and data. The second-level and third-level caches are unified (they contain both instructions and data). Memory at the highest levels of the hierarchy have greater capacity (larger size), but have slower access, than memory at the lowest levels. Using caches to store frequently used instructions and data can result in significantly improved software performance by avoiding accesses to the slower main memory. Applications function identically on systems without caches and on systems with caches, although cacheless systems typically execute applications more slowly. Application software can, however, be optimized to make efficient use of caches when they are present, as described later in this section. 118 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Main Memory Larger Size L3 Cache System L2 Cache Faster Access L1 Instruction Cache L1 Data Cache Processor 513-137.eps Figure 3-19. Memory Hierarchy Example Write Buffering. Processor implementations can contain writebuffers attached to the internal caches. Write buffers can also be present on the interface used to communicate with the external portions of the memory hierarchy. Write buffers temporarily hold data writes when main memory or the caches are busy responding to other memory-system accesses. The existence of write buffers is transparent to software. However, some of the instructions used to optimize memory-hierarchy performance can affect the write buffers, as described in "Forcing Memory Order" on page 115. 3.9.4 Cache Operation Although the existence of caches is transparent to application software, a simple understanding how caches are accessed can assist application developers in optimizing their code to run efficiently when caches are present. Chapter 3: General-Purpose Programming 119 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Caches are divided into fixed-size blocks, called cache lines. Typically, implementations have either 32-byte or 64-byte cache lines. The processor allocates a cache line to correspond to an identically-sized region in main memory. After a cache line is allocated, the addresses in the corresponding region of main memory are used as addresses into the cache line. It is the processor's responsibility to keep the contents of the allocated cache line coherent with main memory. Should another system device access a memory address that is cached, the processor maintains coherency by providing the correct data back to the device and main memory. When a memory-read occurs as a result of an instruction fetch or operand access, the processor first checks the cache to see if the requested information is available. A read hit occurs if the information is available in the cache, and a read miss occurs if the information is not available. Likewise, a write hit occurs if a memory write can be stored in the cache, and a write miss occurs if it cannot be stored in the cache. A read miss or write miss can result in the allocation of a cache line, followed by a cache-line fill. Even if only a single byte is needed, all bytes in a cache line are loaded from memory by a cache-line fill. Typically, a cache-line fill must write over an existing cache line in a process called a cache-line replacement. In this case, if the existing cache line is modified, the processor performs a cache-line writeback to main memory prior to performing the cache-line fill. Cache-line writebacks help maintain coherency between the caches and main memory. Internally, the processor can also maintain cache coherency by internally probing (checking) the other caches and write buffers for a more recent version of the requested data. External devices can also check a processor's caches and write buffers for more recent versions of data by externally probing the processor. All coherency operations are performed in hardware and are completely transparent to applications. Cache Coherency and MOESI. I m p l e m e n t a t i o n s o f t h e A M D 6 4 architecture maintain coherency between memory and caches using a five-state protocol known as MOESI. The five MOESI states are modified, owned, exclusive, shared, and invalid. See "Memory System" in Volume 2 for additional information on MOESI and cache coherency. 120 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Self-Modifying Code. Software that writes into a code segment is classified as self-modifying code. To avoid cache-coherency problems due to self-modifying code, implementations of the AMD64 architecture invalidate a cache line during a memory write if the cache line corresponds to a code-segment memory location. By invalidating the cache line, the processor is forced to write the modified instruction into main memory. A subsequent fetch of the modified instruction goes to main memory to get the coherent version of the instruction. 3.9.5 Cache Pollution Because cache sizes are limited, caches should be filled only with data that is frequently used by an application. Data that is used infrequently, or not at all, is said to pollute the cache because it occupies otherwise useful cache lines. Ideally, the best data to cache is data that adheres to the principle of locality. This principle has two components: temporal locality and spatial locality. Temporal locality refers to data that is likely to be used more than once in a short period of time. It is useful to cache temporal data because subsequent accesses can retrieve the data quickly. Non-temporal data is assumed to be used once, and then not used again for a long period of time, or ever. Caching of non-temporal data pollutes the cache and should be avoided. Cache-control instructions ("Cache-Control Instructions" on page 122) are available to applications to minimize cache pollution caused by non-temporal data. Spatial locality refers to data that resides at addresses adjacent to or very close to the data being referenced. Typically, when data is accessed, it is likely the data at nearby addresses will be accessed in a short period of time. Caches perform cache-line fills in order to take advantage of spatial locality. During a cache-line fill, the referenced data and nearest neighbors are loaded into the cache. If the characteristics of spacial locality do not fit the data used by an application, then the cache becomes polluted with a large amount of unreferenced data. Applications can avoid problems with this type of cache pollution by using data structures with good spatial-locality characteristics. Another form of cache pollution is stale data. Data that adheres to the principle of locality can become stale when it is no longer Chapter 3: General-Purpose Programming 121 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 used by the program, or won't be used again for a long time. Applications can use the CLFLUSH instruction to remove stale data from the cache. 3.9.6 Cache-Control Instructions General control and management of the caches is performed by system software and not application software. System software uses special registers to assign memory types to physical-address ranges, and page-attribute tables are used to assign memory types to virtual address ranges. Memory types define the cacheability characteristics of memory regions and how coherency is maintained with main memory. See "Memory System" in Volume 2 for additional information on memory typing. Instructions are available that allow application software to control the cacheability of data it uses on a more limited basis. These instructions can be used to boost an application's performance by prefetching data into the cache, and by avoiding cache pollution. Run-time analysis tools and compilers may be able to suggest the use of cache-control instructions for critical sections of application code. Cache Prefetching. Applications can prefetch entire cache lines i n t o t h e c a ch i n g h i e ra rchy u s i n g o n e o f t h e p re f e t ch instructions. The prefetch should be performed in advance, so that the data is available in the cache when needed. Although load instructions can mimic the prefetch function, they do not offer the same performance advantage, because a load instruction may cause a subsequent instruction to stall until the load completes, but a prefetch instruction will never cause such a stall. Load instructions also unnecessarily require the use of a register, but prefetch instructions do not. The instructions available in the AMD64 architecture for cacheline prefetching include one SSE instruction and two 3DNow! instructions: PREFETCHlevel--(an SSE instruction) Prefetches read/write data into a specific level of the cache hierarchy. If the requested data is already in the desired cache level or closer to the processor (lower cache-hierarchy level), the data is not prefetched. If the operand specifies an invalid memory address, no exception occurs, and the instruction has no effect. Attempts to prefetch data from non-cacheable memory, such as video frame buffers, or data from writecombining memory, are also ignored. The exact actions 122 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology performed by the PREFETCHlevel instructions depend on the processor implementation. - PREFETCHT0--Prefetches temporal data into the entire cache hierarchy. - PREFETCHT1--Prefetches temporal data into the second-level (L2) and higher-level caches, but not into the L1 cache. - PREFETCHT2--Prefetches temporal data into the thirdlevel (L3) and higher-level caches, but not into the L1 or L2 cache. - PREFETCHNTA--Prefetches non-temporal data into the processor, minimizing cache pollution. The specific technique for minimizing cache pollution is implementation-dependent and can include such techniques as allocating space in a software-invisible buffer, allocating a cache line in a single cache or a specific way of a cache, etc. PREFETCH--(a 3DNow! instruction) Prefetches read data into the L1 data cache. Data can be written to such a cache line, but doing so can result in additional delay because the processor must signal externally to negotiate the right to change the cache line's cache-coherency state for the purpose of writing to it. PREFETCHW--(a 3DNow! instruction) Prefetches write data into the L1 data cache. Data can be written to the cache line without additional delay, because the data is already prefetched in the modified cache-coherency state. Data can also be read from the cache line without additional delay. However, prefetching write data takes longer than prefetching read data if the processor must wait for another caching master to first write-back its modified copy of the requested data to memory before the prefetch request is satisfied. The PREFETCHW instruction provides a hint to the processor that the cache line is to be modified, and is intended for use when the cache line will be written to shortly after the prefetch is performed. The processor can place the cache line in the modified state when it is prefetched, but before it is actually written. Doing so can save time compared to a PREFETCH instruction, followed by a subsequent cache-state change due to a write. Chapter 3: General-Purpose Programming 123 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 To prevent a false-store dependency from stalling a prefetch instruction, prefetched data should be located at least one cache-line away from the address of any surrounding data write. For example, if the cache-line size is 32 bytes, avoid prefetching from data addresses within 32 bytes of the data address in a preceding write instruction. Non-Temporal Stores. N o n - t e m p o ra l s t o re i n s t r u c t i o n s a re provided to prevent memory writes from being stored in the cache, thereby reducing cache pollution. These non-temporal store instructions are specific to the type of register they write: GPR Non-Temporal Stores--MOVNTI. XMM Non-Temporal Stores--MASKMOVDQU, MOVNTDQ, MOVNTPD, and MOVNTPS. MMX Non-Temporal Stores--MASKMOVQ and MOVNTQ. Removing Stale Cache Lines. When cache data becomes stale, it occupies space in the cache that could be used to store frequently-accessed data. Applications can use the CLFLUSH instruction to free a stale cache-line for use by other data. CLFLUSH writes the contents of a cache line to memory and then invalidates the line in the cache and in all other caches in the cache hierarchy that contain the line. Once invalidated, the line is available for use by the processor and can be filled with other data. 3.10 Performance Considerations In addition to typical code optimization techniques, such as those affecting loops and the inlining of function calls, the following considerations may help improve the performance of a p p l i c a t i o n p r og ra m s w r i t t e n w i t h g e n e ra l - p u r p o s e instructions. These are implementation-independent performance considerations. Other considerations depend on the hardware implementation. For information about such implementationdependent considerations and for more information about application performance in general, see the data sheets and the software-optimization guides relating to particular hardware implementations. 124 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology 3.10.1 Use Large Operand Sizes 3.10.2 Use Short Instructions Loading, storing, and moving data with the largest relevant operand size maximizes the memory bandwidth of these instructions. Use the shortest possible form of an instruction (the form with fewest opcode bytes). This increases the number of instructions that can be decoded at any one time, and it reduces overall code size. Data alignment directly affects memory-access performance. Data alignment is particularly important when accessing streaming (also called non-temporal) data--data that will not be reused and therefore should not be cached. Data alignment is also important in cases where data that is written by one instruction is subsequently read by a subsequent instruction soon after the write. Branching can be very time-consuming. If the body of a branch is small, the branch may be replaceable with conditional move (CMOVcc) instructions, or with 128-bit or 64-bit media instructions that simulate predicated parallel execution or parallel conditional moves. Memory latency can be substantially reduced--especially for data that will be used multiple times--by prefetching such data into various levels of the cache hierarchy. Software can use the PREFETCHx instructions very effectively in such cases. One PREFETCHx per cache line should be used. Some of the best places to use prefetch instructions are inside loops that process large amounts of data. If the loop goes through less than one cache line of data per iteration, partially unroll the loop. Try to use virtually all of the prefetched data. This usually requires unit-stride memory accesses--those in which all accesses are to contiguous memory locations. For data that will be used only once in a procedure, consider using non-temporal accesses. Such accesses are not burdened by the overhead of cache protocols. 3.10.3 Align Data 3.10.4 Avoid Branches 3.10.5 Prefetch Data 3.10.6 Keep Common Operands in Registers 3.10.7 Avoid True Dependencies Keep frequently used values in registers rather than in memory. This avoids the comparatively long latencies for accessing memory. S p re a d o u t t r u e d e p e n d e n c i e s ( w r i t e - re a d o r f l o w dependencies) to increase the opportunities for parallel 125 Chapter 3: General-Purpose Programming AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 execution. This spreading out is not necessary for antidependencies and output dependencies. 3.10.8 Avoid Store-toLoad Dependencies Store-to-load dependencies occur when data is stored to memory, only to be read back shortly thereafter. Hardware implementations of the architecture may contain means of accelerating such store-to-load dependencies, allowing the load to obtain the store data before it has been written to memory. However, this acceleration might be available only when the addresses and operand sizes of the store and the dependent load are matched, and when both memory accesses are aligned. Pe r fo r m a n c e i s t y p i c a l ly o p t i m i z e d by avo i d i n g s u ch dependencies altogether and keeping the data, including temporary variables, in registers. When allocating space on the stack for local variables and/or outgoing parameters within a procedure, adjust the stack pointer and use moves rather than pushes. This method of allocation allows random access to the outgoing parameters, so that they can be set up when they are calculated instead of being held in a register or memory until the procedure call. This method also reduces stack-pointer dependencies. The repeat instruction prefixes have a setup overhead. If the repeated count is variable, the overhead can sometimes be avoided by substituting a simple loop to move or store the data. Repeated string instructions can be expanded into equivalent sequences of inline loads and stores. For details, see "Repeat Prefixes" in Volume 3. Some integer-based programs can be made to run faster by using 128-bit media or 64-bit media instructions. These instructions have their own register sets. Because of this, they relieve register pressure on the GPR registers. For loads, stores, adds, shifts, etc., media instructions may be good substitutes for general-purpose integer instructions. GPR registers are freed up, and the media instructions increase opportunities for parallel operations. Organize frequently accessed constants and coefficients into cache-line-size blocks and prefetch them. Procedures that access data arranged in memory-bus-sized blocks, or memoryburst-sized blocks, can make optimum use of the available memory bandwidth. 3.10.9 Optimize Stack Allocation 3.10.10 Consider Repeat-Prefix Setup Time 3.10.11 Replace GPR with Media Instructions 3.10.12 Organize Data in Memory Blocks 126 Chapter 3: General-Purpose Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology 4 128-Bit Media and Scientific Programming This chapter describes the 128-bit media and scientific programming model. This model includes all instructions that access the 128-bit XMM registers--called the 128-bit media instructions. These instructions perform integer and floatingpoint operations primarily on vector operands (a few of the instructions take scalar operands). They can speed up certain types of procedures--typically high-performance media and scientific procedures--by substantial factors, depending on data-element size and the regularity and locality of data accesses to memory. 4.1 4.1.1 Origins Overview The 128-bit media instruction set includes instructions originally introduced as the streaming SIMD extension (SSE) and SSE2 instructions. For details on the instruction set origin of each instruction, see "Instruction Subsets and CPUID Feature Sets" in Volume 3. The 128-bit media instructions can be executed in any of the architecture's operating modes. Existing SSE and SSE2 binary programs run in legacy and compatibility modes without modification. The support provided by the AMD64 architecture for such binaries is identical to that provided by legacy x86 architectures. To run in 64-bit mode, legacy 128-bit media programs must be recompiled. The recompilation has no side effects on such programs, other than to provide access to the following additional resources: Access to the eight extended XMM registers (for a total of 16 XMM registers). Access to the eight extended general-purpose registers (for a total of 16 GPRs). Access to the extended 64-bit width of all GPRs. Access to the 64-bit virtual address space. Access to the RIP-relative addressing mode. 4.1.2 Compatibility Chapter 4: 128-Bit Media and Scientific Programming 127 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 The 128-bit media instructions use data registers, a control and status register (MXCSR), rounding control, and an exception reporting and response mechanism that are distinct from and functionally independent of those used by the x87 floatingpoint instructions. Because of this, 128-bit media programming support usually requires exception handlers that are distinct from those used for x87 exceptions. This support is provided by virtually all legacy operating systems for the x86 architecture. 4.2 Capabilities The 128-bit media instructions are designed to support media and scientific applications. The vector operands used by these instructions allow applications to operate in parallel on multiple elements of vectors. The elements can be integers (from bytes to quadwords) or floating-point (either singleprecision or double-precision). Arithmetic operations produce signed, unsigned, and/or saturating results. The availability of several types of vector move instructions and (in 64-bit mode) twice the legacy number of XMM registers (a total of 16 such registers) can eliminate substantial memorya c c e s s ove r h e a d , m a k i n g a s u b s t a n t i a l d i f f e re n c e i n performance. 4.2.1 Types of Applications Typical media applications well-suited to the 128-bit media programming model include a broad range of audio, video, and graphics programs. For example, music synthesis, speech synthesis, speech recognition, audio and video compression (encoding) and decompression (decoding), 2D and 3D graphics, streaming video (up to high-definition TV), and digital signal processing (DSP) kernels are all likely to experience higher performance using 128-bit media instructions than using other types of instructions in AMD64 architecture. Such applications commonly use small-sized integer or singleprecision floating-point data elements in repetitive loops, in which the typical operations are inherently parallel. For example, 8-bit and 16-bit data elements are commonly used for pixel information in graphics applications, in which each of the RGB pixel components (red, green, blue, and alpha) are represented by an 8-bit or 16-bit integer. 16-bit data elements are also commonly used for audio sampling. 128 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology The 128-bit media instructions allow multiple data elements like these to be packed into 128-bit vector operands located in XMM registers or memory. The instructions operate in parallel on each of the elements in these vectors. For example, 16 elements of 8-bit data can be packed into a 128-bit vector operand, so that all 16 byte elements are operated on simultaneously, and in pairs of source operands, by a single instruction. The 128-bit media instructions also support a broad spectrum of scientific applications. For example, their ability to operate in parallel on double-precision floating-point vector elements makes them well-suited to computations like dense systems of linear equations, including matrix and vector-space operations with real and complex numbers. In professional CA D applications, for example, high-performance physical-modeling algorithms can be implemented to simulate processes such as heat transfer or fluid dynamics. 4.2.2 Integer Vector Operations Most of the 128-bit media arithmetic instructions perform parallel operations on pairs of vectors. Vector operations are also called packed or SIMD (single-instruction, multiple-data) operations. They take vector operands consisting of multiple elements, and all elements are operated on in parallel. Figure 4-1 shows an example of parallel operations on pairs of 16 byte-sized integers in the source operands. The result of the operation replaces the first source operand. There are also instructions that operate on vectors of words, doublewords, or quadwords. operand 1 127 0 127 operand 2 0 .............. operation .............. operation .............. 127 0 513-163.eps result Figure 4-1. Parallel Operations on Vectors of Integer Elements 129 Chapter 4: 128-Bit Media and Scientific Programming AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 4.2.3 Floating-Point Vector Operations There are almost as many 128-bit floating-point instructions as integer instructions. Figure 4-2 shows an example of parallel operations on vectors containing four 32-bit single-precision floating-point values. There are also instructions that operate on vectors containing two 64-bit double-precision floating-point values. operand 1 127 0 127 operand 2 0 FP single FP single FP single FP single FP single FP single FP single FP single . . operation . operation . . 127 . 0 FP single FP single FP single FP single result 513-164.eps Figure 4-2. Parallel Operations on Vectors of Floating-Point Elements Integer and floating-point instructions can be freely intermixed in the same procedure. The floating-point instructions allow media applications such as 3D graphics to accelerate geometry, clipping, and lighting calculations. Pixel data are typically integer-based, although both integer and floating-point instructions are often required to operate completely on the data. For example, software can change the viewing perspective of a 3D scene through transformation matrices by using floating-point instructions in the same procedure that contains integer operations on other aspects of the graphics data. It is typically much easier to write 128-bit media programs using floating-point instructions. Such programs perform better than x87 floating-point programs, because the XMM register file is flat rather than stack-oriented, there are twice as many registers (in 64-bit mode), and 128-bit media instructions can operate on two or four times the number of floating-point operands as can x87 instructions. This ability to operate in parallel on multiple pairs of floating-point elements often makes it possible to remove local temporary variables that would otherwise be needed in x87 floating-point code. 130 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology 4.2.4 Data Conversion and Reordering There are instructions that support data conversion of vector elements, including conversions between integer and floatingpoint data types--located in XMM registers, MMXTM registers, GPR registers, or memory--and conversions of elementordering or precision. For example, the unpack instructions take two vector operands and interleave their low or high elements. Figure 4-3 shows an unpack and interleave operation on word-sized elements (PUNCKLWD). If the left-hand source operand has elements whose value is zero, the operation converts each element in the low half of the right-hand operand to a data type of twice its original precision--useful, for example, in multiply operations in which results may overflow or underflow. operand 1 127 0 127 operand 2 0 . 127 . . . 0 513-149.eps result Figure 4-3. Unpack and Interleave Operation There are also pack instructions, such as PACKSSDW shown in Figure 4-4 on page 132, that convert each element in a pair of vectors to lower precision by selecting the elements in the low half of each vector. Vector-shift instructions are also supported. They can scale each element in a vector to higher or lower values. Chapter 4: 128-Bit Media and Scientific Programming 131 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 operand 1 127 0 127 operand 2 0 127 result 0 513-150.eps Figure 4-4. Pack Operation Figure 4-5 shows one of many types of shuffle operation (PSHUFD). Here, the second operand is a vector containing doubleword elements, and an immediate byte provides shuffle control for up to 256 permutations of the elements. Shuffles are useful, for example, in color imaging when computing alpha saturation of RGB values. In this case, a shuffle instruction can replicate an alpha value in a register so that parallel comparisons with three RGB values can be performed. operand 1 127 0 127 operand 2 0 127 result 0 513-151.eps Figure 4-5. Shuffle Operation There is an instruction that inserts a single word from a generalpurpose register or memory into an XMM register, at a specified 132 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology location, leaving the other words in the XMM register unmodified. 4.2.5 Block Operations Move instructions--along with unpack instructions--are among the most frequently used instructions in 128-bit media procedures. Figure 4-6 on page 134 shows the combined set of move operations supported by the integer and floating-point move instructions. These instructions provide a fast way to copy large amounts of data between registers or between registers and memory. They support block copies and sequential processing of contiguous data. When moving between XMM registers, or between an XMM register and memory, each integer move instruction can copy up to 16 bytes of data. When moving between an XMM register and an MMX or GPR register, an integer move instruction can move 8 bytes of data. The floating-point move instructions can copy vectors of four single-precision or two double-precision floatingpoint operands in parallel. Streaming-store versions of the move instructions permit bypassing the cache when storing data that is accessed only once. This maximizes memory-bus utilization and minimizes cache pollution. There is also a streaming-store integer movemask instruction that stores bytes from one vector, as selected by mask values in a second vector. Figure 4-7 on page 135 shows the MASKMOVDQU operation. It can be used, for example, to handle end cases in block copies and block fills based on streaming stores. Chapter 4: 128-Bit Media and Scientific Programming 133 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 127 XMM 0 127 XMM or Memory 0 127 XMM or Memory 0 127 XMM 0 memory 127 XMM 0 127 XMM 0 63 GPR or Memory 0 127 XMM 0 memory 127 XMM 0 63 GPR or Memory 0 TM 63MMX Register 0 127 XMM 0 127 XMM 0 63 MMX Register 0 513-171.eps Figure 4-6. Move Operations 134 Chapter 4: 128-Bit Media and Scientific Programming memory memory 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology operand 1 127 0 127 operand 2 0 select .............. select .............. store address memory rDI 513-148.eps Figure 4-7. 4.2.6 Matrix and Special Arithmetic Operations Move Mask Operation The instruction set provides a broad assortment of vector add, subtract, multiply, divide, and square-root operations for use on matrices and other data structures common to media and scientific applications. It also provides special arithmetic operations including multiply-add, average, sum-of-absolute differences, reciprocal square-root, and reciprocal estimation. Media applications often multiply and accumulate vector and matrix data. In 3D-graphics geometry, for example, objects are typically represented by triangles, each of whose vertices are located in 3D space by a matrix of coordinate values, and matrix transforms are performed to simulate object movement. 128-bit media integer and floating-point instructions can perform several types of matrix-vector or matrix-matrix operations, such as addition, subtraction, multiplication, and accumulation, to effect 3D tranforms of vertices. Efficient matrix multiplication is further supported with instructions that can first transpose the elements of matrix rows and columns. These transpositions can make subsequent accesses to memory or cache more efficient when performing arithmetic matrix operations. Chapter 4: 128-Bit Media and Scientific Programming 135 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Fi g u re 4 -8 s h ow s a ve c t o r m u l t i p ly - a d d i n s t r u c t i o n (PMADDWD) that multiplies vectors of 16-bit integer elements to yield intermediate results of 32-bit elements, which are then summed pair-wise to yield four 32-bit elements. This operation can be used with one source operand (for example, a coefficient) taken from memory and the other source operand (for example, the data to be multiplied by that coefficient) taken from an XMM register. It can also be used together with a vector-add operation to accumulate dot product results (also called inner or scalar products), which are used in many media algorithms such as those required for finite impulse response (FIR) filters, one of the commonly used DSP algorithms. operand 1 127 0 127 operand 2 0 * 255 * . intermediate result . . . * * 0 + + + + 127 result 0 513-154.eps Figure 4-8. Multiply-Add Operation There is also a sum-of-absolute-differences instruction (PSADBW), shown in Figure 4-9 on page 137. This is useful, for example, in computing motion-estimation algorithms for video compression. 136 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology operand 1 127 0 127 operand 2 0 ...... ABS ...... ABS ...... ABS ...... ABS high-order intermediate result . . . . . . low-order intermediate result . . . . . . 0 127 0 result 0 513-155.eps Figure 4-9. Sum-of-Absolute-Differences Operation There is an instruction for computing the average of unsigned bytes or words. The instruction is useful for MPEG decoding, in which motion compensation involves many byte-averaging operations between and within macroblocks. In addition to speeding up these operations, the instruction also frees up registers and make it possible to unroll the averaging loops. Some of the arithmetic and pack instructions produce vector results in which each element saturates independently of the other elements in the result vector. Such results are clamped (limited) to the maximum or minimum value representable by the destination data type when the true result exceeds that maximum or minimum representable value. Saturating data is useful for representing physical-world data, such as sound and color. It is used, for example, when combining values for pixel coloring. 4.2.7 Branch Removal Branching is a time-consuming operation that, unlike most 128bit media vector operations, does not exhibit parallel behavior (there is only one branch target, not multiple targets, per branch instruction). In many media applications, a branch involves selecting between only a few (often only two) cases. Such branches can be replaced with 128-bit media vector 137 Chapter 4: 128-Bit Media and Scientific Programming AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 compare and vector logical instructions that simulate predicated execution or conditional moves. Figure 4-10 shows an example of a non-branching sequence that implements a two-way multiplexer--one that is equivalent to the ternary operator "?:" in C and C++. The comparable code sequence is explained in "Compare and Write Mask" on page 183. The sequence in Figure 4-10 begins with a vector compare instruction that compares the elements of two source operands in parallel and produces a mask vector containing elements of all 1s or 0s. This mask vector is ANDed with one source operand and ANDed-Not with the other source operand to isolate the desired elements of both operands. These results are then ORed to select the relevant elements from each operand. A similar branch-removal operation can be done using floatingpoint source operands. operand 1 127 0 127 operand 2 0 a7 a6 a5 a4 a3 a2 a1 a0 b7 b6 b5 b4 b3 b2 b1 b0 Compare and Write Mask FFFF 0000 0000 FFFF FFFF 0000 0000 FFFF And And-Not a7 0000 0000 a4 a3 0000 0000 a0 0000 b6 b5 0000 0000 b2 b1 0000 Or a7 127 b6 b5 a4 a3 b2 b1 a0 0 513-170.eps Figure 4-10. 138 Branch-Removal Sequence Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology The min/max compare instructions, for example, are useful for clamping, such as color clamping in 3D graphics, without the need for branching. Figure 4-11 illustrates a move-mask instruction (PMOVMSKB) that copies sign bits to a generalpurpose register (GPR). The instruction can extract bits from mask patterns, or zero values from quantized data, or sign bits--resulting in a byte that can be used for data-dependent branching. GPR 0 127 XMM 0 concatenate 16 most-significant bits 513-157..eps Figure 4-11. Move Mask Operation 4.3 Registers Operands for most 128-bit media instructions are located in XMM registers or memory. Operation of the 128-bit media instructions is supported by the MXCSR control and status register. A few 128-bit media instructions--those that perform data conversion or move operations--can have operands located in MMXTM registers or general-purpose registers (GPRs). 4.3.1 XMM Registers Sixteen 128-bit XMM data registers, xmm0-xmm15, support the 128-bit media instructions. Figure 4-12 on page 140 shows these registers. They can hold operands for both vector and scalar operations with integer and floating-point data types. The high eight XMM registers, xmm8-xmm15, are available to software running in 64-bit mode for instructions that use a REX prefix (see "REX Prefixes" on page 89). Chapter 4: 128-Bit Media and Scientific Programming 139 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 XMM Data Registers 127 0 xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15 Available in all modes Available only in 64-bit mode 128-Bit Media Control and Status Register 31 MXCSR 0 513-314.eps Figure 4-12. 128-bit Media Registers Upon power-on reset, all 16 XMM registers are cleared to +0.0. However, initialization by means of the #INIT external input signal does not change the state of the XMM registers. 4.3.2 MXCSR Register Figure 4-13 on page 141 shows a detailed view of the 128-bit media-instruction control and status register (MXCSR). All bits in this register are read/write. The fields within the MXCSR a p p ly o n ly t o o p e ra t i o n s p e r for m e d by 1 2 8 - b i t m e d i a Chapter 4: 128-Bit Media and Scientific Programming 140 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology instructions. Software can load the register from memory using the FXRSTOR or LDMXCSR instructions, and it can store the re g i s t e r t o m e m o ry u s i n g t h e F X S AV E o r S T M X C S R instructions. 31 16 15 14 13 12 11 10 9 F Z R C 8 7 6 5 P E 4 3 2 1 0 D PUOZDIA MMMMMMZ UOZ EEE DI EE Reserved Symbol Description Bits FZ Flush-to-Zero for Masked Underflow 15 RC Floating-Point Rounding Control 14-13 Exception Masks PM Precision Exception Mask 12 UM Underflow Exception Mask 11 OM Overflow Exception Mask 10 ZM Zero-Divide Exception Mask 9 DM Denormalized-Operand Exception Mask 8 IM Invalid-Operation Exception Mask 7 DAZ Denormals Are Zeros 6 Exception Flags PE Precision Exception 5 UE Underflow Exception 4 OE Overflow Exception 3 ZE Zero-Divide Exception 2 DE Denormalized-Operand Exception 1 IE Invalid-Operation Exception 0 Figure 4-13. 128-Bit Media Control and Status Register (MXCSR) Upon power-on reset, the low 16 bits of the MXCSR are initialized with the value 1F80h, the bit values of which are shown in Table 4-1 on page 142. However, initialization by means of the #INIT external input signal does not change the state of the XMM registers. Chapter 4: 128-Bit Media and Scientific Programming 141 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Table 4-1. MXCSR Register Reset Values Field IE DE ZE OE UE PE DAZ IM DM ZM OM UM PM RC FZ Bit 0 1 2 3 4 5 6 7 8 9 10 11 12 14-13 15 Description Invalid-Operation Exception Denormalized-Operand Exception Zero-Divide Exception Overflow Exception Underflow Exception Precision Exception Denormals are Zeros Invalid-Operation Exception Mask Denormalized-Operand Exception Mask Zero-Divide Exception Mask Overflow Exception Mask Underflow Exception Mask Precision Exception Mask Floating-Point Rounding Control Flush-to-Zero for Masked Underflow Reset Bit-Value 0 0 0 0 0 0 0 1 1 1 1 1 1 00 0 The bits in the MXCSR register are defined immediately below, starting with bit 0. The six exception flags (IE, DE, ZE, OE, UE, PE) are sticky bits. Once set by the processor, such a bit remains set until software clears it. For details about the causes of SIMD floating-point exceptions indicated by bits 5-0, see "SIMD Floating-Point Exception Causes" on page 211. For details about the masking of these exceptions, see "SIMD FloatingPoint Exception Masking" on page 218. Invalid-Operation Exception (IE). Bit 0. The processor sets this bit to 1 when an invalid-operation exception occurs. These exceptions are caused by many types of errors, such as an invalid operand. Denormalized-Operand Exception (DE). Bit 1. The processor sets this bit to 1 when one of the source operands of an instruction is in denormaliz ed form, except that if software has set the 142 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology denormals are zeros (DAZ) bit, the processor does not set the DE bit. (See "Denormalized (Tiny) Numbers" on page 154.) Zero-Divide Exception (ZE). Bit 2. The processor sets this bit to 1 when a non-zero number is divided by zero. Overflow Exception (OE). Bit 3. The processor sets this bit to 1 when the absolute value of a rounded result is larger than the largest representable normalized floating-point number for the destination format. (See "Normalized Numbers" on page 153.) Underflow Exception (UE). Bit 4. The processor sets this bit to 1 when the absolute value of a rounded non-zero result is too small to be represented as a normalized floating-point number for the destination format. (See "Normalized Numbers" on page 153.) The underflow exception has an unusual behavior. When masked by the UM bit (bit 11), the processor only reports a UE exception if the UE occurs together with a precision exception (PE). Also, see bit 15, the flush-to-zero (FZ) bit. Precision Exception (PE). Bit 5. The processor sets this bit to 1 when a floating-point result, after rounding, differs from the infinitely precise result and thus cannot be represented exactly in the specified destination format. The PE exception is also called the inexact-result exception. Denormals Are Zeros (DAZ). Bit 6. Software can set this bit to 1 to enable the DAZ mode, if the hardware implementation supports this mode. In the DAZ mode, when the processor encounters source operands in the denormalized format it converts them to signed zero values, with the sign of the denormalized source operand, before operating on them, and the processor does not set the denormalized-operand exception (DE) bit, regardless of whether such exceptions are masked or unmasked. Support for the DAZ bit is indicated by the MXCSR Mask field in the FXSAVE memory image, as described in "Saving Media and x87 Processor State" in Volume 2. The DAZ mode does not comply with the IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754). Exception Masks (PM, UM, OM, ZM, DM, IM). Bits 12-7. Software can set these bits to mask, or clear this bits to unmask, the Chapter 4: 128-Bit Media and Scientific Programming 143 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 corresponding six types of SIMD floating-point exceptions (PE, UE, OE, ZE, DE, IE). A bit masks its exception type when set to 1, and unmasks it when cleared to 0. In general, masking a type of exception causes the processor to handle all subsequent instances of the exception type in a default way (the UE exception has an unusual behavior). Unmasking the exception type causes the processor to branch to the SIMD floating-point exception service routine when an exception occurs. For details about the processor's responses to masked and unmasked exceptions, see "SIMD Floating-Point Exception Masking" on page 218. Floating-Point Rounding Control (RC). Bit 14-13. Software uses these bits to specify the rounding method for 128-bit media floatingpoint operations. The choices are: 00 = round to nearest (default) 01 = round down 10 = round up 11 = round toward zero For details, see "Floating-Point Rounding" on page 158. Flush-to-Zero for Masked Underflow (FZ). Bit 15. Setting this bit to 1 causes the processor to set the UE and PE flags and return a zero result, with the sign of the true result, if an underflow occurs while the underflow mask (UM) bit is set to 1. This response does not comply with the IEEE 754 standard, but it may offer higher performance than can be achieved by responding to an underflow in this circumstance. The FZ bit is only effective if the UM bit is set to 1. If the UM bit is cleared to 0, the FZ bit is ignored. For details, see Table 4-15 on page 219. 4.3.3 Other Data Registers Some 128-bit media instructions that perform data transfer, data conversion or data reordering operations ("Data Transfer" on page 162, "Data Conversion" on page 166, and "Data Reordering" on page 168) can access operands in the MMX or general-purpose registers (GPRs). When addressing GPRs registers in 64-bit mode, the REX instruction prefix can be used to access the extended GPRs, as described in "REX Prefixes" on page 89. 144 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology For a description of the GPR registers, see "Registers" on page 27. For a description of the MMX registers, see "MMXTM Registers" on page 237. 4.3.4 rFLAGS Registers The COMISS, COMISD, UCOMISS, and UCOMISD instructions, described in "Compare" on page 202, write flag bits in the rFLAGS register. For a description of the rFLAGS register, see "Flags Register" on page 37. 4.4 Operands Operands for a 128-bit media instruction are either referenced by the instruction's opcode or included as an immediate value in the instruction encoding. Depending on the instruction, referenced operands can be located in registers or memory. The data types of these operands include vector and scalar floatingpoint, and vector and scalar integer. 4.4.1 Data Types Figure 4-14 on page 146 shows the register images of the 128-bit media data types. These data types can be interpreted by instruction syntax and/or the software context as one of the following types of values: Vector (packed) single-precision (32-bit) floating-point numbers. Vector (packed) double-precision (64-bit) floating-point numbers. Vector (packed) signed (two's-complement) integers. Vector (packed) unsigned integers. Scalar signed (two's-complement) integers. Scalar unsigned integers. Hardware does not check or enforce the data types for instructions. Software is responsible for ensuring that each operand for an instruction is of the correct data type. If data produced by a previous instruction is of a type different from that used by the current instruction, and the current instruction sources such data, the current instruction may incur a latency penalty, depending on the hardware implementation. Chapter 4: 128-Bit Media and Scientific Programming 145 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Vector (Packed) Floating-Point Double Precision and Single Precision 127 ss ss 115 63 51 0 exp exp 118 significand significand ss ss exp exp 54 significand significand ss exp 86 significand ss exp 22 significand 0 127 95 63 31 Vector (Packed) Signed Integer Quadword, Doubleword, Word, Byte ss ss ss ss quadword doubleword word byte ss ss ss ss ss ss ss quadword doubleword word byte ss ss ss ss ss ss doubleword word byte ss ss ss ss ss ss doubleword word byte ss ss ss word byte ss word byte ss word byte ss word byte ss byte byte byte byte byte byte byte byte 0 127 119 111 103 95 87 79 71 63 55 47 39 31 23 15 7 Vector (Packed) Unsigned Integer Quadword, Doubleword, Word, Byte quadword doubleword word byte 127 quadword doubleword word word byte 79 71 doubleword word byte 63 55 doubleword word byte 31 23 word byte 111 word byte 47 39 word byte 15 7 byte 119 byte 103 95 byte 87 byte byte byte byte byte byte 0 Scalar Floating-Point Double Precision and Single Precision ss exp 51 significand ss 63 exp 22 significand 0 31 Scalar Unsigned Integers 127 127 63 31 15 7 513-316.eps double quadword quadword doubleword word byte bit 0 0 Figure 4-14. 146 128-Bit Media Data Types Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Software can interpret the data types in ways other than those shown in Figure 4-14--such as bit fields or fractional numbers-- but the 128-bit media instructions do not directly support such interpretations and software must handle them entirely on its own. 4.4.2 Operand Sizes and Overrides Operand sizes for 128-bit media instructions are determined by instruction opcodes. Some of these opcodes include an operandsize override prefix, but this prefix acts in a special way to modify the opcode and is considered an integral part of the opcode. The general use of the 66h operand-size override prefix described in "Instruction Prefixes" on page 85 does not apply to 128-bit media instructions. For details on the use of operand-size override prefixes in 128bit media instructions, see the opcodes in "128-Bit Media Instruction Reference" in Volume 4. 4.4.3 Operand Addressing Depending on the 128-bit media instruction, referenced operands may be in registers or memory. Register Operands. Most 128-bit media instructions can access source and destination operands in XMM registers. A few of these instructions access the MMX registers, GPR registers, rFLAGS register, or MXCSR register. The type of register addressed is specified in the instruction syntax. When addressing GPR or XMM registers, the REX instruction prefix can be used to access the extended GPR or XMM registers, as described in "Instruction Prefixes" on page 208. Memory Operands. Most 128-bit media instructions can read memory for source operands, and some of the instructions can write results to memory. "Memory Addressing" on page 16, describes the general methods for addressing memory operands. Immediate Operands. Immediate operands are used in certain data-conversion, vector-shift, and vector-compare instructions. Such instructions take 8-bit immediates, which provide control for the operation. I/O Ports. I/O ports in the I/O address space cannot be directly addressed by 128-bit media instructions, and although memorymapped I/O ports can be addressed by such instructions, doing Chapter 4: 128-Bit Media and Scientific Programming 147 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 so may produce unpredictable results, depending on the hardware implementation of the architecture. 4.4.4 Data Alignment 128-bit media instructions that access a 128-bit operand in memory incur a general-protection exception (#GP) if the operand is not aligned to a 16-byte boundary, except for the following instructions: MASKMOVDQU--Masked Move Double Quadword Unaligned. MOVDQU--Move Unaligned Double Quadword. MOVUPD--Move Unaligned Packed Double-Precision Floating-Point. MOVUPS--Move Unaligned Packed Single-Precision Floating-Point. For other 128-bit media instructions, the architecture does not i m p o s e d a t a - a l i g n m e n t re q u i re m e n t s . H oweve r, t h e consequence of storing operands at unaligned locations is that accesses to those operands may require more processor and bus cycles than for aligned accesses. See "Data Alignment" on page 47 for details. 4.4.5 Integer Data Types The 128-bit media instructions that support operations on integer data types are summarized in "Instruction Summary-- Integer Instructions" on page 160. The characteristics of these data types are described below. Sign. Many of the 128-bit media instructions have variants for operating on signed or unsigned integers. For signed integers, the sign bit is the most-significant bit--bit 7 for a byte, bit 15 for a word, bit 31 for a doubleword, bit 63 for a quadword, or bit 127 for a double quadword. Arithmetic instructions that are not specifically named as unsigned perform signed two'scomplement arithmetic. Range of Representable Values. Table 4-2 on page 149 shows the range of representable values for the integer data types. 148 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Table 4-2. Range of Values in 128-Bit Media Integer Data Types Byte 0 to +28-1 0 to 255 -27 to +(27 -1) -128 to +127 Word 0 to +216-1 0 to 65,535 -215 to +(215-1) -32,768 to +32,767 Doubleword 0 to +232-1 0 to 4.29 * 109 -231 to +(231 -1) -2.14 * 109 to +2.14 * 109 Quadword 0 to +264-1 0 to 1.84 * 1019 -263 to +(263 -1) -9.22 * 1018 to +9.22 * 1018 Double Quadword 0 to +2128-1 0 to 3.40 * 1038 -2127 to +(2127 -1) -1.70 * 1038 to +1.70 * 1038 Data-Type Interpretation Unsigned integers Base-2 (exact) Base-10 (approx.) Base-2 (exact) Base-10 (approx.) Signed integers1 Note: 1. The sign bit is the most-significant bit (bit 7 for a byte, bit 15 for a word, bit 31 for doubleword, bit 63 for quadword, bit 127 for double quadword.). Saturation. Saturating (also called limiting or clamping) instructions limit the value of a result to the maximum or minimum value representable by the applicable data type. Saturating versions of integer vector-arithmetic instructions operate on byte-siz ed and word-siz ed elements. These instructions--for example, PACKx, PADDSx, PADDUSx, PSUBSx, and PSUBUSx--saturate signed or unsigned data at the vector-element level when the element reaches its maximum or minimum representable value. Saturation avoids overflow or underflow errors. The examples in Table 4-3 on page 150 illustrate saturating and non-saturating results with word operands. Saturation for other data-type sizes follows similar rules. Once saturated, the saturated value is treated like any other value of its type. For example, if 0001h is subtracted from the saturated value, 7FFFh, the result is 7FFEh. Chapter 4: 128-Bit Media and Scientific Programming 149 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Table 4-3. Saturation Examples Non-Saturated Infinitely Precise Result 9000h E000h 1E000h 12000h 80FFh 17EFFh Saturated Signed Result 7FFFh 7FFFh E000h 8000h 7FFFh 7EFFh Saturated Unsigned Result 9000h E000h FFFFh FFFFh 80FFh FFFFh Operation 7000h + 2000h 7000h + 7000h F000h + F000h 9000h + 9000h 7FFFh + 0100h 7FFFh + FF00h Arithmetic instructions not specifically designated as s a t u rat i n g p e r fo r m n o n - s a t u rat i n g , t wo s - c o m p l e m e n t arithmetic. Other Fixed-Point Operands. The architecture provides specific support only for integer fixed-point operands--those in which an implied binary point is always located to the right of bit 0. Nevertheless, software may use fixed-point operands in which the implied binary point is located in any position. In such cases, software is responsible for managing the interpretation of such implied binary points, as well as any redundant sign bits that may occur during multiplication. 4.4.6 Floating-Point Data Types The 128-bit media floating-point instructions take vector or scalar operands, depending on the instruction. The vector instructions operate in parallel on up to four, or four pairs, of single-precision floating-point values or up to two, or two pairs, o f d o u b l e - p rec is i o n f l o a t i n g - p o i n t va l u e s . Th e s c a l a r instructions operate on only one, or one pair, of single-precision or double-precision operands. Floating-Point Data Types. The floating-point data types, shown in Figure 4-15 on page 151, include 32-bit single precision and 64bit double precision. Both formats are fully compatible with the IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754). The 128-bit media instructions operate internally on floating-point data types in the precision specified by each instruction. 150 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Single Precision 31 30 S 23 22 Significand (also Fraction) 0 Biased Exponent S = Sign Bit Double Precision 63 62 S Biased Exponent 52 51 Significand (also Fraction) 0 S = Sign Bit Figure 4-15. 128-Bit Media Floating-Point Data Types Both of the floating-point data types consist of a sign (0 = positive, 1 = negative), a biased exponent (base-2), and a significand, which represents the integer and fractional parts of the number. The integer bit (also called the J bit) is implied (called a hidden integer bit). The value of an implied integer bit can be inferred from number encodings, as described in "Floating-Point Number Encodings" on page 156. The bias of the exponent is a constant which makes the exponent always positive and allows reciprocation, without overflow, of the smallest normalized number representable by that data type. Specifically, the data types are formatted as follows: Single-Precision Format--This format includes a 1-bit sign, an 8-bit biased exponent whose value is 127, and a 23-bit significand. The integer bit is implied, making a total of 24 bits in the significand. Double-Precision Format--This format includes a 1-bit sign, an 11-bit biased exponent whose value is 1023, and a 52-bit significand. The integer bit is implied, making a total of 53 bits in the significand. Table 4-4 on page 152 shows the shows the range of finite values representable by the two floating-point data types. Chapter 4: 128-Bit Media and Scientific Programming 151 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Table 4-4. Range of Values in Normalized Floating-Point Data Types Range of Normalized1 Values Base 2 (exact) Base 10 (approximate) 1.17 * 10-38 to +3.40 * 1038 2.23 * 10-308 to +1.79 * 10308 Data Type Single Precision Double Precision Note: 2-126 to 2127 * (2 - 2-23) 2-1022 to 21023 * (2 - 2-52) 1. See "Floating-Point Number Representation" on page 153 for a definition of "normalized". For example, in the single-precision format, the largest normal nu m b e r re p re s e n t ab l e h a s a n ex p o n e n t o f F E h a n d a significand of 7FFFFFh, with a numerical value of 2127 * (2 - 2-23). Results that overflow above the maximum representable value return either the maximum representable normalized number (see "Normalized Numbers" on page 153) or infinity, with the sign of the true result, depending on the rounding mode specified in the rounding control (RC) field of the MXCSR register. Results that underflow below the minimum representable value return either the minimum representable normaliz ed number or a denormaliz ed number (see "Denormalized (Tiny) Numbers" on page 154), with the sign of the true result, or a result determined by the SIMD floatingpoint exception handler, depending on the rounding mode and the underflow-exception mask (UM) in the MXCSR register (see "Unmasked Responses" on page 221). Compatibility with x87 Floating-Point Data Types. The results produced by 128-bit media floating-point instructions comply fully with the IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754), because these instructions represent data in the single-precision or double-precision data types t h ro u g h o u t t h e i r o p e ra t i o n s . Th e x 8 7 f l o a t i n g - p o i n t instructions, however, by default perform operations in the double-extended-precision format. Because of this, x87 instructions operating on the same source operands as 128-bit media floating-point instructions may return results that are slightly different in their least-significant bits. 152 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology 4.4.7 Floating-Point Number Representation A 128-bit media floating-point value can be one of five types, as follows: Normal Denormal (Tiny) Zero Infinity Not a Number (NaN) In common engineering and scientific usage, floating-point numbers--also called real numbers--are represented in base (radix) 10. A non-zero number consists of a sign, a normalized significand, and a signed exponent, as in: +2.71828 e0 Both large and small numbers are representable in this notation, subject to the limits of data-type precision. For example, a million in base-10 notation appears as +1.00000 e6 and -0.0000383 is represented as -3.83000 e-5. A non-zero number can always be written in normalized form--that is, with a leading non-zero digit immediately before the decimal point. Thus, a normalized significand in base-10 notation is a number in the range [1,10). The signed exponent specifies the number of positions that the decimal point is shifted. Unlike the common engineering and scientific usage described above, 128-bit media floating-point numbers are represented in base (radix) 2. Like its base-10 counterpart, a normalized base-2 s i g n i f i c a n d i s w r i t t e n w i t h i t s l e a d i n g n o n - z e ro d i g it immediately to the left of the radix point. In base-2 arithmetic, a non-zero digit is always a one, so the range of a binary significand is [1,2): +1.fraction exponent The leading non-zero digit is called the integer bit. As shown in Figure 4-15, the integer bit is omitted (and called the hidden integer bit) in the single-precision and the double-precision floating-point formats, because its implied value is always 1 in a normalized significand (0 in a denormalized significand), and the omission allows an extra bit of precision. The following sections describe the number representations. Normalized Numbers. Normalized floating-point numbers are the most frequent operands for 128-bit media instructions. These Chapter 4: 128-Bit Media and Scientific Programming 153 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 are finite, non-zero, positive or negative numbers in which the integer bit is 1, the biased exponent is non-zero and nonmaximum, and the fraction is any representable value. Thus, the significand is within the range of [1, 2). Whenever possible, the processor represents a floating-point result as a normalized number. Denormalized (Tiny) Numbers. Denormalized numbers (also called tiny numbers) are smaller than the smallest representable normaliz ed numbers. They arise through an underflow condition, when the exponent of a result lies below the representable minimum exponent. These are finite, non-zero, positive or negative numbers in which the integer bit is 0, the biased exponent is 0, and the fraction is non-zero. The processor generates a denormalized-operand exception (DE) when an instruction uses a denormalized source operand. The processor may generate an underflow exception (UE) when an instruction produces a rounded, non-zero result that is too small to be represented as a normalized floating-point number in the destination format, and thus is represented as a denormalized number. If a result, after rounding, is too small to be represented as the minimum denormalized number, it is represented as zero. (See "Exceptions" on page 209 for specific details.) Denormalization may correct the exponent by placing leading zeros in the significand. This may cause a loss of precision, because the number of significant bits in the fraction is reduced by the leading zeros. In the single-precision floating-point format, for example, normaliz ed numbers have biased exponents ranging from 1 to 254 (the unbiased exponent range is from -126 to +127). A true result with an exponent of, say, -130, undergoes denormalization by right-shifting the significand by the difference between the normalized exponent and the minimum exponent, as shown in Table 4-5. Table 4-5. Example of Denormalization Exponent -130 -126 True result Denormalized result Result Type Significand (base 2) 1.0011010000000000 0.0001001101000000 154 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Zero. The floating-point zero is a finite, positive or negative number in which the integer bit is 0, the biased exponent is 0, and the fraction is 0. The sign of a zero result depends on the operation being performed and the selected rounding mode. It may indicate the direction from which an underflow occurred, or it may reflect the result of a division by + or -. Infinity. Infinity is a positive or negative number, + and -, in which the integer bit is 1, the biased exponent is maximum, and the fraction is 0. The infinities are the maximum numbers that can be represented in floating-point format. Negative infinity is less than any finite number and positive infinity is greater than any finite number (i.e., the affine sense). An infinite result is produced when a non-zero, non-infinite number is divided by 0 or multiplied by infinity, or when infinity is added to infinity or to 0. Arithmetic on infinities is exact. For example, adding any floating-point number to + gives a result of +. Arithmetic comparisons work correctly on infinities. Exceptions occur only when the use of an infinity as a source operand constitutes an invalid operation. Not a Number (NaN). NaNs are non-numbers, lying outside the range of representable floating-point values. The integer bit is 1, the biased exponent is maximum, and the fraction is nonzero. NaNs are of two types: Signaling NaN (SNaN) Quiet NaN (QNaN) A QNaN is a NaN with the most-significant fraction bit set to 1, and an SNaN is a NaN with the most-significant fraction bit cleared to 0. When the processor encounters an SNaN as a source operand for an instruction, an invalid-operation exception (IE) occurs and a QNaN is produced as the result, if the exception is masked. In general, when the processor encounters a QNaN as a source operand for an instruction, the processor does not generate an exception but generates a QNaN as the result. The processor never generates an SNaN as a result of a floatingpoint operation. When an invalid-operation exception (IE) occurs due to an SNaN operand, the invalid-operation exception mask (IM) bit determines the processor's response, as described in "SIMD Floating-Point Exception Masking" on page 218. Chapter 4: 128-Bit Media and Scientific Programming 155 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 When a floating-point operation or exception produces a QNaN result, its value is determined by the rules in Table 4-6 on page 156. Table 4-6. NaN Results Source Operands (in either order) QNaN SNaN QNaN QNaN SNaN SNaN Any non-NaN floating-point value, or single-operand instructions Any non-NaN floating-point value or, single-operand instructions QNaN SNaN QNaN SNaN NaN Result1 Value of QNaN Value of SNaN converted to a QNaN2 Value of operand 1 Value of operand 1 converted to a QNaN2 Floating-point indefinite value3 (a special form of QNaN) Invalid-Operation Exception (IE) occurs without QNaN or SNaN source operands Notes: 1. The NaN result is produced when the floating-point invalid-operation exception is masked. 2. The conversion is done by changing the most-significant fraction bit to 1. 3. See "Indefinite Values" on page 157. 4.4.8 Floating-Point Number Encodings Supported Encodings. Table 4-7 on page 157 shows the floatingpoint encodings of supported numbers and non-numbers. The number categories are ordered from large to small. In this affine ordering, positive infinity is larger than any positive normalized number, which in turn is larger than any positive denormalized number, which is larger than positive zero, and so forth. Thus, the ordinary rules of comparison apply between categories as well as within categories, so that comparison of any two numbers is well-defined. The actual exponent field length is 8 or 11 bits, and the fraction field length is 23 or 52 bits, depending on operand precision. The single-precision and double-precision formats do not include the integer bit in the significand (the value of the integer bit can be inferred from number encodings). Exponents 156 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology of both types are encoded in biased format, with respective biasing constants of 127 and 1023. Table 4-7. Supported Floating-Point Encodings Classification SNaN Positive Non-Numbers QNaN Positive Infinity (+) Positive Floating-Point Numbers Positive Normal Positive Denormal Positive Zero Negative Zero Negative Floating-Point Numbers Negative Denormal Negative Normal Negative Infinity (-) SNaN Negative Non-Numbers QNaN3 Notes: 1 111 ... 111 0 111 ... 111 Sign Biased Exponent1 111 ... 111 Significand2 1.011 ... 111 to 1.000 ... 001 1.111 ... 111 to 1.100 ... 000 1.000 ... 000 1.111 ... 111 to 1.000 ... 000 0.111 ... 111 to 0.000 ... 001 0.000 ... 000 0.000 ... 000 0.000 ... 001 to 0.111 ... 111 1.000 ... 000 to 1.111 ... 111 1.000 ... 000 1.000 ... 001 to 1.011 ... 111 1.100 ... 000 to 1.111 ... 111 0 0 111 ... 111 111 ... 110 to 000 ... 001 0 0 000 ... 000 0 1 000 ... 000 000 ... 000 1 000 ... 000 1 000 ... 001 to 111 ... 110 111 ... 111 1 1 111 ... 111 1. The actual exponent field length is 8 or 11 bits, depending on operand precision. 2. The "1." and "0." prefixes represent the implicit integer bit. The actual fraction field length is 23 or 52 bits, depending on operand precision. 3. The floating-point indefinite value is a QNaN with a negative sign and a significand whose value is 1.100 ... 000. Indefinite Values. Floating-point and integer data type each have a unique encoding that represents an indefinite value. The Chapter 4: 128-Bit Media and Scientific Programming 157 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 processor returns an indefinite value when a masked invalidoperation exception (IE) occurs. For example, if a floating-point division operation is attempted using source operands which are both zero, and IE exceptions are masked, the floating-point indefinite value is returned as the result. Or, if a floating-point-to-integer data conversion overflows its destination integer data type, and IE exceptions are masked, the integer indefinite value is returned as the result. Table 4-8 shows the encodings of the indefinite values for each data type. For floating-point numbers, the indefinite value is a special form of QNaN. For integers, the indefinite value is the largest representable negative twos-complement number, 80...00h. (This value is interpreted as the largest representable negative number, except when a masked IE exception occurs, in which case it is interpreted as an indefinite value.) Table 4-8. Indefinite-Value Encodings Indefinite Encoding * * * * sign bit = 1 biased exponent = 111 ... 111 significand integer bit = 1 significand fraction = 100 ... 000 Data Type Floating-Point Integer * sign bit = 1 * integer = 000 ... 000 4.4.9 Floating-Point Rounding Bits 14-13 of the MXCSR control and status register ("MXCSR Register" on page 140) comprise the floating-point rounding control (RC) field, which specifies how the results of floatingpoint computations are rounded. Rounding modes apply to most arithmetic operations. When rounding occurs, the processor generates a precision exception (PE). Rounding is not applied to operations that produce NaN results. The IEEE 754 standard defines the four rounding modes as shown in Table 4-9 on page 159. 158 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Table 4-9. RC Value 00 (default) 01 10 11 Types of Rounding Mode Round to nearest Type of Rounding The rounded result is the representable value closest to the infinitely precise result. If equally close, the even value (with least-significant bit 0) is taken. The rounded result is closest to, but no greater than, the infinitely precise result. The rounded result is closest to, but no less than, the infinitely precise result. The rounded result is closest to, but no greater in absolute value than, the infinitely precise result. Round down Round up Round toward zero Round to nearest is the default rounding mode. It provides a statistically unbiased estimate of the true result, and is suitable for most applications. The other rounding modes are directed roundings: round up (toward +), round down (toward -), and round toward zero. Round up and round down are used in interval arithmetic, in which upper and lower bounds bracket the true result of a computation. Round toward zero takes the smaller in magnitude, that is, always truncates. The processor produces a floating-point result defined by the IEEE standard to be infinitely precise. This result may not be representable exactly in the destination format, because only a s u b s e t o f t h e c o n t i nu u m o f re a l nu m b e rs f i n d s ex a c t representation in any particular floating-point format. Rounding modifies such a result to conform to the destination format, thereby making the result inexact and also generating a precision exception (PE), as described in "SIMD Floating-Point Exception Causes" on page 211. Suppose, for example, the following 24-bit result is to be represented in single-precision format, where "E 2 1010" represents the biased exponent: 1.0011 0101 0000 0001 0010 0111 E2 1010 This result has no exact representation, because the leastsignificant 1 does not fit into the single-precision format, which allows for only 23 bits of fraction. The rounding control field determines the direction of rounding. Rounding introduces an error in a result that is less than one unit in the last place (ulp), Chapter 4: 128-Bit Media and Scientific Programming 159 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 that is, the least-significant bit position of the floating-point representation. 4.5 Instruction Summary--Integer Instructions This section summarizes functions of the integer instructions in the 128-bit media instruction subset. These include integer instructions that use an XMM register for source or destination and data-conversion instructions that convert from integers to floating-point formats. For a summary of the floating-point instructions in the 128-bit media instruction subset, including data-conversion instructions that convert from floating-point to integer formats, see "Instruction Summary--Floating-Point Instructions" on page 187. The instructions are organized here by functional group--such as data-transfer, vector arithmetic, and so on. Software running at any privilege level can use any of these instructions, if the CPUID instruction reports support for the instructions (see "Feature Detection" on page 209). More detail on individual instructions is given in the alphabetically organized "128-Bit Media Instruction Reference" in Volume 4. 4.5.1 Syntax Each instruction has a mnemonic syntax used by assemblers to specify the operation and the operands to be used for source and destination (result) data. The majority of 128-bit media integer instructions have the following syntax: MNEMONIC xmm1, xmm2/mem128 Figure 4-16 shows an example of the mnemonic syntax for a packed add bytes (PADDB) instruction. PADDB xmm1, xmm2/mem128 Mnemonic First Source Operand and Destination Operand Second Source Operand 513-147.eps Figure 4-16. 160 Mnemonic Syntax for Typical Instruction Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology This example shows the PADDB mnemonic followed by two operands, a 128-bit XMM register operand and another 128-bit XMM register or 128-bit memory operand. In most instructions that take two operands, the first (left-most) operand is both a source operand and the destination operand. The second (rightmost) operand serves only as a source. Some instructions can have one or more prefixes that modify default properties, as described in "Instruction Prefixes" on page 208. Mnemonics. The following characters are used as prefixes in the mnemonics of integer instructions: CVT--Convert CVTT--Convert with truncation P--Packed (vector) PACK--Pack elements of 2x data size to 1x data size PUNPCK--Unpack and interleave elements UNPCK--Unpack and interleave elements In addition to the above prefix characters, the following characters are used elsewhere in the mnemonics of integer instructions: B--Byte D--Doubleword DQ--Double quadword H--High L--Low, or Left PD--Packed double-precision floating-point PI--Packed integer PS--Packed single-precision floating-point Q--Quadword R--Right S--Signed, or Saturation, or Shift SD--Scalar double-precision floating-point SI--Signed integer SS--Scalar single-precision floating-point, saturation U--Unsigned, or Unordered, or Unaligned US--Unsigned saturation Chapter 4: 128-Bit Media and Scientific Programming or Signed 161 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 W--Word x--One or more variable characters in the mnemonic For example, the mnemonic for the instruction that packs four words into eight unsigned bytes is PACKUSWB. In this mnemonic, the US designates an unsigned result with saturation, and the WB designates the source as words and the result as bytes. 4.5.2 Data Transfer The data-transfer instructions copy operands between a memory location, an XMM register, an MMX register, or a GPR. The MOV mnemonic, which stands for move, is a misnomer. A copy function is actually performed instead of a move. A new copy of the source value is created at the destination address, and the original copy remains unchanged at its source location. Move. MOVD--Move Doubleword or Quadword MOVQ--Move Quadword MOVDQA--Move Aligned Double Quadword MOVDQU--Move Unaligned Double Quadword MOVDQ2Q--Move Quadword to Quadword MOVQ2DQ--Move Quadword to Quadword The MOVD instruction copies a 32-bit or 64-bit value from a GPR register or memory location to the low-order 32 or 64 bits of an XMM register, or from the low-order 32 or 64 bits of an XMM register to a 32-bit or 64-bit GPR or memory location. If the source operand is a GPR or memory location, the source is zero-extended to 128 bits in the XMM register. If the source is an XMM register, only the low-order 32 or 64 bits of the source are copied to the destination. The MOVQ instruction copies a 64-bit value from memory to the low quadword of an XMM register, or from the low quadword of an XMM register to memory, or between the low quadwords of two XMM registers. If the source is in memory and the destination is an XMM register, the source is zero-extended to 128 bits in the XMM register. The MOVDQA instruction copies a 128-bit value from memory to an XMM register, or from an XMM register to memory, or between two XMM registers. If either the source or destination is a memory location, the memory address must be aligned. The 162 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology MOVDQU instruction does the same, except for unaligned operands. The MOVDQ2Q instruction copies the low-order 64-bit value in an XMM register to an MMX register. The MOVQ2DQ instruction copies a 64-bit value from an MMX register to the low-order 64 bits of an XMM register, with zero-extension to 128 bits. Figure 4-17 on page 164 shows the capabilities of the various integer move instructions. These instructions move large amounts of data. When copying between XMM registers, or between an XMM register and memory, a move instruction can copy up to 16 bytes of data. When copying between an XMM register and an MMX or GPR register, a move instruction can copy up to 8 bytes of data. The MOVx instructions--along with the PUNPCKx instructions--are often among the most frequently used instructions in 128-bit media integer and floating-point procedures. The move instructions are in many respects similar to the assignment operator in high-level languages. The simplest example of their use is for initializing variables. To initialize a register to 0, however, rather than using a MOVx instruction it may be more efficient to use the PXOR instruction with identical destination and source operands. Move Non-Temporal. The move non-temporal instructions are streaming-store instructions. They minimize pollution of the cache. MOVNTDQ--Move Non-Temporal Double Quadword MASKMOVDQU--Masked Move Double Quadword Unaligned The MOVNTDQ instruction stores its second operand (a 128-bit XMM register value) into its first operand (a 128-bit memory location). MOVNTDQ indicates to the processor that its data is non-temporal, which assumes that the referenced data will be used only once and is therefore not subject to cache-related overhead (as opposed to temporal data, which assumes that the data will be accessed again soon and should be cached). The non-temporal instructions use weakly-ordered, write-combining buffering of write data, and they minimize cache pollution. The exact method by which cache pollution is minimized depends Chapter 4: 128-Bit Media and Scientific Programming 163 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 on the hardware implementation of the instruction. For further information, see "Memory Optimization" on page 113. 127 XMM Register (destination) 0 MOVQ 127 XMM Register or Memory (destination) 0 memory MOVDQA MOVDQU MOVQ 127 XMM Register (source) 0 GPR Register or Memory (destination) 63 0 127 XMM Register (source) 0 memory MOVD 127 XMM Register (destination) 0 GPR Register or Memory (source) 63 0 MOVD MMXTM Register (destination) 63 0 127 XMM Register (source) 0 MOVDQ2Q XMM Register (destination) MMX Register (source) 127 0 63 0 MOVQ2DQ 513-173.eps Figure 4-17. Integer Move Operations MASKMOVDQU is also a non-temporal instruction. It stores bytes from the first operand, as selected by the mask value in 164 Chapter 4: 128-Bit Media and Scientific Programming memory memory MOVDQA MOVDQU 127 XMM Register or Memory (source) 0 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology the second operand (0 = no write and 1 = write), to a memory location specified in the rDI and DS registers. The first and second operands are both XMM registers. The address may be unaligned. Figure 4-18 shows the MASKMOVDQU operation. It is useful for the handling of end cases in block copies and block fills based on streaming stores. operand 1 127 0 127 operand 2 0 select .............. select .............. store address memory rDI 513-148.eps Figure 4-18. Move Mask. MASKMOVDQU Move Mask Operation PMOVMSKB--Packed Move Mask Byte The PMOVMSKB instruction moves the most-significant bit of each byte in an XMM register to the low-order word of a 32-bit or 64-bit general-purpose register, with zero-extension. The instruction is useful for extracting bits from mask patterns, or zero values from quantized data, or sign bits--resulting in a by t e t h a t c a n b e u s e d for d a t a - d e p e n d e n t b ra n ch i n g . Figure 4-19 on page 166 shows the PMOVMSKB operation. Chapter 4: 128-Bit Media and Scientific Programming 165 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 GPR 0 127 XMM 0 concatenate 16 most-significant bits 513-157..eps Figure 4-19. 4.5.3 Data Conversion PMOVMSKB Move Mask Operation The integer data-conversion instructions convert integer operands to floating-point operands. These instructions take 128-bit integer source o p e ra n d s . For d a t a - c o nve rs i o n instructions that take 128-bit floating-point source operands, see "Data Conversion" on page 192. For data-conversion instructions that take 64-bit source operands, see "Data Conversion" on page 250 and "Data Conversion" on page 266. Convert Integer to Floating-Point. These instructions convert integer data types in XMM registers or memory into floating-point data types in XMM registers. CVTDQ2PS--Convert Packed Doubleword Integers to Packed Single-Precision Floating-Point CVTDQ2PD--Convert Packed Doubleword Integers to Packed Double-Precision Floating-Point The CVTDQ2PS instruction converts four 32-bit signed integer values in the second operand to four single-precision floatingpoint values and writes the converted values in another XMM register. If the result of the conversion is an inexact value, the value is rounded. The CVTDQ2PD instruction is analogous to CVTDQ2PS except that it converts two 64-bit signed integer values to two double-precision floating-point values. Convert MMX Integer to Floating-Point. These instructions convert integer data types in MMX registers or memory into floatingpoint data types in XMM registers. CVTPI2PS--Convert Packed Doubleword Integers to Packed Single-Precision Floating-Point 166 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology CVTPI2PD--Convert Packed Doubleword Packed Double-Precision Floating-Point Integers to The CVTPI2PS instruction converts two 32-bit signed integer values in an MMX register or a 64-bit memory location to two single-precision floating-point values and writes the converted values in the low-order 64 bits of an XMM register. The highorder 64 bits of the XMM register are not modified. The CVTPI2PD instruction is analogous to CVTPI2PS except that it converts two 32-bit signed integer values to two doubleprecision floating-point values and writes the converted values in the full 128 bits of an XMM register. Before executing a CVTPI2x instruction, software should ensure that the MMX registers are properly initialized so as to prevent conflict with their aliased use by x87 floating-point instructions. This may require clearing the MMX state, as described in "Accessing Operands in MMXTM Registers" on page 223. For a description of 128-bit media instructions that convert in the opposite direction--floating-point to integer in MMX registers--see "Convert Floating-Point to MMXTM Integer" on page 194. For a summary of instructions that operate on MMX registers, see Chapter 5, "64-Bit Media Programming." Convert GPR Integer to Floating-Point. These instructions convert integer data types in GPR registers or memory into floatingpoint data types in XMM registers. CVTSI2SS--Convert Signed Doubleword or Quadword Integer to Scalar Single-Precision Floating-Point CVTSI2SD--Convert Signed Doubleword or Quadword Integer to Scalar Double-Precision Floating-Point The CVTSI2SS instruction converts a 32-bit or 64-bit signed integer value in a general-purpose register or memory location to a single-precision floating-point value and writes the converted value in the low-order 32 bits of an XMM register. The three high-order doublewords in the destination XMM register are not modified. The CVTSI2SD instruction converts a 32-bit or 64-bit signed integer value in a general-purpose register or memory location to a double-precision floating-point value and writes the converted value in the low-order 64 bits of an XMM register. Chapter 4: 128-Bit Media and Scientific Programming 167 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 The high-order 64 bits in the destination XMM register are not modified. 4.5.4 Data Reordering The integer data-reordering instructions pack, unpack, interleave, extract, insert, and shuffle the elements of vector operands. Pack with Saturation. These instructions pack larger data types into smaller data types, thus halving the precision of each element in a vector operand. PACKSSDW--Pack with Saturation Signed Doubleword to Word PACKSSWB--Pack with Saturation Signed Word to Byte PACKUSWB--Pack with Saturation Signed Word to Unsigned Byte The PACKSSDW instruction converts each of the four signed doubleword integers in its two source operands (an XMM register, and another XMM register or 128-bit memory location) into signed word integers and packs the converted values into the destination operand (an XMM register). The PACKSSWB instruction does the analogous conversion between word elements in the source vectors and byte elements in the destination vector. The PACKUSWB instruction does the same as PACKSSWB except that it converts signed word integers into unsigned (rather than signed) bytes. Figure 4-20 on page 169 shows an example of a PACKSSDW instruction. The operation merges vector elements of 2x size into vector elements of 1x size, thus reducing the precision of the vector-element data types. Any results that would otherwise overflow or underflow are saturated (clamped) at the maximum or minimum representable value, respectively, as described in "Saturation" on page 149. 168 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology operand 1 127 0 127 operand 2 0 127 result 0 513-150.eps Figure 4-20. PACKSSDW Pack Operation Conversion from higher-to-lower precision is often needed, for example, by multiplication operations in which the higherprecision format is used for source operands in order to prevent possible overflow, and the lower-precision format is the desired format for the next operation. Unpack and Interleave. These instructions interleave vector elements from the high or low halves of two integer source operands. They can be used to double the precision of operands. PUNPCKHBW--Unpack and Interleave High Bytes PUNPCKHWD--Unpack and Interleave High Words PUNPCKHDQ--Unpack and Interleave High Doublewords PUNPCKHQDQ--Unpack and Interleave High Quadwords PUNPCKLBW--Unpack and Interleave Low Bytes PUNPCKLWD--Unpack and Interleave Low Words PUNPCKLDQ--Unpack and Interleave Low Doublewords PUNPCKLQDQ--Unpack and Interleave Low Quadwords The PUNPCKHBW instruction copies the eight high-order bytes from its two source operands (an XMM register, and another XMM register or 128-bit memory location) and interleaves them into the 128-bit destination operand (an XMM register). The bytes in the low-order half of the source operands a re i g n o re d . T h e P U N P C K H W D, P U N P C K H D Q, a n d PUNPCKHQDQ instructions perform analogous operations for Chapter 4: 128-Bit Media and Scientific Programming 169 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 words, doublewords, and quadwords in the source operands, packing them into interleaved words, interleaved doublewords, and interleaved quadwords in the destination operand. Th e P U N P C K L B W, P U N P C K LW D, P U N P C K L D Q, a n d PUNPCKLQDQ instructions are analogous to their highelement counterparts except that they take elements from the low quadword of each source vector and ignore elements in the high quadword. Depending on the hardware implementation, if t h e s o u rc e o p e ra n d fo r P U N P C K L x a n d P U N P C K H x instructions is in memory, only the low 64 bits of the operand may be loaded. Figure 4-21 shows an example of the PUNPCKLWD instruction. The elements are taken from the low half of the source operands. In this register image, elements from operand2 are placed to the left of elements from operand1. operand 1 127 0 127 operand 2 0 . 127 . . . 0 513-149.eps result Figure 4-21. PUNPCKLWD Unpack and Interleave Operation If operand 2 is a vector consisting of all zero-valued elements, the unpack instructions perform the function of expanding vector elements of 1x size into vector elements of 2x size. Conversion from lower-to-higher precision is often needed, for example, prior to multiplication operations in which the higherprecision format is used for source operands in order to prevent possible overflow during multiplication. If both source operands are of identical value, the unpack instructions can perform the function of duplicating adjacent elements in a vector. 170 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology The PUNPCKx instructions can be used in a repeating sequence to transpose rows and columns of an array. For example, such a sequence could begin with PUNPCKxWD and be followed by PUNPCKxQD. These instructions can also be used to convert pixel representation from RGB format to colorplane format, or to interleave interpolation elements into a vector. A s n o t e d a b ove , a n d d e p e n d i n g o n t h e h a rd wa re implementation, the width of the memory access performed by the memory-operand forms of PUNPCKLBW, PUNPCKLWD, PUNPCKLDQ, and PUNPCKLQDQ may be 64 bits, but the width of the memory access of the memory-operand forms of P U N P C K H B W, P U N P C K H W D, P U N P C K H D Q, a n d P U N P C K H Q D Q m ay b e 1 2 8 b it s . Th u s , t h e a l i g n m e n t constraints for PUNPCKLx instructions may be less restrictive than the alignment constraints for PUNPCKHx instructions. For details, see the documentation for particular hardware implementations of the architecture. A n o t h e r a dva n t a g e o f u s i n g P U N P C K L x ra t h e r t h a n P U N P C K H x -- a l s o d e p e n d i n g o n t h e h a rd wa re implementation--is that it may help avoid potential size mismatches if a particular hardware implementation uses loadto-store forwarding. In such cases, store data from either a quadword store or the lower quadword of a double-quadword store could be forwarded to PUNPCKLx instructions, but only store data from a double-quadword store could be forwarded to PUNPCKHx instructions. T h e P U N P C K x i n s t r u c t i o n s -- a l o n g w i t h t h e M OV x instructions--are often among the most frequently used instructions in 128-bit media integer and floating-point procedures. Extract and Insert. These instructions copy a word element from a vector, in a manner specified by an immediate operand. PEXTRW--Packed Extract Word PINSRW--Packed Insert Word The PEXTRW instruction extracts a 16-bit value from an XMM register, as selected by the immediate-byte operand, and writes it to the low-order word of a 32-bit or 64-bit general-purpose register, with zero-extension to 32 or 64 bits. PEXTRW is useful for loading computed values, such as table-lookup indices, into Chapter 4: 128-Bit Media and Scientific Programming 171 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 general-purpose registers where the values can be used for addressing tables in memory. The PINSRW instruction inserts a 16-bit value from the loworder word of a general-purpose register or from a 16-bit memory location into an XMM register. The location in the destination register is selected by the immediate-byte operand. The other words in the destination register operand are not modified. Figure 4-22 shows the operation. xmm 127 0 reg32/64/mem16 15 0 imm8 select word position for insert 127 result 0 513-166.eps Figure 4-22. PINSRW Operation Shuffle. These instructions reorder the elements of a vector. PSHUFD--Packed Shuffle Doublewords PSHUFHW--Packed Shuffle High Words PSHUFLW--Packed Shuffle Low Words The PSHUFD instruction fills each doubleword of the first operand (an XMM register) by copying any one of the doublewords in the second operand (an XMM register or 128-bit memory location). The ordering of the shuffle can occur in one of 256 possible ways, as specified by the third operand, an immediate byte. Figure 4-23 on page 173 shows one of the 256 possible shuffle operations. 172 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology operand 1 127 0 127 operand 2 0 127 result 0 513-151.eps Figure 4-23. PSHUFD Shuffle Operation The PSHUFHW and PSHUFLW instructions are analogous to PSHUFD, except that they fill each word of the high or low quadword, respectively, of the first operand by copying any one of the four words in the high or low quadword of the second operand. Figure 4-24 shows the PSHUFHW operation. PSHUFHW and PSHUFLW are useful, for example, in color imaging when computing alpha saturation of RGB values. In this case, PSHUFxW can replicate an alpha value in a register so that parallel comparisons with three RGB values can be performed. operand 1 127 0 127 operand 2 0 127 result 0 513-167.eps Figure 4-24. PSHUFHW Shuffle Operation Chapter 4: 128-Bit Media and Scientific Programming 173 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 4.5.5 Arithmetic The integer vector-arithmetic instructions perform an arithmetic operation on the elements of two source vectors. Figure 4-25 shows a typical arithmetic operation on vectors of bytes. Such instructions performs 16 arithmetic operations in parallel. operand 1 127 0 127 operand 2 0 .............. operation .............. operation .............. 127 0 513-163.eps result Figure 4-25. Addition. Arithmetic Operation on Vectors of Bytes PADDB--Packed Add Bytes PADDW--Packed Add Words PADDD--Packed Add Doublewords PADDQ--Packed Add Quadwords PADDSB--Packed Add with Saturation Bytes PADDSW--Packed Add with Saturation Words PADDUSB--Packed Add Unsigned with Saturation Bytes PADDUSW--Packed Add Unsigned with Saturation Words The PADDB, PADDW, PADDD, and PADDQ instructions add each packed 8-bit (PADDB), 16-bit (PADDW), 32-bit (PADDD), or 64-bit (PADDQ) integer element in the second operand to the corresponding, same-sized integer element in the first operand and write the integer result to the corresponding, same-sized element of the destination. Figure 4-25 shows a PADDB operation. These instructions operate on both signed and unsigned integers. However, if the result overflows, the carry is ignored and only the low-order byte, word, doubleword, or quadword of each result is written to the destination. The 174 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology PADDD instruction can be used together with PMADDWD (page 178) to implement dot products. The PADDSB and PADDSW instructions add each 8-bit (PADDSB) or 16-bit (PADDSW) signed integer element in the second operand to the corresponding, same-sized signed integer element in the first operand and write the signed integer result to the corresponding, same-sized element of the destination. For each result in the destination, if the result is larger than the largest, or smaller than the smallest, representable 8-bit (PADDSB) or 16-bit (PADDSW) signed integer, the result is saturated to the largest or smallest representable value, respectively. The PADDUSB and PADDUSW instructions perform saturatingadd operations analogous to the PADDSB and PADDSW instructions, except on unsigned integer elements. Subtraction. PSUBB--Packed Subtract Bytes PSUBW--Packed Subtract Words PSUBD--Packed Subtract Doublewords PSUBQ--Packed Subtract Quadword PSUBSB--Packed Subtract with Saturation Bytes PSUBSW--Packed Subtract with Saturation Words PSUBUSB--Packed Subtract Unsigned and Saturate Bytes PSUBUSW--Packed Subtract Unsigned and Saturate Words The subtraction instructions perform operations analogous to the addition instructions. The PSUBB, PSUBW, PSUBD, and PSUBQ instructions subtract each 8-bit (PSUBB), 16-bit (PSUBW), 32-bit (PSUBD), or 64-bit (PSUBQ) integer element in the second operand from the corresponding, same-sized integer element in the first operand and write the integer result to the corresponding, same-sized element of the destination. For vectors of n number of elements, the operation is: operand1[i] = operand1[i] - operand2[i] where: i = 0 to n - 1 These instructions operate on both signed and unsigned integers. However, if the result underflows, the borrow is Chapter 4: 128-Bit Media and Scientific Programming 175 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 ignored and only the low-order byte, word, doubleword, or quadword of each result is written to the destination. The PSUBSB and PSUBSW instructions subtract each 8-bit (PSUBSB) or 16-bit (PSUBSW) signed integer element in the second operand from the corresponding, same-sized signed integer element in the first operand and write the signed integer result to the corresponding, same-sized element of the destination. For each result in the destination, if the result is larger than the larges t, or smaller than the smallest, representable 8-bit (PSUBSB) or 16-bit (PSUBSW) signed integer, the result is saturated to the largest or smallest representable value, respectively. The PSUBUSB and PSUBUSW instructions perform saturatingadd operations analogous to the PSUBSB and PSUBSW instructions, except on unsigned integer elements. Multiplication. PMULHW--Packed Multiply High Signed Word PMULLW--Packed Multiply Low Signed Word PMULHUW--Packed Multiply High Unsigned Word PMULUDQ--Packed Multiply Unsigned Doubleword and Store Quadword The PMULHW instruction multiplies each 16-bit signed integer value in the first operand by the corresponding 16-bit integer in the second operand, producing a 32-bit intermediate result. The instruction then writes the high-order 16 bits of the 32-bit intermediate result of each multiplication to the corresponding word of the destination. The PMULLW instruction performs the same multiplication as PMULHW but writes the low-order 16 bits of the 32-bit intermediate result to the corresponding word of the destination. Figure 4-26 on page 177 shows the PMULHW and PMULLW operations. The difference between the two is whether the high or low half of each intermediate-element result is copied to the destination result. 176 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology operand 1 127 0 127 operand 2 0 * 255 * . . . . * * 0 intermediate result . 127 . . . 0 513-152.eps result Figure 4-26. PMULxW Multiply Operation The PMULHUW instruction performs the same multiplication a s P M U L H W b u t o n u n s i g n e d o p e ra n d s . Wit h o u t t h i s instruction, it is difficult to perform unsigned integer multiplies using 128-bit media instructions. The instruction is useful in 3D rasterization, which operates on unsigned pixel values. Th e P M U L U D Q i n s t r u c t i o n , u n l i ke t h e o t h e r P M U Lx instructions, preserves the full precision of results by multiplying only half of the source-vector elements. It multiplies the 32-bit unsigned integer values in the first (loworder) and third doublewords of the source operands, writes the full 64-bit result of the low-order multiply to the low-order doubleword of the destination, and writes a corresponding result of the high-order multiply to the high-order doubleword of the destination. Figure 4-27 on page 178 shows a PMULUDQ operation. Chapter 4: 128-Bit Media and Scientific Programming 177 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 operand 1 127 0 127 operand 2 0 * * 127 result 0 513-153.eps Figure 4-27. PMULUDQ Multiply Operation See "Shift" on page 181 for shift instructions that can be used to perform multiplication and division by powers of 2. Multiply-Add. This instruction multiplies the elements of two source vectors and add their intermediate results in a single operation. PMADDWD--Packed Multiply Words and Add Doublewords The PMADDWD instruction multiplies each 16-bit signed value in the first operand by the corresponding 16-bit signed value in the second operand. The instruction then adds the adjacent 32bit intermediate results of each multiplication, and writes the 3 2 - b i t res u l t o f e a ch a dd it i o n i n t o t h e corresponding doubleword of the destination. For vectors of n number of source elements (src), m number of destination elements (dst), and n = 2m, the operation is: dst[j] = ((src1[i] * src2[i]) + (src1[i+1] * src2[i+1])) where: i = 0 to n - 1 i = 2j PMADDWD thus performs four signed multiply-adds in parallel. Figure 4-28 on page 179 shows the operation. 178 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology operand 1 127 0 127 operand 2 0 * 255 * . intermediate result . . . * * 0 + + + + 127 result 0 513-154.eps Figure 4-28. PMADDWD Multiply-Add Operation PMADDWD can be used with one source operand (for example, a coefficient) taken from memory and the other source operand (for example, the data to be multiplied by that coefficient) taken from an XMM register. The instruction can also be used together with the PADDD instruction (page 174) to compute dot products. Scaling can be done, before or after the multiply, using a vector-shift instruction (page 181). If all four of the 16-bit source operands used to produce a 32-bit multiply-add result have the value 8000h, the result is represented as 8000_0000h, because the maximum negative 16bit value of 8000h multiplied by itself equals 4000_0000h, and 4000_0000h added to 4000_0000h equals 8000_0000h. The result of multiplying two negative numbers should be a positive number, but 8000_0000h is the maximum possible 32-bit negative number rather than a positive number. Average. PAVGB--Packed Average Unsigned Bytes PAVGW--Packed Average Unsigned Words Chapter 4: 128-Bit Media and Scientific Programming 179 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 The PAVGx instructions compute the rounded average of each unsigned 8-bit (PAVGB) or 16-bit (PAVGW) integer value in the first operand and the corresponding, same-sized unsigned integer in the second operand and write the result in the corresponding, same-sized element of the destination. The rounded average is computed by adding each pair of operands, adding 1 to the temporary sum, and then right-shifting the temporary sum by one bit-position. For vectors of n number of elements, the operation is: operand1[i] = ((operand1[i] + operand2[i]) + 1) /2 where: i = 0 to n - 1 The PAVGB instruction is useful for MPEG decoding, in which motion compensation performs many byte-averaging operations between and within macroblocks. In addition to speeding up these operations, PAVGB can free up registers and make it possible to unroll the averaging loops. Sum of Absolute Differences. PSADBW--Packed Sum of Absolute Differences of Bytes into a Word The PSADBW instruction computes the absolute values of the differences of corresponding 8-bit signed integer values in the two quadword halves of both source operands, sums the differences for each quadword half, and writes the two unsigned 16-bit integer results in the destination. The sum for the high-order half is written in the least-significant word of the destination's high-order quadword, with the remaining bytes cleared to all 0s. The sum for the low-order half is written in the least-significant word of the destination's low-order quadword, with the remaining bytes cleared to all 0s. Figure 4-29 on page 181 shows the PSADBW operation. Sums of absolute differences are useful, for example, in computing the L1 norm in motion-estimation algorithms for video compression. 180 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology operand 1 127 0 127 operand 2 0 ...... ABS ...... ABS ...... ABS ...... ABS high-order intermediate result . . . . . . low-order intermediate result . . . . . . 0 127 0 result 0 513-155.eps Figure 4-29. 4.5.6 Shift PSADBW Sum-of-Absolute-Differences Operation The vector-shift instructions are useful for scaling vector elements to higher or lower precision, packing and unpacking vector elements, and multiplying and dividing vector elements by powers of 2. Left Logical Shift. PSLLW--Packed Shift Left Logical Words PSLLD--Packed Shift Left Logical Doublewords PSLLQ--Packed Shift Left Logical Quadwords PSLLDQ--Packed Shift Left Logical Double Quadword The PSLLW, PSLLD, and PSLLQ instructions left-shift each of the 16-bit, 32-bit, or 64-bit values, respectively, in the first operand by the number of bits specified in the second operand. The instructions then write each shifted value into the corresponding, same-sized element of the destination. The loworder bits that are emptied by the shift operation are cleared to 0. The first operand is an XMM register. The second operand can be an XMM register, 128-bit memory location, or immediate byte. Chapter 4: 128-Bit Media and Scientific Programming 181 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 In integer arithmetic, left logical shifts effectively multiply unsigned operands by positive powers of 2. Thus, For vectors of n number of elements, the operation is: operand1[i] = operand1[i] * 2operand2 where: i = 0 to n - 1 The PSLLDQ instruction differs from the other three left-shift instructions because it operates on bytes rather than bits. It left-shifts the 128-bit (double quadword) value in an XMM register by the number of bytes specified in an immediate byte value. Right Logical Shift. PSRLW--Packed Shift Right Logical Words PSRLD--Packed Shift Right Logical Doublewords PSRLQ--Packed Shift Right Logical Quadwords PSRLDQ--Packed Shift Right Logical Double Quadword The PSRLW, PSRLD, and PSRLQ instructions right-shift each of the 16-bit, 32-bit, or 64-bit values, respectively, in the first operand by the number of bits specified in the second operand. The instructions then write each shifted value into the corresponding, same-sized element of the destination. The high-order bits that are emptied by the shift operation are cleared to 0. The first operand is an XMM register. The second operand can be an XMM register, 128-bit memory location, or immediate byte. In integer arithmetic, right logical bit-shifts effectively divide unsigned operands by positive powers of 2, or they divide positive signed operands by positive powers of 2. Thus, For vectors of n number of elements, the operation is: operand1[i] = operand1[i] / 2operand2 where: i = 0 to n - 1 The PSRLDQ instruction differs from the other three right-shift instructions because it operates on bytes rather than bits. It right-shifts the 128-bit (double quadword) value in an XMM register by the number of bytes specified in an immediate byte value. PSRLDQ can be used, for example, to move the high 8 bytes of an XMM register to the low 8 bytes of the register. In 182 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology some implementations, however, PUNPCKHQDQ may be a better choice for this operation. Right Arithmetic Shift. PSRAW--Packed Shift Right Arithmetic Words PSRAD--Packed Shift Right Arithmetic Doublewords The PSRAx instructions right-shift each of the 16-bit (PSRAW) or 32-bit (PSRAD) values in the first operand by the number of bits specified in the second operand. The instructions then write each shifted value into the corresponding, same-sized element of the destination. The high-order bits that are emptied by the shift operation are filled with the sign bit of the initial value. In integer arithmetic, right arithmetic shifts effectively divide signed operands by positive powers of 2. Thus, For vectors of n number of elements, the operation is: operand1[i] = operand1[i] / 2operand2 where: i = 0 to n - 1 4.5.7 Compare The integer vector-compare instructions compare two operands, and they either write a mask or they write the maximum or minimum value. Compare and Write Mask. PCMPEQB--Packed Compare Equal Bytes PCMPEQW--Packed Compare Equal Words PCMPEQD--Packed Compare Equal Doublewords PCMPGTB--Packed Compare Greater Than Signed Bytes PCMPGTW--Packed Compare Greater Than Signed Words PCMPGTD--Packed Compare Greater Than Signed Doublewords Th e P C M P E Q x a n d P C M P G T x i n s t r u c t i o n s c o m p a re corresponding bytes, words, or doublewords in the two source operands. The instructions then write a mask of all 1s or 0s for each compare into the corresponding, same-sized element of the destination. Figure 4-30 on page 184 shows a PCMPEQB compare operation. It performs 16 compares in parallel. Chapter 4: 128-Bit Media and Scientific Programming 183 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 operand 1 127 0 127 operand 2 0 .............. imm8 .............. compare compare all 1s or 0s all 1s or 0s .............. 127 0 513-168.eps result Figure 4-30. PCMPEQB Compare Operation For the PCMPEQx instructions, if the compared values are equal, the result mask is all 1s. If the values are not equal, the result mask is all 0s. For the PCMPGTx instructions, if the signed value in the first operand is greater than the signed value in the second operand, the result mask is all 1s. If the value in the first operand is less than or equal to the value in the second operand, the result mask is all 0s. By specifying the same register for both operands, PCMPEQx can be used to set the bits in an XMM register to all 1s. Figure 4-10 on page 138 shows an example of a non-branching sequence that implements a two-way multiplexer--one that is equivalent to the following sequence of ternary operators in C or C++: r0 r1 r2 r3 r4 r5 r6 r7 = = = = = = = = a0 a1 a2 a3 a4 a5 a6 a7 > > > > > > > > b0 b1 b2 b3 b4 b5 b6 b7 ? ? ? ? ? ? ? ? a0 a1 a2 a3 a4 a5 a6 a7 : : : : : : : : b0 b1 b2 b3 b4 b5 b6 b7 184 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Assuming xmm0 contains the vector a, and xmm1 contains the vector b, the above C sequence can be implemented with the following assembler sequence: MOVQ PCMPGTW PAND PANDN POR xmm3, xmm3, xmm0, xmm3, xmm0, xmm0 xmm2 xmm3 xmm1 xmm3 ; ; ; ; a a a r > > > = b b b a ? ? > > 0xffff : 0 a: 0 0:b b ? a: b In the above sequence, PCMPGTW, PAND, PANDN, and POR operate, in parallel, on all four elements of the vectors. Compare and Write Minimum or Maximum. PMAXUB--Packed Maximum Unsigned Bytes PMINUB--Packed Minimum Unsigned Bytes PMAXSW--Packed Maximum Signed Words PMINSW--Packed Minimum Signed Words The PMAXUB and PMINUB instructions compare each of the 8bit unsigned integer values in the first operand with the corresponding 8-bit unsigned integer values in the second operand. The instructions then write the maximum (PMAXUB) or minimum (PMINUB) of the two values for each comparison into the corresponding byte of the destination. The PMAXSW and PMINSW instructions perform operations analogous to the PMAXUB and PMINUB instructions, except on 16-bit signed integer values. 4.5.8 Logical The vector-logic instructions perform Boolean logic operations, including AND, OR, and exclusive OR. And. PAND--Packed Logical Bitwise AND PANDN--Packed Logical Bitwise AND NOT The PAND instruction performs a logical bitwise AND of the values in the first and second operands and writes the result to the destination. The PANDN instruction inverts the first operand (creating a ones-complement of the operand), ANDs it with the second operand, and writes the result to the destination. Table 4-10 on page 186 shows an example. Chapter 4: 128-Bit Media and Scientific Programming 185 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Table 4-10. Example PANDN Bit Values Operand1 Bit (Inverted) 0 0 1 1 Operand2 Bit 1 0 1 0 PANDN Result Bit 0 0 1 0 Operand1 Bit 1 1 0 0 Or. POR--Packed Logical Bitwise OR The POR instruction performs a logical bitwise OR of the values in the first and second operands and writes the result to the destination. Exclusive Or. PXOR--Packed Logical Bitwise Exclusive OR The PXOR instruction performs a logical bitwise exclusive OR of the values in the first and second operands and writes the result to the destination. PXOR can be used to clear all bits in an XMM register by specifying the same register for both operands. 4.5.9 Save and Restore State These instructions save and restore the entire processor state for 128-bit media instructions. Save and Restore 128-Bit, 64-Bit, and x87 State. FXSAVE--Save XMM, MMX, and x87 State. FXRSTOR--Restore XMM, MMX, and x87 State. The FXSAVE and FXRSTOR instructions save and restore the entire 512-byte processor state for 128-bit media instructions, 64-bit media instructions, and x87 floating-point instructions. The architecture supports two memory formats for FXSAVE and FXRSTOR, a 512-byte 32-bit legacy format and a 512-byte 64-bit format. Selection of the 32-bit or 64-bit format is determined by the effective operand size for the FXSAVE and FXRSTOR instructions. For details, see "Saving Media and x87 Processor State" in Volume 2. 186 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Save and Restore Control and Status. STMXCSR--Store MXCSR Control/Status Register LDMXCSR--Load MXCSR Control/Status Register The STMXCSR and LDMXCSR instructions save and restore the 32-bit contents of the MXCSR register. For further information, see "MXCSR Register" on page 140. 4.6 Instruction Summary--Floating-Point Instructions This section summarizes the functions of the floating-point instructions in the 128-bit media instruction subset. These include floating-point instructions that use an XMM register for source or destination and data-conversion instructions that convert from floating-point to integers formats. For a summary of the integer instructions in the 128-bit media instruction subset, including data-conversion instructions that convert from integer to floating-point formats, see "Instruction Summary--Integer Instructions" on page 160. For a summary of the 64-bit media floating-point instructions, see "Instruction Summary--Floating-Point Instructions" on page 265. For a summary of the x87 floating-point instructions, see "Instruction Summary" on page 315. The instructions are organized here by functional group--such as data-transfer, vector arithmetic, and so on. Software running at any privilege level can use any of these instructions, if the CPUID instruction reports support for the instructions (see "Feature Detection" on page 209). More detail on individual instructions is given in the alphabetically organized "128-Bit Media Instruction Reference" on page 1. 4.6.1 Syntax The 128-bit media floating-point instructions have the same syntax rules as those for the 128-bit media integer instructions, described in "Syntax" on page 160. For an illustration of typical syntax, see Figure 4-16 on page 160. The data-transfer instructions copy operands between 32-bit, 64-bit, or 128-bit memory locations and XMM registers. The MOV mnemonic, which stands for move, is a misnomer. A copy function is actually performed instead of a move. A new copy of the source value is created at the destination address, and the original copy remains unchanged at its source location. 187 4.6.2 Data Transfer Chapter 4: 128-Bit Media and Scientific Programming AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Move. MOVAPS--Move Aligned Packed Single-Precision FloatingPoint MOVAPD--Move Aligned Packed Double-Precision Floating-Point MOVUPS--Move Unaligned Packed Single-Precision Floating-Point MOVUPD--Move Unaligned Packed Double-Precision Floating-Point MOVHPS--Move High Packed Single-Precision FloatingPoint MOVHPD--Move High Packed Double-Precision FloatingPoint MOVLPS--Move Low Packed Single-Precision FloatingPoint MOVLPD--Move Low Packed Double-Precision FloatingPoint MOVHLPS--Move Packed Single-Precision Floating-Point High to Low MOVLHPS--Move Packed Single-Precision Floating-Point Low to High MOVSS--Move Scalar Single--Precision Floating-Point MOVSD--Move Scalar Double-Precision Floating-Point Figure 4-31 on page 189 shows the capabilities of the various floating-point move instructions. The MOVAPx instructions copy a vector of four single-precision floating-point values (MOVAPS) or a vector of two doubleprecision floating-point values (MOVAPD) from the second operand to the first operand--i.e., from an XMM register or 128bit memory location or to another XMM register, or vice versa. A general-protection exception occurs if a memory operand is not aligned on a 16-byte boundary. The MOVUPx instructions perform operations analogous to the MOVAPx instructions, except that unaligned memory operands do not cause a general-protection exception. 188 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology 127 XMM Register (destination) 0 MOVAPS MOVAPD MOVUPS MOVUPD MOVLPS* MOVLPD* MOVHPS* MOVHPD* MOVSD MOVSS 127 XMM Register or Memory (source) 0 127 XMM Register or Memory (destination) 0 MOVAPS MOVAPD MOVUPS MOVUPD MOVLPS* MOVLPD* 127 XMM Register (source) 0 memory MOVHPS* MOVHPD* MOVSD MOVSS 127 XMM Register (destination) 0 127 XMM Register (source) 0 MOVHLPS MOVLHPS * These instructions copy data only between memory and regsiter or vice versa, not between two registers. 513-169.eps Figure 4-31. Floating-Point Move Operations The MOVHPS and MOVHPD instructions copy a vector of two single-precision floating-point values (MOVHPS) or one doubleprecision floating-point value (MOVHPD) from a 64-bit memory Chapter 4: 128-Bit Media and Scientific Programming memory 189 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 location to the high-order 64 bits of an XMM register, or from the high-order 64 bits of an XMM register to a 64-bit memory location. In the memory-to-register case, the low-order 64 bits of the destination XMM register are not modified. The MOVLPS and MOVLPD instructions copy a vector of two single-precision floating-point values (MOVLPS) or one doubleprecision floating-point value (MOVLPD) from a 64-bit memory location to the low-order 64 bits of an XMM register, or from the low-order 64 bits of an XMM register to a 64-bit memory location. In the memory-to-register case, the high-order 64 bits of the destination XMM register are not modified. The MOVHLPS instruction copies a vector of two singleprecision floating-point values from the high-order 64 bits of an XMM register to the low-order 64 bits of another XMM register. The high-order 64 bits of the destination XMM register are not modified. The MOVLHPS instruction performs an analogous operation except in the opposite direct (low-order to highorder), and the low-order 64 bits of the destination XMM register are not modified. The MOVSS instruction copies a scalar single-precision floating-point value from the low-order 32 bits of an XMM register or a 32-bit memory location to the low-order 32 bits of another XMM register, or vice versa. If the source operand is an XMM register, the high-order 96 bits of the destination XMM register are not modified. If the source operand is a 32-bit memory location, the high-order 96 bits of the destination XMM register are cleared to all 0s. The MOVSD instruction copies a scalar double-precision floating-point value from the low-order 64 bits of an XMM register or a 64-bit memory location to the low-order 64 bits of another XMM register, or vice versa. If the source operand is an XMM register, the high-order 64 bits of the destination XMM register are not modified. If the source operand is a memory location, the high-order 64 bits of the destination XMM register are cleared to all 0s. The above MOVSD instruction should not be confused with the same-mnemonic MOVSD (move string doubleword) instruction in the general-purpose instruction set. Assemblers distinguish the two instructions by their operand data types. 190 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Move Non-Temporal. The move non-temporal instructions are streaming-store instructions. They minimize pollution of the cache. MOVNTPS--Move Non-Temporal Packed Single-Precision Floating-Point MOVNTPD--Move Non-Temporal Packed Double-Precision Floating-Point The MOVNTPx instructions copy four single-precision floatingpoint (MOVNTPS) or two double-precision floating-point (MOVNTPD) values from an XMM register into a 128-bit memory location. These instructions indicate to the processor that their data is non-temporal, which assumes that the data they reference will be used only once and is therefore not subject to cache-related overhead (as opposed to temporal data, which assumes that the data will be accessed again soon and should be cached). The non-temporal instructions use weakly-ordered, write-combining buffering of write data, and they minimizes cache pollution. The exact method by which cache pollution is minimized depends on the hardware implementation of the instruction. For further information, see "Memory Optimization" on page 113. Move Mask. MOVMSKPS--Extract Packed Single-Precision FloatingPoint Sign Mask MOVMSKPD--Extract Packed Double-Precision FloatingPoint Sign Mask The MOVMSKPS instruction copies the sign bits of four singleprecision floating-point values in an XMM register to the four low-order bits of a 32-bit or 64-bit general-purpose register, with zero-extension. The MOVMSKPD instruction copies the sign bits of two double-precision floating-point values in an XMM register to the two low-order bits of a general-purpose register, with zero-extension. The result of either instruction is a sign-bit mask that can be used for data-dependent branching. Figure 4-32 on page 192 shows the MOVMSKPS operation. Chapter 4: 128-Bit Media and Scientific Programming 191 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 GPR 0 127 XMM 0 concatenate 4 sign bits 513-158.eps Figure 4-32. 4.6.3 Data Conversion MOVMSKPS Move Mask Operation The floating-point data-conversion instructions convert floating-point operands to integer operands. These data-conversion instructions take 128-bit floating-point source operands. For data-conversion instructions that take 128bit integer source operands, see "Data Conversion" on page 166. For data-conversion instructions that take 64-bit source operands, see "Data Conversion" on page 250 and "Data Conversion" on page 266. Convert Floating-Point to Floating-Point. These instructions convert floating-point data types in XMM registers or memory into different floating-point data types in XMM registers. CVTPS2PD--Convert Packed Single-Precision FloatingPoint to Packed Double-Precision Floating-Point CVTPD2PS--Convert Packed Double-Precision FloatingPoint to Packed Single-Precision Floating-Point CVTSS2SD--Convert Scalar Single-Precision Floating-Point to Scalar Double-Precision Floating-Point CVTSD2SS--Convert Scalar Double-Precision FloatingPoint to Scalar Single-Precision Floating-Point The CVTPS2PD instruction converts two single-precision floating-point values in the low-order 64 bits of the second operand (an XMM register or a 64-bit memory location) to two double-precision floating-point values in the destination operand (an XMM register). The CVTPD2PS instruction converts two double-precision floating-point values in the second operand to two single- 192 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology precision floating-point values in the low-order 64 bits of the destination. The high-order 64 bits in the destination XMM register are cleared to all 0s. If the result of the conversion is an inexact value, the value is rounded. The CVTSS2SD instruction converts a single-precision floatingpoint value in the low-order 32 bits of the second operand to a double-precision floating-point value in the low-order 64 bits of the destination. The high-order 64 bits in the destination XMM register are not modified. The CVTSD2SS instruction converts a double-precision floating-point value in the low-order 64 bits of the second operand to a single-precision floating-point value in the loworder 64 bits of the destina tion. The three high-order doublewords in the destination XMM register are not modified. If the result of the conversion is an inexact value, the value is rounded. Convert Floating-Point to XMM Integer. These instructions convert floating-point data types in XMM registers or memory into integer data types in XMM registers. CVTPS2DQ--Convert Packed Single-Precision Point to Packed Doubleword Integers CVTPD2DQ--Convert Packed Double-Precision Point to Packed Doubleword Integers CVTTPS2DQ--Convert Packed Single-Precision Point to Packed Doubleword Integers, Truncated CVTTPD2DQ--Convert Packed Double-Precision Point to Packed Doubleword Integers, Truncated FloatingFloatingFloatingFloating- The CVTPS2DQ and CVTTPS2DQ instructions convert four single-precision floating-point values in the second operand to four 32-bit signed integer values in the destination. For the CVTPS2DQ instruction, if the result of the conversion is an inexact value, the value is rounded, but for the CVTTPS2DQ instruction such a result is truncated (rounded toward zero). The CVTPD2DQ and CVTTPD2DQ instructions convert two double-precision floating-point values in the second operand to two 32-bit signed integer values in the destination. The highorder 64 bits in the destination XMM register are cleared to all 0s. For the CVTPD2DQ instruction, if the result of the conversion is an inexact value, the value is rounded, but for the Chapter 4: 128-Bit Media and Scientific Programming 193 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 CVTTPD2DQ instruction such a result is truncated (rounded toward zero). For a description of 128-bit media instructions that convert in the opposite direction--integer to floating-point--see "Convert Integer to Floating-Point" on page 166. Convert Floating-Point to MMXTM Integer. These instructions convert floating-point data types in XMM registers or memory into integer data types in MMX registers. CVTPS2PI--Convert Packed Single-Precision Floating-Point to Packed Doubleword Integers CVTPD2PI--Convert Packed Double-Precision FloatingPoint to Packed Doubleword Integers CVTTPS2PI--Convert Packed Single-Precision FloatingPoint to Packed Doubleword Integers, Truncated CVTTPD2PI--Convert Packed Double-Precision FloatingPoint to Packed Doubleword Integers, Truncated The CVTPS2PI and CVTTPS2PI instructions convert two singleprecision floating-point values in the low-order 64 bits of an XMM register or a 64-bit memory location to two 32-bit signed integer values in an MMX register. For the CVTPS2PI instruction, if the result of the conversion is an inexact value, the value is rounded, but for the CVTTPS2PI instruction such a result is truncated (rounded toward zero). The CVTPD2PI and CVTTPD2PI instructions convert two double-precision floating-point values in an XMM register or a 128-bit memory location to two 32-bit signed integer values in an MMX register. For the CVTPD2PI instruction, if the result of the conversion is an inexact value, the value is rounded, but for the CVTTPD2PI instruction such a result is truncated (rounded toward zero). Before executing a CVTxPS2PI or CVTxPD2PI instruction, software should ensure that the MMX registers are properly initialized so as to prevent conflict with their aliased use by x87 floating-point instructions. This may require clearing the MMX state, as described in "Accessing Operands in MMXTM Registers" on page 223. For a description of 128-bit media instructions that convert in the opposite direction--integer in MMX registers to floatingpoint in XMM registers--see "Convert MMX Integer to 194 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Floating-Point" on page 166. For a summary of instructions that operate on MMX registers, see Chapter 5, "64-Bit Media Programming." Convert Floating-Point to GPR Integer. These instructions convert floating-point data types in XMM registers or memory into integer data types in GPR registers. CVTSS2SI--Convert Scalar Single-Precision Floating-Point to Signed Doubleword or Quadword Integer CVTSD2SI--Convert Scalar Double-Precision Floating-Point to Signed Doubleword or Quadword Integer CVTTSS2SI--Convert Scalar Single-Precision FloatingPoint to Signed Doubleword or Quadword Integer, Truncated CVTTSD2SI--Convert Scalar Double-Precision FloatingPoint to Signed Doubleword or Quadword Integer, Truncated The CVTSS2SI and CVTTSS2SI instructions convert a singleprecision floating-point value in the low-order 32 bits of an XMM register or a 32-bit memory location to a 32-bit or 64-bit signed integer value in a general-purpose register. For the CVTSS2SI instruction, if the result of the conversion is an inexact value, the value is rounded, but for the CVTTSS2SI instruction such a result is truncated (rounded toward zero). The CVTSD2SI and CVTTSD2SI instructions convert a doubleprecision floating-point value in the low-order 64 bits of an XMM register or a 64-bit memory location to a 32-bit or 64-bit signed integer value in a general-purpose register. For the CVTSD2SI instruction, if the result of the conversion is an inexact value, the value is rounded, but for the CVTTSD2SI instruction such a result is truncated (rounded toward zero). For a description of 128-bit media instructions that convert in the opposite direction--integer in GPR registers to floatingpoint in XMM registers--see "Convert GPR Integer to FloatingPoint" on page 167. For a summary of instructions that operate o n G P R re g i s t e rs , s e e C h a p t e r 3 , " G e n e ra l - P u r p o s e Programming." 4.6.4 Data Reordering The floating-point data-reordering instructions unpack and interleave, or shuffle the elements of vector operands. Chapter 4: 128-Bit Media and Scientific Programming 195 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Unpack and Interleave. These instructions interleave vector elements from the high or low halves of two floating-point source operands. UNPCKHPS--Unpack High Single-Precision Floating-Point UNPCKHPD--Unpack High Double-Precision FloatingPoint UNPCKLPS--Unpack Low Single-Precision Floating-Point UNPCKLPD--Unpack Low Double-Precision Floating-Point The UNPCKHPx instructions copy the high-order two singleprecision floating-point values (UNPCKHPS) or one doubleprecision floating-point value (UNPCKHPD) in the first and second operands and interleave them into the 128-bit destination. The low-order 64 bits of the source operands are ignored. The UNPCKLPx instructions are analogous to their highelement counterparts except that they take elements from the low quadword of each source vector and ignore elements in the high quadword. Depending on the hardware implementation, if the source operand for UNPCKHPx or UNPCKLPx is in memory, only the low 64 bits of the operand may be loaded. Figure 4-33 shows the UNPCKLPS instruction. The elements are taken from the low half of the source operands. In this register image, elements from operand2 are placed to the left of elements from operand1. operand 1 127 0 127 operand 2 0 127 result 0 513-159.eps Figure 4-33. 196 UNPCKLPS Unpack and Interleave Operation Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Shuffle. These instructions reorder the elements of a vector. SHUFPS--Shuffle Packed Single-Precision Floating-Point SHUFPD--Shuffle Packed Double-Precision Floating-Point The SHUFPS instruction moves any two of the four singleprecision floating-point values in the first operand to the loworder quadword of the destination and moves any two of the four single-precision floating-point values in the second operand to the high-order quadword of the destination. In each case, the value of the destination is determined by a field in the immediate-byte operand. Figure 4-34 shows the SHUFPS shuffle operation. SHUFPS is useful, for example, in color imaging when computing alpha saturation of RGB values. In this case, SHUFPS can replicate an alpha value in a register so that parallel comparisons with three RGB values can be performed. operand 1 127 0 127 operand 2 0 imm8 mux mux 127 result 0 513-160.eps Figure 4-34. SHUFPS Shuffle Operation The SHUFPD instruction moves either of the two doubleprecision floating-point values in the first operand to the loworder quadword of the destination and moves either of the two double-precision floating-point values in the second operand to the high-order quadword of the destination. 4.6.5 Arithmetic The floating-point vector-arithmetic instructions perform an arithmetic operation on two floating-point operands. Chapter 4: 128-Bit Media and Scientific Programming 197 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Addition. ADDPS--Add Packed Single-Precision Floating-Point ADDPD-- Add Packed Double-Precision Floating-Point ADDSS--Add Scalar Single-Precision Floating-Point ADDSD--Add Scalar Double-Precision Floating-Point The ADDPS instruction adds each of four single-precision floating-point values in the first operand to the corresponding single-precision floating-point values in the second operand and writes the result in the corresponding quadword of the destination. The ADDPD instruction performs an analogous operation for two double-precision floating-point values. Figure 4-35 shows a typical arithmetic operation on vectors of floating-point single-precision elements--in this case an ADDPS instruction. The instruction performs four arithmetic operations in parallel. operand 1 127 0 127 operand 2 0 FP single FP single FP single FP single FP single FP single FP single FP single . . operation . operation . . 127 . 0 FP single FP single FP single FP single result 513-164.eps Figure 4-35. ADDPS Arithmetic Operation The ADDSS instruction adds the single-precision floating-point value in the low-order doubleword of the first operand to the single-precision floating-p oint value in the low-order doubleword of the second operand and writes the result in the low-order doubleword of the destination. The three high-order doublewords of the destination are not modified. The ADDSD instruction adds the double-precision floatingpoint value in the low-order quadword of the first operand to the double-precision floating-point value in the low-order 198 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology quadword of the second operand and writes the result in the low-order quadword of the destination. The high-order quadword of the destination is not modified. Subtraction. SUBPS--Subtract Packed Single-Precision Floating-Point SUBPD--Subtract Packed Double-Precision Floating-Point SUBSS--Subtract Scalar Single-Precision Floating-Point SUBSD--Subtract Scalar Double-Precision Floating-Point The SUBPS instruction subtracts each of four single-precision floating-point values in the second operand from the corresponding single-precision floating-point value in the first operand and writes the result in the corresponding quadword of the destination. The SUBPD instruction performs an analogous operation for two double-precision floating-point values. For vectors of n number of elements, the operations are: operand1[i] = operand1[i] - operand2[i] where: = 0 to n - 1 The SUBSS instruction subtracts the single-precision floatingpoint value in the low-order doubleword of the second operand from the corresponding single-precision floating-point value in the low-order doubleword of the first operand and writes the result in the low-order doubleword of the destination. The three high-order doublewords of the destination are not modified. The SUBSD instruction subtracts the double-precision floatingpoint value in the low-order quadword of the second operand from the corresponding double-precision floating-point value in the low-order quadword of the first operand and writes the result in the low-order quadword of the destination. The highorder quadword of the destination is not modified. Multiplication. MULPS--Multiply Packed Single-Precision Floating-Point MULPD--Multiply Packed Double-Precision Floating-Point MULSS--Multiply Scalar Single-Precision Floating-Point MULSD--Multiply Scalar Double-Precision Floating-Point The MULPS instruction multiplies each of four single-precision floating-point values in the first operand by the corresponding Chapter 4: 128-Bit Media and Scientific Programming 199 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 single-precision floating-point value in the second operand and writes the result in the corresponding doubleword of the destination. The MULPD instruction performs an analogous operation for two double-precision floating-point values. The MULSS instruction multiplies the single-precision floatingpoint value in the low-order doubleword of the first operand by the single-precision floating-point value in the low-order doubleword of the second operand and writes the result in the low-order doubleword of the destination. The three high-order doublewords of the destination are not modified. The MULSD instruction multiplies the double-precision floating-point value in the low-order quadword of the first operand by the double-precision floating-point value in the loworder quadword of the second operand and writes the result in the low-order quadword of the destination. The high-order quadword of the destination is not modified. Division. DIVPS--Divide Packed Single-Precision Floating-Point DIVPD--Divide Packed Double-Precision Floating-Point DIVSS--Divide Scalar Single-Precision Floating-Point DIVSD--Divide Scalar Double-Precision Floating-Point The DIVPS instruction divides each of the four single-precision floating-point values in the first operand by the corresponding single-precision floating-point value in the second operand and writes the result in the corresponding quadword of the destination. The DIVPD instruction performs an analogous operation for two double-precision floating-point values. For vectors of n number of elements, the operations are: operand1[i] = operand1[i] / operand2[i] where: i = 0 to n - 1 The DIVSS instruction divides the single-precision floatingpoint value in the low-order doubleword of the first operand by the single-precision floating-point value in the low-order doubleword of the second operand and writes the result in the low-order doubleword of the destination. The three high-order doublewords of the destination are not modified. The DIVSD instruction divides the double-precision floatingpoint value in the low-order quadword of the first operand by 200 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology the double-precision floating-point value in the low-order quadword of the second operand and writes the result in the low-order quadword of the destination. The high-order quadword of the destination is not modified. If accuracy requirements allow, convert floating-point division by a constant to a multiply by the reciprocal. Divisors that are powers of two and their reciprocals are exactly representable, and therefore do not cause an accuracy issue, except for the rare cases in which the reciprocal overflows or underflows. Square Root. SQRTPS--Square Point SQRTPD--Square Point SQRTSS--Square Point SQRTSD--Square Point Root Packed Single-Precision FloatingRoot Packed Double-Precision FloatingRoot Scalar Single-Precision FloatingRoot Scalar Double-Precision Floating- The SQRTPS instruction computes the square root of each of four single-precision floating-point values in the second operand (an XMM register or 128-bit memory location) and writes the result in the corresponding doubleword of the destination. The SQRTPD instruction performs an analogous operation for two double-precision floating-point values. The SQRTSS instruction computes the square root of the loworder single-precision floating-point value in the second operand (an XMM register or 32-bit memory location) and writes the result in the low-order doubleword of the destination. The three high-order doublewords of the destination XMM register are not modified. The SQRTSD instruction computes the square root of the loworder double-precision floating-point value in the second operand (an XMM register or 64-bit memory location) and writes the result in the low-order quadword of the destination. The high-order quadword of the destination XMM register is not modified. Chapter 4: 128-Bit Media and Scientific Programming 201 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Reciprocal Square Root. RSQRTPS--Reciprocal Square Root Packed SinglePrecision Floating-Point RSQRTSS--Reciprocal Square Root Scalar Single-Precision Floating-Point The RSQRTPS instruction computes the approximate reciprocal of the square root of each of four single-precision floating-point values in the second operand (an XMM register or 128-bit memory location) and writes the result in the corresponding doubleword of the destination. The RSQRTSS instruction computes the approximate reciprocal of the square root of the low-order single-precision floating-point value in the second operand (an XMM register or 32-bit memory location) and writes the result in the low-order d o u b l ewo rd o f t h e d e s t i n a t i o n . Th e t h re e h i g h - o rd e r doublewords in the destination XMM register are not modified. For both RSQRTPS and RSQRTSS, the maximum relative error is less than or equal to 1.5 * 2-12. Reciprocal Estimation. RCPPS--Reciprocal Packed Single-Precision Floating-Point RCPSS--Reciprocal Scalar Single-Precision Floating-Point The RCPPS instruction computes the approximate reciprocal of each of the four single-precision floating-point values in the second operand (an XMM register or 128-bit memory location) and writes the result in the corresponding doubleword of the destination. The RCPSS instruction computes the approximate reciprocal of the low-order single-precision floating-point value in the second operand (an XMM register or 32-bit memory location) and writes the result in the low-order doubleword of the destination. The three high-order doublewords in the destination are not modified. For both RCPPS and RCPSS, the maximum relative error is less than or equal to 1.5 * 2-12. 4.6.6 Compare The floating-point vector-compare instructions compare two operands, and they either write a mask, or they write the maximum or minimum value, or they set flags. Compare Chapter 4: 128-Bit Media and Scientific Programming 202 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology instructions can be used to avoid branches. Figure 4-10 on page 138 shows an example of using compare instructions. Compare and Write Mask. CMPPS--Compare Packed Single-Precision Floating-Point CMPPD--Compare Packed Double-Precision Floating-Point CMPSS--Compare Scalar Single-Precision Floating-Point CMPSD--Compare Scalar Double-Precision Floating-Point The CMPPS instruction compares each of four single-precision f l o a t i n g - p o i n t va l u e s i n t h e f i r s t o p e ra n d w i t h t h e corresponding single-precision floating-point value in the second operand and writes the result in the corresponding 32 bits of the destination. The type of comparison is specified by the three low-order bits of the immediate-byte operand. The result of each compare is a 32-bit value of all 1s (TRUE) or all 0s (FALSE). Some compare operations that are not directly s u p p o r t e d by t h e i m m e d i a t e - by t e e n c o d i n g s c a n b e implemented by swapping the contents of the source and destination operands before executing the compare. The CMPPD instruction performs an analogous operation for two double-precision floating-point values. The CMPSS instruction performs an analogous operation for the singleprecision floating-point values in the low-order 32 bits of the source operands. The three high-order doublewords of the destination are not modified. The CMPSD instruction performs an analogous operation for the double-precision floating-point values in the low-order 64 bits of the source operands. The highorder 64 bits of the destination XMM register are not modified. Figure 4-36 on page 204 shows a CMPPD compare operation. Chapter 4: 128-Bit Media and Scientific Programming 203 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 operand 1 127 0 127 operand 2 0 imm8 compare compare all 1s or 0s all 1s or 0s 127 result 0 513-162.eps Figure 4-36. CMPPD Compare Operation Compare and Write Minimum or Maximum. MAXPS--Maximum Packed Single-Precision Floating-Point MAXPD--Maximum Packed Double-Precision FloatingPoint MAXSS--Maximum Scalar Single-Precision Floating-Point MAXSD--Maximum Scalar Double-Precision Floating-Point MINPS--Minimum Packed Single-Precision Floating-Point MINPD--Minimum Packed Double-Precision Floating-Point MINSS--Minimum Scalar Single-Precision Floating-Point MINSD--Minimum Scalar Double-Precision Floating-Point The MAXPS and MINPS instructions compare each of four single-precision floating-point values in the first operand with the corresponding single-precision floating-point value in the second operand and writes the maximum or minimum, respectively, of the two values in the corresponding doubleword of the destination. The MAXPD and MINPD instructions perform analogous operations on pairs of double-precision floating-point values. The MAXSS and MINSS instructions compare the singleprecision floating-point value in the low-order 32 bits of the 204 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology first operand with the single-precision floating-point value in the low-order 32 bits of the second operand and writes the maximum or minimum, respectively, of the two values in the low-order 32 bits of the destination. The three high-order doublewords of the destination XMM register are not modified. The MAXSD and MINSD instructions compare the doubleprecision floating-point value in the low-order 64 bits of the first operand with the double-precision floating-point value in the low-order 64 bits of the second operand and writes the maximum or minimum, respectively, of the two values in the low-order quadword of the destination. The high-order quadword of the destination XMM register is not modified. The MINx and MAXx instructions are useful for clamping (saturating) values, such as color values in 3D geometry and rasterization. Compare and Write rFLAGS. COMISS--Compare Ordered Scalar Floating-Point COMISD--Compare Ordered Scalar Floating-Point UCOMISS--Unordered Compare Scalar Floating-Point UCOMISD--Unordered Compare Scalar Floating-Point Single-Precision Double-Precision Single-Precision Double-Precision The COMISS instruction performs an ordered compare of the single-precision floating-point value in the low-order 32 bits of the first operand with the single-precision floating-point value in the low-order 32 bits of the second operand and sets the zero flag (ZF), parity flag (PF), and carry flag (CF) bits in the rFLAGS register to reflect the result of the compare. The OF, AF, and SF bits in rFLAGS are set to zero. The COMISD instruction performs an analogous operation on the double-precision floating-point values in the low-order 64 bits of the source operands. The UCOMISS and UCOMISD instructions perform an analogous, but unordered, compare operations. Figure 4-37 on page 206 shows a COMISD compare operation. Chapter 4: 128-Bit Media and Scientific Programming 205 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 operand 1 127 0 127 operand 2 0 compare 0 63 rFLAGS 31 0 513-161.eps Figure 4-37. COMISD Compare Operation The difference between an ordered and unordered comparison has to do with the conditions under which a floating-point invalid-operation exception (IE) occurs. In an ordered comparison (COMISS or COMISD), an IE exception occurs if either of the source operands is either type of NaN (QNaN or SNaN). In an unordered comparison, the exception occurs only if a source operand is an SNaN. For a description of NaNs, see "Floating-Point Number Representation" on page 153. For a description of exceptions, see "Exceptions" on page 209. 4.6.7 Logical The vector-logic instructions perform Boolean logic operations, including AND, OR, and exclusive OR. And. ANDPS--Logical Bitwise AND Packed Single-Precision Floating-Point ANDPD--Logical Bitwise AND Packed Double-Precision Floating-Point ANDNPS--Logical Bitwise AND NOT Packed SinglePrecision Floating-Point ANDNPD--Logical Bitwise AND NOT Packed DoublePrecision Floating-Point The ANDPS instruction performs a logical bitwise AND of the four packed single-precision floating-point values in the first operand and the corresponding four single-precision floatingpoint values in the second operand and writes the result in the destination. The ANDPD instruction performs an analogous 206 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology operation on two packed double-precision floating-point values. The ANDNPS and ANDNPD instructions invert the elements of the first source vector (creating a one's complement of each element), AND them with the elements of the second source vector, and write the result to the destination. Or. ORPS--Logical Floating-Point ORPD--Logical Floating-Point Bitwise Bitwise OR OR Packed Packed Single-Precision Double-Precision The ORPS instruction performs a logical bitwise OR of four single-precision floating-point values in the first operand and the corresponding four single-precision floating-point values in the second operand and writes the result in the destination. The ORPD instruction performs an analogous operation on pairs of two double-precision floating-point values. Exclusive Or. XORPS--Logical Bitwise Exclusive OR Packed SinglePrecision Floating-Point XORPD--Logical Bitwise Exclusive OR Packed DoublePrecision Floating-Point The XORPS instruction performs a logical bitwise exclusive OR of four single-precision floating-point values in the first operand and the corresponding four single-precision floatingpoint values in the second operand and writes the result in the destination. The XORPD instruction performs an analogous operation on pairs of two double-precision floating-point values. 4.7 Instruction Effects on Flags The STMXCSR and LDMXCSR instructions, described in "Save and Restore State" on page 186, read and write flags in the MXCSR register. For a description of the MXCSR register, see "MXCSR Register" on page 140. The COMISS, COMISD, UCOMISS, and UCOMISD instructions, described in "Compare" on page 202, write flag bits in the rFLAGS register. For a description of the rFLAGS register, see "Flags Register" on page 37. Chapter 4: 128-Bit Media and Scientific Programming 207 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 4.8 Instruction Prefixes Instruction prefixes, in general, are described in "Instruction Prefixes" on page 85. The following restrictions apply to the use of instruction prefixes with 128-bit media instructions. 4.8.1 Supported Prefixes The following prefixes can be used with 128-bit media instructions: Address-Size Override--The 67h prefix affects only operands in memory. The prefix is ignored by all other 128-bit media instructions. Operand-Size Override--The 66h prefix is used to form the opcodes of certain 128-bit media instructions. The prefix is ignored by all other 128-bit media instructions. Segment Overrides--The 2Eh (CS), 36h (SS), 3Eh (DS), 26h (ES), 64h (FS), and 65h (GS) prefixes affect only operands in memory. In 64-bit mode, the contents of the CS, DS, ES, SS segment registers are ignored. REP--The F2 and F3h prefixes do not function as repeat prefixes for 128-bit media instructions. Instead, they are used to form the opcodes of certain 128-bit media instructions. The prefixes are ignored by all other 128-bit media instructions. REX--The REX prefixes affect operands that reference a GPR or XMM register when running in 64-bit mode. It allows access to the full 64-bit width of any of the 16 extended GPRs and to any of the 16 extended XMM registers. The REX prefix also affects the FXSAVE and FXRSTOR instructions, in which it selects between two types of 512byte memory-image format, as described in "Saving Media and x87 Processor State" in Volume 2. The prefix is ignored by all other 128-bit media instructions. 4.8.2 Special-Use and Reserved Prefixes The following prefixes are used as opcode bytes in some 128-bit media instructions and are reserved in all other 128-bit media instructions: Operand-Size Override--The 66h prefix. REP--The F2 and F3h prefixes. 4.8.3 Prefixes That Cause Exceptions The following prefixes cause an exception: LOCK--The F0h prefix causes an invalid-opcode exception when used with 128-bit media instructions. Chapter 4: 128-Bit Media and Scientific Programming 208 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology 4.9 Feature Detection Before executing 128-bit media instructions, software should determine whether the processor supports the technology by executing the CPUID instruction. "Feature Detection" on page 90 describes how software uses the CPUID instruction to detect feature support. For full support of the 128-bit media instructions documented here, the following features require detection: SSE, indicated by bit 25 of CPUID extended function 8000_0001h. SSE2, indicated by bit 26 of CPUID extended function 8000_0001h. FXSAVE and FXRSTOR, indicated by bit 24 of CPUID standard function 1 and extended function 8000_0001h.DD Software that runs in long mode should also check for the following support: Long Mode, indicated by bit 29 of CPUID extended function 8000_0001h. See "Processor Feature Identification" in Volume 2 for a full description of the CPUID instruction and its function codes. In addition, the operating system must support the FXSAVE and FXRSTOR instructions (by having set CR4.OSFXSR = 1), and it may wish to support SIMD floating-point exceptions (by having set CR4.OSXMMEXCPT = 1). For details, see "SystemControl Registers" in Volume 2. 4.10 Exceptions Types of Exceptions. 128-bit media instructions can generate two types of exceptions: General-Purpose Exceptions, described below in "GeneralPurpose Exceptions" SIMD Floating-Point Exception, described below in "SIMD Floating-Point Exception Causes" on page 211 Relation to x87 Exceptions. Although the 128-bit media instructions and the x87 floating-point instructions each have certain exceptions with the same names, the exception-reporting and Chapter 4: 128-Bit Media and Scientific Programming 209 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 exception-handling methods used by the two instruction subsets are distinct and independent of each other. If procedures using both types of instructions are run in the same operating environment, separate services routines should be provided for the exceptions of each type of instruction subset. 4.10.1 GeneralPurpose Exceptions The sections below list general-purpose exceptions generated and not generated by 128-bit media instructions. For a summary of the general-purpose exception mechanism, see "Interrupts and Exceptions" on page 104. For details about each exception and its potential causes, see "Exceptions and Interrupts" in Volume 2. Exceptions Generated. The 128-bit media instructions can generate the following general-purpose exceptions: #DB--Debug Exception (Vector 1) #UD--Invalid-Opcode Exception (Vector 6) #NM--Device-Not-Available Exception (Vector 7) #DF--Double-Fault Exception (Vector 8) #SS--Stack Exception (Vector 12) #GP--General-Protection Exception (Vector 13) #PF--Page-Fault Exception (Vector 14) #MF--x87 Floating-Point Exception-Pending (Vector 16) #AC--Alignment-Check Exception (Vector 17) #MC--Machine-Check Exception (Vector 18) #XF--SIMD Floating-Point Exception (Vector 19) A device-not-available exception (#NM) can occur if an attempt is made to execute a 128-bit media instruction when the task switch bit (TS) of the control register (CR0) is set to 1 (CR0.TS = 1). An invalid-opcode exception (#UD) can occur if: a required CPUID feature flag is not set (see "Feature Detection" on page 209), or an FXSAVE or FXRSTOR instruction is executed when the floating-point software-emulation (EM) bit in control register 0 is set to 1 (CR0.EM = 1), or when the operatingsystem FXSAVE/FXRSTOR support bit (OSFXSR) in control register 4 is cleared to 0 (CR4.OSXSR = 0), or 210 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology a SIMD floating-point exception occurs when the operatingsystem XMM exception support bit (OSXMMEXCPT) in control register 4 is cleared to 0 (CR4.OSXMMEXCPT = 0). Only the following 128-bit media instructions, all of which can access an MMX register, can cause an #MF exception: Data Conversion: CVTPD2PI, CVTPS2PI, CVTPI2PS, CVTTPD2PI, CVTTPS2PI. Data Transfer: MOVDQ2Q, MOVQ2DQ. CPTPI2PD, For details on the system control-register bits, see "SystemControl Registers" in Volume 2. For details on the machinecheck mechanism, see "Machine Check Mechanism" in Volume 2. For details on #XF exceptions, see "SIMD Floating-Point Exception Causes" on page 211. Exceptions Not Generated. The 128-bit media instructions do not generate the following general-purpose exceptions: #DE--Divide-by-zero-error exception (Vector 0) Non-Maskable-Interrupt Exception (Vector 2) #BP--Breakpoint Exception (Vector 3) #OF--Overflow exception (Vector 4) #BR--Bound-range exception (Vector 5) Coprocessor-segment-overrun exception (Vector 9) #TS--Invalid-TSS exception (Vector 10) #NP--Segment-not-present exception (Vector 11) #MC--Machine-check exception (Vector 18) For details on all general-purpose exceptions, see "Exceptions and Interrupts" in Volume 2. 4.10.2 SIMD FloatingPoint Exception Causes The SIMD floating-point exception is the logical OR of the six floating-point exceptions (IE, DE, ZE, OE, UE, PE) that are reported (signalled) in the MXCSR register's exception flags ("MXCSR Register" on page 140). Each of these six exceptions can be either masked or unmasked by software, using the mask bits in the MXCSR register. Exception Vectors. The SIMD floating-point exception is listed above as #XF (Vector 19) but it actually causes either an #XF Chapter 4: 128-Bit Media and Scientific Programming 211 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 exception or a #UD (Vector 6) exception, if an unmasked IE, DE, ZE, OE, UE, or PE exception is reported. The choice of exception vector is determined by the operating-system XMM exception support bit (OSXMMEXCPT) in control register 4 (CR4): When CR4.OSXMMEXCPT = 1, a #XF exception occurs. When CR4.OSXMMEXCPT = 0, a #UD exception occurs. SIMD floating-point exceptions are precise. If an exception occurs when it is masked, the processor responds in a default way that does not invoke the SIMD floating-point exception service routine. If an exception occurs when it is unmasked, the processor suspends processing of the faulting instruction precisely and invokes the exception service routine. Exception Types and Flags. SIMD floating-point exceptions are differentiated into six types, five of which are mandated by the IEEE 754 standard. These six types and their bit-flags in the MXCSR register are shown in Table 4-11. The causes and handling of such exceptions are described below. Table 4-11. SIMD Floating-Point Exception Flags Exception and Mnemonic Invalid-operation exception (IE) Denormalized operation exception (DE) Zero-divide exception (ZE) Overflow exception (OE) Underflow exception (UE) Precision exception (PE) Note: MXCSR Bit1 0 1 2 3 4 5 Comparable IEEE 754 Exception Invalid Operation none Division by Zero Overflow Underflow Inexact 1. See "MXCSR Register" on page 140 for a summary of each exception. The sections below describe the causes for the SIMD floatingpoint exceptions. The pseudocode equations in these descriptions assume logical TRUE = 1 and the following definitions: 212 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Maxnormal The largest normalized number that can be represented in the destination format. This is equal to the format's largest representable finite, positive or negative value. (Normal numbers are described in "Normalized Numbers" on page 153.) Minnormal The smallest normalized number that can be represented in the destination format. This is equal to the format's smallest precisely representable positive or negative value with an unbiased exponent of 1. Resultinfinite A result of infinite precision, which is representable when the width of the exponent and the width of the significand are both infinite. Resultround A result, after rounding, whose unbiased exponent is infinitely wide and whose significand is the width specified for the destination format. (Rounding is described in "Floating-Point Rounding" on page 158.) Resultround, denormal A re s u l t , a f t e r r o u n d i n g a n d d e n o r m a l i z a t i o n . (Denormalization is described in "Denormalized (Tiny) Numbers" on page 154.) Masked and unmasked responses to the exceptions are described in "SIMD Floating-Point Exception Masking" on page 218. The priority of the exceptions is described in "SIMD Floating-Point Exception Priority" on page 216. Invalid-Operation Exception (IE). The IE exception occurs due to one of the attempted invalid operations shown in Table 4-12 on page 214. Chapter 4: 128-Bit Media and Scientific Programming 213 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Table 4-12. Invalid-Operation Exception (IE) Causes Operation Condition A source operand is an SNaN Any Arithmetic Operation, and CVTPS2PD, CVTPD2PS, CVTSS2SD, CVTSD2SS MAXPS, MAXPD, MAXSS, MAXSD MINPS, MINPD, MINSS, MINSD CMPPS, CMPPD, CMPSS, CMPSD COMISS, COMISD ADDPS, ADDPD, ADDSS, ADDSD SUBPS, SUBPD, SUBSS, SUBSD MULPS, MULPD, MULSS, MULSD DIVPS, DIVPD, DIVSS, DIVSD SQRTPS, SQRTPD, SQRTSS, SQRTSD Data conversion from floating-point to integer (CVTPS2PI, CVTPD2PI, CVTSS2SI, CVTSD2SI, CVTPS2DQ, CVTPD2DQ, CVTTPS2PI, CVTTPD2PI, CVTTPD2DQ, CVTTPS2DQ, CVTTSS2SI, CVTTSD2SI) A source operand is a NaN (QNaN or SNaN) Source operands are infinities with opposite signs Source operands are infinities with same sign Source operands are zero and infinity Source operands are both infinity or both zero Source operand is less than zero (except 0 which returns 0) Source operand is a NaN, infinite, or not representable in destination data type Denormalized-Operand Exception (DE). Th e D E exc e p t i o n o c c u rs when one of the source operands of an instruction is in denormalized form, as described in "Denormalized (Tiny) Numbers" on page 154. Zero-Divide Exception (ZE). The ZE exception occurs when and instruction attempts to divide zero into a non-zero finite dividend. Overflow Exception (OE). The OE exception occurs when the value of a rounded floating-point result is larger than the largest representable normalized positive or negative floating-point number in the destination format. Specifically: OE = Resultround > Maxnormal 214 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology An overflow can occur through computation or through conversion of higher-precision numbers to lower-precision numbers. Underflow Exception (UE). The UE exception occurs when the value of a rounded, non-zero floating-point result is too small to be represented as a normalized positive or negative floating-point number in the destination format. Such a result is called a tiny number, associated with the "Precision Exception (PE)" described immediately below. If UE exceptions are masked by the underflow mask (UM) bit, a UE exception occurs only if the denormalized form of the rounded result is imprecise. Specifically: UE =((UM=0 and (Resultround < Minnormal) or ((UM=1 and (Resultround, denormal ) != Resultinfinite) Underflows can occur, for example, by taking the reciprocal of the largest representable number, or by converting small numbers in double-precision format to a single-precision format, or simply through repeated division. The flush-to-zero (FZ) bit in the MXCSR offers additional control of underflows that are masked. See "MXCSR Register" on page 140 for details. Precision Exception (PE). The PE exception, also called the inexactresult exception, occurs when a rounded floating-point result differs from the infinitely precise result and thus cannot be represented precisely in the destination format. This exception is caused by--among other things--rounding of underflow or overflow results according to the rounding control (RC) field in the MXCSR, as described in "Floating-Point Rounding" on page 158. If an overflow or underflow occurs and the OE or UE exceptions are masked by the overflow mask (OM) or underflow mask (UM) bit, a PE exception occurs only if the rounded result (for OE) or the denormalized form of the rounded result (for UE) is imprecise. Specifically: PE =((Resultround, denormal or Resultround ) != Resultinfinite) or (OM=1 and (Resultround > Maxnormal)) or (UM=1 and (Resultround, denormal < Minnormal)) Software that does not require exact results normally masks this exception. Chapter 4: 128-Bit Media and Scientific Programming 215 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 4.10.3 SIMD FloatingPoint Exception Priority Table 4-13 shows the priority with which the processor recognizes multiple, simultaneous SIMD floating-point exceptions and operations involving QNaN operands. Each exception type is characterized by its timing, as follows: Pre-Computation--an exception that is recognized before an instruction begins its operation. Post-Computation--an exception that is recognized after an instruction completes its operation. For masked (but not unmasked) post-computation exceptions, a result may be written to the destination, depending on the type of exception. Operations involving QNaNs do not necessarily cause exceptions, but the processor handles them with the priority shown in Table 4-13 relative to the handling of exceptions. Table 4-13. Priority 1 2 3 4 5 6 Note: Priority of SIMD Floating-Point Exceptions Exception or Operation Invalid-operation exception (IE) when accessing SNaN operand Operation involving a QNaN operand1 Any other type of invalid-operation exception (IE) Zero-divide exception (ZE) Denormalized operation exception (DE) Overflow exception (OE) Underflow exception (UE) Precision (inexact) exception (PE) Timing Pre-Computation -- Pre-Computation Pre-Computation Pre-Computation Post-Computation Post-Computation Post-Computation 1. Operations involving QNaN operands do not, in themselves, cause exceptions but they are handled with this priority relative to the handling of exceptions. Figure 4-38 on page 217 shows the prioritized procedure used by the processor to detect and report SIMD floating-point exceptions. Each of the two types of exceptions--precomputation and post-computation--is handled independently and completely in the sequence shown. If there are no unmasked exceptions, the processor responds to masked exceptions. Because of this two-step process, up to two 216 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology exceptions--one pre-computation, one post-computation--can be caused by a single instruction. For Each Exception Type For Each Vector Element Test For Pre-Computation Exceptions Set MXCSR Exception Flags Yes Any Unmasked Exceptions ? No For Each Exception Type For Each Vector Element Test For Pre-Computation Exceptions Set MXCSR Exception Flags Yes Any Unmasked Exceptions ? No Invoke Exception Service Routine Any Masked Exceptions ? No Yes Default Response Continue Execution 513-188.eps Figure 4-38. SIMD Floating-Point Detection Process Chapter 4: 128-Bit Media and Scientific Programming 217 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 4.10.4 SIMD FloatingPoint Exception Masking The six floating-point exception flags have corresponding exception-flag masks in the MXCSR register, as shown in Table 4-14. Table 4-14. SIMD Floating-Point Exception Masks Exception Mask and Mnemonic Invalid-operation exception mask (IM) Denormalized-operand exception mask (DM) Zero-divide exception mask (ZM) Overflow exception mask (OM) Underflow exception mask (UM) Precision exception mask (PM) MXCSR Bit 7 8 9 10 11 12 Comparable IEEE 754 Exception Invalid Operation none Division by Zero Overflow Underflow Inexact Each mask bit, when set to 1, inhibits invocation of the exception handler for that exception and instead causes a default response. Thus, an unmasked exception is one that invokes its exception handler when it occurs, whereas a masked exception continues normal execution using the default response for the exception type. During power-on initialization, all exception-mask bits in the MXCSR register are set to 1 (masked). Masked Responses. The occurrence of a masked exception does not invoke its exception handler when the exception condition occurs. Instead, the processor handles masked exceptions in a default way, as shown in Table 4-15 on page 219. 218 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Table 4-15. Masked Responses to SIMD Floating-Point Exceptions Operation1 Any of the following, in which one or both operands is an SNaN: * Addition (ADDPS, ADDPD, ADDSS, ADDSD), or * Subtraction (SUBPS, SUBPD, SUBSS, SUBSD), or * Multiplication (MULPS, MULPD, MULSS, MULSD), or * Division (DIVPS, DIVPD, DIVSS, DIVSD), or * Square-root (SQRTPS, SQRTPD, SQRTSS, SQRTSD), or * Data conversion of floating-point to floating-point (CVTPS2PD, CVTPD2PS, CVTSS2SD, CVTSD2SS). * Addition of infinities with opposite sign (ADDPS, ADDPD, ADDSS, ADDSD), or * Subtraction of infinities with same sign (SUBPS, SUBPD, SUBSS, SUBSD), or * Multiplication of zero by infinity (MULPS, MULPD, MULSS, MULSD), or * Division of zero by zero or infinity by infinity (DIVPS, DIVPD, DIVSS, DIVSD), or * Square-root in which the operand is non-zero negative (SQRTPS, SQRTPD, SQRTSS, SQRTSD). Any of the following, in which one or both operands is a NaN: * Maximum or Minimum (MAXPS, MAXPD, MAXSS, MAXSD MINPS, MINPD, MINSS, MINSD), or * Compare (CMPPS, CMPPD, CMPSS, CMPSD COMISS, COMISD). Compare, in which one or both operands is a NaN (CMPPS, CMPPD, CMPSS, CMPSD). Compare is unordered or notequal All other compares Processor Response2 Exception Return a QNaN, based on the rules in Table 4-6 on page 156. Invalidoperation exception (IE) Return the floating-point indefinite value. Return second source operand. Return mask of all 1s. Return mask of all 0s. Notes: 1. For complete details about operations, see "SIMD Floating-Point Exception Causes" on page 211. 2. In all cases, the processor sets the associated exception flag in MXCSR. For details about number representation, see "FloatingPoint Number Representation" on page 153 and "Floating-Point Number Encodings" on page 156. 3. This response does not comply with the IEEE 754 standard, but it offers higher performance. Chapter 4: 128-Bit Media and Scientific Programming 219 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Table 4-15. Masked Responses to SIMD Floating-Point Exceptions (continued) Operation1 Processor Response2 Set the zero (ZF), parity (PF), and carry (CF) flags in rFLAGS. Clear the overflow (OF), sign (SF), and auxiliary carry (AF) flags in rFLAGS. Exception Ordered or unordered scalar compare, in which one or both operands is a NaN (COMISS, COMISD, UCOMISS, UCOMISD). Invalidoperation exception (IE) Data conversion from floating-point to integer, in which source operand is a NaN, infinity, or is larger than the representable value of the destination (CVTPS2PI, CVTPD2PI, CVTSS2SI, CVTSD2SI, CVTPS2DQ, CVTPD2DQ, CVTTPS2PI, CVTTPD2PI, CVTTPD2DQ, CVTTPS2DQ, CVTTSS2SI, CVTTSD2SI). One or both operands is denormal Return the integer indefinite value. Denormalizedoperand exception (DE) Zero-divide exception (ZE) Return the result using the denormal operand(s). Return signed infinity, with sign bit = XOR of the operand sign bits. Return +. Return -. Return +. Return finite negative number with largest magnitude. Return finite positive number with largest magnitude. Return -. Return finite positive number with largest magnitude. Return finite negative number with largest magnitude. Divide (DIVx) zero with non-zero finite dividend Sign of result is positive Sign of result is negative Sign of result is positive Sign of result is negative Sign of result is positive Sign of result is negative Sign of result is positive Sign of result is negative Overflow when rounding mode = round to nearest Overflow when rounding mode = round toward + Overflow exception (OE) Overflow when rounding mode = round toward - Overflow when rounding mode = round toward 0 Notes: 1. For complete details about operations, see "SIMD Floating-Point Exception Causes" on page 211. 2. In all cases, the processor sets the associated exception flag in MXCSR. For details about number representation, see "FloatingPoint Number Representation" on page 153 and "Floating-Point Number Encodings" on page 156. 3. This response does not comply with the IEEE 754 standard, but it offers higher performance. 220 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Table 4-15. Masked Responses to SIMD Floating-Point Exceptions (continued) Operation1 MXCSR flush-to-zero (FZ) bit = 0 Inexact denormalized result MXCSR flush-to-zero (FZ) bit = 1 Without OE or UE exception Processor Response2 Set PE flag and return denormalized result. Set PE flag and return zero, with sign of true result.3 Return rounded result. Respond as for OE or UE exception. Respond as for OE or UE exception, and invoke SIMD exception handler. Exception Underflow exception (UE) Precision exception (PE) Inexact normalized or denormalized result With masked OE or UE exception With unmasked OE or UE exception Notes: 1. For complete details about operations, see "SIMD Floating-Point Exception Causes" on page 211. 2. In all cases, the processor sets the associated exception flag in MXCSR. For details about number representation, see "FloatingPoint Number Representation" on page 153 and "Floating-Point Number Encodings" on page 156. 3. This response does not comply with the IEEE 754 standard, but it offers higher performance. Unmasked Responses. If the processor detects an unmasked exception, it sets the associated exception flag in the MXCSR register and invokes the SIMD floating-point exception handler. The processor does not write a result or change any of the source operands for any type of unmasked exception. The exception handler must determine which exception occurred (by examining the exception flags in the MXCSR register) and take appropriate action. In all cases of unmasked exceptions, before calling the e x c e p t i o n h a n d l e r, t h e p r o c e s s o r e x a m i n e s t h e CR4.OSXMMEXCPT bit to see if it is set to 1. If it is set, the processor calls the #XF exception (vector 19). If it is cleared, the processor calls the #UD exception (vector 6). See "SystemControl Registers" in Volume 2 for details. For details about the operations that can cause unmasked exceptions, see "SIMD Floating-Point Exception Causes" on page 211 and Table 4-15. Using NaNs in IE Diagnostic Exceptions. Both SNaNs and QNaNs can be encoded with many different values to carry diagnostic information. By means of appropriate masking and unmasking Chapter 4: 128-Bit Media and Scientific Programming 221 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 of the invalid-operation exception (IE), software can use signaling NaNs to invoke an exception handler. Within the constraints imposed by the encoding of SNaNs and QNaNs, software may freely assign the bits in the significand of a NaN. See "Not a Number (NaN)" on page 155 for format details. For example, software can pre-load each element of an array with a signaling NaN that encodes the array index. When an application accesses an uninitialized array element, the invalidoperation exception is invoked and the service routine can identify that element. A service routine can store debug information in memory as the exceptions occur. The routine can create a QNaN that references its associated debug area in memory. As the program runs, the service routine can create a different QNaN for each error condition, so that a single testrun can identify a collection of errors. 4.11 Saving, Clearing, and Passing State In general, system software should save and restore 128-bit media state between task switches or other interventions in the execution of 128-bit media procedures. Virtually all modern o p e ra t i n g s y s t e m s r u n n i n g o n x 8 6 p r o c e s s o r s -- l i ke Windows NT(R), UNIX, and OS/2--are preemptive multitasking operating systems that handle such saving and restoring of state properly across task switches, independently of hardware task-switch support. However, application procedures are also free to save and restore 128-bit media state at any time they deem useful. Software running at any privilege level may save and restore 128-bit media state by executing the FXSAVE instruction, which saves not only 128-bit media state but also x87 floatingpoint state. Alternatively, software may use multiple move instructions for saving only the contents of selected 128-bit media data registers, or the STMXCSR instruction for saving the MXCSR register state. For details, see "Save and Restore State" on page 186. 4.11.1 Saving and Restoring State 4.11.2 Parameter Passing 128-bit media procedures can use MOVx instructions to pass data to other such procedures. This can be done directly, via the XMM registers, or indirectly by storing data on the procedure stack. When storing to the stack, software should use the rSP register for the memory address and, after the save, explicitly Chapter 4: 128-Bit Media and Scientific Programming 222 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology decrement rSP by 16 for each 128-bit XMM register parameter stored on the stack. Likewise, to load a 128-bit XMM register from the stack, software should increment rSP by 16 after the load. There is a choice of MOVx instructions designed for aligned and unaligned moves, as described in "Data Transfer" on page 162 and "Data Transfer" on page 187. The processor does not check the data type of instruction operands prior to executing instructions. It only checks them at the point of execution. For example, if the processor executes an arithmetic instruction that takes double-precision operands but is provided with single-precision operands by MOVx instructions, the processor will first convert the operands from single precision to double precision prior to executing the arithmetic operation, and the result will be correct. However, the required conversion may cause degradation of performance. Because of this possibility of data-type mismatching between M OV x i n s t r u c t i o n s u s e d t o p a s s p a ra m e t e rs a n d t h e instructions in the called procedure that subsequently operate on the moved data, the calling procedure should save its own state prior to the call. The called procedure cannot determine the caller's data types, and thus it cannot optimize its choice of instructions for storing a caller's state. For further information, see the software optimization documentation for particular hardware implementations. 4.11.3 Accessing Operands in MMXTM Registers Software may freely mix 128-bit media instructions (integer or floating-point) with 64-bit media instructions (integer or floating-point) and general-purpose instructions in a single procedure. There are no restrictions on transitioning from 128bit media procedures to x87 procedures, except when a 128-bit media procedure accesses an MMX register by means of a datatransfer or data-conversion instruction. In such cases, software should separate such procedures or dynamic link libraries (DLLs) from x87 floating-point procedures or DLLs by clearing the MMX state with the EMMS instruction, as described in "Exit Media State" on page 247. For further details, see "Mixing Media Code with x87 Code" on page 278. Chapter 4: 128-Bit Media and Scientific Programming 223 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 4.12 Performance Considerations In addition to typical code optimization techniques, such as those affecting loops and the inlining of function calls, the following considerations may help improve the performance of application programs written with 128-bit media instructions. These are implementation-independent performance considerations. Other considerations depend on the hardware implementation. For information about such implementationdependent considerations and for more information about application performance in general, see the data sheets and the software-optimization guides relating to particular hardware implementations. 4.12.1 Use Small Operand Sizes The performance advantages available with 128-bit media operations is to some extent a function of the data sizes operated upon. The smaller the data size, the more data elements that can be packed into single 128-bit vectors. The parallelism of computation increases as the number of elements per vector increases. Much of the performance benefit from the 128-bit media instructions comes from the parallelism inherent in vector operations. It can be advantageous to reorganize data before performing arithmetic operations so that its layout after reorganization maximizes the parallelism of the arithmetic operations. The speed of memory access is particularly important for certain types of computation, such as graphics rendering, that depend on the regularity and locality of data-memory accesses. For example, in matrix operations, performance is high when operating on the rows of the matrix, because row bytes are contiguous in memory, but lower when operating on the columns of the matrix, because column bytes are not contiguous in memory and accessing them can result in cache misses. To improve performance for operations on such columns, the matrix should first be transposed. Such transpositions can, for example, be done using a sequence of unpacking or shuffle instructions. 4.12.2 Reorganize Data for Parallel Operations 4.12.3 Remove Branches Branch can be replaced with 128-bit media instructions that simulate predicated execution or conditional moves, as described in "Branch Removal" on page 137. The branch can be Chapter 4: 128-Bit Media and Scientific Programming 224 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology replaced with 128-bit media instructions that simulate predicated execution or conditional moves. Figure 4-10 on page 138 shows an example of a non-branching sequence that implements a two-way multiplexer. Where possible, break long dependency chains into several shorter dependency chains which can be executed in parallel. This is especially important for floating-point instructions because of their longer latencies. 4.12.4 Use Streaming Stores The MOVNTDQ and MASKMOVDQU instructions store streaming (non-temporal) data to memory. These instructions indicate to the processor that the data they reference will be used only once and is therefore not subject to cache-related overhead (such as write-allocation). A typical case benefitting from streaming stores occurs when data written by the processor is never read by the processor, such as data written to a graphics frame buffer. Data alignment is particularly important for performance when data written by one instruction is read by a subsequent instruction soon after the write, or when accessing streaming (non-temporal) data. These cases may occur frequently in 128bit media procedures. Accesses to data stored at unaligned locations may benefit from on-the-fly software alignment or from repetition of data at different alignment boundaries, as required by different loops that process the data. 4.12.6 Organize Data for Cacheability Pack small data structures into cache-line-size blocks. Organize frequently accessed constants and coefficients into cache-linesize blocks and prefetch them. Procedures that access data arranged in memory-bus-sized blocks, or memory-burst-sized blocks, can make optimum use of the available memory bandwidth. For data that will be used only once in a procedure, consider using non-cacheable memory. Accesses to such memory are not burdened by the overhead of cache protocols. 4.12.7 Prefetch Data Media applications typically operate on large data sets. Because of this, they make intensive use of the memory bus. Memory latency can be substantially reduced--especially for data that will be used only once--by prefetching such data into various levels of the cache hierarchy. Software can use the 225 4.12.5 Align Data Chapter 4: 128-Bit Media and Scientific Programming AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 PREFETCHx instructions very effectively in such cases, as described in "Cache and Memory Management" on page 79. Some of the best places to use prefetch instructions are inside loops that process large amounts of data. If the loop goes through less than one cache line of data per iteration, partially unroll the loop. Try to use virtually all of the prefetched data. This usually requires unit-stride memory accesses--those in which all accesses are to contiguous memory locations. Exactly one PREFETCHx instruction per cache line must be used. 4.12.8 Use 128-Bit Media Code for Moving Data 4.12.9 Retain Intermediate Results in XMM Registers Movements of data between memory, GPR, XMM, and MMX registers can take advantage of the parallel vector operations supported by the 128-bit media MOVx instructions. Figure 4-6 on page 134 illustrates the range of move operations available. Keep intermediate results in the XMM registers as much as possible, especially if the intermediate results are used shortly after they have been produced. Avoid spilling intermediate results to memory and reusing them shortly thereafter. In 64-bit mode, the architecture's 16 XMM registers offer twice the number of legacy XMM registers. In 64-bit mode, the AMD64 architecture provides twice the number of general-purpose registers (GPRs) as the legacy x86 architecture, thereby reducing potential pressure on GPRs. Nevertheless, general-purpose instructions do not operate in p a ra l l e l o n ve c t o rs o f e l e m e n t s , a s d o 1 2 8 - b i t m e d i a instructions. Thus, 128-bit media code supports parallel operations and can perform better with algorithms and data that are organized for parallel operations. One of the most useful advantages of 128-bit media instructions is the ability to intermix integer and floating-point instructions in the same procedure, using a register set that is separate from the GPR, MMX, and x87 register sets. Code written with 128-bit media floating-point instructions can operate in parallel on four times as many single-precision floating-point operands as can x87 floating-point code. This achieves potentially four times the computational work of x87 instructions that take singleprecision operands. Also, the higher density of 128-bit media floating-point operands may make it possible to remove local temporary variables that would otherwise be needed in x87 floating-point code. 128-bit media code is also easier to write than x87 floating-point code, because the XMM register file is Chapter 4: 128-Bit Media and Scientific Programming 4.12.10 Replace GPR Code with 128-bit media Code. 4.12.11 Replace x87 Code with 128-Bit Media Code 226 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology flat, rather than stack-oriented, and in 64-bit mode there are twice the number of XMM registers as x87 registers. Moreover, when integer and floating-point instructions must be used together, 128-bit media floating-point instructions avoid the potential need to save and restore state between integer operations and floating-point procedures. Chapter 4: 128-Bit Media and Scientific Programming 227 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 228 Chapter 4: 128-Bit Media and Scientific Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology 5 64-Bit Media Programming This chapter describes the 64-bit media programming model. This model includes all instructions that access the MMXTM registers, including the MMX and 3DNow!TM instructions plus some SSE and SSE2 instructions. The 64-bit media instructions perform integer and floatingpoint operations primarily on vector operands (a few of the instructions take scalar operands). The MMX integer operations produce signed, unsigned, and/or saturating results. The 3DNow! floating-point operations take single-precision operands and produce saturating results without generating floating-point exceptions. The instructions that take vector operands can speed up certain types of procedures by significant factors, depending on data-element size and the regularity and locality of data accesses to memory. The term 64-bit is used in two different contexts within the AMD64 architecture: the 64-bit media instructions, described in this chapter, and the 64-bit operating mode, described in "64Bit Mode" on page 8. 5.1 Origins The 64-bit media instructions were introduced in the following extensions to the legacy x86 architecture: MMX Instructions--These are primarily integer instructions that use primarily vector operands in 64-bit MMX registers or memory locations. 3DNow! Instructions--These are primarily floating-point instructions that use primarily vector operands in MMX registers or memory locations. SSE and SSE2 Instructions--These are the streaming SIMD extension (SSE) and SSE2 instructions. Some of them perform conversions between operands in the 64-bit MMX register set and other register sets. For details on the extension-set origin of each instruction, see "Instruction Subsets and CPUID Feature Sets" in Volume 3. Chapter 5: 64-Bit Media Programming 229 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 5.2 Compatibility 64-bit media instructions can be executed in any of the architecture's operating modes. Existing MMX and 3DNow! binary programs run in legacy and compatibility modes without modification. The support provided by the AMD64 architecture for such binaries is identical to that provided by legacy x86 architectures. To run in 64-bit mode, 64-bit media programs must be recompiled. The recompilation has no side effects on such programs, other then to make available the extended generalpurpose registers and 64-bit virtual address space. The MMX and 3DNow! instructions introduce no additional registers, status bits, or other processor state to the legacy x86 architecture. Instead, they use the x87 floating-point registers that have long been a part of most x86 architectures. Because of this, 64-bit media procedures require no special operatingsystem support or exception handlers. When state-saves are required between procedures, the same instructions that system software uses to save and restore x87 floating-point state also save and restore the 64-bit media-programming state. 5.3 Capabilities The 64-bit media instructions are designed to support multimedia and communication applications that operate on vectors of small-sized data elements. For example, 8-bit and 16bit integer data elements are commonly used for pixel information in graphics applications, and 16-bit integer data elements are used for audio sampling. The 64-bit media instructions allow multiple data elements like these to be packed into single 64-bit vector operands located in an MMX register or in memory. The instructions operate in parallel on each of the elements in these vectors. For example, 8-bit integer data can be packed in vectors of eight elements in a single 64bit register, so that all eight byte elements are operated on simultaneously by a single instruction. Typical applications of the 64-bit media integer instructions include music synthesis, speech synthesis, speech recognition, audio and video compression (encoding) and decompression (decoding), 2D and 3D graphics (including 3D texture mapping), and streaming video. Typical applications of the 64- 230 Chapter 5: 64-Bit Media Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology bit media floating-point instructions include digital signal processing (DSP) kernels and front-end 3D graphics algorithms, such as geometry, clipping, and lighting. Th e s e t y p e s o f a p p l i c a t i o n s a re re f e r re d t o a s m e d i a applications. Such applications commonly use small data elements in repetitive loops, in which the typical operations are inherently parallel. In 256-color video applications, for example, 8-bit operands in 64-bit MMX registers can be used to compute transformations on eight pixels per instruction. 5.3.1 Parallel Operations Most of the 64-bit media instructions perform parallel operations on vectors of operands. Vector operations are also called packed or SIMD (single-instruction, multiple-data) operations. They take operands consisting of multiple elements and operate on all elements in parallel. Figure 5-1 shows an example of an integer operation on two vectors, each containing 16-bit (word) elements. There are also 64-bit media instructions that operate on vectors of byte or doubleword elements. operand 1 63 0 63 operand 2 0 op op op op 63 result 0 513-121.eps Figure 5-1. 5.3.2 Data Conversion and Reordering Parallel Integer Operations on Elements of Vectors The 64-bit media instructions support conversions of various integer data types to floating-point data types, and vice versa. There are also instructions that reorder vector-element ordering or the bit-width of vector elements. For example, the unpack instructions take two vector operands and interleave their low or high elements. Figure 5-2 on page 232 shows an unpack operation (PUNPCKLWD) that interleaves low-order Chapter 5: 64-Bit Media Programming 231 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 elements of each source operand. If each element of operand 2 has the value zero, the operation zero-extends each element of operand 1 to twice its original width. This may be useful, for example, prior to an arithmetic operation in which the dataconversion result must be paired with another source operand containing vector elements that are twice the width of the prec o nve rs i o n ( h a l f - s i z e ) e l e m e n t s . Th e re a re a l s o p a ck instructions that convert each element of 2x size in a pair of vectors to elements of 1x size, with saturation at maximum and minimum values. operand 1 63 0 63 operand 2 0 63 result 0 513-144.eps Figure 5-2. Unpack and Interleave Operation Figure 5-3 on page 233 shows a shuffle operation (PSHUFW), in which one of the operands provides vector data, and an immediate byte provides shuffle control for up to 256 permutations of the data. 232 Chapter 5: 64-Bit Media Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology 63 operand 1 0 63 operand 2 0 63 result 0 513-126.eps Figure 5-3. 5.3.3 Matrix Operations Shuffle Operation (1 of 256) Media applications often multiply and accumulate vector and matrix data. In 3D graphics applications, for example, objects are typically represented by triangles, each of whose vertices are located in 3D space by a matrix of coordinate values, and matrix transforms are performed to simulate object movement. 64-bit media integer and floating-point instructions can perform several types of matrix-vector or matrix-matrix operations, such as addition, subtraction, multiplication, and accumulation. The integer instructions can also perform m u l t i p ly - a c c u m u l a t e o p e ra t i o n s . E f f i c i e n t m a t r i x multiplication is further supported with instructions that can first transpose the elements of matrix rows and columns. These transpositions can make subsequent accesses to memory or cache more efficient when performing arithmetic matrix operations. Figure 5-4 on page 234 shows a vector multiply-add instruction (PMADDWD) that multiplies vectors of 16-bit integer elements to yield intermediate results of 32-bit elements, which are then summed pair-wise to yield two 32-bit elements. Chapter 5: 64-Bit Media Programming 233 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 operand 1 63 0 63 operand 2 0 * * * * 127 0 + + 63 result 0 513-119.eps Figure 5-4. Multiply-Add Operation The operation shown in Figure 5-4 can be used together with transpose and vector-add operations (see "Addition" on page 255) to accumulate dot product results (also called inner or scalar products), which are used in many media algorithms. 5.3.4 Saturation Several of the 64-bit media integer instructions and most of the 64-bit media floating-point instructions produce vector results in which each element saturates independently of the other elements in the result vector. Such results are clamped (limited) to the maximum or minimum value representable by the destination data type when the true result exceeds that maximum or minimum representable value. Saturation avoids the need for code that tests for potential ove r f l ow o r u n d e r f l ow. S a t u ra t i n g d a t a i s u s e f u l fo r representing physical-world data, such as sound and color. It is used, for example, when combining values for pixel coloring. 5.3.5 Branch Removal Branching is a time-consuming operation that, unlike most 64bit media vector operations, does not exhibit parallel behavior (there is only one branch target, not multiple targets, per branch instruction). In many media applications, a branch involves selecting between only a few (often only two) cases. Chapter 5: 64-Bit Media Programming 234 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Such branches can be replaced with 64-bit media vector compare and vector logical instructions that simulate predicated execution or conditional moves. Figure 5-5 shows an example of a non-branching sequence that implements a two-way multiplexer--one that is equivalent to the ternary operator "?:" in C and C++. The comparable code sequence is explained in "Compare and Write Mask" on page 262. The sequence in Figure 5-5 begins with a vector compare instruction that compares the elements of two source operands in parallel and produces a mask vector containing elements of all 1s or 0s. This mask vector is ANDed with one source operand and ANDed-Not with the other source operand to isolate the desired elements of both operands. These results are then ORed to select the relevant elements from each operand. A similar branch-removal operation can be done using floatingpoint source operands. operand 1 63 0 63 operand 2 0 a3 a2 a1 a0 b3 b2 b1 b0 Compare FFFF 0000 0000 FFFF And And-Not a3 0000 0000 a0 0000 b2 b1 0000 Or a3 b2 b1 a0 513-127.eps Figure 5-5. Branch-Removal Sequence Chapter 5: 64-Bit Media Programming 235 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 5.3.6 Floating-Point (3DNow!TM) Vector Operations Floating-point vector instructions using the MMX registers were introduced by AMD with the 3DNow! technology. These instructions take 64-bit vector operands consisting of two 32-bit single-precision floating-point numbers, shown as FP single in Figure 5-6. 63 32 31 0 63 32 31 0 FP single FP single FP single FP single op op FP single FP single 63 32 31 0 513-124.eps Figure 5-6. Floating-Point (3DNow!TM Instruction) Operations The AMD64 architecture's 3DNow! floating-point instructions provide a unique advantage over legacy x87 floating-point instructions: They allow integer and floating-point instructions to be intermixed in the same procedure, using only the MMX registers. This avoids the need to switch between integer MMX procedures and x87 floating-point procedures--a switch that may involve time-consuming state saves and restores--while at the same time leaving the 128-bit XMM register resources free for other applications. The 3DNow! instructions allow applications such as 3D graphics to accelerate front-end geometry, clipping, and lighting calculations. Picture and pixel data are typically integer data types, although both integer and floating-point instructions are often required to operate completely on the data. For example, software can change the viewing perspective of a 3D scene through transformation matrices by using floating-point instructions in the same procedure that contains integer operations on other aspects of the graphics data. 3DNow! programs typically perform better than x87 floatingpoint code, because the MMX register file is flat rather than 236 Chapter 5: 64-Bit Media Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology stack-oriented and because 3DNow! instructions can operate on twice as many operands as x87 floating-point instructions. This ability to operate in parallel on twice as many floating-point values in the same register space often makes it possible to remove local temporary variables in 3DNow! code that would otherwise be needed in x87 floating-point code. 5.4 Registers Eight 64-bit MMX registers, mmx0-mmx7, support the 64-bit media instructions. Figure 5-7 shows these registers. They can hold operands for both vector and scalar operations on integer (MMX) and floating-point (3DNow!) data types. 5.4.1 MMXTM Registers MMXTM Registers 63 0 mmx0 mmx1 mmx2 mmx3 mmx4 mmx5 mmx6 mmx7 513-145.eps Figure 5-7. 64-bit Media Registers The MMX registers are mapped onto the low 64 bits of the 80bit x87 floating-point physical data registers, FPR0-FPR7, described in "Registers" on page 287. However, the x87 stack re g i s t e r s t r u c t u re , S T ( 0 ) - S T ( 7 ) , i s n o t u s e d by M M X instructions. The x87 tag bits, top-of-stack pointer (TOP), and high bits of the 80-bit FPR registers are changed when 64-bit media instructions are executed. For details about the x87related actions performed by hardware during execution of 64- Chapter 5: 64-Bit Media Programming 237 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 bit media instructions, see "Actions Taken on Executing 64-Bit Media Instructions" on page 276. 5.4.2 Other Registers Some 64-bit media instructions that perform data transfer, data conversion or data reordering operations ("Data Transfer" on page 248, "Data Conversion" on page 250, and "Data Conversion" on page 266) can access operands in the generalpurpose registers (GPRs) or XMM registers. When addressing GPRs or XMM registers in 64-bit mode, the REX instruction prefix can be used to access the extended GPRs or XMM registers, as described in "REX Prefixes" on page 89. For a description of the GPR registers, see "Registers" on page 27. For a description of the XMM registers, see "XMM Registers" on page 139. 5.5 Operands Operands for a 64-bit media instruction are either referenced by the instruction's opcode or included as an immediate value in the instruction encoding. Depending on the instruction, referenced operands can be located in registers or memory. The data types of these operands include vector and scalar integer, and vector floating-point. 5.5.1 Data Types Figure 5-8 on page 239 shows the register images of the 64-bit media data types. These data types can be interpreted by instruction syntax and/or the software context as one of the following types of values: Vector (packed) single-precision (32-bit) floating-point numbers. Vector (packed) signed (two's-complement) integers. Vector (packed) unsigned integers. Scalar signed (two's-complement) integers. Scalar unsigned integers. Hardware does not check or enforce the data types for instructions. Software is responsible for ensuring that each operand for an instruction is of the correct data type. Software can interpret the data types in ways other than those shown in Figure 5-8 on page 239--such as bit fields or fractional numbers--but the 64-bit media instructions do not directly support such interpretations and software must handle them entirely on its own. 238 Chapter 5: 64-Bit Media Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Vector (Packed) Single-Precision Floating-Point ss exp 54 significand ss exp 22 significand 0 63 31 Vector (Packed) Signed Integers ss ss ss doubleword word byte ss ss ss ss ss ss doubleword word byte ss ss ss word byte ss word byte ss byte byte byte byte 0 63 55 47 39 31 23 15 7 Vector (Packed) Unsigned Integers doubleword word byte 63 55 doubleword word byte 31 23 word byte 47 39 word byte 15 7 byte byte byte byte 0 Signed Integers s quadword s 63 doubleword s 31 word s 15 byte 0 7 Unsigned Integers quadword 63 31 15 7 513-319.eps doubleword word byte 0 Figure 5-8. 64-Bit Media Data Types Chapter 5: 64-Bit Media Programming 239 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 5.5.2 Operand Sizes and Overrides Operand sizes for 64-bit media instructions are determined by instruction opcodes. Some of these opcodes include an operandsize override prefix, but this prefix acts in a special way to modify the opcode and is considered an integral part of the opcode. The general use of the 66h operand-size override prefix described in "Instruction Prefixes" on page 85 does not apply to 64-bit media instructions. For details on the use of operand-size override prefixes in 64-bit media instructions, see the opcodes in "64-Bit Media Instruction Reference" in Volume 5. 5.5.3 Operand Addressing Depending on the 64-bit media instruction, referenced operands may be in registers or memory. Register Operands. Most 64-bit media instructions can access source and destination operands located in MMX registers. A few of these instructions access the XMM or GPR registers. When addressing GPR or XMM registers in 64-bit mode, the REX instruction prefix can be used to access the extended GPR or XMM registers, as described in "Instruction Prefixes" on page 272. The 64-bit media instructions do not access the rFLAGS register, and none of the bits in that register are affected by execution of the 64-bit media instructions. Memory Operands. Most 64-bit media instructions can read memory for source operands, and a few of the instructions can write results to memory. "Memory Addressing" on page 16, describes the general methods and conditions for addressing memory operands. Immediate Operands. Immediate operands are used in certain data-conversion and vector-shift instructions. Such instructions take 8-bit immediates, which provide control for the operation. I/O Ports. I/O ports in the I/O address space cannot be directly addressed by 64-bit media instructions, and although memorymapped I/O ports can be addressed by such instructions, doing so may produce unpredictable results, depending on the hardware implementation of the architecture. See the data sheet or software-optimization documentation for particular hardware implementations. 240 Chapter 5: 64-Bit Media Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology 5.5.4 Data Alignment Those 64-bit media instructions that access a 128-bit operand in memory incur a general-protection exception (#GP) if the operand is not aligned to a 16-byte boundary. These instructions include: CVTPD2PI--Convert Packed Double-Precision FloatingPoint to Packed Doubleword Integers. CVTTPD2PI--Convert Packed Double-Precision FloatingPoint to Packed Doubleword Integers, Truncated. FXRSTOR--Restore XMM, MMX, and x87 State. FXSAVE--Save XMM, MMX, and x87 State. For other 64-bit media instructions, the architecture does not impose data-alignment requirements for accessing 64-bit media data in memory. Specifically, operands in physical memory do not need to be stored at addresses which are even multiples of the operand size, in bytes. However, the consequence of storing operands at unaligned locations is that accesses to those operands may require more processor and bus cycles than for aligned accesses. See "Data Alignment" on page 47 for details. 5.5.5 Integer Data Types Most of the MMX instructions support operations on the integer data types shown in Figure 5-8. These instructions are summarized in "Instruction Summary--Integer Instructions" on page 245. The characteristics of these data types are described below. Sign. Many of the 64-bit media instructions have variants for operating on signed or unsigned integers. For signed integers, the sign bit is the most-significant bit--bit 7 for a byte, bit 15 for a word, bit 31 for a doubleword, or bit 63 for a quadword. Arithmetic instructions that are not specifically named as unsigned perform signed two's-complement arithmetic. Maximum and Minimum Representable Values. Table 5-1 on page 242 shows the range of representable values for the integer data types. Chapter 5: 64-Bit Media Programming 241 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Table 5-1. Range of Values in 64-Bit Media Integer Data Types Byte 0 to +28-1 0 to 255 -27 to +(27-1) -128 to +127 Word 0 to +216-1 0 to 65,535 -215 to +(215-1) -32,768 to +32,767 Doubleword 0 to +232-1 0 to 4.29 * 109 -231 to +(231-1) -2.14 * 109 to +2.14 * 109 Quadword 0 to +264-1 0 to 1.84 * 1019 -263 to +(263-1) -9.22 * 1018 to +9.22 * 1018 Data-Type Interpretation Unsigned integers Base-2 (exact) Base-10 (approx.) Base-2 (exact) Signed integers1 Base-10 (approx.) Saturation. Saturating (also called limiting or clamping) instructions limit the value of a result to the maximum or minimum value representable by the destination data type. Saturating versions of integer vector-arithmetic instructions operate on byte-siz ed and word-siz ed elements. These instructions--for example, PADDSx, PADDUSx, PSUBSx, and PSUBUSx--saturate signed or unsigned data independently for each element in a vector when the element reaches its maximum or minimum representable value. Saturation avoids overflow or underflow errors. The examples in Table 5-2 illustrate saturating and nonsaturating results with word operands. Saturation for other data-type sizes follows similar rules. Once saturated, the saturated value is treated like any other value of its type. For example, if 0001h is subtracted from the saturated value, 7FFFh, the result is 7FFEh. Table 5-2. Saturation Examples Operation 7000h + 2000h 7000h + 7000h F000h + F000h 9000h + 9000h 7FFFh + 0100h 7FFFh + FF00h Non-Saturated Infinitely Precise Result 9000h E000h 1E000h 12000h 80FFh 17EFFh Saturated Signed Result 7FFFh 7FFFh E000h 8000h 7FFFh 7EFFh Saturated Unsigned Result 9000h E000h FFFFh FFFFh 80FFh FFFFh 242 Chapter 5: 64-Bit Media Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Arithmetic instructions not specifically designated as saturating perform non-saturating, two's-complement arithmetic. Rounding. There is a rounding version of the integer vectormultiply instruction, PMULHRW, that multiplies pairs of signed-integer word elements and then adds 8000h to the lower word of the doubleword result, thus rounding the high-order word which is returned as the result. Other Fixed-Point Operands. The architecture provides specific support only for integer fixed-point operands--those in which an implied binary point is located to the right of bit 0. Nevertheless, software may use fixed-point operands in which the implied binary point is located in any position. In such cases, software is responsible for managing the interpretation of such implied binary points, as well as any redundant sign bits that may occur during multiplication. 5.5.6 Floating-Point Data Types All 64-bit media 3DNow! instructions, except PFRCP and PFRSQRT, take 64-bit vector operands. They operate in parallel on two single-precision (32-bit) floating-point values contained in those vectors. Figure 5-9 shows the format of the vector operands. The characteristics of the single-precision floating-point data types a re d e s c r i b e d b e l ow. The 6 4 - b i t f l o a t i n g - p o i n t m e d i a instructions are summarized in "Instruction Summary-- Floating-Point Instructions" on page 265. 63 62 S 55 54 Significand (also Fraction) S = Sign Bit 32 31 30 S 23 22 Significand (also Fraction) S = Sign Bit 0 Biased Exponent Biased Exponent Figure 5-9. 64-Bit Floating-Point (3DNow!) Vector Operand Single-Precision Format. The single-precision floating-point format supported by 64-bit media instructions is the same format as the normalized IEEE 754 single-precision format. This format includes a sign bit, an 8-bit biased exponent, and a 23-bit Chapter 5: 64-Bit Media Programming 243 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 significand with one hidden integer bit for a total of 24 bits in the significand. The hidden integer bit is assumed to have a value of 1, and the significand field is also the fraction. The bias of the exponent is 127. However, the 3DNow! format does not support other aspects of the IEEE 754 standard, such as multiple rounding modes, representation of numbers other than normalized numbers, and floating-point exceptions. Range of Representable Values and Saturation. Tab l e 5 -3 s h ow s t h e range of representable values for 64-bit media floating-point data. Table 5-4 shows the exponent ranges. The largest representable positive normal number has an exponent of FEh and a significand of 7FFFFFh, with a numerical value of 2127 * (2 - 2-23). The smallest representable negative normal number has an exponent of 01h and a significand of 000000h, with a numerical value of 2-126. Table 5-3. Range of Values in 64-Bit Media Floating-Point Data Types Data-Type Interpretation Base-2 (exact) Floating-point Base-10 (approx.) Doubleword 2-126 to 2127 * (2 - 2-23) 1.17 * 10-38 to +3.40 * 1038 Quadword Two single-precision floatingpoint doublewords Table 5-4. 64-Bit Floating-Point Exponent Ranges Description Unsupported1 Zero Normal 2 (1-127) lowest possible exponent 2 (254-127) largest possible exponent Biased Exponent FFh 00h 00h 1. Unsupported numbers can be used as source operands but produce undefined results. Results that, after rounding, overflow above the maximumrepresentable positive or negative number are saturated (limited or clamped) at the maximum positive or negative 244 Chapter 5: 64-Bit Media Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology nu m b e r. R e s u l t s t h a t u n d er f l ow b e l ow t h e m i n i mu m representable positive or negative number are treated as zero. Floating-Point Rounding. In contrast to the IEEE standard, which requires four rounding modes, the 64-bit media floating-point instructions support only one rounding mode, depending on the instruction. All such instructions use round-to-nearest, except certain floating-point-to-integer conversion instructions ("Data Conversion" on page 266) which use round-to-zero. No Support for Infinities, NaNs, and Denormals. 64-bit media floatingpoint instructions support only normalized numbers. They do n o t s u p p o r t i n f i n i t y, N a N, a n d d e n o r m a l i z e d nu m b e r representations. Operations on such numbers produce undefined results, and no exceptions are generated. If all source operands are normalized numbers, these instructions never produce infinities, NaNs, or denormalized numbers as results. This aspect of 64-bit media floating-point operations does not comply with the IEEE 754 standard. Software must use only normalized operands and ensure that computations remain within valid normalized-number ranges. No Support for Floating-Point Exceptions. The 64-bit media floatingpoint instructions do not generate floating-point exceptions. Software must ensure that in-range operands are provided to these instructions. 5.6 Instruction Summary--Integer Instructions This section summarizes the functions of the integer (MMX and a few SSE and SSE2) instructions in the 64-bit media instruction subset. These include integer instructions that use an MMX register for source or destination and data-conversion instructions that convert from integers to floating-point formats. For a summary of the floating-point instructions in the 64-bit media instruction subset, including data-conversion instructions that convert from floating-point to integer formats, see "Instruction Summary--Floating-Point Instructions" on page 265. The instructions are organized here by functional group--such as data-transfer, vector arithmetic, and so on. Software running at any privilege level can use any of these instructions, if the Chapter 5: 64-Bit Media Programming 245 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 CPUID instruction reports support for the instructions (see "Feature Detection" on page 273). More detail on individual instructions is given in the alphabetically organized "64-Bit Media Instruction Reference" in Volume 5. 5.6.1 Syntax Each instruction has a mnemonic syntax used by assemblers to specify the operation and the operands to be used for source and destination (result) data. The majority of 64-bit media integer instructions have the following syntax: MNEMONIC mmx1, mmx2/mem64 Figure 5-10 shows an example of the mnemonic syntax for a packed add bytes (PADDB) instruction. PADDB mmx1, mmx2/mem64 Mnemonic First Source Operand and Destination Operand Second Source Operand 513-142.eps Figure 5-10. Mnemonic Syntax for Typical Instruction This example shows the PADDB mnemonic followed by two operands, a 64-bit MMX register operand and another 64-bit MMX register or 64-bit memory operand. In most instructions that take two operands, the first (left-most) operand is both a source operand and the destination operand. The second (rightmost) operand serves only as a source. Some instructions can have one or more prefixes that modify default properties, as described in "Instruction Prefixes" on page 272. Mnemonics. The following characters are used as prefixes in the mnemonics of integer instructions: CVT--Convert CVTT--Convert with truncation P--Packed (vector) PACK--Pack elements of 2x data size to 1x data size PUNPCK--Unpack and interleave elements 246 Chapter 5: 64-Bit Media Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology In addition to the above prefix characters, the following characters are used elsewhere in the mnemonics of integer instructions: B--Byte D--Doubleword DQ--Double quadword ID--Integer doubleword IW--Integer word PD--Packed double-precision floating-point PI--Packed integer PS--Packed single-precision floating-point Q--Quadword S--Signed SS--Signed saturation U--Unsigned US--Unsigned saturation W--Word x--One or more variable characters in the mnemonic For example, the mnemonic for the instruction that packs four words into eight unsigned bytes is PACKUSWB. In this mnemonic, the PACK designates 2x-to-1x conversion of vector elements, the US designates unsigned results with saturation, and the WB designates vector elements of the source as words and those of the result as bytes. 5.6.2 Exit Media State The exit media state instructions are used to isolate the use of processor resources between 64-bit media instructions and x87 floating-point instructions. EMMS--Exit Media State FEMMS--Fast Exit Media State These instructions initialize the contents of the x87 floatingpoint stack registers--called clearing the MMX state. Software should execute one of these instructions before leaving a 64-bit media procedure. The EMMS and FEMMS instructions both clear the MMX state, as described in "Mixing Media Code with x87 Code" on page 278. The instructions differ in one respect: FEMMS leaves Chapter 5: 64-Bit Media Programming 247 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 the data in the x87 stack registers undefined. By contrast, EMMS leaves the data in each such register as it was defined by the last x87 or 64-bit media instruction that wrote to the register. The FEMMS instruction is supported for backwardcompatibility. Software that must be compatible with both A M D a n d n o n - A M D p ro c e s s o rs s h o u l d u s e t h e E M M S instruction. 5.6.3 Data Transfer The data-transfer instructions copy operands between a 32-bit or 64-bit memory location, an MMX register, an XMM register, or a GPR. The MOV mnemonic, which stands for move, is a misnomer. A copy function is actually performed instead of a move. Move. MOVD--Move Doubleword MOVQ--Move Quadword MOVDQ2Q--Move Double Quadword to Quadword MOVQ2DQ--Move Quadword to Double Quadword The MOVD instruction copies a 32-bit or 64-bit value from a general-purpose register (GPR) or memory location to an MMX register, or from an MMX register to a GPR or memory location. If the source operand is 32 bits and the destination operand is 64 bits, the source is zero-extended to 64 bits in the destination. If the source is 64 bits and the destination is 32 bits, only the low-order 32 bits of the source are copied to the destination. The MOVQ instruction copies a 64-bit value from an MMX register or 64-bit memory location to another MMX register, or from an MMX register to another MMX register or 64-bit memory location. The MOVDQ2Q instruction copies the low-order 64-bit value in an XMM register to an MMX register. The MOVQ2DQ instruction copies a 64-bit value from an MMX register to the low-order 64 bits of an XMM register, with zeroextension to 128 bits. The MOVD and MOVQ instructions--along with the PUNPCKx instructions--are often among the most frequently used instructions in 64-bit media procedures (both integer and floating-point). The move instructions are similar to the assignment operator in high-level languages. 248 Chapter 5: 64-Bit Media Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Move Non-Temporal. The move non-temporal instructions are called streaming-store instructions. They minimize pollution of the cache. The assumption is that the data they reference will be used only once, and is therefore not subject to cache-related overhead such as write-allocation. For further information, see "Memory Optimization" on page 113. MOVNTQ--Move Non-Temporal Quadword MASKMOVQ--Mask Move Quadword The MOVNTQ instruction stores a 64-bit MMX register value into a 64-bit memory location. The MASKMOVQ instruction stores bytes from the first operand, as selected by the mask value (most-significant bit of each byte) in the second operand, to a memory location specified in the rDI and DS registers. The first operand is an MMX register, and the second operand is another MMX register. The size of the store is determined by the effective address size. Figure 5-11 shows the MASKMOVQ operation. operand 1 63 0 63 operand 2 0 select ...... ...... select store address memory rDI 513-133.eps Figure 5-11. MASKMOVQ Move Mask Operation The MOVNTQ and MASKMOVQ instructions use weaklyordered, write-combining buffering of write data, and they minimizes cache pollution. The exact method by which cache Chapter 5: 64-Bit Media Programming 249 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 p o l l u t i o n i s m i n i m i z e d d e p e n d s o n t h e h a rd wa re implementation of the instruction. For further information, see "Memory Optimization" on page 113. A typical case benefitting from streaming stores occurs when data written by the processor is never read by the processor, such as data written to a graphics frame buffer. MASKMOVQ is useful for the handling of end cases in block copies and block fills based on streaming stores. Move Mask. PMOVMSKB--Packed Move Mask Byte The PMOVMSKB instruction moves the most-significant bit of each byte in an MMX register to the low-order byte of a 32-bit or 64-bit general-purpose register, with zero-extension. It is useful for extracting bits from a mask, or extracting zero-point values from quantized data such as signal samples, resulting in a byte that can be used for data-dependent branching. 5.6.4 Data Conversion The integer data-conversion instructions convert operands from integer formats to floating-point formats. They take 64-bit integer source operands. For data-conversion instructions that take 32-bit and 64-bit floating-point source operands, see "Data Conversion" on page 266. For data-conversion instructions that take 128-bit source operands, see "Data Conversion" on page 166 and "Data Conversion" on page 192. Convert Integer to Floating-Point. These instructions convert integer data types into floating-point data types. CVTPI2PS--Convert Packed Doubleword Integers to Packed Single-Precision Floating-Point CVTPI2PD--Convert Packed Doubleword Integers to Packed Double-Precision Floating-Point PI2FW--Packed Integer To Floating-Point Word Conversion PI2FD--Packed Integer to Floating-Point Doubleword Conversion The CVTPI2Px instructions convert two 32-bit signed integer values in the second operand (an MMX register or 64-bit memory location) to two single-precision (CVTPI2PS) or double-precision (CVTPI2PD) floating-point values. The instructions then write the converted values into the low-order 64 bits of an XMM register (CVTPI2PS) or the full 128 bits of an 250 Chapter 5: 64-Bit Media Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology XMM register (CVTPI2PD). The CVTPI2PS instruction does not modify the high-order 64 bits of the XMM register. The PI2Fx instructions are 3DNow! instructions. They convert two 16-bit (PI2FW) or 32-bit (PI2FD) signed integer values in the second operand to two single-precision floating-point values. The instructions then write the converted values into the destination. If a PI2FD conversion produces an inexact value, the value is truncated (rounded toward zero). 5.6.5 Data Reordering The integer data-reordering instructions pack, unpack, interleave, extract, insert, shuffle, and swap the elements of vector operands. Pack with Saturation. These instructions pack 2x-sized data types into 1x-sized data types, thus halving the precision of each element in a vector operand. PACKSSDW--Pack with Saturation Signed Doubleword to Word PACKSSWB--Pack with Saturation Signed Word to Byte PACKUSWB--Pack with Saturation Signed Word to Unsigned Byte The PACKSSDW instruction converts each 32-bit signed integer in its two source operands (an MMX register or 64-bit memory location and another MMX register) into a 16-bit signed integer and packs the converted values into the destination MMX register. The PACKSSWB instruction does the analogous operation between word elements in the source vectors and byte elements in the destination vector. The PACKUSWB instruction does the same as PACKSSWB except that it converts word integers into unsigned (rather than signed) bytes. Figure 5-12 on page 252 shows an example of a PACKSSDW instruction. The operation merges vector elements of 2x size (doubleword-size) into vector elements of 1x size (word-size), thus reducing the precision of the vector-element data types. Any results that would otherwise overflow or underflow are s a t u ra t e d ( c l a m p e d ) a t t h e m a x i m u m o r m i n i m u m representable value, respectively, as described in "Saturation" on page 242. Chapter 5: 64-Bit Media Programming 251 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 operand 1 63 0 63 operand 2 0 63 result 0 513-143.eps Figure 5-12. PACKSSDW Pack Operation Conversion from higher-to-lower precision may be needed, for example, after an arithmetic operation which requires the higher-precision format to prevent possible overflow, but which requires the lower-precision format for a subsequent operation. Unpack and Interleave. These instructions interleave vector elements from the high or low half of two source operands. They can be used to double the precision of operands. PUNPCKHBW--Unpack and Interleave High Bytes PUNPCKHWD--Unpack and Interleave High Words PUNPCKHDQ--Unpack and Interleave High Doublewords PUNPCKLBW--Unpack and Interleave Low Bytes PUNPCKLWD--Unpack and Interleave Low Words PUNPCKLDQ--Unpack and Interleave Low Doublewords The PUNPCKHBW instruction unpacks the four high-order bytes from its two source operands and interleaves them into the bytes in the destination operand. The bytes in the low-order half of the source operand are ignored. The PUNPCKHWD and PUNPCKHDQ instructions perform analogous operations for words and doublewords in the source operands, packing them into interleaved words and interleaved doublewords in the destination operand. Th e P U N P C K L B W, P U N P C K LW D, a n d P U N P C K L D Q instructions are analogous to their high-element counterparts 252 Chapter 5: 64-Bit Media Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology except that they take elements from the low doubleword of each source vector and ignore elements in the high doubleword. If the source operand for PUNPCKLx instructions is in memory, only the low 32 bits of the operand are loaded. Figure 5-13 shows an example of the PUNPCKLWD instruction. The elements are taken from the low half of the source operands. In this register image, elements from operand2 are placed to the left of elements from operand1. operand 1 63 0 63 operand 2 0 63 result 0 513-144.eps Figure 5-13. PUNPCKLWD Unpack and Interleave Operation If one of the two source operands is a vector consisting of all zero-valued elements, the unpack instructions perform the function of expanding vector elements of 1x size into vector elements of 2x size (for example, word-size to doubleword-size). If both source operands are of identical value, the unpack instructions can perform the function of duplicating adjacent elements in a vector. The PUNPCKx instructions--along with MOVD and MOVQ-- are among the most frequently used instructions in 64-bit media procedures (both integer and floating-point). Extract and Insert. These instructions copy a word element from a vector, in a manner specified by an immediate operand. PEXTRW--Packed Extract Word PINSRW--Packed Insert Word Chapter 5: 64-Bit Media Programming 253 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 The PEXTRW instruction extracts a 16-bit value from an MMX register, as selected by the immediate-byte operand, and writes it to the low-order word of a 32-bit or 64-bit general-purpose register, with zero-extension to 32 or 64 bits. PEXTRW is useful for loading computed values, such as table-lookup indices, into general-purpose registers where the values can be used for addressing tables in memory. The PINSRW instruction inserts a 16-bit value from a the loworder word of a 32-bit or 64-bit general purpose register or a 16bit memory location into an MMX register. The location in the destination register is selected by the immediate-byte operand. The other words in the destination register operand are not modified. Shuffle and Swap. These instructions reorder the elements of a vector. PSHUFW--Packed Shuffle Words PSWAPD--Packed Swap Doubleword The PSHUFW instruction moves any one of the four words in its second operand (an MMX register or 64-bit memory location) to specified word locations in its first operand (another MMX register). The ordering of the shuffle can occur in any of 256 possible ways, as specified by the immediate-byte operand. Figure 5-14 shows one of the 256 possible shuffle operations. PSHUFW is useful, for example, in color imaging when computing alpha saturation of RGB values. In this case, PSHUFW can replicate an alpha value in a register so that parallel comparisons with three RGB values can be performed. 63 operand 1 0 63 operand 2 0 63 result 0 513-126.eps Figure 5-14. 254 PSHUFW Shuffle Operation Chapter 5: 64-Bit Media Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology The PSWAPD instruction swaps (reverses) the order of two 32bit values in the second operand and writes each swapped value in the corresponding doubleword of the destination. Figure 5-15 shows a swap operation. PSWAPD is useful, for example, in complex-number multiplication in which the elements of one source operand must be swapped (see "Accumulation" on page 268 for details). PSWAPD supports independent source and result operands so that it can also perform a load function. operand 1 63 0 63 operand 2 0 63 result 0 513-132.eps Figure 5-15. 5.6.6 Arithmetic PSWAPD Swap Operation The integer vector-arithmetic instructions perform an arithmetic operation on the elements of two source vectors. Arithmetic instructions that are not specifically named as unsigned perform signed two's-complement arithmetic. Addition. PADDB--Packed Add Bytes PADDW--Packed Add Words PADDD--Packed Add Doublewords PADDQ--Packed Add Quadwords PADDSB--Packed Add with Saturation Bytes PADDSW--Packed Add with Saturation Words PADDUSB--Packed Add Unsigned with Saturation Bytes PADDUSW--Packed Add Unsigned with Saturation Words The PADDB, PADDW, PADDD, and PADDQ instructions add each 8-bit (PADDB), 16-bit (PADDW), 32-bit (PADDD), or 64-bit (PADDQ) integer element in the second operand to the Chapter 5: 64-Bit Media Programming 255 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 corresponding, same-sized integer element in the first operand. The instructions then write the integer result of each addition to the corresponding, same-sized element of the destination. These instructions operate on both signed and unsigned integers. However, if the result overflows, only the low-order byte, word, doubleword, or quadword of each result is written to the destination. The PADDD instruction can be used together with PMADDWD (page 258) to implement dot products. The PADDSB and PADDSW instructions perform additions analogous to the PADDB and PADDW instructions, except with saturation. For each result in the destination, if the result is larger than the larges t, or smaller than the smallest, representable 8-bit (PADDSB) or 16-bit (PADDSW) signed integer, the result is saturated to the largest or smallest representable value, respectively. The PADDUSB and PADDUSW instructions perform saturating additions analogous to the PADDSB and PADDSW instructions, except on unsigned integer elements. Subtraction. PSUBB--Packed Subtract Bytes PSUBW--Packed Subtract Words PSUBD--Packed Subtract Doublewords PSUBQ--Packed Subtract Quadword PSUBSB--Packed Subtract with Saturation Bytes PSUBSW--Packed Subtract with Saturation Words PSUBUSB--Packed Subtract Unsigned and Saturate Bytes PSUBUSW--Packed Subtract Unsigned and Saturate Words The subtraction instructions perform operations analogous to the addition instructions. The PSUBB, PSUBW, PSUBD, and PSUBQ instructions subtract each 8-bit (PSUBB), 16-bit (PSUBW), 32-bit (PSUBD), or 64-bit (PSUBQ) integer element in the second operand from the corresponding, same-sized integer element in the first operand. The instructions then write the integer result of each subtraction to the corresponding, same-sized element of the destination. These instructions operate on both signed and unsigned integers. However, if the result underflows, only the 256 Chapter 5: 64-Bit Media Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology low-order byte, word, doubleword, or quadword of each result is written to the destination. The PSUBSB and PSUBSW instructions perform subtractions analogous to the PSUBB and PSUBW instructions, except with saturation. For each result in the destination, if the result is larger than the larges t, or smaller than the smallest, representable 8-bit (PSUBSB) or 16-bit (PSUBSW) signed integer, the result is saturated to the largest or smallest representable value, respectively. The PSUBUSB and PSUBUSW instructions perform saturating s u b t ra c t i o n s a n a l og o u s t o t h e P S U B S B a n d P S U B S W instructions, except on unsigned integer elements. Multiplication. PMULHW--Packed Multiply High Signed Word PMULLW--Packed Multiply Low Signed Word PMULHRW--Packed Multiply High Rounded Word PMULHUW--Packed Multiply High Unsigned Word PMULUDQ--Packed Multiply Unsigned Doubleword and Store Quadword The PMULHW instruction multiplies each 16-bit signed integer value in first operand by the corresponding 16-bit integer in the second operand, producing a 32-bit intermediate result. The instruction then writes the high-order 16 bits of the 32-bit intermediate result of each multiplication to the corresponding word of the destination. The PMULLW instruction performs the same multiplication as PMULHW but writes the low-order 16 bits of the 32-bit intermediate result to the corresponding word of the destination. The PMULHRW instruction performs the same multiplication as PMULHW but with rounding. After the multiplication, PMULHRW adds 8000h to the lower word of the doubleword result, thus rounding the high-order word which is returned as the result. The PMULHUW instruction performs the same multiplication as PMULHW but on unsigned operands. The instruction is useful in 3D rasterization, which operates on unsigned pixel values. Chapter 5: 64-Bit Media Programming 257 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Th e P M U L U D Q i n s t r u c t i o n , unlike the other PMUL x instructions, preserves the full precision of the result. It multiplies 32-bit unsigned integer values in the first and second operands and writes the full 64-bit result to the destination. See "Shift" on page 260 for shift instructions that can be used to perform multiplication and division by powers of 2. Multiply-Add. PMADDWD--Packed Multiply Words and Add Doublewords The PMADDWD instruction multiplies each 16-bit signed value in the first operand by the corresponding 16-bit signed value in the second operand. The instruction then adds the adjacent 32bit intermediate results of each multiplication, and writes the 3 2 - b i t res u l t o f e a ch a dd it i o n i n t o t h e corresponding doubleword of the destination. PMADDWD thus performs two signed (16 x 16 = 32) + (16 x 16 = 32) multiply-adds in parallel. Figure 5-16 on page 259 shows the PMADDWD operation. The only case in which overflow can occur is when all four of the 16-bit source operands used to produce a 32-bit multiply-add result have the value 8000h. In this case, the result returned is 8000_0000h, because the maximum negative 16-bit value of 8000h multiplied by itself equals 4000_0000h, and 4000_0000h added to 4000_0000h equals 8000_0000h. The result of multiplying two negative numbers should be a positive number, but 8000_0000h is the maximum possible 32-bit negative number rather than a positive number. 258 Chapter 5: 64-Bit Media Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology operand 1 63 0 63 operand 2 0 * * * * 127 0 + + 63 result 0 513-119.eps Figure 5-16. PMADDWD Multiply-Add Operation PMADDWD can be used with one source operand (for example, a coefficient) taken from memory and the other source operand (for example, the data to be multiplied by that coefficient) taken from an MMX register. The instruction can also be used together with the PADDD instruction (page 255) to compute dot products, such as those required for finite impulse response (FIR) filters, one of the commonly used DSP algorithms. Scaling can be done, before or after the multiply, using a vector-shift instruction (page 260). For floating-point multiplication operations, see the PFMUL instruction on page 268. For floating-point accumulation o p e ra t i o n s , s e e t h e P FAC C , P F NAC C , a n d P F P NAC C instructions on page 268. Average. PAVGB--Packed Average Unsigned Bytes PAVGW--Packed Average Unsigned Words PAVGUSB--Packed Average Unsigned Packed Bytes The PAVGx instructions compute the rounded average of each unsigned 8-bit (PAVGB) or 16-bit (PAVGW) integer value in the Chapter 5: 64-Bit Media Programming 259 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 first operand and the corresponding, same-sized unsigned integer in the second operand. The instructions then write each average in the corresponding, same-sized element of the destination. The rounded average is computed by adding each pair of operands, adding 1 to the temporary sum, and then rightshifting the temporary sum by one bit. The PAVGB instruction is useful for MPEG decoding, in which motion compensation performs many byte-averaging operations between and within macroblocks. In addition to speeding up these operations, PAVGB can free up registers and make it possible to unroll the averaging loops. The PAVGUSB instruction (a 3DNow! instruction) performs a function identical to the PAVGB instruction, described on page 259, although the two instructions have different opcodes. Sum of Absolute Differences. PSADBW--Packed Sum of Absolute Differences of Bytes into a Word The PSADBW instruction computes the absolute values of the differences of corresponding 8-bit signed integer values in the first and second operands. The instruction then sums the differences and writes an unsigned 16-bit integer result in the low-order word of the destination. The remaining bytes in the destination are cleared to all 0s. Sums of absolute differences are used to compute the L1 norm in motion-estimation algorithms for video compression. 5.6.7 Shift The vector-shift instructions are useful for scaling vector elements to higher or lower precision, packing and unpacking vector elements, and multiplying and dividing vector elements by powers of 2. Left Logical Shift. PSLLW--Packed Shift Left Logical Words PSLLD--Packed Shift Left Logical Doublewords PSLLQ--Packed Shift Left Logical Quadwords The PSLLx instructions left-shift each of the 16-bit (PSLLW), 32-bit (PSLLD), or 64-bit (PSLLQ) values in the first operand by the number of bits specified in the second operand. The i n s t r u c t i o n s t h e n w r i t e e a ch s h i f t e d va l u e i n t o t h e 260 Chapter 5: 64-Bit Media Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology corresponding, same-sized element of the destination. The first and second operands are either an MMX register and another MMX register or 64-bit memory location, or an MMX register and an immediate-byte value. The low-order bits that are emptied by the shift operation are cleared to 0. In integer arithmetic, left logical shifts effectively multiply unsigned operands by positive powers of 2. Right Logical Shift. PSRLW--Packed Shift Right Logical Words PSRLD--Packed Shift Right Logical Doublewords PSRLQ--Packed Shift Right Logical Quadwords The PSRLx instructions right-shift each of the 16-bit (PSRLW), 32-bit (PSRLD), or 64-bit (PSRLQ) values in the first operand by the number of bits specified in the second operand. The i n s t r u c t i o n s t h e n w r i t e e a ch s h i f t e d va l u e i n t o t h e corresponding, same-sized element of the destination. The first and second operands are either an MMX register and another MMX register or 64-bit memory location, or an MMX register and an immediate-byte value. The high-order bits that are emptied by the shift operation are cleared to 0. In integer arithmetic, right logical shifts effectively divide unsigned operands or positive signed operands by positive powers of 2. PSRLQ can be used to move the high 32 bits of an MMX register to the low 32 bits of the register. Right Arithmetic Shift. PSRAW--Packed Shift Right Arithmetic Words PSRAD--Packed Shift Right Arithmetic Doublewords The PSRAx instructions right-shifts each of the 16-bit (PSRAW) or 32-bit (PSRAD) values in the first operand by the number of bits specified in the second operand. The instructions then write each shifted value into the corresponding, same-sized element of the destination. The high-order bits that are emptied by the shift operation are filled with the sign bit of the initial value. In integer arithmetic, right arithmetic shifts effectively divide signed operands by positive powers of 2. Chapter 5: 64-Bit Media Programming 261 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 5.6.8 Compare The integer vector-compare instructions compare two operands, and they either write a mask or they write the maximum or minimum value. Compare and Write Mask. PCMPEQB--Packed Compare Equal Bytes PCMPEQW--Packed Compare Equal Words PCMPEQD--Packed Compare Equal Doublewords PCMPGTB--Packed Compare Greater Than Signed Bytes PCMPGTW--Packed Compare Greater Than Signed Words PCMPGTD--Packed Compare Greater Than Signed Doublewords Th e P C M P E Q x a n d P C M P G T x i n s t r u c t i o n s c o m p a re corresponding bytes, words, or doubleword in the first and second operands. The instructions then write a mask of all 1s or 0s for each compare into the corresponding, same-sized element of the destination. For the PCMPEQx instructions, if the compared values are equal, the result mask is all 1s. If the values are not equal, the result mask is all 0s. For the PCMPGTx instructions, if the signed value in the first operand is greater than the signed value in the second operand, the result mask is all 1s. If the value in the first operand is less than or equal to the value in the second operand, the result mask is all 0s. PCMPEQx can be used to set the bits in an MMX register to all 1s by specifying the same register for both operands. By specifying the same register for both operands, PCMPEQx can be used to set the bits in an MMX register to all 1s. Figure 5-5 on page 235 shows an example of a non-branching sequence that implements a two-way multiplexer--one that is equivalent to the following sequence of ternary operators in C or C++: r0 r1 r2 r3 = = = = a0 a1 a2 a3 > > > > b0 b1 b2 b3 ? ? ? ? a0 a1 a2 a3 : : : : b0 b1 b2 b3 Assuming mmx0 contains a, and mmx1 contains b, the above C sequence can be implemented with the following assembler sequence: 262 Chapter 5: 64-Bit Media Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology MOVQ PCMPGTW PAND PANDN POR mmx3, mmx3, mmx0, mmx3, mmx0, mmx0 mmx2 mmx3 mmx1 mmx3 ; ; ; ; a a a r > > > = b b b a ? ? > > 0xffff : 0 a: 0 0:b b ? a: b In the above sequence, PCMPGTW, PAND, PANDN, and POR operate, in parallel, on all four elements of the vectors. Compare and Write Minimum or Maximum. PMAXUB--Packed Maximum Unsigned Bytes PMINUB--Packed Minimum Unsigned Bytes PMAXSW--Packed Maximum Signed Words PMINSW--Packed Minimum Signed Words The PMAXUB and PMINUB instructions compare each of the 8bit unsigned integer values in the first operand with the corresponding 8-bit unsigned integer values in the second operand. The instructions then write the maximum (PMAXUB) or minimum (PMINUB) of the two values for each comparison into the corresponding byte of the destination. The PMAXSW and PMINSW instructions perform operations analogous to the PMAXUB and PMINUB instructions, except on 16-bit signed integer values. 5.6.9 Logical The vector-logic instructions perform Boolean logic operations, including AND, OR, and exclusive OR. And. PAND--Packed Logical Bitwise AND PANDN--Packed Logical Bitwise AND NOT The PAND instruction performs a bitwise logical AND of the values in the first and second operands and writes the result to the destination. The PANDN instruction inverts the first operand (creating a one's complement of the operand), ANDs it with the second operand, and writes the result to the destination, and writes the result to the destination. Table 5-5 on page 264 shows an example. Chapter 5: 64-Bit Media Programming 263 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Table 5-5. Example PANDN Bit Values Operand1 Bit 1 1 0 0 Operand1 Bit (Inverted) 0 0 1 1 Operand2 Bit 1 0 1 0 PANDN Result Bit 0 0 1 0 PAND can be used with the value 7FFFFFFF7FFFFFFFh to compute the absolute value of the elements of a 64-bit media floating-point vector operand. This method is equivalent to the x87 FABS (floating-point absolute value) instruction. Or. POR--Packed Logical Bitwise OR The POR instruction performs a bitwise logical OR of the values in the first and second operands and writes the result to the destination. Exclusive Or. PXOR--Packed Logical Bitwise Exclusive OR The PXOR instruction performs a bitwise logical exclusive OR of the values in the first and second operands and writes the result to the destination. PXOR can be used to clear all bits in an MMX register by specifying the same register for both o p e ra n d s . P X O R c a n a l s o u s e d w i t h t h e va l u e 8000000080000000h to change the sign bits of the elements of a 64-bit media floating-point vector operand. This method is equivalent to the x87 floating-point change sign (FCHS) instruction. 5.6.10 Save and Restore State These instructions save and restore the processor state for 64bit media instructions. Save and Restore 64-Bit Media and x87 State. FSAVE--Save x87 and MMX State FNSAVE--Save No-Wait x87 and MMX State 264 Chapter 5: 64-Bit Media Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology FRSTOR--Restore x87 and MMX State These instructions save and restore the entire processor state fo r x 8 7 f l o a t i n g - p o i n t i n s t r u c t i o n s a n d 6 4 - b i t m e d i a instructions. The instructions save and restore either 94 or 108 bytes of data, depending on the effective operand size. Assemblers issue FSAVE as an FWAIT instruction followed by an FNSAVE instruction. Thus, FSAVE (but not FNSAVE) reports pending unmasked x87 floating-point exceptions before saving the state. After saving the state, the processor initializes the x87 state by performing the equivalent of an FINIT instruction. Save and Restore 128-Bit, 64-Bit, and x87 State. FXSAVE--Save XMM, MMX, and x87 State FXRSTOR--Restore XMM, MMX, and x87 State The FXSAVE and FXRSTOR instructions save and restore the entire 512-byte processor state for 128-bit media instructions, 64-bit media instructions, and x87 floating-point instructions. The architecture supports two memory formats for FXSAVE and FXRSTOR, a 512-byte 32-bit legacy format and a 512-byte 64-bit format. Selection of the 32-bit or 64-bit format is determined by the effective operand size for the FXSAVE and FXRSTOR instructions. For details, see "Saving Media and x87 Processor State" in Volume 2. FXSAVE and FXRSTOR execute faster than FSAVE/FNSAVE and FRSTOR. However, unlike FSAVE and FNSAVE, FXSAVE does not initialize the x87 state, and like FNSAVE it does not report pending unmasked x87 floating-point exceptions. For details, see "Saving and Restoring State" on page 279. 5.7 Instruction Summary--Floating-Point Instructions This section summarizes the functions of the floating-point (3DNow! and a few SSE and SSE2) instructions in the 64-bit media instruction subset. These include floating-point instructions that use an MMX register for source or destination and data-conversion instructions that convert from floatingpoint to integers formats. For a summary of the integer instructions in the 64-bit media instruction subset, including data-conversion instructions that convert from integer to Chapter 5: 64-Bit Media Programming 265 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 floating-point formats, see "Instruction Summary--Integer Instructions" on page 245. For a summary of the 128-bit media floating-point instructions, see "Instruction Summary--Floating-Point Instructions" on page 187. For a summary of the x87 floating-point instructions, see "Instruction Summary" on page 315. The instructions are organized here by functional group--such as data-transfer, vector arithmetic, and so on. Software running at any privilege level can use any of these instructions, if the CPUID instruction reports support for the instructions (see "Feature Detection" on page 273). More detail on individual instructions is given in the alphabetically organized "64-Bit Media Instruction Reference" in Volume 5. 5.7.1 Syntax The 64-bit media floating-point instructions have the same syntax rules as those for the 64-bit media integer instructions, described in "Syntax" on page 246, except that the mnemonics of most floating-point instructions begin with the following prefix: PF--Packed floating-point 5.7.2 Data Conversion These data-conversion instructions convert operands from floating-point to integer formats. The instructions take 32-bit or 64-bit floating-point source operands. For data-conversion instructions that take 64-bit integer source operands, see "Data Conversion" on page 250. For data-conversion instructions that take 128-bit source operands, see "Data Conversion" on page 166 and "Data Conversion" on page 192. Convert Floating-Point to Integer. CVTPS2PI--Convert Packed Single-Precision Floating-Point to Packed Doubleword Integers CVTTPS2PI--Convert Packed Single-Precision FloatingPoint to Packed Doubleword Integers, Truncated CVTPD2PI--Convert Packed Double-Precision FloatingPoint to Packed Doubleword Integers CVTTPD2PI--Convert Packed Double-Precision FloatingPoint to Packed Doubleword Integers, Truncated PF2IW--Packed Floating-Point to Integer Word Conversion PF2ID--Packed Floating-Point to Integer Doubleword Conversion 266 Chapter 5: 64-Bit Media Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology The CVTPS2PI and CVTTPS2PI instructions convert two singleprecision (32-bit) floating-point values in the second operand (the low-order 64 bits of an XMM register or a 64-bit memory location) to two 32-bit signed integers, and write the converted values into the first operand (an MMX register). For the CVTPS2PI instruction, if the conversion result is an inexact value, the value is rounded as specified in the rounding control (RC) field of the MXCSR register ("MXCSR Register" on page 140), but for the CVTTPS2PI instruction such a result is truncated (rounded toward zero). The CVTPD2PI and CVTTPD2PI instructions perform conversions analogous to CVTPS2PI and CVTTPS2PI but for two double-precision (64-bit) floating-point values. The 3DNow! PF2IW instruction converts two single-precision floating-point values in the second operand (an MMX register or a 64-bit memory location) to two 16-bit signed integer values, sign-extended to 32-bits, and writes the converted values into the first operand (an MMX register). The 3DNow! PF2ID instruction converts two single-precision floating-point values in the second operand to two 32-bit signed integer values, and writes the converted values into the first operand. If the result of either conversion is an inexact value, the value is truncated (rounded toward zero). As described in "Floating-Point Data Types" on page 243, PF2IW and PF2ID do not fully comply with the IEEE-754 standard. Conversion of some source operands of the C type float (IEEE-754 single-precision)--specifically NaNs, infinities, and denormals--are not supported. Attempts to convert such source operands produce undefined results, and no exceptions are generated. 5.7.3 Arithmetic The floating-point vector-arithmetic instructions perform an arithmetic operation on two floating-point operands. For a description of 3DNow! instruction saturation on overflow and underflow conditions, see "Floating-Point Data Types" on page 243. Addition. PFADD--Packed Floating-Point Add The PFADD instruction adds each single-precision floatingpoint value in the first operand (an MMX register) to the Chapter 5: 64-Bit Media Programming 267 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 corresponding single-precision floating-point value in the second operand (an MMX register or 64-bit memory location). The instruction then writes the result of each addition into the corresponding doubleword of the destination. Subtraction. PFSUB--Packed Floating-Point Subtract PFSUBR--Packed Floating-Point Subtract Reverse The PFSUB instruction subtracts each single-precision floatingpoint value in the second operand from the corresponding single-precision floating-point value in the first operand. The instruction then writes the result of each subtraction into the corresponding quadword of the destination. The PFSUBR instruction performs a subtraction that is the reverse of the PFSUB instruction. It subtracts each value in the first operand from the corresponding value in the second operand. The provision of both the PFSUB and PFSUBR instructions allows software to choose which source operand to overwrite during a subtraction. Multiplication. PFMUL--Packed Floating-Point Multiply The PFMUL instruction multiplies each of the two singleprecision floating-point values in the first operand by the corresponding single-precision floating-point value in the second operand and writes the result of each multiplication into the corresponding doubleword of the destination. Division. For a description of floating-point division techniques, see "Reciprocal Estimation" on page 270. Division is equivalent to multiplication of the dividend by the reciprocal of the divisor. Accumulation. PFACC--Packed Floating-Point Accumulate PFNACC--Packed Floating-Point Negative Accumulate PFPNACC--Packed Floating-Point Positive-Negative Accumulate 268 Chapter 5: 64-Bit Media Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology The PFACC instruction adds the two single-precision floatingpoint values in the first operand and writes the result into the low-order word of the destination, and it adds the two singleprecision values in the second operand and writes the result into the high-order word of the destination. Figure 5-17 illustrates the operation. operand 1 63 0 63 operand 2 0 + + 63 result 0 513-183.eps Figure 5-17. PFACC Accumulate Operation The PFNACC instruction subtracts the first operand's highorder single-precision floating-point value from its low-order single-precision floating-point value and writes the result into the low-order doubleword of the destination, and it subtracts the second operand's high-order single-precision floating-point value from its low-order single-precision floating-point value and writes the result into the high-order doubleword of the destination. The PFPNACC instruction subtracts the first operand's highorder single-precision floating-point value from its low-order single-precision floating-point value and writes the result into the low-order doubleword of the destination, and it adds the two single-precision values in the second operand and writes the result into the high-order doubleword of the destination. PFPNACC is useful in complex-number multiplication, in which mixed positive-negative accumulation must be performed. Assuming that complex numbers are represented as twoelement vectors (one element is the real part, the other element is the imaginary part), there is a need to swap the elements of Chapter 5: 64-Bit Media Programming 269 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 one source operand to perform the multiplication, and there is a need for mixed positive-negative accumulation to complete the parallel computation of real and imaginary results. The PSWAPD instruction can swap elements of one source operand and the PFPNACC instruction can perform the mixed positivenegative accumulation to complete the computation. Reciprocal Estimation. PFRCP--Packed Floating-Point Reciprocal Approximation PFRCPIT1--Packed Floating-Point Reciprocal, Iteration 1 PFRCPIT2--Packed Floating-Point Reciprocal or Reciprocal Square Root, Iteration 2 The PFRCP instruction computes the approximate reciprocal of the single-precision floating-point value in the low-order 32 bits of the second operand and writes the result into both doublewords of the first operand. The PFRCPIT1 instruction performs the first intermediate step in the Newton-Raphson iteration to refine the reciprocal approximation produced by the PFRCP instruction. The first operand contains the input to a previous PFRCP instruction, and the second operand contains the result of the same PFRCP instruction. The PFRCPIT2 instruction performs the second and final step in the Newton-Raphson iteration to refine the reciprocal approximation produced by the PFRCP instruction or the reciprocal square-root approximation produced by the PFSQRT instructions. The first operand contains the result of a previous PFRCPIT1 or PFRSQIT1 instruction, and the second operand contains the result of a PFRCP or PFRSQRT instruction. The PFRCP instruction can be used together with the PFRCPIT1 and PFRCPIT2 instructions to increase the accuracy of a single-precision significand. Reciprocal Square Root. PFRSQRT--Packed Floating-Point Reciprocal Square Root Approximation PFRSQIT1--Packed Floating-Point Reciprocal Square Root, Iteration 1 270 Chapter 5: 64-Bit Media Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology The PFRSQRT instruction computes the approximate reciprocal square root of the single-precision floating-point value in the low-order 32 bits of the second operand and writes the result into each doubleword of the first operand. The second operand is a single-precision floating-point value with a 24-bit significand. The result written to the first operand is accurate to 15 bits. Negative operands are treated as positive operands for purposes of reciprocal square-root computation, with the sign of the result the same as the sign of the source operand. The PFRSQIT1 instruction performs the first step in the Newton-Raphson iteration to refine the reciprocal square-root approximation produced by the PFSQRT instruction. The first operand contains the input to a previous PFRSQRT instruction, and the second operand contains the square of the result of the same PFRSQRT instruction. The PFRSQRT instruction can be used together with the PFRSQIT1 instruction and the PFRCPIT2 instruction (described in "Reciprocal Estimation" on page 270) to increase the accuracy of a single-precision significand. 5.7.4 Compare The floating-point vector-compare instructions compare two operands, and they either write a mask or they write the maximum or minimum value. Compare and Write Mask. PFCMPEQ--Packed Floating-Point Compare Equal PFCMPGT--Packed Floating-Point Compare Greater Than PFCMPGE--Packed Floating-Point Compare Greater or Equal The PFCMPx instructions compare each of the two singleprecision floating-point values in the first operand with the corresponding single-precision floating-point value in the second operand. The instructions then write the result of each comparison into the corresponding doubleword of the destination. If the comparison test (equal, greater than, greater or equal) is true, the result is a mask of all 1s. If the comparison test is false, the result is a mask of all 0s. Compare and Write Minimum or Maximum. PFMAX--Packed Floating-Point Maximum PFMIN--Packed Floating-Point Minimum Chapter 5: 64-Bit Media Programming 271 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 The PFMAX and PFMIN instructions compare each of the two single-precision floating-point values in the first operand with the corresponding single-precision floating-point value in the second operand. The instructions then write the maximum (PFMAX) or minimum (PFMIN) of the two values for each comparison into the corresponding doubleword of the destination. The PFMIN and PFMAX instructions are useful for clamping, such as color clamping in 3D geometry and rasterization. They can also be used to avoid branching. 5.8 Instruction Effects on Flags The 64-bit media instructions do not read or write any flags in the rFLAGS register, nor do they write any exception-status flags in the x87 status-word register, nor is their execution dependent on any mask bits in the x87 control-word register. The only x87 state affected by the 64-bit media instructions is described in "Actions Taken on Executing 64-Bit Media Instructions" on page 276. 5.9 Instruction Prefixes Instruction prefixes, in general, are described in "Instruction Prefixes" on page 85. The following restrictions apply to the use of instruction prefixes with 64-bit media instructions. 5.9.1 Supported Prefixes The following prefixes can be used with 64-bit media instructions: Address-Size Override--The 67h prefix affects only operands in memory. The prefix is ignored by all other 64-bit media instructions. Operand-Size Override--The 66h prefix is used to form the opcodes of certain 64-bit media instructions. The prefix is ignored by all other 64-bit media instructions. Segment Overrides--The 2Eh (CS), 36h (SS), 3Eh (DS), 26h (ES), 64h (FS), and 65h (GS) prefixes affect only operands in memory. In 64-bit mode, the contents of the CS, DS, ES, SS segment registers are ignored. REP--The F2 and F3h prefixes do not function as repeat prefixes for 64-bit media instructions. Instead, they are used 272 Chapter 5: 64-Bit Media Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology to form the opcodes of certain 64-bit media instructions. The prefixes are ignored by all other 64-bit media instructions. REX--The REX prefixes affect operands that reference a GPR or XMM register when running in 64-bit mode. It allows access to the full 64-bit width of any of the 16 extended GPRs and to any of the 16 extended XMM registers. The REX prefix also affects the FXSAVE and FXRSTOR instructions, in which it selects between two types of 512byte memory-image format, as described in "Saving Media and x87 Processor State" in Volume 2. The prefix is ignored by all other 64-bit media instructions. 5.9.2 Special-Use and Reserved Prefixes The following prefixes are used as opcode bytes in some 64-bit media instructions and are reserved in all other 64-bit media instructions: Operand-Size Override--The 66h prefix. REP--The F2 and F3h prefixes. 5.9.3 Prefixes That Cause Exceptions The following prefixes cause an exception: LOCK--The F0h prefix causes an invalid-opcode exception when used with 64-bit media instructions. 5.10 Feature Detection Before executing 64-bit media instructions, software should determine whether the processor supports the technology by executing the CPUID instruction. "Feature Detection" on page 90 describes how software uses the CPUID instruction to detect feature support. For full support of the 64-bit media instructions documented here, the following features require detection: MMX instructions, indicated by bit 23 of CPUID standard function 1 and extended function 8000_0001h. 3DNow! instructions, indicated by bit 31 of CPUID extended function 8000_0001h. MMX extensions, indicated by bit 22 of CPUID extended function 8000_0001h. 3DNow! extensions, indicated by bit 30 of CPUID extended function 8000_0001h. SSE instructions, indicated by bit 25 of CPUID extended function 8000_0001h. Chapter 5: 64-Bit Media Programming 273 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 SSE2 instruction extensions, indicated by bit 26 of CPUID extended function 8000_0001h. Software may also wish to check for the following support, because the FXSAVE and FXRSTOR instructions execute faster than FSAVE and FRSTOR: FXSAVE and FXRSTOR, indicated by bit 24 of CPUID standard function 1 and extended function 8000_0001h. Software that runs in long mode should also check for the following support: Long Mode, indicated by bit 29 of CPUID extended function 8000_0001h. See "Processor Feature Identification" in Volume 2 for a full description of the CPUID instruction and its function codes. If the FXSAVE and FXRSTOR instructions are to be used, the operating system must support these instructions by having set CR4.OSFXSR = 1. If the MMX floating-point-to-integer dataconversion instructions (CVTPS2PI, CVTTPS2PI, CVTPD2PI, or CVTTPD2PI) are used, the operating system must support the FXSAVE and FXRSTOR instructions and SIMD floatingpoint exceptions (by having set CR4.OSXMMEXCPT = 1). For details, see "System-Control Registers" in Volume 2. 5.11 Exceptions 64-bit media instructions can generate two types of exceptions: General-Purpose Exceptions, described below in "GeneralPurpose Exceptions" x87 Floating-Point Exceptions (#MF), described in "x87 Floating-Point Exceptions (#MF)" on page 276 All exceptions that occur while executing 64-bit media instructions can be handled by legacy exception handlers used for general-purpose instructions and x87 floating-point instructions. 5.11.1 GeneralPurpose Exceptions The sections below list exceptions generated and not generated by general-purpose instructions. For a summary of the generalp u r p o s e e x c e p t i o n m e ch a n i s m , s e e " I n t e r r u p t s a n d Exceptions" on page 104. For details about each exception and 274 Chapter 5: 64-Bit Media Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology its potential causes, see "Exceptions and Interrupts" in Volume 2. Exceptions Generated. The 64-bit media instructions can generate the following general-purpose exceptions: #DB--Debug Exception (Vector 1) #UD--Invalid-Opcode Exception (Vector 6) #DF--Double-Fault Exception (Vector 8) #SS--Stack Exception (Vector 12) #GP--General-Protection Exception (Vector 13) #PF--Page-Fault Exception (Vector 14) #MF--x87 Floating-Point Exception-Pending (Vector 16) #AC--Alignment-Check Exception (Vector 17) #MC--Machine-Check Exception (Vector 18) #XF--SIMD Floating-Point Exception (Vector 19)--Only by the CVTPS2PI, CVTTPS2PI, CVTPD2PI, and CVTTPD2PI instructions. An invalid-opcode exception (#UD) can occur if a required CPUID feature flag is not set (see "Feature Detection" on page 273), or if an attempt is made to execute a 64-bit media instruction and the operating system has set the floating-point software-emulation (EM) bit in control register 0 to 1 (CR0.EM = 1). For details on the system control-register bits, see "SystemControl Registers" in Volume 2. For details on the machinecheck mechanism, see "Machine Check Mechanism" in Volume 2. For details on #MF exceptions, see "x87 Floating-Point Exceptions (#MF)" on page 276. Exceptions Not Generated. The 64-bit media instructions do not generate the following general-purpose exceptions: #DE--Divide-By-Zero-Error Exception (Vector 0) Non-Maskable-Interrupt Exception (Vector 2) #BP--Breakpoint Exception (Vector 3) #OF--Overflow Exception (Vector 4) #BR--Bound-Range Exception (Vector 5) #NM--Device-Not-Available Exception (Vector 7) Chapter 5: 64-Bit Media Programming 275 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Coprocessor-Segment-Overrun Exception (Vector 9) #TS--Invalid-TSS Exception (Vector 10) #NP--Segment-Not-Present Exception (Vector 11) For details on all general-purpose exceptions, see "Exceptions and Interrupts" in Volume 2. 5.11.2 x87 FloatingPoint Exceptions (#MF) The 64-bit media instructions do not generate x87 floating-point (#MF) exceptions as a consequence of their own computations. However, an #MF exception can occur during the execution of a 64-bit media instruction, due to a prior x87 floating-point instruction. Specifically, if an unmasked x87 floating-point exception is pending at the instruction boundary of the next 64bit media instruction, the processor asserts the FERR# output signal. For details about the x87 floating-point exceptions and the FERR# output signal, as described in "x87 Floating-Point Exception Causes" on page 338. 5.12 Actions Taken on Executing 64-Bit Media Instructions The MMX registers are mapped onto the low 64 bits of the 80bit x87 floating-point physical registers, FPR0-FPR7, described in "Registers" on page 287. The MMX instructions do not use the x87 stack-addressing mechanism. However, 64-bit media instructions write certain values in the x87 top-of-stack pointer, tag bits, and high bits of the FPR0-FPR7 data registers. Specifically, the processor performs the following x87-related actions atomically with the execution of 64-bit media instructions: Top-Of-Stack Pointer (TOP)--The processor clears the x87 topof-stack pointer (bits 13-11 in the x87 status word register) to all 0s during the execution of every 64-bit media instruction, causing it to point to the mmx0 register. Tag Bits--During the execution of every 64-bit media instruction, except the EMMS and FEMMS instructions, the processor changes the tag state for all eight MMX registers to full, as described below. In the case of EMMS and FEMMS, the processor changes the tag state for all eight MMX registers to empty, thus initializing the stack for an x87 floating-point procedure. Bits 79-64--During the execution of every 64-bit media instruction that writes a result to an MMX register, the 276 Chapter 5: 64-Bit Media Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology processor writes the result data to a 64-bit MMX register (the low 64 bits of the associated 80-bit x87 floating-point physical register) and sets the exponent and sign bits (the high 16 bits of the associated 80-bit x87 floating-point physical register) to all 1s. In the x87 environment, the effect of setting the high 16 bits to all 1s indicates that the contents of the low 64 bits are not finite numbers. Such a designation prevents an x87 floating-point instruction from interpreting the data as a finite x87 floating-point number. The rest of the x87 floating-point processor state--the entire x87 control-word register, the remaining fields of the statusword register, and the error pointers (instruction pointer, data pointer, and last opcode register)--is not affected by the execution of 64-bit media instructions. The 2-bit tag fields defined by the x87 architecture for each x87 data register, and stored in the x87 tag-word register (also called the floating-point tag word, or FTW), characterize the contents of the MMX registers. The tag bits are visible to software only after an FSAVE or FNSAVE (but not FXSAVE) instruction, as described in "Saving Media and x87 Processor State" in Volume 2. Internally, however, the processor maintains only a one-bit representation of each 2-bit tag field. This single bit indicates whether the associated register is empty or full. Table 5-6 shows the mapping between the 1-bit internal tag--which is referred to in this chapter by its empty or full state--and the 2-bit architectural tag. Chapter 5: 64-Bit Media Programming 277 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Table 5-6. Mapping Between Internal and Software-Visible Tag Bits Architectural State State Valid Zero Special (NaN, infinity, denormal)2 Empty Notes: Binary Value 00 01 10 11 Internal State1 Full (0) Empty (1) 1. For a more detailed description of this mapping, see "Deriving FSAVE Tag Field from FXSAVE Tag Field" in Volume 2. 2. The 64-bit media floating point (3DNow!TM) instructions do not support NaNs, infinities, and denormals. When the processor executes an FSAVE or FNSAVE (but not FXSAVE) instruction, it changes the internal 1-bit tag state to its 2-bit architectural tag by reading the data in all 80 bits of the physical data registers and using the mapping in Table 5-6. For example, if the value in the high 16 bits of the 80-bit physical register indicate a NaN, the two tag bits for that register are changed to a binary value of 10 before the x87 status word is written to memory. The tag bits have no effect on the execution of 64-bit media instructions or their interpretation of the contents of the MMX registers. However, the converse is not true: execution of 64-bit media instructions that write to an MMX register alter the tag bits and thus may affect execution of subsequent x87 floatingpoint instructions. For a more detailed description of the mapping shown in Table 5-6, see "Deriving FSAVE Tag Field from FXSAVE Tag Field" in Volume 2 and its accompanying text. 5.13 Mixing Media Code with x87 Code Software may freely mix 64-bit media instructions (integer or floating-point) with 128-bit media instructions (integer or floating-point) and general-purpose instructions in a single procedure. However, before transitioning from a 64-bit media Chapter 5: 64-Bit Media Programming 5.13.1 Mixing Code 278 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology procedure--or a 128-bit media procedure that accesses an MMXTM register--to an x87 procedure, or to software that may eventually branch to an x87 procedure, software should clear the MMX state, as described immediately below. 5.13.2 Clearing MMXTM State Software should separate 64-bit media procedures, 128-bit media procedures, or dynamic link libraries (DLLs) that access MMX registers from x87 floating-point procedures or DLLs by clearing the MMX state with the EMMS or FEMMS instruction before leaving a 64-bit media procedure, as described in "Exit Media State" on page 247. Th e 6 4 - b i t m e d i a i n s t r u c t i o n s a n d x 8 7 f l o a t i n g - p o i n t instructions interpret the contents of their aliased MMX and x87 registers differently. Because of this, software should not exchange register data between 64-bit media and x87 floatingpoint procedures, or use conditional branches at the end of loops that might jump to code of the other type. Software must not rely on the contents of the aliased MMX and x87 registers across such code-type transitions. If a transition to an x87 procedure occurs from a 64-bit media procedure that does not clear the MMX state, the x87 stack may overflow. 5.14 State-Saving In general, system software should save and restore MMXTM and x87 state between task switches or other interventions in the execution of 64-bit media procedures. Virtually all modern operating systems running on x86 processors--including such systems as Windows NTTM, UNIX, and OS/2--are preemptive multitasking operating systems that handle such saving and restoring of state properly across task switches, independently of hardware task-switch support. No changes are needed to the x87 register-saving performed by 32-bit operating systems, exception handlers, or device drivers. The same support provided in a 32-bit operating system's device-not-available (#NM) exception handler by any of the x87register save/restore instructions described below also supports saving and restoring the MMX registers. However, application procedures are also free to save and restore MMX and x87 state at any time they deem useful. 5.14.1 Saving and Restoring State Chapter 5: 64-Bit Media Programming 279 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 5.14.2 State-Saving Instructions Software running at any privilege level may save and restore 64bit media and x87 state by executing the FSAVE, FNSAVE, or FXSAVE instruction. Alternatively, software may use move instructions for saving only the contents of the MMX registers, rather than the complete 64-bit media and x87 state. For example, when saving MMX register values, use eight MOVQ instructions. FSAVE/FNSAVE and FRSTOR Instructions. The FSAVE, FNSAVE, and FRSTOR instructions are described in "Save and Restore 64-Bit Media and x87 State" on page 264. After saving state with FSAVE or FNSAVE, the tag bits for all MMX and x87 registers are changed to empty and thus available for a new procedure. Thus, FSAVE and FNSAVE also perform the state-clearing function of EMMS or FEMMS. FXSAVE and FXRSTOR Instructions. Th e F S AV E , F N S AV E , a n d FRSTOR instructions are described in "Save and Restore 128Bit, 64-Bit, and x87 State" on page 265. The FXSAVE and FXRSTOR instructions execute faster than FSAVE/FNSAVE and FRSTOR because they do not save and restore the x87 error pointers (described in "Pointers and Opcode State" on page 297) except in the relatively rare cases in which the exception-summary (ES) bit in the x87 status word (register image for FXSAVE, memory image for FXRSTOR) is set to 1, indicating that an unmasked x87 exception has occurred. Unlike FSAVE and FNSAVE, however, FXSAVE does not alter the tag bits (thus, it does not perform the state-clearing function of EMMS or FEMMS). The state of the saved MMX and x87 registers is retained, thus indicating that the registers may still be valid (or whatever other value the tag bits indicated prior to the save). To invalidate the contents of the MMX and x87 registers after FXSAVE, software must explicitly execute an FINIT instruction. Also, FXSAVE (like FNSAVE) and FXRSTOR do not check for pending unmasked x87 floatingpoint exceptions. An FWAIT instruction can be used for this purpose. For details about the FXSAVE and FXRSTOR memory formats, see "Saving Media and x87 Processor State" in Volume 2. 280 Chapter 5: 64-Bit Media Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology 5.15 Performance Considerations In addition to typical code optimization techniques, such as those affecting loops and the inlining of function calls, the following considerations may help improve the performance of application programs written with 64-bit media instructions. These are implementation-independent performance considerations. Other considerations depend on the hardware implementation. For information about such implementationdependent considerations and for more information about application performance in general, see the data sheets and the software-optimization guides relating to particular hardware implementations. 5.15.1 Use Small Operand Sizes The performance advantages available with 64-bit media operations is to some extent a function of the data sizes operated upon. The smaller the data size, the more data elements that can be packed into single 64-bit vectors. The parallelism of computation increases as the number of elements per vector increases. Much of the performance benefit from the 64-bit media instructions comes from the parallelism inherent in vector operations. It can be advantageous to reorganize data before performing arithmetic operations so that its layout after reorganization maximizes the parallelism of the arithmetic operations. The speed of memory access is particularly important for certain types of computation, such as graphics rendering, that depend on the regularity and locality of data-memory accesses. For example, in matrix operations, performance is high when operating on the rows of the matrix, because row bytes are contiguous in memory, but lower when operating on the columns of the matrix, because column bytes are not contiguous in memory and accessing them can result in cache misses. To improve performance for operations on such columns, the matrix should first be transposed. Such transpositions can, for example, be done using a sequence of unpacking or shuffle instructions. 5.15.2 Reorganize Data for Parallel Operations 5.15.3 Remove Branches Branch can be replaced with 64-bit media instructions that simulate predicated execution or conditional moves, as described in "Branch Removal" on page 234. Where possible, 281 Chapter 5: 64-Bit Media Programming AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 break long dependency chains into several shorter dependency chains which can be executed in parallel. This is especially important for floating-point instructions because of their longer latencies. 5.15.4 Align Data Data alignment is particularly important for performance when data written by one instruction is read by a subsequent instruction soon after the write, or when accessing streaming (non-temporal) data--data that will not be reused and therefore should not be cached. These cases may occur frequently in 64bit media procedures. Accesses to data stored at unaligned locations may benefit from on-the-fly software alignment or from repetition of data at different alignment boundaries, as required by different loops that process the data. 5.15.5 Organize Data for Cacheability Pack small data structures into cache-line-size blocks. Organize frequently accessed constants and coefficients into cache-linesize blocks and prefetch them. Procedures that access data arranged in memory-bus-sized blocks, or memory-burst-sized blocks, can make optimum use of the available memory bandwidth. For data that will be used only once in a procedure, consider using non-cacheable memory. Accesses to such memory are not burdened by the overhead of cache protocols. 5.15.6 Prefetch Data Media applications typically operate on large data sets. Because of this, they make intensive use of the memory bus. Memory latency can be substantially reduced--especially for data that will be used only once--by prefetching such data into various levels of the cache hierarchy. Software can use the PREFETCHx instructions very effectively in such cases, as described in "Cache and Memory Management" on page 79. Some of the best places to use prefetch instructions are inside loops that process large amounts of data. If the loop goes through less than one cache line of data per iteration, partially unroll the loop to obtain multiple iterations of the loop within a cache line. Try to use virtually all of the prefetched data. This usually requires unit-stride memory accesses--those in which all accesses are to contiguous memory locations. 282 Chapter 5: 64-Bit Media Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology 5.15.7 Retain Intermediate Results in MMXTM Registers Keep intermediate results in the MMX registers as much as possible, especially if the intermediate results are used shortly after they have been produced. Avoid spilling intermediate results to memory and reusing them shortly thereafter. Chapter 5: 64-Bit Media Programming 283 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 284 Chapter 5: 64-Bit Media Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology 6 x87 Floating-Point Programming This chapter describes the x87 floating-point programming model. This model supports all aspects of the legacy x87 floating-point model and complies with the IEEE 754 and 854 standards for binary floating-point arithmetic. In hardware implementations of the AMD64 architecture, support for specific features of the x87 programming model are indicated by the CPUID feature bits, as described in "Feature Detection" on page 336. 6.1 6.1.1 Origins Overview In 1979, AMD introduced the first floating-point coprocessor for microprocessors--the AM9511 arithmetic circuit. This coprocessor performed 32-bit floating-point operations under microprocessor control. In 1980, AMD introduced the AM9512, which performed 64-bit floating-point operations. These coprocessors were second-sourced as the 8231 and 8232 coprocessors. Before then, programmers working with generalpurpose microprocessors had to use much slower, vendorsupplied software libraries for their floating-point needs. In 1985, the Institute of Electrical and Electronics Engineers published the IEEE Standard for Binary Floating-Point Arithmetic, also referred to as the ANSI/IEEE Std 754-1985 standard, or IEEE 754. This standard defines the data types, operations, and exception-handling methods that are the basis for the x87 floating-point technology implemented in the legacy x86 architecture. In 1987, the IEEE published a more general radix-independent version of that standard, called the ANSI/IEEE Std 854-1987 standard, or IEEE 854 for short. The AMD64 architecture complies with both the IEEE 754 and IEEE 854 standards. 6.1.2 Compatibility x87 floating-point instructions can be executed in any of the architecture's operating modes. Existing x87 binary programs run in legacy and compatibility modes without modification. The support provided by the AMD64 architecture for such b i n a r i e s i s i d e n t i c a l t o t h a t p rov i d e d by l e g a cy x 8 6 architectures. Chapter 6: x87 Floating-Point Programming 285 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 To run in 64-bit mode, x87 floating-point programs must be recompiled. The recompilation has no side effects on such programs, other then to make available the extended generalpurpose registers and 64-bit virtual address space. 6.2 Capabilities Floating-point software is typically written to manipulate numbers that are very large or very small, that require a high degree of precision, or that result from complex mathematical operations such as transcendentals. Applications that take advantage of floating-point operations include geometric calculations for graphics acceleration, scientific, statistical, and engineering applications, and process control. The advantages of using x87 floating-point instructions include: Representation of all numbers in common, IEEE-754/854 formats, ensuring repeatability of results across all platforms that conform to IEEE-754/854 standards. Availability of separate floating-point registers. Depending on the hardware implementation of the architecture, this may allow execution of x87 floating-point instructions in parallel with execution of general-purpose and 128-bit media instructions. Instructions that compute absolute value, change-of-sign, round-to-integer, partial remainder, and square root. Instructions that compute transcendental values, including 2x-1, cosine, partial arc tangent, partial tangent, sine, sine with cosine, y*log2x, and y*log2(x+1). The cosine, partial arc tangent, sine, and sine with cosine instructions use angular values expressed in radians for operands and results. Instructions that load common constants, such as log2e, log210, log102, loge2, Pi, 1, and 0. x87 instructions operate on data in three floating-point formats--32-bit single-precision, 64-bit double-precision, and 80-bit double-extended-precision (sometimes called extended precision)--as well as integer, and 80-bit packed-BCD formats. x87 instructions carry out all computations using the 80-bit double-extended-precision format. When an x87 instruction reads a number from memory in 80-bit double-extendedprecision format, the number can be used directly in 286 Chapter 6: x87 Floating-Point Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology computations, without conversion. When an x87 instruction reads a number in a format other than double-extendedprecision format, the processor first converts the number into double-extended-precision format. The processor can convert numbers back to specific formats, or leave them in doubleextended-precision format when writing them to memory. Most x87 operations for addition, subtraction, multiplication, and division specify two source operands, the first of which is replaced by the result. Instructions for subtraction and division have reverse forms which swap the ordering of operands. 6.3 Registers Operands for the x87 instructions are located in the x87 registers or memory. Figure 6-1 shows an overview of the x87 registers. x87 Data Registers 79 0 fpr0 fpr1 fpr2 fpr3 fpr4 fpr5 fpr6 fpr7 Instruction Pointer (rIP) Data Pointer (rDP) 63 Control Word Control Word Status Word Status Word Opcode 10 0 15 Tag Word Tag Word 0 513-321.eps Figure 6-1. x87 Registers Chapter 6: x87 Floating-Point Programming 287 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 These registers include eight 80-bit data registers, three 16-bit registers that hold the x87 control word, status word, and tag word, two 64-bit registers that hold instruction and data pointers, and an 11-bit register that holds a permutation of an x87 opcode. 6.3.1 x87 Data Registers Figure 6-2 shows the eight 80-bit data registers in more detail. Typically, x87 instructions reference these registers as a stack. x87 instructions store operands only in these 80-bit registers or in memory. They do not (with two exceptions) access the GPR registers, and they do not access the XMM registers. x87 Status Word TOP 13 11 ST(6) ST(7) ST(0) ST(1) ST(2) ST(3) ST(4) ST(5) 79 fpr0 fpr1 fpr2 fpr3 fpr4 fpr5 fpr6 fpr7 0 513-134.eps Figure 6-2. x87 Physical and Stack Registers Stack Organization. The bank of eight physical data registers, FPR0-FPR7, are organized internally as a stack, ST(0)-ST(7). The stack functions like a circular modulo-8 buffer. The stack top can be set by software to start at any register position in the bank. Many instructions access the top of stack as well as individual registers relative to the top of stack. Stack Pointer. Bits 13-11 of the x87 status word ("x87 Status Word Register" on page 289) are the top-of-stack pointer (TOP). The TOP specifies the mapping of the stack registers onto the physical registers. The TOP contains the physical-register index of the location of the top of stack, ST(0). Instructions that load operands from memory into an x87 register first decrement the 288 Chapter 6: x87 Floating-Point Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology stack pointer and then copy the operand (often with conversion to the double-extended-precision format) from memory into the decremented top-of-stack register. Instructions that store operands from an x87 register to memory copy the operand (often with conversion from the double-extended-precision format) in the top-of-stack register to memory and then increment the stack pointer. Figure 6-2 shows the mapping between stack registers and physical registers when the TOP has the value 2. Modulo-8 wraparound addressing is used. Pushing a new element onto this stack--for example with the FLDZ (floating-point load +0.0) instruction--decrements the TOP to 1, so that ST(0) refers to FPR1, and the new top-of-stack is loaded with +0.0. The architecture provides alternative versions of many instructions that either modify or do not modify the TOP as a side effect. For example, FADDP (floating-point add and pop) behaves exactly like FADD (floating-point add), except that it pops the stack after completion. Programs that use the x87 registers as a flat register file rather than as a stack would use non-popping versions of instructions to ensure that the TOP remains unchanged. However, loads (pus hes) wit hout corresponding pops can cause the stack to overflow, which occurs when a value is pushed or loaded into an x87 register that is not empty (as indicated by the register's tag bits). To prevent overflow, the FXCH (floating-point exchange) instruction can be used to access stack registers, giving the appearance of a flat register file, but all x87 programs must be aware of the register file's stack organization. The FINCSTP and FDECSTP instructions can be used to increment and decrement, respectively, the TOP, modulo-8, allowing the stack top to wrap around to the bottom of the eight-register file when incremented beyond the top of the file, or to wrap around to the top of the register file when decremented beyond the bottom of the file. Neither the x87 tag word nor the contents of the floating-point stack itself is updated when these instructions are used. 6.3.2 x87 Status Word Register The 16-bit x87 status word register contains information about the state of the floating-point unit, including the top-of-stack pointer (TOP), four condition-code bits, exception-summary flag, stack-fault flag, and six x87 floating-point exception flags. Figure 6-3 on page 290 shows the format of this register. All bits 289 Chapter 6: x87 Floating-Point Programming AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 can be read and written, however values written to the B and ES bits (bits 15 and 7) are ignored. The FRSTOR and FXRSTOR instructions load the status word from memory. The FSTSW, FNSTSW, FSAVE, FNSAVE, FXSAVE, FSTENV, and FNSTENV instructions store the status word to memory. The FCLEX and FNCLEX instructions clear the exception flags. The FINIT and FNINIT instructions clear all bits in the status-word. 15 14 13 12 11 10 9 8 B C 3 TOP CC 21 C 0 7 E S 6 S F 5 4 3 2 1 0 PUOZ EEEE DI EE Symbol Description B x87 Floating-Point Unit Busy C3 Condition Code TOP Top of Stack Pointer 000 = FPR0 111 = FPR7 C2 Condition Code C1 Condition Code C0 Condition Code ES Exception Status SF Stack Fault x87 Exception Flags PE Precision Exception UE Underflow Exception OE Overflow Exception ZE Zero-Divide Exception DE Denormalized Operation Exception IE Invalid Operation Exception Bits 15 14 13-11 10 9 8 7 6 5 4 3 2 1 0 Figure 6-3. x87 Status Word Register The bits in the x87 status word are defined immediately below, starting with bit 0. The six exception flags (IE, DE, ZE, OE, UE, PE) plus the stack fault (SF) flag are sticky bits. Once set by the processor, such a bit remains set until software clears it. For details about the causes of x87 exceptions indicated by bits 6-0, see "x87 Floating-Point Exception Causes" on page 338. For details about the masking of x87 exceptions, see "x87 FloatingPoint Exception Masking" on page 344. 290 Chapter 6: x87 Floating-Point Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Invalid-Operation Exception (IE). Bit 0. The processor sets this bit to 1 when an invalid-operation exception occurs. These exceptions are caused by many types of errors, such as an invalid operand or by stack faults. When a stack fault causes an IE exception, the stack fault (SF) exception bit is also set. Denormalized-Operand Exception (DE). Bit 1. The processor sets this bit to 1 when one of the source operands of an instruction is in denormalized form. (See "Denormalized (Tiny) Numbers" on page 306.) Zero-Divide Exception (ZE). Bit 2. The processor sets this bit to 1 when a non-zero number is divided by zero. Overflow Exception (OE). Bit 3. The processor sets this bit to 1 when the absolute value of a rounded result is larger than the largest representable normalized floating-point number for the destination format. (See "Normalized Numbers" on page 306.) Underflow Exception (UE). Bit 4. The processor sets this bit to 1 when the absolute value of a rounded non-zero result is too small to be represented as a normalized floating-point number for the destination format. (See "Normalized Numbers" on page 306.) The underflow exception has an unusual behavior. When masked by the UM bit (bit 4 of the x87 control word), the processor only reports a UE exception if the UE occurs together with a precision exception (PE). Precision Exception (PE). Bit 5. The processor sets this bit to 1 when a floating-point result, after rounding, differs from the infinitely precise result and thus cannot be represented exactly in the specified destination format. The PE exception is also called the inexact-result exception. Stack Fault (SF). Bit 6. The processor sets this bit to 1 when a stack overflow (due to a push or load into a non-empty stack register) or stack underflow (due to referencing an empty stack register) occurs in the x87 stack-register file. When either of these conditions occur, the processor also sets the invalid-operation exception (IE) flag, and the processor distinguishes overflow from underflow by writing the condition-code 1 (C1) bit (C1 = 1 for overflow, C1 = 0 for underflow). Unlike the flags for the Chapter 6: x87 Floating-Point Programming 291 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 other x87 exceptions, the SF flag does not have a corresponding mask bit in the x87 control word. If, subsequent to the instruction that caused the SF bit to be set, a second invalid-operation exception (IE) occurs due to an invalid operand in an arithmetic instruction (i.e., not a stack fault), and if software has not cleared the SF bit between the two instructions, the SF bit will remain set. Exception Status (ES). Bit 7. The processor calculates the value of this bit at each instruction boundary and sets the bit to 1 when one or more unmasked floating-point exceptions occur. If the ES bit has already been set by the action of some prior instruction, the processor invokes the #MF exception handler when the next non-control x87 or 64-bit media instruction is executed. (See "Control" on page 331 for a definition of control instructions). The ES bit can be written, but the written value is ignored. Like the SF bit, the ES bit does not have a corresponding mask bit in the x87 control word. Top-of-Stack Pointer (TOP). B it s 1 3 - 1 1 . The TOP c o n t a in s t h e physical register index of the location of the top of stack, ST(0). It thus specifies the mapping of the x87 stack registers, ST(0)-ST(7), onto the x87 physical registers, FPR0-FPR7. The processor changes the TOP during any instructions that pushes or pops the stack. For details on how the stack works, see "Stack Organization" on page 288. Condition Codes (C3-C0). Bits 14 and 10-8. The processor sets these bits according to the result of arithmetic, compare, and other instructions. In certain cases, other status-word flags can be used together with the condition codes to determine the result of an operation, including stack overflow, stack underflow, sign, least-significant quotient bits, last-rounding direction, and outof-range operand. For details on how each instruction sets the c o n d it i o n c o d e s , s e e " x 8 7 F l o a t i n g - Poi n t I n s t r u c t i o n Reference" in Volume 5. x87 Floating-Point Unit Busy (B). Bit 15. The processor sets the value of this bit equal to the calculated value of the ES bit, bit 7. This bit can be written, but the written value is ignored. The bit is included only for backward-compatibility with the 8087 coprocessor, in which it indicates that the coprocessor is busy. 292 Chapter 6: x87 Floating-Point Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology For further details about the x87 floating-point exceptions, see "x87 Floating-Point Exception Causes" on page 338. 6.3.3 x87 Control Word Register The 16-bit x87 control word register allows software to manage certain x87 processing options, including rounding, precision, and masking of the six x87 floating-point exceptions (any of which is reported as an #MF exception). Figure 6-4 shows the format of the control word. All bits, except reserved bits, can be read and written. The FLDCW, FRSTOR, and FXRSTOR instructions load the control word from memory. The FSTCW, FNSTCW, FSAVE, FNSAVE, and FXSAVE instructions store the control word to memory. The FINIT and FNINIT instructions initialize the control word with the value 037Fh, which specifies round-tonearest, all exceptions masked, and double-extended precision (64-bit). 15 14 13 12 11 10 9 8 Y R C P C 7 6 5 4 3 2 1 0 PUOZDI MMMMMM Reserved Symbol Description Y Infinity Bit (80287 compatibility) RC Rounding Control PC Precision Control #MF Exception Masks PM Precision Exception Mask UM Underflow Exception Mask OM Overflow Exception Mask ZM Zero-Divide Exception Mask DM Denormalized Operation Exception Mask IM Invalid Operation Exception Mask Rounding-Control (RC) Specification 00b = Round to nearest (default) 01b = Round down 10b = Round up 11b = Round toward zero Bits 12 11-10 9-8 5 4 3 2 1 0 Precision-Control (PC) Specification 00b = Single Precision 01b = reserved 10b = Double Precision 11b = Double-Extended Precision (default) Figure 6-4. x87 Control Word Register Chapter 6: x87 Floating-Point Programming 293 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Starting from bit 0, the bits are: Exception Masks (PM, UM, OM, ZM, DM, IM). Bits 5-0. Software can set these bits to mask, or clear this bits to unmask, the corresponding six types of x87 floating-point exceptions (PE, UE, OE, ZE, DE, IE), which are reported in the x87 status word as described in "x87 Status Word Register" on page 289. A bit masks its exception type when set to 1, and unmasks it when cleared to 0. Masking a type of exception causes the processor to handle all subsequent instances of the exception type in a default way. Unmasking the exception type causes the processor to branch to the #MF exception service routine when an exception occurs. For details about the processor's responses to masked and unmasked exceptions, see "x87 Floating-Point Exception Causes" on page 338. Precision Control (PC). Bits 9-8. Software can set this field to specify the precision of x87 floating-point calculations, as shown in Table 6-1. Details on each precision are given in "Data Types" on page 300. The default precision is double-extendedprecision. Precision control affects only the FADDx, FSUBx, FMULx, FDIVx, and FSQRT instructions. For further details on precision, see "Precision" on page 313. Table 6-1. Precision Control (PC) Summary PC Value (binary) 00 01 10 11 Single precision reserved Double precision Double-extended precision (default) Data Type Rounding Control (RC). Bits 11-10. Software can set this field to specify how the results of x87 instructions are to be rounded. Table 6-2 on page 295 lists the four rounding modes, which are defined by the IEEE 754 standard. 294 Chapter 6: x87 Floating-Point Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Table 6-2. RC Value 00 (default) 01 10 11 Types of Rounding Mode Round to nearest Type of Rounding The rounded result is the representable value closest to the infinitely precise result. If equally close, the even value (with least-significant bit 0) is taken. The rounded result is closest to, but no greater than, the infinitely precise result. The rounded result is closest to, but no less than, the infinitely precise result. The rounded result is closest to, but no greater in absolute value than, the infinitely precise result. Round down Round up Round toward zero Round-to-nearest is the default rounding mode. It provides a statistically unbiased estimate of the true result, and is suitable for most applications. Rounding modes apply to all arithmetic operations except comparison and remainder. They have no effect on operations that produce not-a-number (NaN) results. For further details on rounding, see "Rounding" on page 314. Infinity Bit (Y). Bit 12. This bit is obsolete. It can be read and written, but the value has no meaning. On pre-386 processor implementations, the bit specified the affine (Y = 1) or projective (Y = 0) infinity. The AMD64 architecture uses only the affine infinity, which specifies distinct positive and negative infinity values. 6.3.4 x87 Tag Word Register The x87 tag word register contains a 2-bit tag field for each x87 physical data register. These tag fields characterize the register's data. Figure 6-5 on page 296 shows the format of the tag word. Chapter 6: x87 Floating-Point Programming 295 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 TAG TAG TAG TAG TAG TAG TAG TAG (FPR7) (FPR6) (FPR5) (FPR4) (FPR3) (FPR2) (FPR1) (FPR1) Tag Values 00 = Valid 01 = Zero 10 = Special 11 = Empty Figure 6-5. x87 Tag Word Register In the memory image saved by the instructions described in "x87 Environment" on page 298, each x87 physical data register has two tag bits with are encoded according to the Tag Values shown in Figure 6-5. Internally, the hardware may maintain only a single bit that indicates whether the associated register is empty or full. The mapping between such a 1-bit internal tag and the 2-bit software-visible architectural representation saved in memory is shown in Table 6-3. In such a mapping, whenever software saves the tag word, the processor expands the internal 1-bit tag state to the 2-bit architectural representation by examining the contents of the x87 registers, as described in "128-Bit, 64-Bit, and x87 Programming" in Volume 2. Table 6-3. Mapping Between Internal and Software-Visible Tag Bits Architectural State (Software-Visible) State Valid Zero Special (NaN, infinity, denormal, or unsupported) Empty Bit Value 00 01 10 11 Empty Full Hardware State The FINIT and FNINIT instructions write the tag word so that it specifies all floating-point registers as empty. Execution of 64bit media instructions that write to an MMXTM register alter the tag bits by setting all the registers to full, and thus they may 296 Chapter 6: x87 Floating-Point Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology affect execution of subsequent x87 floating-point instructions. For details, see "Mixing Media Code with x87 Code" on page 278. 6.3.5 Pointers and Opcode State The x87 instruction pointer, instruction opcode, and data pointer are part of the x87 environment (non-data processor state) that is loaded and stored by the instructions described in "x87 Environment" on page 298. Figure 6-6 illustrates the pointer and opcode state. Execution of all x87 instructions-- except control instructions (see "Control" on page 331)-- causes the processor to store this state in hardware. For convenience, the pointer and opcode state is illustrated here as registers. However, the manner of storing this state in hardware depends on the hardware implementation. The AMD64 architecture specifies only the software-visible state that is saved in memory. (See "Media and x87 Processor State" in Volume 2 for details of the memory images.) Instruction Pointer (rIP) Data Pointer 63 Opcode 10 0 513-138.eps Figure 6-6. x87 Pointers and Opcode State Last x87 Instruction Pointer. Th e c o n t e n t s o f t h e 6 4 - b i t l a s t instruction pointer depends on the operating mode, as follows: 64-Bit Mode--The pointer contains the 64-bit RIP offset of the last non-control x87 instruction executed (see "Control" on page 331 for a definition of control instructions). The 16bit code-segment (CS) selector is not saved. (It is the operating system's responsibility to ensure that the 64-bit state-restoration is executed in the same code segment as the preceding 64-bit state-store.) Legacy Protected Mode, Legacy Virtual-8086 Mode, and Compatibility Mode--The pointer contains the 16-bit codesegment (CS) selector and the 16-bit or 32-bit eIP offset, Chapter 6: x87 Floating-Point Programming 297 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 depending on the effective operand size, of the last noncontrol x87 instruction executed. Legacy Real Mode--The pointer contains the 16-bit or 32-bit eIP offset of the last non-control x87 instruction executed. The FINIT and FNINIT instructions clears all bits in this pointer. Last x87 Opcode. T h e 1 1 - b i t i n s t r u c t i o n o p c o d e h o l d s a permutation of the two-byte instruction opcode from the last non-control x87 floating-point instruction executed by the processor. The opcode field is formed as follows: Opcode Field[10:8] = First x87-opcode byte[2:0]. Opcode Field[7:0] = Second x87-opcode byte[7:0]. For example, the x87 opcode D9 F8 (floating-point partial remainder) is stored as 001_1111_1000b. The low-order three bits of the first opcode byte, D9 (1101_1001b), are stored in bits 10-8. The second opcode byte, F8 (1111_1000b), is stored in bits 7-0. The high-order five bits of the first opcode byte (1101_1b) are not needed because they are identical for all x87 instructions. Last x87 Data Pointer. The contents of the 64-bit data pointer depends on the operating mode, as follows: 64-Bit Mode--The pointer contains the 64-bit offset of the last memory operand accessed by the last non-control x87 instruction executed. Legacy Protected Mode, Legacy Virtual-8086 Mode, and Compatibility Mode--The pointer contains the 16-bit datasegment (DS) selector and the 16-bit or 32-bit offset of the last memory operand accessed by the last non-control x87 instruction executed. Legacy Real Mode--The pointer contains the 16-bit or 32-bit offset of the last memory operand accessed by the last noncontrol x87 instruction executed. The FINIT and FNINIT instructions clears all bits in this pointer. 6.3.6 x87 Environment The x87 environment--or non-data processor state--includes the following processor state: x87 control word register (FCW) 298 Chapter 6: x87 Floating-Point Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology x87 status word register (FSW) x87 tag word (FTW) last x87 instruction pointer last x87 data pointer last x87 opcode Table 6-4 lists the x87 instructions can access this x87 processor state. Table 6-4. Instruction FINIT FNINIT FNSAVE FRSTOR FSAVE FLDCW FNSTCW FSTCW FNSTSW FSTSW FLDENV Instructions that Access the x87 Environment Description Floating-Point Initialize Floating-Point No-Wait Initialize Floating-Point No-Wait Save State Floating-Point Restore State Floating-Point Save State Floating-Point Load x87 Control Word Floating-Point No-Wait Store Control Word Floating-Point Store Control Word Floating-Point No-Wait Store Status Word Floating-Point Store Status Word Floating-Point Load x87 Environment State Accessed Entire Environment Entire Environment Entire Environment Entire Environment Entire Environment x87 Control Word x87 Control Word x87 Control Word x87 Status Word x87 Status Word Environment, Not Including x87 Data Registers Environment, Not Including x87 Data Registers Environment, Not Including x87 Data Registers FNSTENV Floating-Point No-Wait Store Environment FSTENV Floating-Point Store Environment For details on how the x87 environment is stored in memory, see "Media and x87 Processor State" in Volume 2. 6.3.7 Floating-Point Emulation (CR0.EM) The operating system can set the floating-point softwareemulation (EM) bit in control register 0 (CR0) to 1 to allow 299 Chapter 6: x87 Floating-Point Programming AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 software emulation of x87 instructions. If the operating system has set CR0.EM = 1, the processor does not execute x87 instructions. Instead, a device-not-available exception (#NM) occurs whenever an attempt is made to execute such an instruction, except that setting CR0.EM to 1 does not cause an #NM exception when the WAIT or FWAIT instruction is executed. For details, see "System-Control Registers" in Volume 2. 6.4 Operands Operands for x87 instructions are referenced by the opcodes. Operands can be located either in x87 registers or memory. Immediate operands are not used in x87 floating-point instructions, and I/O ports cannot be directly addressed by x87 floating-point instructions. Memory Operands. Most x87 floating-point instructions can take source operands from memory, and a few of the instructions can write results to memory. The following sections describe the methods and conditions for addressing memory operands: "Memory Addressing" on page 16 describes the general methods and conditions for addressing memory operands. "Instruction Prefixes" on page 335 describes the use of address-size instruction overrides by 64-bit media instructions. Register Operands. Most x87 floating-point instructions can read source operands from and write results to x87 registers. Most instructions access the ST(0)-ST(7) register stack. For a few instructions, the register types also include the x87 control word register, the x87 status word register, and (for FSTSW and FNSTSW) the AX general-purpose register. 6.4.1 Operand Addressing 6.4.2 Data Types Figure 6-7 on page 301 shows register images of the x87 data types. These include three scalar floating-point formats (80-bit double-extended-precision, 64-bit double-precision, and 32-bit single-precision), three scalar signed-integer formats (quadword, doubleword, and word), and an 80-bit packed binary-coded decimal (BCD) format. Although Figure 6-7 shows register images of the data types, the three signed-integer data types can exist only in memory. All data types are converted into an 80-bit format when they are loaded into an x87 register. Chapter 6: x87 Floating-Point Programming 300 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Floating-Point 79 s 63 0 exp i s significand exp 51 Double-Extended Precision Double Precision significand 22 0 79 significand s 63 exp Single Precision 31 Signed Integer s 8 bytes s Quadword 4 bytes s 63 Doubleword 2 bytes 0 31 Word 15 Binary-Coded Decimal (BCD) ss Packed Decimal 71 0 513-317.eps 79 Figure 6-7. x87 Data Types Floating-Point Data Types. The floating-point data types, shown in Figure 6-8 on page 302, include 32-bit single precision, 64-bit double precision, and 80-bit double-extended precision. The default precision is double-extended precision, and all operands loaded into registers are converted into doubleextended precision format. All x87 instruction (except FADDx, FSUBx, FSUBRx, FMULx, FDIVx, FDIVRx, and FSQRT) operate on register values in double-extended precision format. The FADDx, FSUBx, FSUBRx, FMULx, FDIVx, FDIVRx, and FSQRT instructions operate on floating-point data types in the precision specified by the precision control (PC) bit in the x87 control word. All three floating-point formats are compatible with the IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754 and 854), except for the rounding effects caused by the Chapter 6: x87 Floating-Point Programming 301 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 processor's internal representation of values in doubleextended-precision format. Single Precision 31 30 S 23 22 Significand (also Fraction) 0 Biased Exponent S = Sign Bit Double Precision 63 62 S Biased Exponent 52 51 Significand (also Fraction) 0 S = Sign Bit Double-Extended Precision 79 78 S Biased Exponent 64 63 62 I Fraction 0 S = Sign Bit I = Integer Bit Significand Figure 6-8. x87 Floating-Point Data Types All of the floating-point data types consist of a sign (0 = positive, 1 = negative), a biased exponent (base-2), and a significand, which represents the integer and fractional parts of the number. The integer bit (also called the J bit) is either implied (called a hidden integer bit) or explicit, depending on the data type. The value of an implied integer bit can be inferred from number encodings, as described in "Number Encodings" on page 308. The bias of the exponent is a constant which makes the exponent always positive and allows reciprocation, without overflow, of the smallest normalized number representable by that data type. Specifically, the data types are formatted as follows: Single-Precision Format--This format includes a 1-bit sign, an 8-bit biased exponent whose value is 127, and a 23-bit 302 Chapter 6: x87 Floating-Point Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology significand. The integer bit is implied, making a total of 24 bits in the significand. Double-Precision Format--This format includes a 1-bit sign, an 11-bit biased exponent whose value is 1023, and a 52-bit significand. The integer bit is implied, making a total of 53 bits in the significand. Double-Extended-Precision Format--This format includes a 1bit sign, a 15-bit biased exponent whose value is 16,383, and a 64-bit significand, which includes one explicit integer bit. Table 6-5 shows the range of finite values representable by the three x87 floating-point data types. Table 6-5. Range of Finite Floating-Point Values Range of Finite Values1 Base 2 Single Precision Double Precision Double-Extended Precision Note: Data Type Precision Base 10 24 bits 53 bits 64 bits 2-126 to 2127 * (2 - 2-23) 2-1022 to 21023 * (2 - 2-52) 2-16382 to 216383 * (2 - 2-63) 1.17 * 10-38 to +3.40 * 1038 2.23 * 10-308 to +1.79 * 10308 3.37 * 10-4932 to +1.18 * 104932 1. See "Number Representation" on page 305. For example, in the single-precision format, the largest normal nu m b e r re p re s e n t ab l e h a s a n ex p o n e n t o f F E h a n d a significand of 7FFFFFh, with a numerical value of 2127 * (2 - 2-23). Results that overflow above the maximum representable value return either the maximum representable normalized number (see "Normalized Numbers" on page 306) or infinity, with the sign of the true result, depending on the rounding mode specified in the rounding control (RC) field of the x87 control word. Results that underflow below the minimum representable value return either the minimum representable normaliz ed number or a denormaliz ed number (see "Denormalized (Tiny) Numbers" on page 306), with the sign of the true result, or a result determined by the x87 exception handler, depending on the rounding mode, precision mode, and underflow-exception mask (UM) in the x87 control word (see "Unmasked Responses" on page 348). Chapter 6: x87 Floating-Point Programming 303 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Integer Data Type. The integer data types, shown in Figure 6-7 on page 301, include two's-complement 16-bit word, 32-bit doubleword, and 64-bit quadword. These data types are used in x87 instructions that convert signed integer operands into floating-point values. The integers can be loaded from memory into x87 registers and stored from x87 registers into memory. The data types cannot be moved between x87 registers and other registers. For details on the format and number-representation of the integer data types, see "Data Types" on page 41. Packed-Decimal Data Type. The 80-bit packed-decimal data type, shown in Figure 6-9, represents an 18-digit decimal integer using the binary-coded decimal (BCD) format. Each of the 18 digits is a 4-bit representation of an integer. The 18 digits use a total of 72 bits. The next-higher seven bits in the 80-bit format are reserved (ignored on loads, zeros on stores). The high bit (bit 79) is a sign bit. 79 78 S 72 71 0 Ignore or Zero Description Ignored on Load, Zeros on Store Sign Bit Precision -- 18 Digits, 72 Bits Used, 4-Bits/Digit Bits 78-72 79 Figure 6-9. x87 Packed Decimal Data Type Two x87 instructions operate on the packed-decimal data type. The FBLD (floating-point load binary-coded decimal) and FBSTP (floating-point store binary-coded decimal integer and pop) instructions push and pop, respectively, a packed-decimal memory operand between the floating-point stack and memory. FBLD converts the value being pushed to a double-extendedprecision floating-point value. FBSTP rounds the value being popped to an integer. For details on the format and use of 4-bit BCD integers, see "Binary-Coded-Decimal (BCD) Digits" on page 43. 304 Chapter 6: x87 Floating-Point Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology 6.4.3 Number Representation Of the following types of floating-point values, six are supported by the architecture and three are not supported: Supported Values - Normal - Denormal (Tiny) - Pseudo-Denormal - Zero - Infinity - Not a Number (NaN) Unsupported Values - Unnormal - Pseudo-Infinity - Pseudo-NaN The supported values can be used as operands in x87 floatingpoint instructions. The unsupported values cause an invalidoperation exception (IE) when used as operands. In common engineering and scientific usage, floating-point numbers--also called real numbers--are represented in base (radix) 10. A non-zero number consists of a sign, a normalized significand, and a signed exponent, as in: +2.71828 e0 Both large and small numbers are representable in this notation, subject to the limits of data-type precision. For example, a million in base-10 notation appears as +1.00000 e6 and -0.0000383 is represented as -3.83000 e-5. A non-zero number can always be written in normalized form--that is, with a leading non-zero digit immediately before the decimal point. Thus, a normalized significand in base-10 notation is a number in the range [1,10). The signed exponent specifies the number of positions that the decimal point is shifted. Unlike the common engineering and scientific usage described above, x87 floating-point numbers are represented in base (radix) 2. Like its base-10 counterpart, a normalized base-2 s i g n i f i c a n d i s w r i t t e n w i t h i t s l e a d i n g n o n - z e ro d i g it immediately to the left of the radix point. In base-2 arithmetic, a non-zero digit is always a one, so the range of a binary significand is [1,2): +1.fraction exponent 305 Chapter 6: x87 Floating-Point Programming AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 The leading non-zero digit is called the integer bit, and in the x87 double-extended-precision floating-point format this integer bit is explicit, as shown in Figure 6-8. In the x87 singleprecision and the double-precision floating-point formats, the integer bit is simply omitted (and called the hidden integer bit), because its implied value is always 1 in a normaliz ed significand (0 in a denormalized significand), and the omission allows an extra bit of precision. The following sections describe the supported number representations. Normalized Numbers. Normalized floating-point numbers are the most frequent operands for x87 instructions. These are finite, non-zero, positive or negative numbers in which the integer bit is 1, the biased exponent is non-zero and non-maximum, and the fraction is any representable value. Thus, the significand is within the range of [1, 2). Whenever possible, the processor represents a floating-point result as a normalized number. Denormalized (Tiny) Numbers. Denormalized numbers (also called tiny numbers) are smaller than the smallest representable normaliz ed numbers. They arise through an underflow condition, when the exponent of a result lies below the representable minimum exponent. These are finite, non-zero, positive or negative numbers in which the integer bit is 0, the biased exponent is 0, and the fraction is non-zero. The processor generates a denormalized-operand exception (DE) when an instruction uses a denormalized source operand. The processor may generate an underflow exception (UE) when an instruction produces a rounded, non-zero result that is too small to be represented as a normalized floating-point number in the destination format, and thus is represented as a denormalized number. If a result, after rounding, is too small to be represented as the minimum denormalized number, it is represented as zero. (See "Exceptions" on page 337 for specific details.) Denormalization may correct the exponent by placing leading zeros in the significand. This may cause a loss of precision, because the number of significant bits in the fraction is reduced by the leading zeros. In the single-precision floating-point format, for example, normaliz ed numbers have biased exponents ranging from 1 to 254 (the unbiased exponent range 306 Chapter 6: x87 Floating-Point Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology is from -126 to +127). A true result with an exponent of, say, -130, undergoes denormalization by right-shifting the significand by the difference between the normalized exponent and the minimum exponent, as shown in Table 6-6. Table 6-6. Example of Denormalization Exponent -130 -126 True result Denormalized result Result Type Significand (base 2) 1.0011010000000000 0.0001001101000000 Pseudo-Denormalized Numbers. Pseudo-denormalized numbers are positive or negative numbers in which the integer bit is 1, the biased exponent is 0, and the fraction is any value. The processor accepts pseudo-denormal source operands but it does not produce pseudo-denormal results. When a pseudo-denormal number is used as a source operand, the processor treats the arithmetic value of its biased exponent as 1 rather then 0, and the processor generates a denormalized-operand exception (DE). Zero. The floating-point zero is a finite, positive or negative number in which the integer bit is 0, the biased exponent is 0, and the fraction is 0. The sign of a zero result depends on the operation being performed and the selected rounding mode. It may indicate the direction from which an underflow occurred, or it may reflect the result of a division by + or -. Infinity. Infinity is a positive or negative number, + and -, in which the integer bit is 1, the biased exponent is maximum, and the fraction is 0. The infinities are the maximum numbers that can be represented in floating-point format. Negative infinity is less than any finite number and positive infinity is greater than any finite number (i.e., the affine sense). An infinite result is produced when a non-zero, non-infinite number is divided by 0 or multiplied by infinity, or when infinity is added to infinity or to 0. Arithmetic on infinities is exact. For example, adding any floating-point number to + gives a result of +. Arithmetic comparisons work correctly on infinities. Exceptions occur only when the use of an infinity as a source operand constitutes an invalid operation. Chapter 6: x87 Floating-Point Programming 307 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Not a Number (NaN). NaNs are non-numbers, lying outside the range of representable floating-point values. The integer bit is 1, the biased exponent is maximum, and the fraction is nonzero. NaNs are of two types: Signaling NaN (SNaN) Quiet NaN (QNaN) A QNaN is a NaN with the most-significant fraction bit set to 1, and an SNaN is a NaN with the most-significant fraction bit cleared to 0. When the processor encounters an SNaN as a source operand for an instruction, an invalid-operation exception (IE) occurs and a QNaN is produced as the result, if the exception is masked. In general, when the processor encounters a QNaN as a source operand for an instruction--in an instruction other than FxCOMx, FISTx, or FSTx--the processor does not generate an exception but generates a QNaN as the result. The processor never generates an SNaN as a result of a floatingpoint operation. When an invalid-operation exception (IE) occurs due to an SNaN operand, the invalid-operation exception mask (IM) bit determines the processor's response, as described in "x87 Floating-Point Exception Masking" on page 344. When a floating-point operation or exception produces a QNaN result, its value is derived from the source operands according to the rules shown in Table 6-7 on page 309. 6.4.4 Number Encodings Supported Encodings. Table 6-8 on page 310 shows the floatingpoint encodings of supported numbers and non-numbers. The number categories are ordered from large to small. In this affine ordering, positive infinity is larger than any positive normalized number, which in turn is larger than any positive denormalized number, which is larger than positive zero, and so forth. Thus, the ordinary rules of comparison apply between categories as well as within categories, so that comparison of any two numbers is well-defined. The actual exponent field length is 8, 11, or 15 bits, and the fraction field length is 23, 52, or 63 bits, depending on operand precision. 308 Chapter 6: x87 Floating-Point Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Table 6-7. NaN Results from NaN Source Operands Source Operand (in either order)1 QNaN SNaN QNaN QNaN SNaN SNaN Notes: NaN Result2 Value of QNaN Value of SNaN, converted to a QNaN3 Value of QNaN with the larger significand4 Value of QNaN Value of QNaN Value of SNaN with the larger significand4 Any non-NaN floating-point value (or single-operand instruction) Any non-NaN floating-point value (or single-operand instruction) QNaN SNaN QNaN SNaN 1. 2. 3. 4. This table does not include NaN source operands used in FxCOMx, FISTx, or FSTx instructions. A NaN result is produced when the floating-point invalid-operation exception is masked. The conversion is done by changing the most-significant fraction bit to 1. If the significands of the source operands are equal but their signs are different, the NaN result is undefined. The single-precision and double-precision formats do not include the integer bit in the significand (the value of the integer bit can be inferred from number encodings). The double-extended-precision format explicitly includes the integer in bit 63 and places the most-significant fraction bit in bit 62. Exponents of all three types are encoded in biased format, with respective biasing constants of 127, 1023, and 16,383. Chapter 6: x87 Floating-Point Programming 309 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Table 6-8. Supported Floating-Point Encodings Classification SNaN Sign Biased Exponent1 111 ... 111 Significand2 1.011 ... 111 to 1.000 ... 001 1.111 ... 111 to 1.100 ... 000 1.000 ... 000 1.111 ... 111 to 1.000 ... 000 1.111 ... 111 to 1.000 ... 001 0.111 ... 111 to 0.000 ... 001 0.000 ... 000 0 Positive Non-Numbers QNaN Positive Infinity (+) Positive Normal Positive Floating-Point Numbers Positive PseudoDenormal3 Positive Denormal Positive Zero Notes: 0 111 ... 111 0 111 ... 111 111 ... 110 to 000 ... 001 0 0 000 ... 000 0 000 ... 000 0 000 ... 000 1. The actual exponent field length is 8, 11, or 15 bits, depending on operand precision. 2. The "1." and "0." prefixes represent the implicit or explicit integer bit. The actual fraction field length is 23, 52, or 63 bits, depending on operand precision. 3. Pseudo-denormals can only occur in double-extended-precision format, because they require an explicit integer bit. 4. The floating-point indefinite value is a QNaN with a negative sign and a significand whose value is 1.100 ... 000. 310 Chapter 6: x87 Floating-Point Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Table 6-8. Supported Floating-Point Encodings (continued) Classification Negative Zero Negative Denormal Sign 1 Biased Exponent1 000 ... 000 Significand2 0.000 ... 000 0.000 ... 001 to 0.111 ... 111 1.000 ... 001 to 1.111 ... 111 1.000 ... 000 to 1.111 ... 111 1.000 ... 000 1.000 ... 001 to 1.011 ... 111 1.100 ... 000 to 1.111 ... 111 1 000 ... 000 Negative Floating-Point Numbers Negative PseudoDenormal3 Negative Normal Negative Infinity (-) SNaN 1 000 ... 000 1 000 ... 001 to 111 ... 110 111 ... 111 1 1 111 ... 111 Negative Non-Numbers QNaN4 Notes: 1 111 ... 111 1. The actual exponent field length is 8, 11, or 15 bits, depending on operand precision. 2. The "1." and "0." prefixes represent the implicit or explicit integer bit. The actual fraction field length is 23, 52, or 63 bits, depending on operand precision. 3. Pseudo-denormals can only occur in double-extended-precision format, because they require an explicit integer bit. 4. The floating-point indefinite value is a QNaN with a negative sign and a significand whose value is 1.100 ... 000. Unsupported Encodings. Table 6-9 on page 312 shows the encodings of unsupported values. These values can exist only in the double-extended-precision format, because they require an explicit integer bit. The processor does not generate them as results, and they cause an invalid-operation exception (IE) when used as source operands. Indefinite Values. Floating-point, integer, and packed-decimal data types each have a unique encoding that represents an indefinite value. The processor returns an indefinite value when a masked invalid-operation exception (IE) occurs. For example, if a floating-point arithmetic operation is attempted using a source operand which is in an unsupported format, and IE exceptions are masked, the floating-point Chapter 6: x87 Floating-Point Programming 311 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 indefinite value is returned as the result. Or, if an integer store instruction overflows its destination data type, and IE exceptions are masked, the integer indefinite value is returned as the result. Table 6-9. Unsupported Floating-Point Encodings Sign 0 Classification Positive Pseudo-NaN Positive Pseudo-Infinity Positive Unnormal Negative Unnormal Negative Pseudo-Infinity Negative Pseudo-NaN Notes: Biased Exponent1 111 ... 111 Significand2 0.111 ... 111 to 0.000 ... 001 0.000 ... 000 0.111 ... 111 to 0.000 ... 000 0.000 ... 000 to 0.111 ... 111 0.000 ... 000 0.000 ... 001 to 0.111 ... 111 0 111 ... 111 111 ... 110 to 000 ... 001 000 ... 001 to 111 ... 110 111 ... 111 0 1 1 1 111 ... 111 1. The actual exponent field length is 15 bits. 2. The "0." prefix represent the explicit integer bit. The actual fraction field length is 63 bits. Table 6-10 on page 313 shows the encodings of the indefinite values for each data type. For floating-point numbers, the indefinite value is a special form of QNaN. For integers, the indefinite value is the largest representable negative two'scomplement number, 80...00h. (This value is interpreted as the largest representable negative number, except when a masked IE exception occurs, in which case it is interpreted as an indefinite value.) For packed-decimal numbers, the indefinite value has no other meaning than indefinite. 312 Chapter 6: x87 Floating-Point Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Table 6-10. Indefinite-Value Encodings Indefinite Encoding * * * * sign bit = 1 biased exponent = 111 ... 111 significand integer bit = 1 significand fraction = 100 ... 000 Data Type Floating-Point Integer * sign bit = 1 * integer = 000 ... 000 * * * * bit 79 (sign bit) = 1 bits 78-72 = 1111111 bits 71-64 = 11111111 bits 63-0 = any value Packed-Decimal 6.4.5 Precision Bits 9-8 of the x87 control word ("x87 Control Word Register" on page 293) comprise the precision control (PC) field, which specifies the precision of floating-point calculations for the FADDx, FSUBx, FMULx, FDIVx, and FSQRT instructions, as shown in Table 6-11. Table 6-11. PC Field 00 01 10 11 Note: Precision Control Field (PC) Values and Bit Precision Data Type Single precision reserved Double precision Double-extended precision Precision (bits) 241 01 531 64 1. The single-precision and double-precision bit counts include the implied integer bit. The default precision is double-extended-precision. Selecting double-precision or single-precision reduces the size of the significand to 53 bits or 24 bits, respectively, to satisfy the IEEE standard for these floating-point types. This allows exact replication, on different IEEE-compliant processors, of calculations done using these lower-precision data types. When using reduced precision, rounding clears the unused bits on the right of the significand to 0s. Chapter 6: x87 Floating-Point Programming 313 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 6.4.6 Rounding Bits 11-10 of the x87 control word ("x87 Control Word Register" on page 293) comprise the rounding control (RC) field, which specifies how the results of x87 floating-point computations are rounded. Rounding modes apply to most arithmetic operations but not to comparison or remainder. They have no effect on operations that produce NaN results. The IEEE 754 standard defines the four rounding modes as shown in Table 6-12. Table 6-12. RC Value 00 (default) 01 10 11 Types of Rounding Mode Round to nearest Type of Rounding The rounded result is the representable value closest to the infinitely precise result. If equally close, the even value (with least-significant bit 0) is taken. The rounded result is closest to, but no greater than, the infinitely precise result. The rounded result is closest to, but no less than, the infinitely precise result. The rounded result is closest to, but no greater in absolute value than, the infinitely precise result. Round down Round up Round toward zero Round to nearest is the default rounding mode. It provides a statistically unbiased estimate of the true result, and is suitable for most applications. The other rounding modes are directed roundings: round up (toward +), round down (toward -), and round toward zero. Round up and round down are used in interval arithmetic, in which upper and lower bounds bracket the true result of a computation. Round toward zero takes the smaller in magnitude, that is, always truncates. The processor produces a floating-point result defined by the IEEE standard to be infinitely precise. This result may not be representable exactly in the destination format, because only a s u b s e t o f t h e c o n t i nu u m o f re a l nu m b e rs f i n d s ex a c t representation in any particular floating-point format. Rounding modifies such a result to conform to the destination format, thereby making the result inexact and also generating a precision exception (PE), as described in "x87 Floating-Point Exception Causes" on page 338. 314 Chapter 6: x87 Floating-Point Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Suppose, for example, the following 24-bit result is to be represented in single-precision format, where "E 2 1010" represents the biased exponent: 1.0011 0101 0000 0001 0010 0111 E2 1010 This result has no exact representation, because the leastsignificant 1 does not fit into the single-precision format, which allows for only 23 bits of fraction. The rounding control field determines the direction of rounding. Rounding introduces an error in a result that is less than one unit in the last place (ulp), that is, the least-significant bit position of the floating-point representation. 6.5 Instruction Summary This section summarizes the functions of the x87 floating-point instructions. The instructions are organized here by functional group--such as data-transfer, arithmetic, and so on. More detail on individual instructions is given in the alphabetically organized "x87 Floating-Point Instruction Reference" in Volume 5. Software running at any privilege level can use any of these instructions, if the CPUID instruction reports support for the instructions (see "Feature Detection" on page 336). Most x87 instructions take floating-point data types for both their source and destination operands, although some x87 data-conversion instructions take integer formats for their source or destination operands. 6.5.1 Syntax Each instruction has a mnemonic syntax used by assemblers to specify the operation and the operands to be used for source and destination (result) data. Many of x87 instructions have the following syntax: MNEMONIC st(j), st(i) Figure 6-10 on page 316 shows an example of the mnemonic syntax for a floating-point add (FADD) instruction. Chapter 6: x87 Floating-Point Programming 315 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 FADD st(0), st(i) Mnemonic First Source Operand and Destination Operand Second Source Operand 513-146.eps Figure 6-10. Mnemonic Syntax for Typical Instruction This example shows the FADD mnemonic followed by two operands, both of which are 80-bit stack-register operands. Most instructions take source operands from an x87 stack register and/or memory and write their results to a stack register or memory. Only two of the instructions (FSTSW and FNSTSW) can access a general-purpose registers (GPR), and none access the 128-bit media (XMM) registers. Although the MMX registers map to the x87 registers, the contents of the MMX re g i s t e rs c a n n o t b e a c c e s s e d m e a n i n g f u l ly u s i n g x 8 7 instructions. Instructions can have one or more prefixes that modify default operand properties. These prefixes are summariz ed in "Instruction Prefixes" on page 85. Mnemonics. The following characters are used as prefixes in the mnemonics of integer instructions: F--x87 Floating-point In addition to the above prefix characters, the following characters are used elsewhere in the mnemonics of x87 instructions: B--Below, or BCD BE--Below or Equal CMOV--Conditional Move c--Variable condition E--Equal I--Integer 316 Chapter 6: x87 Floating-Point Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology LD--Load N--No Wait NB--Not Below NBE--Not Below or Equal NE--Not Equal NU--Not Unordered P--Pop PP--Pop Twice R--Reverse ST--Store U--Unordered x--One or more variable characters in the mnemonic For example, the mnemonic for the store instruction that stores the top-of-stack and pops the stack is FSTP. In this mnemonic, the F means a floating-point instruction, the ST means a store, and the P means pop the stack. 6.5.2 Data Transfer and Conversion The data transfer and conversion instructions copy data--in some cases with data conversion--between x87 stack registers and memory or between stack positions. Load or Store Floating-Point. FLD--Floating-Point Load FST--Floating-Point Store Stack Top FSTP--Floating-Point Store Stack Top and Pop The FLD instruction pushes the source operand onto the top-ofstack, ST(0). The source operand may be a single-precision, double-precision, or double-extended-precision floating-point value in memory or the contents of a specified stack position, ST(i). The FST instruction copies the value at the top-of-stack, ST(0), to a specified stack position, ST(i), or to a 32-bit or 64-bit memory location. If the destination is a memory location, the value copied is first converted to a single-precision or doubleprecision floating-point value. If the top-of-stack value is a single-precision or double-precision value, FSTP converts it according to the rounding control (RC) field of the x87 control word. If the top-of-stack value is a NaN or an infinity, FST Chapter 6: x87 Floating-Point Programming 317 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 truncates the stack-top exponent and significand to fit the destination size. The FSTP instruction is similar to FST, except that FSTP can also store to an 80-bit memory location and it pops the stack after the store. FSTP can be used to clean up the x87 stack at the end of an x87 procedure by removing one register of preloaded data from the stack. Convert and Load or Store Integer. FILD--Floating-Point Load Integer FIST--Floating-Point Integer Store FISTP--Floating-Point Integer Store and Pop The FILD instruction converts the 16-bit, 32-bit, or 64-bit source signed integer in memory into a double-extended-precision floating-point value and pushes the result onto the top-of-stack, ST(0). The FIST instruction converts and rounds the source value in the top-of-stack, ST(0), to a signed integer and copies it to the specified 16-bit or 32-bit memory location. The source may be any floating-point data type, including a single-precision, double-precision, or double-extended-precision floating-point value. The type of rounding is determined by the rounding control (RC) field of the x87 control word. The default is roundto-nearest. The FISTP instruction is similar to FIST, except that FISTP can also store the result to a 64-bit memory location and it pops ST(0) after the store. Convert and Load or Store BCD. FBLD--Floating-Point Load Binary-Coded Decimal FBSTP--Floating-Point Store Binary-Coded Decimal Integer and Pop The FBLD and FBSTP instructions, respectively, push and pop an 80-bit packed BCD memory value on and off the top-of-stack, ST(0). FBLD first converts the value being pushed to a doubleextended-precision floating-point value. FBSTP rounds the value being popped to an integer, using the rounding mode specified by the RC field, and converts the value to an 80-bit 318 Chapter 6: x87 Floating-Point Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology packed BCD value. Thus, no FRNDIT (round-to-integer) instruction is needed prior to FBSTP. Conditional Move. FCMOVB--Floating-Point Conditional Move If Below FCMOVBE--Floating-Point Conditional Move If Below or Equal FCMOVE--Floating-Point Conditional Move If Equal FCMOVNB--Floating-Point Conditional Move If Not Below FCMOVNBE--Floating-Point Conditional Move If Not Below or Equal FCMOVNE--Floating-Point Conditional Move If Not Equal FCMOVNU--Floating-Point Conditional Move If Not Unordered FCMOVU--Floating-Point Conditional Move If Unordered The FCMOVcc instructions copy the contents of a specified stack position, ST(i), to the top-of-stack, ST(0), if the specified rFLAGS condition is met. Table 6-13 specifies the flag combinations for each conditional move. Table 6-13. rFLAGS Conditions for FCMOVcc Mnemonic B BE E NB NBE NE NU U rFLAGS Register State Carry flag is set (CF = 1) Either carry flag or zero flag is set (CF = 1 or ZF = 1) Zero flag is set (ZF = 1) Carry flag is not set (CF = 0) Neither carry flag nor zero flag is set (CF = 0, ZF = 0) Zero flag is not set (ZF = 0) Parity flag is not set (PF = 0) Parity flag is set (PF = 1) Condition Below Below or Equal Equal Not Below Not Below or Equal Not Equal Not Unordered Unordered Chapter 6: x87 Floating-Point Programming 319 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Exchange. FXCH--Floating-Point Exchange The FXCH instruction exchanges the contents of a specified stack position, ST(i), with the top-of-stack, ST(0). The top-ofstack pointer is left unchanged. In the form of the instruction that specifies no operand, the contents of ST(1) and ST(0) are exchanged. Extract. FXTRACT--Floating-Point Significand Extract Exponent and The FXTRACT instruction copies the unbiased exponent of the original value in the top-of-stack, ST(0), and writes it as a floating-point value to ST(1), then copies the significand and sign of the original value in the top-of-stack and writes it as a floating-point value with an exponent of zero to the top-ofstack, ST(0). 6.5.3 Load Constants Load 0, 1, or Pi. FLDZ--Floating-Point Load +0.0 FLD1--Floating-Point Load +1.0 FLDPI--Floating-Point Load Pi The FLDZ, FLD1, and FLDPI instructions, respectively, push t h e f l o a t i n g - p o i n t c o n s t a n t va l u e , + 0 . 0 , + 1 . 0 , a n d P i (3.141592653...), onto the top-of-stack, ST(0). Load Logarithm. FLDL2E--Floating-Point Load Log2 e FLDL2T--Floating-Point Load Log2 10 FLDLG2--Floating-Point Load Log10 2 FLDLN2--Floating-Point Load Ln 2 The FLDL2E, FLDL2T, FLDLG2, and FLDLN2 instructions, respectively, push the floating-point constant value, log 2 e, log210, log102, and loge2, onto the top-of-stack, ST(0). 6.5.4 Arithmetic The arithmetic instructions support addition, subtraction, multiplication, division, change-sign, round, round to integer, partial remainder, and square root. In most arithmetic operations, one of the source operands is the top-of-stack, ST(0). Chapter 6: x87 Floating-Point Programming 320 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology The other source operand can be another stack entry, ST(i), or a floating-point or integer operand in memory. The non-commutative operations of subtraction and division have two forms, the direct FSUB and FDIV, and the reverse FSUBR and FDIVR. FSUB, for example, subtracts the right operand from the left operand, and writes the result to the left operand. FSUBR subtracts the left operand from the right operand, and writes the result to the left operand. The FADD and FMUL operations have no reverse counterparts. Addition. FADD--Floating-Point Add FADDP--Floating-Point Add and Pop FIADD--Floating-Point Add Integer to Stack Top The FADD instruction syntax has forms that include one or two explicit source operands. In the one-operand form, the instruction reads a 32-bit or 64-bit floating-point value from memory, converts it to the double-extended-precision format, adds it to ST(0), and writes the result to ST(0). In the twooperand form, the instruction adds both source operands from stack registers and writes the result to the first operand. The FADDP instruction syntax has forms that include zero or two explicit source operands. In the zero-operand form, the instruction adds ST(0) to ST(1), writes the result to ST(1), and pops the stack. In the two-operand form, the instruction adds both source operands from stack registers, writes the result to the first operand, and pops the stack. The FIADD instruction reads a 16-bit or 32-bit integer value from memory, converts it to the double-extended-precision format, adds it to ST(0), and writes the result to ST(0). Subtraction. FSUB--Floating-Point Subtract FSUBP--Floating-Point Subtract and Pop FISUB--Floating-Point Integer Subtract FSUBR--Floating-Point Subtract Reverse FSUBRP--Floating-Point Subtract Reverse and Pop FISUBR--Floating-Point Integer Subtract Reverse Chapter 6: x87 Floating-Point Programming 321 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 The FSUB instruction syntax has forms that include one or two explicit source operands. In the one-operand form, the instruction reads a 32-bit or 64-bit floating-point value from memory, converts it to the double-extended-precision format, subtracts it from ST(0), and writes the result to ST(0). In the two-operand form, both source operands are located in stack registers. The instruction subtracts the second operand from the first operand and writes the result to the first operand. The FSUBP instruction syntax has forms that include zero or two explicit source operands. In the zero-operand form, the instruction subtracts ST(0) from ST(1), writes the result to ST(1), and pops the stack. In the two-operand form, both source operands are located in stack registers. The instruction subtracts the second operand from the first operand, writes the result to the first operand, and pops the stack. The FISUB instruction reads a 16-bit or 32-bit integer value from memory, converts it to the double-extended-precision format, subtracts it from ST(0), and writes the result to ST(0). The FSUBR and FSUBRP instructions perform the same operations as FSUB and FSUBP, respectively, except that the source operands are reversed. Instead of subtracting the second operand from the first operand, FSUBR and FSUBRP subtract the first operand from the second operand. Multiplication. FMUL--Floating-Point Multiply FMULP--Floating-Point Multiply and Pop FIMUL--Floating-Point Integer Multiply The FMUL instruction syntax has forms that include one or two explicit source operands that may be single-precision or doubleprecision floating-point values or 16-bit or 32-bit integer values. In the one-operand form, the instruction reads a value from memory, multiplies ST(0) by the memory operand, and writes the result to ST(0). In the two-operand form, both source operands are located in stack registers. The instruction multiplies the first operand by the second operand and writes the result to the first operand. The FMULP instruction syntax has forms that include zero or two explicit source operands. In the zero-operand form, the instruction multiplies ST(1) by ST(0), writes the result to ST(1), 322 Chapter 6: x87 Floating-Point Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology and pops the stack. In the two-operand form, both source operands are located in stack registers. The instruction multiplies the first operand by the second operand, writes the result to the first operand, and pops the stack. The FIMUL instruction reads a 16-bit or 32-bit integer value from memory, converts it to the double-extended-precision format, multiplies ST(0) by the memory operand, and writes the result to ST(0). Division. FDIV--Floating-Point Divide FDIVP--Floating-Point Divide and Pop FIDIV--Floating-Point Integer Divide FDIVR--Floating-Point Divide Reverse FDIVRP--Floating-Point Divide Reverse and Pop FIDIVR--Floating-Point Integer Divide Reverse The FDIV instruction syntax has forms that include one or two source explicit operands that may be single-precision or doubleprecision floating-point values or 16-bit or 32-bit integer values. In the one-operand form, the instruction reads a value from memory, divides ST(0) by the memory operand, and writes the result to ST(0). In the two-operand form, both source operands are located in stack registers. The instruction divides the first operand by the second operand and writes the result to the first operand. The FDIVP instruction syntax has forms that include zero or two explicit source operands. In the zero-operand form, the instruction divides ST(1) by ST(0), writes the result to ST(1), and pops the stack. In the two-operand form, both source operands are located in stack registers. The instruction divides the first operand by the second operand, writes the result to the first operand, and pops the stack. The FIDIV instruction reads a 16-bit or 32-bit integer value from memory, converts it to the double-extended-precision format, divides ST(0) by the memory operand, and writes the result to ST(0). The FDIVR and FDIVRP instructions perform the same operations as FDIV and FDIVP, respectively, except that the source operands are reversed. Instead of dividing the first Chapter 6: x87 Floating-Point Programming 323 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 operand by the second operand, FDIVR and FDIVRP divide the second operand by the first operand. Change Sign. FABS--Floating-Point Absolute Value FCHS--Floating-Point Change Sign The FABS instruction changes the top-of-stack value, ST(0), to its absolute value by clearing its sign bit to 0. The top-of-stack value is always positive following execution of the FABS instruction. The FCHS instruction complements the sign bit of ST(0). For example, if ST(0) was +0.0 before the execution of FCHS, it is changed to -0.0. Round. FRNDINT--Floating-Point Round to Integer The FRNDINT instruction rounds the top-of-stack value, ST(0), to an integer value, although the value remains in doubleextended-precision floating-point format. Rounding takes place according to the setting of the rounding control (RC) field in the x87 control word. Partial Remainder. FPREM--Floating-Point Partial Remainder FPREM1--Floating-Point Partial Remainder The FPREM instruction returns the remainder obtained by dividing ST(0) by ST(1) and stores it in ST(0). If the exponent difference between ST(0) and ST(1) is less than 64, all integer bits of the quotient are calculated, guaranteeing that the remainder returned is less in magnitude that the divisor in ST(1). If the exponent difference is equal to or greater than 64, only a subset of the integer quotient bits, numbering between 32 and 63, are calculated and a partial remainder is returned. FPREM can be repeated on a partial remainder until reduction i s c o m p l e t e . I t c a n b e u s e d t o b r i n g t h e o p e ra n d s o f transcendental functions into their proper range. FPREM is supported for software written for early x87 coprocessors. Unlike the FPREM1 instruction, FPREM does not calculate the partial remainder as specified in IEEE Standard 754. 324 Chapter 6: x87 Floating-Point Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology The FPREM1 instruction works like FPREM, except that the FPREM1 quotient is rounded using round-to-nearest mode, whereas FPREM truncates the quotient. Square Root. FSQRT--Floating-Point Square Root The FSQRT instruction replaces the contents of the top-ofstack, ST(0), with its square root. 6.5.5 Transcendental Functions The transcendental instructions compute trigonometric functions, inverse trigonometric functions, logarithmic functions, and exponential functions. Trigonometric Functions. FSIN--Floating-Point Sine FCOS--Floating-Point Cosine FSINCOS--Floating-Point Sine and Cosine FPTAN--Floating-Point Partial Tangent FPATAN--Floating-Point Partial Arctangent The FSIN instruction replaces the contents of the top-of-stack, ST(0), with its sine, in radians. The FCOS instruction replaces the contents of the top-of-stack, ST(0), with its cosine, in radians. The FSINCOS instruction computes both the sine and cosine of ST(0), in radians, and writes the sine to ST(0) and pushes the cosine onto the stack. Frequently, a piece of code that needs to compute the sine of an argument also needs to compute the cosine of that same argument. In such cases, use the FSINCOS instruction to compute both functions concurrently, which is faster than using separate FSIN and FCOS instructions. The FPTAN instruction replaces the contents of the top-ofstack, ST(0), with its tangent, in radians, and pushes the value 1.0 onto the stack. The FPATAN instruction computes = arctan (Y/X), in which X is located in ST(0) and Y in ST(1). The result, , is written over Y in ST(1), and the stack is popped. FSIN, FCOS, FSINCOS, and FPTAN are architecturally restricted in their argument range. Only arguments with a Chapter 6: x87 Floating-Point Programming 325 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 magnitude of less than or equal to 263 can be evaluated. If the argument is out of range, the C2 condition-code bit in the x87 status word is set to 1, and the argument is returned as the result. If software detects an out-of-range argument, the FPREM or FPREM1 instruction can be used to reduce the magnitude of the argument before using the FSIN, FCOS, FSINCOS, or FPTAN instruction again. Logarithmic Functions. F2XM1--Floating-Point Compute 2x-1 FSCALE--Floating-Point Scale FYL2X--Floating-Point y * log2x FYL2XP1--Floating-Point y * log2(x +1) The F2XM1 instruction computes Y = 2X - 1. X is located in ST(0) and must fall between -1 and +1. Y replaces X in ST(0). If ST(0) is out of range, the instruction returns an undefined result but no x87 status-word exception bits are affected. The FSCALE instruction replaces ST(0) with ST(0) times 2n, where n is the value in ST(1) truncated to an integer. This provides a fast method of multiplying by integral powers of 2. The FYL2X instruction computes Z = Y * log2 X. X is located in ST(0) and Y is located in ST(1). X must be greater than 0. The result, Z, replaces Y in ST(1), which becomes the new top-ofstack because X is popped off the stack. The FYL2XP1 instruction computes Z = Y * log 2 (X + 1). X located in ST(0) and must be in the range 0 < |X| < (1 - 21/2 / 2). Y is taken from ST(1). The result, Z, replaces Y in ST(1), which becomes the new top-of-stack because X is popped off the stack. Accuracy of Transcendental Results. x87 computations are carried out in double-extended-precision format, so that the transcendental functions provide results accurate to within one unit in the last place (ulp) for each of the floating-point data types. Argument Reduction Using Pi. T h e F P R E M a n d F P R E M 1 i n s t r u c t i o n s c a n b e u s e d t o re d u c e a n a rg u m e n t o f a trigonometric function by a multiple of Pi. The following example shows a reduction by 2: sin(n*2 + x) = sin(x) for all integral n 326 Chapter 6: x87 Floating-Point Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology In this example, the range is 0 x < 2 in the case of FPREM or x in the case of FPREM1. Negative arguments are re d u c e d by re p e a t e d ly s u b t ra c t i n g - 2 . S e e " Pa r t i a l Remainder" on page 324 for details of the instructions. 6.5.6 Compare and Test The compare-and-test instructions set and clear flags in the rFLAGS register to indicate the relationship between two operands (less, equal, greater, or unordered). Floating-Point Ordered Compare. FCOM--Floating-Point Compare FCOMP--Floating-Point Compare and Pop FCOMPP--Floating-Point Compare and Pop Twice FCOMI--Floating-Point Compare and Set Flags FCOMIP--Floating-Point Compare and Set Flags and Pop The FCOM instruction syntax has forms that include zero or one explicit source operands. In the zero-operand form, the instruction compares ST(1) with ST(0) and writes the x87 status-word condition codes accordingly. In the one-operand form, the instruction reads a 32-bit or 64-bit value from memory, compares it with ST(0), and writes the x87 condition codes accordingly. The FCOMP instruction performs the same operation as FCOM but also pops ST(0) after writing the condition codes. The FCOMPP instruction performs the same operation as FCOM but also pops both ST(0) and ST(1). FCOMPP can be used to initialize the x87 stack at the end of an x87 procedure by removing two registers of preloaded data from the stack. The FCOMI instruction compares the contents of ST(0) with the contents of another stack register and writes the ZF, PF, and CF flags in the rFLAGS register as shown in Table 6-14 on page 328. If no source is specified, ST(0) is compared to ST(1). If ST(0) or the source operand is a NaN or in an unsupported format, the flags are set to indicate an unordered condition. The FCOMIP instruction performs the same comparison as FCOMI but also pops ST(0) after writing the rFLAGS bits. Chapter 6: x87 Floating-Point Programming 327 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Table 6-14. Flag ZF PF CF rFLAGS Values for FCOMI Instruction ST(0) > ST(i) 0 0 0 ST(0) < ST(i) 0 0 1 ST(0) = ST(i) 1 0 0 Unordered 1 1 1 For comparison-based branches, the combination of FCOMI and FCMOVcc is faster than the classical method of using FxSTSW AX to copy condition codes through the AX register to the rFLAGS register, where they can provide branch direction for conditional operations. The FCOMx instructions perform ordered compares, as opposed to the FUCOMx instructions. See the description of ordered vs. unordered compares immediately below. Floating-Point Unordered Compare. FUCOM--Floating-Point Unordered Compare FUCOMP--Floating-Point Unordered Compare and Pop FUCOMPP--Floating-Point Unordered Compare and Pop Twice FUCOMI--Floating-Point Unordered Compare and Set Flags FUCOMIP--Floating-Point Unordered Compare and Set Flags and Pop The FUCOMx instructions perform the same operations as the FCOMx instructions, except that the FUCOMx instructions generate an invalid-operation exception (IE) only if any operand is an unsupported data type or a signaling NaN (SNaN), whereas the ordered-compare FCOMx instructions generate an invalid-operation exception if any operand is an unsupported data type or any type of NaN. For a description of NaNs, see "Number Representation" on page 305. Integer Compare. FICOM--Floating-Point Integer Compare FICOMP--Floating-Point Integer Compare and Pop 328 Chapter 6: x87 Floating-Point Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology The FICOM instruction reads a 16-bit or 32-bit integer value from memory, compares it with ST(0), and writes the condition codes in the same way as the FCOM instruction. The FICOMP instruction performs the same operations as FICOM but also pops ST(0). Test. FTST--Floating-Point Test with Zero The FTST instruction compares ST(0) with zero and writes the condition codes in the same way as the FCOM instruction. Classify. FXAM--Floating-Point Examine The FXAM instruction determines the type of value in ST(0) and sets the condition codes accordingly, as shown in Table 6-15 on page 330. Chapter 6: x87 Floating-Point Programming 329 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Table 6-15. C3 0 0 0 0 0 0 0 0 1 1 1 1 1 1 Note: Condition-Code Settings for FXAM C2 0 0 0 0 1 1 1 1 0 0 0 0 1 1 C0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 C11 0 1 0 1 0 1 0 1 0 1 0 1 0 1 Meaning +unsupported -unsupported +NAN -NAN +normal -normal +infinity -infinity +0 -0 +empty -empty +denormal -denormal 1. C1 is the sign of ST(0). 6.5.7 Stack Management The stack management instructions move the x87 top-of-stack pointer (TOP) and clear the contents of stack registers. Stack Control. FDECSTP--Floating-Point Decrement Stack-Top Pointer FINCSTP--Floating-Point Increment Stack-Top Pointer The FINCSTP and FDECSTP instructions increment and decrement, respectively, the TOP, modulo-8. Neither the x87 tag word nor the contents of the floating-point stack itself is updated. Clear State. FFREE--Free Floating-Point Register 330 Chapter 6: x87 Floating-Point Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology The FFREE instruction frees a specified stack register by setting the x87 tag-word bits for the register to all 1s, indicating empty. Neither the stack-register contents nor the stack pointer is modified by this instruction. 6.5.8 No Operation This instruction uses processor cycles but generates no result. FNOP--Floating-Point No Operation The FNOP instruction has no operands and writes no result. Its purpose is simply to delay execution of a sequence of instructions. 6.5.9 Control The control instructions are used to initialize, save, and restore x87 processor state and to manage x87 exceptions. Initialize. FINIT--Floating-Point Initialize FNINIT--Floating-Point No-Wait Initialize The FINIT and FNINIT instructions set all bits in the x87 control-word, status-word, and tag word registers to their d e fau l t val u e s . A s s e m b l e rs i s s u e F I N I T a s a n F WA I T instruction followed by an FNINIT instruction. Thus, FINIT (but not FNINIT) reports pending unmasked x87 floating-point exceptions before performing the initialization. Both FINIT and FNINIT write the control word with its initialization value, 037Fh, which specifies round-to-nearest, all exceptions masked, and double-extended-precision. The tag word indicates that the floating-point registers are empty. The status word and the four condition-code bits are cleared to 0. The x87 pointers and opcode state ("Pointers and Opcode State" on page 297) are all cleared to 0. The FINIT instruction should be used when pending x87 floating-point exceptions are being reported (unmasked). The no-wait instruction, FNINIT, should be used when pending x87 floating-point exceptions are not being reported (masked). Wait for Exceptions. FWAIT or WAIT--Wait for Unmasked x87 Floating-Point Exceptions Chapter 6: x87 Floating-Point Programming 331 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 The FWAIT and WAIT instructions are synonyms. The instruction forces the processor to test for and handle any pending, unmasked x87 floating-point exceptions. Clear Exceptions. FCLEX--Floating-Point Clear Flags FNCLEX--Floating-Point No-Wait Clear Flags These instructions clear the status-word exception flags, stackfault flag, and busy flag. They leave the four condition-code bits undefined. Assemblers issue FCLEX as an FWAIT instruction followed by an FNCLEX instruction. Thus, FCLEX (but not FNCLEX) reports pending unmasked x87 floating-point exceptions before clearing the exception flags. The FCLEX instruction should be used when pending x87 floating-point exceptions are being reported (unmasked). The no-wait instruction, FNCLEX, should be used when pending x87 floating-point exceptions are not being reported (masked). Save and Restore x87 Control Word. FLDCW--Floating-Point Load x87 Control Word FSTCW--Floating-Point Store Control Word FNSTCW--Floating-Point No-Wait Store Control Word These instructions load or store the x87 control-word register as a 2-byte value from or to a memory location. The FLDCW instruction loads a control word. If the loaded control word unmasks any pending x87 floating-point exceptions, these exceptions are reported when the next noncontrol x87 or 64-bit media instruction is executed. Assemblers issue FSTCW as an FWAIT instruction followed by an FNSTCW instruction. Thus, FSTCW (but not FNSTCW) reports pending unmasked x87 floating-point exceptions before storing the control word. The FSTCW instruction should be used when pending x87 floating-point exceptions are being reported (unmasked). The no-wait instruction, FNSTCW, should be used when pending x87 floating-point exceptions are not being reported (masked). 332 Chapter 6: x87 Floating-Point Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Save x87 Status Word. FSTSW--Floating-Point Store Status Word FNSTSW--Floating-Point No-Wait Store Status Word These instructions store the x87 status word either at a specified 2-byte memory location or in the AX register. The second form, FxSTSW AX, is used in older code to copy condition codes through the AX register to the rFLAGS register, where they can be used for conditional branching using generalpurpose instructions. However, the combination of FCOMI and FCMOVcc provides a faster method of conditional branching. Assemblers issue FSTSW as an FWAIT instruction followed by an FNSTSW instruction. Thus, FSTSW (but not FNSTSW) reports pending unmasked x87 floating-point exceptions before storing the status word. The FSTSW instruction should be used when pending x87 floating-point exceptions are being reported (unmasked). The no-wait instruction, FNSTSW, should be used when pending x87 floating-point exceptions are not being reported (masked). Save and Restore x87 Environment. FLDENV--Floating-Point Load x87 Environment FNSTENV--Floating-Point No-Wait Store Environment FSTENV--Floating-Point Store Environment These instructions load or store the entire x87 environment (non-data processor state) as a 14-byte or 28-byte block, depending on effective operand size, from or to memory. When executing FLDENV, any exception flags are set in the new status word, and these exceptions are unmasked in the control word, a floating-point exception occurs when the next non-control x87 or 64-bit media instruction is executed. Assemblers issue FSTENV as an FWAIT instruction followed by an FNSTENV instruction. Thus, FSTENV (but not FNSTENV) reports pending unmasked x87 floating-point exceptions before storing the status word. The x87 environment includes the x87 control word register, x87 status word register, x87 tag word, last x87 instruction pointer, last x87 data pointer, and last x87 opcode. See "Media and x87 Chapter 6: x87 Floating-Point Programming 333 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Processor State" in Volume 2 for details on how the x87 environment is stored in memory. Save and Restore x87 and 64-Bit Media State. FSAVE--Save x87 and MMX State. FNSAVE--Save No-Wait x87 and MMX State. FRSTOR--Restore x87 and MMX State. These instructions save and restore the entire processor state fo r x 8 7 f l o a t i n g - p o i n t i n s t r u c t i o n s a n d 6 4 - b i t m e d i a instructions. The instructions save and restore either 94 or 108 bytes of data, depending on the effective operand size. Assemblers issue FSAVE as an FWAIT instruction followed by an FNSAVE instruction. Thus, FSAVE (but not FNSAVE) reports pending unmasked x87 floating-point exceptions before saving the state. After saving the state, the processor initializes the x87 state by performing the equivalent of an FINIT instruction. For details, see "State-Saving" on page 351. Save and Restore x87, 128-Bit, and 64-Bit State. FXSAVE--Save XMM, MMX, and x87 State. FXRSTOR--Restore XMM, MMX, and x87 State. The FXSAVE and FXRSTOR instructions save and restore the entire 512-byte processor state for 128-bit media instructions, 64-bit media instructions, and x87 floating-point instructions. The architecture supports two memory formats for FXSAVE and FXRSTOR, a 512-byte 32-bit legacy format and a 512-byte 64-bit format. Selection of the 32-bit or 64-bit format is determined by the effective operand size for the FXSAVE and FXRSTOR instructions. For details, see "Saving Media and x87 Processor State" in Volume 2. FXSAVE and FXRSTOR execute faster than FSAVE/FNSAVE and FRSTOR. However, unlike FSAVE and FNSAVE, FXSAVE does not initialize the x87 state, and like FNSAVE it does not report pending unmasked x87 floating-point exceptions. For details, see "State-Saving" on page 351. 334 Chapter 6: x87 Floating-Point Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology 6.6 Instruction Effects on rFLAGS The rFLAGS register is described in "Flags Register" on page 37. Table 6-16 summarizes the effect that x87 floatingpoint instructions have on individual flags within the rFLAGS register. Only instructions that access the rFLAGS register are shown--all other x87 instructions have no effect on rFLAGS. The following codes are used within the table: Mod--The flag is modified. Tst--The flag is tested. Gray shaded cells indicate the flag is not affected by the instruction. Table 6-16. Instruction Mnemonic FCMOVcc FCOMI FCOMIP FUCOMI FUCOMIP Instruction Effects on rFLAGS rFLAGS Mnemonic and Bit Number RF 16 NT 14 OF 11 DF 10 IF 9 TF 8 SF 7 ZF 6 Tst Mo d AF 4 PF 2 Tst Mo d CF 0 Tst Mo d 6.7 Instruction Prefixes Instruction prefixes, in general, are described in "Instruction Prefixes" on page 85. The following restrictions apply to the use of instruction prefixes with x87 instructions. Supported Prefixes. The following prefixes can be used with x87 instructions: Operand-Size Override--The 66h prefix affects only the FLDENV, FSTENV, FNSTENV, FSAVE, FNSAVE, and FRSTOR instructions, in which it selects between a 16-bit and 32-bit memory-image format. The prefix is ignored by all other x87 instructions. Address-Size Override--The 67h prefix affects only operands in memory, in which it selects between a 16-bit and 32-bit addresses. The prefix is ignored by all other x87 instructions. Chapter 6: x87 Floating-Point Programming 335 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Segment Overrides--The 2Eh (CS), 36h (SS), 3Eh (DS), 26h (ES), 64h (FS), and 65h (GS) prefixes specify a segment. They affect only operands in memory. In 64-bit mode, the CS, DS, ES, SS segment overrides are ignored. REX--The REX prefix affects only the FXSAVE and FXRSTOR instructions, in which it selects between two types of 512-byte memory-image format, as described in "Saving Media and x87 Processor State" in Volume 2. The prefix is ignored by all other x87 instructions. Ignored Prefixes. The following prefixes are ignored by x87 instructions: REP--The F3h and F2h prefixes. Prefixes That Cause Exceptions. The following prefixes cause an exception: LOCK--The F0h prefix causes an invalid-opcode exception when used with x87 instructions. 6.8 Feature Detection Before executing x87 floating-point instructions, software should determine if the processor supports the technology by executing the CPUID instruction. "Feature Detection" on page 90 describes how software uses the CPUID instruction to detect feature support. For full support of the x87 floating-point features, the following feature must be present: On-Chip Floating-Point Unit, indicated by bit 0 of CPUID standard function 1 and CPUID extended function 8000_0001h. CMOVcc (conditional moves), indicated by bit 15 of CPUID standard function 1 and CPUID extended function 8000_0001h. This bit indicates support for x87 floating-point conditional moves (FCMOVcc) whenever the On-Chip Floating-Point Unit bit (bit 0) is also set. Software may also wish to check for the following support, because the FXSAVE and FXRSTOR instructions execute faster than FSAVE and FRSTOR: FXSAVE and FXRSTOR, indicated by bit 24 of CPUID standard function 1 and extended function 8000_0001h. 336 Chapter 6: x87 Floating-Point Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Software that runs in long mode should also check for the following support: Long Mode, indicated by bit 29 of CPUID extended function 8000_0001h. See "Processor Feature Identification" in Volume 2 for a full description of the CPUID instruction and its function codes. 6.9 Exceptions Types of Exceptions. x87 instructions can generate two types of exceptions: General-Purpose Exceptions, described below in "GeneralPurpose Exceptions" x87 Floating-Point Exceptions (#MF), described in "x87 Floating-Point Exception Causes" on page 338 Relation to 128-Bit Media Exceptions. Although the x87 floating-point instructions and the 128-bit media instructions each have certain exceptions with the same names, the exceptionreporting and exception-handling methods used by the two instruction subsets are distinct and independent of each other. If procedures using both types of instructions are run in the same operating environment, separate services routines should be provided for the exceptions of each type of instruction subset. 6.9.1 General-Purpose Exceptions The sections below list general-purpose exceptions generated and not generated by x87 floating-point instructions. For a summary of the general-purpose exception mechanism, see "Interrupts and Exceptions" on page 104. For details about each exception and its potential causes, see "Exceptions and Interrupts" in Volume 2. Exceptions Generated. x87 instructions can generate the following general-purpose exceptions: #DB--Debug Exception (Vector 1) #BP--Breakpoint Exception (Vector 3) #UD--Invalid-Opcode Exception (Vector 6) #NM--Device-Not-Available Exception (Vector 7) #DF--Double-Fault Exception (Vector 8) Chapter 6: x87 Floating-Point Programming 337 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 #SS--Stack Exception (Vector 12) #GP--General-Protection Exception (Vector 13) #PF--Page-Fault Exception (Vector 14) #MF--x87 Floating-Point Exception-Pending (Vector 16) #AC--Alignment-Check Exception (Vector 17) #MC--Machine-Check Exception (Vector 18) For details on #MF exceptions, see "x87 Floating-Point Exception Causes" below. Exceptions Not Generated. x87 instructions do not generate the following general-purpose exceptions: #DE--Divide-by-zero-error exception (Vector 0) Non-Maskable-Interrupt Exception (Vector 2) #OF--Overflow exception (Vector 4) #BR--Bound-range exception (Vector 5) Coprocessor-segment-overrun exception (Vector 9) #TS--Invalid-TSS exception (Vector 10) #NP--Segment-not-present exception (Vector 11) #MC--Machine-check exception (Vector 18) #XF--SIMD floating-point exception (Vector 19) For details on all general-purpose exceptions, see "Exceptions and Interrupts" in Volume 2. 6.9.2 x87 FloatingPoint Exception Causes The x87 floating-point exception-pending (#MF) exception listed above in "General-Purpose Exceptions" is actually the logical OR of six exceptions that can be caused by x87 floatingpoint instructions. Each of the six exceptions has a status flag in the x87 status word and a mask bit in the x87 control word. A seventh exception, stack fault (SF), is reported together with one of the six maskable exceptions and does not have a mask bit. If a #MF exception occurs when its mask bit is set to 1 (masked), the processor responds in a default way that does not invoke the #MF exception service routine. If an exception occurs when its mask bit is cleared to 0 (unmasked), the processor suspends processing of the faulting instruction (if possible) and, at the boundary of the next non-control x87 or 64-bit media instruction (see "Control" on page 331), determines that an 338 Chapter 6: x87 Floating-Point Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology unmasked exception is pending--by checking the exception status (ES) flag in the x87 status word--and invokes the #MF exception service routine. #MF Exception Types and Flags. The #MF exceptions are of six types, five of which are mandated by the IEEE 754 standard. These six types and their bit-flags in the x87 status word are shown in Table 6-17. A stack fault (SF) exception is always accompanied by an invalid-operation exception (IE). A summary of each exception type is given in "x87 Status Word Register" on page 289. Table 6-17. x87 Floating-Point (#MF) Exception Flags x87 StatusWord Bit1 0 0 and 6 1 2 3 4 5 Comparable IEEE 754 Exception Invalid Operation none none Division by Zero Overflow Underflow Inexact Exception and Mnemonic Invalid-operation exception (IE) Invalid-operation exception (IE) with stack fault (SF) exception Denormalized-operand exception (DE) Zero-divide exception (ZE) Overflow exception (OE) Underflow exception (UE) Precision exception (PE) Note: 1. See "x87 Status Word Register" on page 289 for a summary of each exception. The sections below describe the causes for the #MF exceptions. Masked and unmasked responses to the exceptions are described in "x87 Floating-Point Exception Masking" on page 344. The priority of #MF exceptions are described in "x87 Floating-Point Exception Priority" on page 342. Invalid-Operation Exception (IE). The IE exception occurs due to one of the attempted operations shown in Table 6-18 on page 340. An IE exception may also be accompanied by a stack fault (SF) exception. See "Stack Fault (SF)" on page 341. Chapter 6: x87 Floating-Point Programming 339 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Table 6-18. Invalid-Operation Exception (IE) Causes Operation Any Arithmetic Operation FADD, FADDP FSUB, FSUBP, FSUBR, FSUBRP FMUL, FMULP FDIV, FDIVP, FDIVR, FDIVRP FSQRT FYL2X Condition * A source operand is an SNaN, or * A source operand is an unsupported data type (pseudoNaN, pseudo-infinity, pseudo-denormal, or unnormal) Source operands are infinities with opposite signs Source operands are infinities with same sign Source operands are zero and infinity Source operands are both infinities or both zeros Source operand is less than zero (except 0 which returns 0) Source operand is less than zero (except 0 which returns ) Source operand is less than minus one Source operand is infinity A source operand is a QNaN Arithmetic (IE exception) FYL2XP1 FCOS, FPTAN, FSIN, FSINCOS FCOM, FCOMP, FCOMPP, FCOMI, FCOMIP FUCOM, FUCOMP, FUCOMPP, FUCOMI, FUCOMIP FPREM, FPREM1 FIST, FISTP FXCH FBSTP A source operand is an SNaN Dividend is infinity or divisor is zero Source operand overflows destination data type A source register is specified empty by the its tag bits Source operand overflows packed BCD data type Stack overflow or underflow1 Stack (IE and SF exceptions) Note: 1. The processor sets condition code C1 = 1 for overflow, C1 = 0 for underflow. Denormalized-Operand Exception (DE). The DE exception occurs in any of the following cases: Denormalized Operand (any precision)--An arithmetic instruction uses an operand of any precision that is in 340 Chapter 6: x87 Floating-Point Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology denormalized form, as described in "Denormalized (Tiny) Numbers" on page 306. Denormalized Single-Precision or Double-Precision Load--An instruction loads a single-precision or double-precision (but not double-extended-precision) operand, which is in denormalized form, into an x87 register. Zero-Divide Exception (ZE). The ZE exception occurs when: Divisor is Zero--An FDIV, FDIVP, FDIVR, FDIVRP, FIDIV, or FIDIVR instruction attempts to divide zero into a non-zero finite dividend. Source Operand is Zero--An FYL2X or FXTRACT instruction uses a source operand that is zero. Overflow Exception (OE). The OE exception occurs when the value of a rounded floating-point result is larger than the largest representable normalized positive or negative floating-point number in the destination format, as shown in Table 6-5 on page 303. An overflow can occur through computation or through conversion of higher-precision numbers to lowerprecision numbers. See "Precision" on page 313. Integer and BCD overflow is reported via the invalid-operation exception. Underflow Exception (UE). The UE exception occurs when the value of a rounded, non-zero floating-point result is too small to be represented as a normalized positive or negative floating-point number in the destination format, as shown in Table 6-5 on page 303. Integer and BCD underflow is reported via the invalid-operation exception. Precision Exception (PE). The PE exception, also called the inexactresult exception, occurs when a floating-point result, after rounding, differs from the infinitely precise result and thus cannot be represented exactly in the specified destination format. Software that does not require exact results normally masks this exception. See "Precision" on page 313 and "Rounding" on page 314. Stack Fault (SF). The SF exception occurs when a stack overflow (due to a push or load into a non-empty stack register) or stack underflow (due to referencing an empty stack register) occurs in the x87 stack-register file. The empty and non-empty conditions are shown in Table 6-3 on page 296. When either of these conditions occur, the processor also sets the invalidChapter 6: x87 Floating-Point Programming 341 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 operation exception (IE) flag, and it sets or clears the conditioncode 1 (C1) bit to indicate the direction of the stack fault (C1 = 1 for overflow, C1 = 0 for underflow). Unlike the flags for the other x87 exceptions, the SF flag does not have a corresponding mask bit in the x87 control word. 6.9.3 x87 FloatingPoint Exception Priority Table 6-19 shows the priority with which the processor recognizes multiple, simultaneous SIMD floating-point exceptions and operations involving QNaN operands. Each exception type is characterized by its timing, as follows: Pre-Computation--an exception that is recognized before an instruction begins its operation. Post-Computation--an exception that is recognized after an instruction completes its operation. For post-computation exceptions, a result may be written to the destination, depending on the type of exception and whether the destination is a register or memory location. Operations involving QNaNs do not necessarily cause exceptions, but the processor handles them with the priority shown in Table 6-19 on page 343 relative to the handling of exceptions. 342 Chapter 6: x87 Floating-Point Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Table 6-19. Priority 1 2 3 4 5 6 7 8 9 Note: Priority of x87 Floating-Point Exceptions Exception or Operation Invalid-operation exception (IE) with stack fault (SF) due to underflow Invalid-operation exception (IE) with stack fault (SF) due to overflow Invalid-operation exception (IE) when accessing unsupported data type Invalid-operation exception (IE) when accessing SNaN operand Operation involving a QNaN operand1 Any other type of invalid-operation exception (IE) Zero-divide exception (ZE) Denormalized operation exception (DE) Overflow exception (OE) Underflow exception (UE) Precision (inexact) exception (PE) Timing Pre-Computation Pre-Computation Pre-Computation Pre-Computation -- Pre-Computation Pre-Computation Pre-Computation Post-Computation Post-Computation Post-Computation 1. Operations involving QNaN operands do not, in themselves, cause exceptions but they are handled with this priority relative to the handling of exceptions. For exceptions that occur before the associated operation (preoperation, as shown in Table 6-19), if an unmasked exception occurs, the processor suspends processing of the faulting instruction but it waits until the boundary of the next noncontrol x87 or 64-bit media instruction to be executed before invoking the associated exception service routine. During this delay, non-x87 instructions may overwrite the faulting x87 instruction's source or destination operands in memory. If that occurs, the x87 service routine may be unable to perform its job. To prevent such problems, analyze x87 procedures for potential exception-causing situations and insert a WAIT or other safe x87 instruction immediately after any x87 instruction that may cause a problem. Chapter 6: x87 Floating-Point Programming 343 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 6.9.4 x87 FloatingPoint Exception Masking The six floating-point exception flags in the x87 status word have corresponding exception-flag masks in the x87 control word, as shown in Table 6-20. Table 6-20. x87 Floating-Point (#MF) Exception Masks Exception Mask and Mnemonic Invalid-operation exception mask (IM) Denormalized-operand exception mask (DM) Zero-divide exception mask (ZM) Overflow exception mask (OM) Underflow exception mask (UM) Precision exception mask (PM) Note: x87 Control-Word Bit1 0 1 2 3 4 5 1. See "x87 Status Word Register" on page 289 for a summary of each exception. Each mask bit, when set to 1, inhibits invocation of the #MF exception handler and instead continues normal execution using the default response for the exception type. During initialization with FINIT or FNINIT, all exception-mask bits in the x87 control word are set to 1 (masked). At processor reset, all exception-mask bits are cleared to 0 (unmasked). Masked Responses. The occurrence of a masked exception does not invoke its exception handler when the exception condition occurs. Instead, the processor handles masked exceptions in a default way, as shown in Table 6-21 on page 345. 344 Chapter 6: x87 Floating-Point Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Table 6-21. Masked Responses to x87 Floating-Point Exceptions Type of Operation1 Any Arithmetic Operation: Source operand is an SNaN Any Arithmetic Operation: Source operand is an unsupported data type or FADD, FADDP: Source operands are infinities with opposite signs or FSUB, FSUBP, FSUBR, FSUBRP: Source operands are infinities with same sign or FMUL, FMULP: Source operands include zero and infinity or FDIV, FDIVP, FDIVR, FDIVRP: Source operands are both infinities or both are zeros or FSQRT: Source operand is less than zero (except 0 which returns 0) or FYL2X: Source operand is less than zero (except 0 which returns ) or FYL2XP1: Source operand is less than minus one Processor Response Set IE flag, and return a QNaN value. Exception and Mnemonic Invalid-operation exception (IE)2 Set IE flag, and return the floating-point indefinite value3. Notes: 1. See "Instruction Summary" on page 315 for the types of instructions. 2. Includes invalid-operation exception (IE) together with stack fault (SF). 3. See "Indefinite Values" on page 311. Chapter 6: x87 Floating-Point Programming 345 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Table 6-21. Masked Responses to x87 Floating-Point Exceptions (continued) Type of Operation1 FCOS, FPTAN, FSIN, FSINCOS: Source operand is or FPREM, FPREM1: Dividend is infinity or divisor is 0 FCOM, FCOMP, or FCOMPP: One or both operands is a NaN or FUCOM, FUCOMP, or FUCOMPP: One or both operands is an SNaN Processor Response Exception and Mnemonic Set IE flag, return the floating-point indefinite value3, and clear condition code C2 to 0. Set IE flag, and set C3-C0 condition codes to reflect the result. Invalid-operation exception (IE)2 FCOMI or FCOMIP: One or both operands is a NaN or FUCOMI or FUCOMIP: One or both operands is an SNaN FIST, FISTP: Source operand overflows destination data type FXCH: A source register is specified empty by the its tag bits FBSTP: Source operand overflows packed BCD data type Set IE flag, and writes the zero (ZF), parity (PF), and carry (CF) flags in rFLAGS according to the result. Set IE flag, and return the integer indefinite value3. Set IE flag, and perform exchange using floating-point indefinite value3 as content for empty register(s). Set IE flag, and return the packed-decimal indefinite value3. Set DE flag, and return the result using the denormal operand(s). Set ZE flag, and return signed with sign bit = XOR of the operand sign bits. Set ZE flag, and return signed with sign bit = complement of sign bit for ST(1) operand. Set ZE flag, write ST(0) = 0 with sign of operand, and write ST(1) = -. Denormalized-operand exception (DE) FDIV, FDIVP, FDIVR, FDIVRP, FIDIV, or FIDIVR: Divisor is 0 Zero-divide exception (ZE) FYL2X: ST(0) is 0 and ST(1) is a non-zero floating-point value FXTRACT: Source operand is 0 Notes: 1. See "Instruction Summary" on page 315 for the types of instructions. 2. Includes invalid-operation exception (IE) together with stack fault (SF). 3. See "Indefinite Values" on page 311. 346 Chapter 6: x87 Floating-Point Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Table 6-21. Masked Responses to x87 Floating-Point Exceptions (continued) Type of Operation1 Round to nearest Processor Response * If sign of result is positive, set OE flag, and return +. * If sign of result is negative, set OE flag, and return -. * If sign of result is positive, set OE flag, and return +. * If sign of result is negative, set OE flag, and return finite negative number with largest magnitude. * If sign of result is positive, set OE flag, and return finite positive number with largest magnitude. * If sign of result is negative, set OE flag, and return -. * If sign of result is positive, set OE flag and return finite positive number with largest magnitude. * If sign of result is negative, set OE flag and return finite negative number with largest magnitude. * If result is both denormal (tiny) and inexact, set UE flag and return denormalized result. * If result is denormal (tiny) but not inexact, return denormalized result but do not set UE flag. Set PE flag, return rounded result, write C1 condition code to specify round-up (C1 = 1) or not round-down (C1 = 0). Set PE flag and respond as for the OE or UE exceptions. Set PE flag, respond as for the OE or UE exception, and call OE or UE service routine. Ignore PE exception, and assert FERR# as for an unmasked exception. The destination and the TOP are not changed. Exception and Mnemonic Round toward + Overflow exception (OE) Round toward - Round toward 0 Underflow exception (UE) Without overflow or underflow With masked overflow or underflow With unmasked overflow or underflow for register destination With unmasked overflow or underflow for memory destination Notes: Precision exception (PE) 1. See "Instruction Summary" on page 315 for the types of instructions. 2. Includes invalid-operation exception (IE) together with stack fault (SF). 3. See "Indefinite Values" on page 311. Chapter 6: x87 Floating-Point Programming 347 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 Unmasked Responses. T h e p r o c e s s o r exceptions as shown in Table 6-22. Table 6-22. Unmasked Responses to x87 Floating-Point Exceptions Type of Operation handles u n m a s ke d Exception and Mnemonic Processor Response1 Invalid-operation exception (IE) Invalid-operation exception (IE) with stack fault (SF) Denormalized-operand exception (DE) Zero-divide exception (ZE) Set IE and ES flags, and call the #MF service routine. The destination and the TOP are not changed. Set DE and ES flags, and call the #MF service routine. The destination and the TOP are not changed. Set ZE and ES flags, and call the #MF service routine. The destination and the TOP are not changed. * If the destination is memory, set OE and ES flags, and call the #MF service routine. The destination and the TOP are not changed. * If the destination is an x87 register: - divide true result by 224576, - round significand according to PC precision control and RC rounding control (or round to double-extended precision for instructions not observing PC precision control), - write C1 condition code according to rounding (C1 = 1 for round up, C1 = 0 for round toward zero), - write result to destination, - pop or push stack if specified by the instruction, - set OE and ES flags, and call the #MF service routine. Overflow exception (OE) Note: 1. For all unmasked exceptions, the processor's response also includes assertion of the FERR# output signal at the completion of the instruction that caused the exception. 348 Chapter 6: x87 Floating-Point Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Table 6-22. Unmasked Responses to x87 Floating-Point Exceptions (continued) Type of Operation Processor Response1 * If the destination is memory, set UE and ES flags, and call the #MF service routine. The destination and the TOP are not changed. * If the destination is an x87 register: - multiply true result by 224576, - round significand according to PC precision control and RC rounding control (or round to double-extended precision for instructions not observing PC precision control), - write C1 condition code according to rounding (C1 = 1 for round up, C1 = 0 for round toward zero), - write result to destination, - pop or push stack if specified by the instruction, - set UE and ES flags, and call the #MF service routine. Set PE and ES flags, return rounded result, write C1 condition code to specify round-up (C1 = 1) or not round-down (C1 = 0), and call the #MF service routine. Exception and Mnemonic Underflow exception (UE) Without overflow or underflow With masked overflow or underflow Precision exception (PE) With unmasked overflow or underflow for register destination With unmasked overflow or underflow for memory destination Note: Set PE and ES flags, respond as for the OE or UE exception, and call the #MF service routine. Do not set PE flag, and set ES flag. The destination and the TOP are not changed. 1. For all unmasked exceptions, the processor's response also includes assertion of the FERR# output signal at the completion of the instruction that caused the exception. FERR# and IGNNE# Signals. In all unmasked-exception responses, the processor also asserts the FERR# output signal at the completion of the instruction that caused the exception. The exception is serviced at the boundary of the next non-control x87 or 64-bit media instruction following the instruction that caused the exception. (See "Control" on page 331 for a definition of control instructions.) Chapter 6: x87 Floating-Point Programming 349 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 System software controls x87 floating-point exception reporting using the numeric error (NE) bit in control register 0 (CR0), as follows: If CR0.NE = 1, internal processor control over x87 floatingpoint exception reporting is enabled. In this case, an #MF exception occurs when the FERR# output signal is asserted. It is recommended that system software set NE to 1. This enables optimal performance in handling x87 floating-point exceptions. If CR0.NE = 0, internal processor control of x87 floatingpoint exceptions is disabled and the external IGNNE# input signal controls whether x87 floating-point exceptions are ignored, as follows: - When IGNNE# is 1, x87 floating-point exceptions are ignored. - When IGNNE# is 0, x87 floating-point exceptions are reported by setting the FERR# input signal to 1. External logic can use the FERR# signal as an external interrupt. Using NaNs in IE Diagnostic Exceptions. Both SNaNs and QNaNs can be encoded with many different values to carry diagnostic information. By means of appropriate masking and unmasking of the invalid-operation exception (IE), software can use signaling NaNs to invoke an exception handler. Within the constraints imposed by the encoding of SNaNs and QNaNs, software may freely assign the bits in the significand of a NaN. See the section "Not a Number (NaN)" on page 308 for format details. For example, software can pre-load each element of an array with a signaling NaN that encodes the array index. When an application accesses an uninitialized array element, the invalidoperation exception is invoked and the service routine can identify that element. A service routine can store debug information in memory as the exceptions occur. The routine can create a QNaN that references its associated debug area in memory. As the program runs, the service routine can create a different QNaN for each error condition, so that a single testrun can identify a collection of errors. 350 Chapter 6: x87 Floating-Point Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology 6.10 State-Saving In general, system software should save and restore x87 state between task switches or other interventions in the execution of x87 floating-point procedures. Virtually all modern operating systems running on x86 processors--like Windows NT(R), UNIX, and OS/2--are preemptive multitasking operating systems that handle such saving and restoring of state properly across task switches, independently of hardware task-switch support. However, application procedures are also free to save and restore x87 state at any time they deem useful. 6.10.1 State-Saving Instructions FSAVE/FNSAVE and FRSTOR Instructions. Application software can save and restore the x87 state by executing the FSAVE (or FNSAVE) and FRSTOR instructions. Alternatively, software may use multiple FxSTx (floating-point store stack top) instructions for saving only the contents of the x87 data registers, rather than the complete x87 state. The FSAVE instruction stores the state, but only after handling any pending unmasked x87 floating-point exceptions, whereas the FNSAVE instruction skips the handling of these exceptions. The state of all x87 data registers is saved, as well as all x87 environment state (the x87 control word register, status word register, tag word, instruction pointer, data pointer, and last opcode register). After saving this state, the tag bits for all x87 registers are changed to empty and thus available for a new procedure. FXSAVE and FXRSTOR Instructions. Application software can save and restore the 128-bit media state, 64-bit media state, and x87 floating-point state by executing the FXSAVE and FXRSTOR instructions. The FXSAVE and FXRSTOR instructions execute faster than FSAVE/FNSAVE and FRSTOR because they do not save and restore the x87 pointers (last instruction pointer, last data pointer, and last opcode, described in "Pointers and Opcode State" on page 297) except in the relatively rare cases in which the exception-summary (ES) bit in the x87 status word (the ES register image for FXSAVE, or the ES memory image for FXRSTOR) is set to 1, indicating that an unmasked x87 exception has occurred. Unlike FSAVE and FNSAVE, however, FXSAVE does not alter the tag bits. The state of the saved x87 data registers is retained, thus indicating that the registers may still be valid (or Chapter 6: x87 Floating-Point Programming 351 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 whatever other value the tag bits indicated prior to the save). To invalidate the contents of the x87 data registers after F X S AV E , s o f t wa re mu s t ex p l i c i t ly ex e c u t e a n F I N I T instruction. Also, FXSAVE (like FNSAVE) and FXRSTOR do not check for pending unmasked x87 floating-point exceptions. An FWAIT instruction can be used for this purpose. The architecture supports two memory formats for FXSAVE and FXRSTOR, a 512-byte 32-bit legacy format and a 512-byte 64-bit format, used in 64-bit mode. Selection of the 32-bit or 64bit format is determined by the effective operand size for the FXSAVE and FXRSTOR instructions. For details, see "Saving Media and x87 Processor State" in Volume 2. 6.11 Performance Considerations In addition to typical code optimization techniques, such as those affecting loops and the inlining of function calls, the following considerations may help improve the performance of application prog rams written with x87 floating-point instructions. Th e s e a re i m p l e m e n t a t i o n- in d e p e n d e n t p e r fo r m a n c e considerations. Other considerations depend on the hardware implementation. For information about such implementationdependent considerations and for more information about application performance in general, see the data sheets and the software-optimization guides relating to particular hardware implementations. 6.11.1 Replace x87 Code with 128-Bit Media Code Code written with 128-bit media floating-point instructions can operate in parallel on four times as many single-precision floating-point operands as can x87 floating-point code. This achieves potentially four times the computational work of x87 instructions that use single-precision operands. Also, the higher density of 128-bit media floating-point operands may make it possible to remove local temporary variables that would otherwise be needed in x87 floating-point code. 128-bit media code is easier to write than x87 floating-point code, because the XMM register file is flat rather than stack-oriented, and, in 64bit mode there are twice the number of XMM registers as x87 registers. D e p e n d i n g o n t h e h a rd wa re i m p l e m e n t a t i o n o f t h e architecture, the combination of FCOMI and FCMOVcc is often Chapter 6: x87 Floating-Point Programming 6.11.2 Use FCOMIFCMOVx Branching 352 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology fa s t e r t h a n t h e c l a s s i c a l a p p ro a ch u s i n g F x S T S W A X instructions for comparison-based branches that depend on the condition codes for branch direction, because FNSTSW AX is often a serializing instruction. 6.11.3 Use FSINCOS Instead of FSIN and FCOS Frequently, a piece of code that needs to compute the sine of an argument also needs to compute the cosine of that same argument. In such cases, use the FSINCOS instruction to compute both trigonometric functions concurrently, which is faster than using separate FSIN and FCOS instructions to accomplish the same task. Parallelism can be increased by breaking up dependency chains or by evaluating multiple dependency chains simultaneously (explicitly switching execution between them). Depending on the hardware implementation of the architecture, the FXCH instruction may prove faster than FST/FLD pairs for switching execution between dependency chains. 6.11.4 Break Up Dependency Chains Chapter 6: x87 Floating-Point Programming 353 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 354 Chapter 6: x87 Floating-Point Programming 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology Index Symbols #AC exception ........................................... 107 #BP exception ........................................... 106 #BR exception ........................................... 107 #DB exception ........................................... 106 #DE exception........................................... 106 #DF exception ........................................... 107 #GP exception ........................................... 107 #MC exception .......................................... 107 #MF exception .......................... 107, 276, 292 #NM exception .......................................... 107 #NP exception ........................................... 107 #OF exception ........................................... 107 #PF exception ........................................... 107 #SS exception............................................ 107 #TS exception............................................ 107 #UD exception .................................. 107, 212 #XF exception ................................... 107, 212 Numerics 128-bit media programming..................... 127 16-bit mode................................................. xxi 32-bit mode................................................. xxi 3DNow!TM instructions ............................. 229 64-bit media programming....................... 229 64-bit mode............................................. xxi, 8 A AAA instruction.................................... 56, 83 AAD instruction.................................... 56, 83 AAM instruction ................................... 56, 83 AAS instruction .................................... 56, 83 aborts ......................................................... 106 absolute address ......................................... 19 ADC instruction .......................................... 59 ADD instruction.......................................... 59 addition........................................................ 59 ADDPD instruction................................... 198 ADDPS instruction ................................... 198 addressing absolute address ...................................... 19 address size .................................. 21, 81, 87 branch address ......................................... 82 canonical form ......................................... 18 complex address....................................... 19 effective address ...................................... 18 I/O ports .......................................... 147, 240 IP-relative ........................................... 19, 22 linear................................................... 13, 15 memory..................................................... 16 operands................................... 46, 147, 240 PC-relative ......................................... 19, 22 RIP-relative.................................... xxvii, 22 stack address ........................................... 20 string address .......................................... 20 virtual................................................. 13, 15 x87 stack................................................. 289 ADDSD instruction .................................. 198 ADDSS instruction................................... 198 AF bit .......................................................... 39 affine ordering ................................. 156, 308 AH register ........................................... 29, 30 AL register ............................................ 29, 30 alignment 128-bit media ................................. 148, 225 64-bit media ........................................... 241 general-purpose............................... 47, 125 AND instruction ......................................... 67 ANDNPD instruction ............................... 206 ANDNPS instruction................................ 206 ANDPD instruction .................................. 206 ANDPS instruction................................... 206 arithmetic instructions ..... 59, 174, 197, 255, 267, 320 ARPL instruction ....................................... 84 array bounds ............................................... 67 ASCII adjust instructions.......................... 56 auxiliary carry flag..................................... 39 AX register ........................................... 29, 30 B B bit ........................................................... 292 BCD data type .......................................... 304 BCD digits ................................................... 43 BH register............................................ 29, 30 biased exponent ........ xxi, 151, 157, 302, 310 binary-coded-decimal (BCD) digits .......... 43 bit scan instructions................................... 65 bit strings .................................................... 44 bit test instructions.................................... 65 BL register ............................................ 29, 30 BOUND instruction.............................. 67, 83 BP register ............................................ 29, 30 BPL register ................................................ 30 branch removal ................. 137, 184, 234, 262 branch-address displacements .................. 82 branches .............................. 93, 102, 125, 224 Index 355 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 BSF instruction ........................................... 65 BSR instruction........................................... 65 BSWAP instruction ..................................... 57 BT instruction ............................................. 65 BTC instruction........................................... 65 BTR instruction .......................................... 65 BTS instruction ........................................... 65 busy (B) bit ................................................ 292 BX register ............................................ 29, 30 byte ordering......................................... 16, 57 byte registers............................................... 33 C C3-C0 bits ................................................. 292 cache .......................................................... 117 cachability .............................................. 225 coherency................................................ 120 line .......................................................... 120 management..................................... 79, 122 pollution ................................................. 121 prefetching ............................................. 122 stale lines................................................ 124 cache management instructions................ 79 CALL instruction ............................ 72, 83, 96 caller-save parameter passing ................. 223 canonical address form .............................. 18 carry flag ..................................................... 39 CBW instruction.......................................... 54 CDQ instruction .......................................... 54 CDQE instruction ....................................... 54 CF bit ........................................................... 39 CH register ............................................ 29, 30 CL register............................................. 29, 30 clamping .................................................... 149 CLC instruction........................................... 75 CLD instruction .......................................... 75 clearing the MMX state ........... 223, 247, 279 CLFLUSH instruction ................................ 79 CLI instruction............................................ 76 CMC instruction.......................................... 75 CMOVcc instructions.................................. 49 CMP instruction.......................................... 64 CMPPD instruction................................... 203 CMPPS instruction ................................... 203 CMPS instruction........................................ 68 CMPSB instruction ..................................... 68 CMPSD instruction............................. 68, 203 CMPSQ instruction..................................... 68 CMPSS instruction ................................... 203 CMPSW instruction .................................... 68 CMPXCHG instruction............................... 78 CMPXCHG8B instruction .......................... 78 COMISD instruction ................................ 205 COMISS instruction ................................. 205 commit............................................... xxii, 113 compare instructions 64, 183, 202, 262, 271, 327 compatibility mode .............................. xxii, 9 complex address ......................................... 19 condition codes (C3-C0).......................... 292 conditional moves .............................. 49, 319 constants ................................................... 320 control instructions (x87) ........................ 331 control transfers ............................. 19, 69, 93 control word.............................................. 293 CPUID instruction ....... 79, 90, 209, 273, 336 CQO instruction ......................................... 54 CR0.EM bit ............................................... 299 CVTDQ2PD instruction ........................... 166 CVTDQ2PS instruction............................ 166 CVTPD2DQ instruction ........................... 193 CVTPD2PI instruction..................... 194, 266 CVTPD2PS instruction ............................ 192 CVTPI2PD instruction..................... 167, 250 CVTPI2PS instruction ..................... 166, 250 CVTPS2DQ instruction............................ 193 CVTPS2PD instruction ............................ 192 CVTPS2PI instruction ..................... 194, 266 CVTSD2SI instruction ............................. 195 CVTSD2SS instruction ............................ 192 CVTSI2SD instruction ............................. 167 CVTSI2SS instruction.............................. 167 CVTSS2SD instruction ............................ 192 CVTSS2SI instruction.............................. 195 CVTTPD2DQ instruction......................... 193 CVTTPD2PI instruction .................. 194, 266 CVTTPS2DQ instruction ......................... 193 CVTTPS2PI instruction ................... 194, 266 CVTTSD2SI instruction........................... 195 CVTTSS2SI instruction ........................... 195 CWD instruction......................................... 54 CWDE instruction ...................................... 54 CX register............................................ 29, 30 D DAA instruction ................................... 56, 83 DAS instruction .................................... 56, 83 data conversion instructions .... 54, 166, 192, 250, 266, 317 data reordering instructions ... 168, 195, 251 data transfer instructions. 49, 162, 187, 248, 317 data types 128-bit media ......................................... 145 356 Index 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology 128-bit media floating-point ................. 156 64-bit media............................................ 238 general-purpose ....................................... 41 mismatched ............................................ 223 x87 ................................................... 300, 308 DAZ bit ...................................................... 143 DE bit......................... 142, 143, 214, 291, 340 DEC instruction .................................... 61, 84 decimal adjust instructions ....................... 56 decrement.................................................... 61 default address size .................................... 21 default operand size ................................... 33 denormalized numbers..................... 154, 306 denormalized-operand exception (DE) . 214, 340 dependencies ............................................ 125 DF bit ........................................................... 40 DH register............................................ 29, 30 DI register ............................................. 29, 30 DIL register ................................................. 30 direct far jump...................................... 70, 73 direct referencing ..................................... xxii direction flag............................................... 40 displacements ............................... xxii, 20, 82 DIV instruction ........................................... 60 division ........................................................ 60 DIVPD instruction .................................... 200 DIVPS instruction..................................... 200 DIVSD instruction .................................... 200 DIVSS instruction..................................... 200 DL register ............................................ 29, 30 DM bit ................................................ 143, 294 dot product ........................................ 136, 234 double quadword .................................... xxiii double-extended-precision format .......... 303 double-precision format ................... 151, 303 doubleword................................................ xxii DX register ............................................ 29, 30 E EAX register ......................................... 29, 30 eAX-eSP register.................................. xxviii EBP register .......................................... 29, 30 EBX register.......................................... 29, 30 ECX register.......................................... 29, 30 EDI register........................................... 29, 30 EDX register ......................................... 29, 30 effective address................................... 18, 58 effective address size ............................. xxiii effective operand size ...................... xxiii, 45 EFLAGS register................................... 29, 37 eFLAGS register ...................................... xxix EIP register................................................. 25 eIP register .............................................. xxix element .................................................... xxiii EM bit........................................................ 299 EMMS instruction ............................ 247, 279 empty................................................. 277, 296 emulation (EM) bit .................................. 299 endian byte-ordering .................. xxxi, 16, 57 ENTER instruction .................................... 52 environment x87 .................................................. 298, 333 ES bit......................................................... 292 ESI register ........................................... 29, 30 ESP register .......................................... 29, 30 exception status (ES) bit ......................... 292 exceptions ........................................ xxiii, 104 #MF causes .................................... 276, 338 #XF causes ............................................. 211 128-bit media ......................................... 209 64-bit media ........................................... 274 denormalized-operand (DE)......... 214, 340 general-purpose..................................... 104 inexact-result ................................. 215, 341 invalid-operation (IE) ................... 213, 339 masked responses.......................... 218, 344 masking .......................................... 218, 344 overflow (OE)................................. 214, 341 post-computation........................... 216, 342 precision (PE) ................................ 215, 341 pre-computation ............................ 216, 342 priority ........................................... 216, 342 SIMD floating-point causes .................. 211 stack fault (SF) ...................................... 341 underflow (UE).............................. 215, 341 unmasked responses ............................. 348 x87 .......................................................... 337 zero-divide (ZE)............................. 214, 341 exit media state........................................ 247 explicit integer bit ................................... 302 exponent .................... xxi, 151, 157, 302, 310 extended functions .................................... 91 external interrupts................................... 105 extract instructions .................. 171, 253, 320 F F2XM1 instruction ................................... 326 FABS instruction ...................................... 324 FADD instruction ..................................... 321 FADDP instruction................................... 321 far calls........................................................ 97 far jumps..................................................... 95 far returns ................................................. 100 Index 357 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 fault............................................................ 105 FBLD instruction ...................................... 318 FBSTP instruction .................................... 318 FCMOVcc instructions ............................. 319 FCOM instruction ..................................... 327 FCOMI instruction.................................... 327 FCOMIP instruction ................................. 327 FCOMP instruction................................... 327 FCOMPP instruction ................................ 327 FCOS instruction ...................................... 325 FCW register ............................................. 298 FDECSTP instruction............................... 330 FDIV instruction....................................... 323 FDIVP instruction .................................... 323 FDIVR instruction .................................... 323 FDIVRP instruction.................................. 323 feature detection ........................................ 90 FEMMS instruction .......................... 247, 279 FERR# output signal................................ 349 FFREE instruction ................................... 331 FICOM instruction.................................... 329 FICOMP instruction ................................. 329 FIDIV instruction ..................................... 323 FIMUL instruction.................................... 323 FINCSTP instruction ................................ 330 FINIT instruction...................................... 331 FIST instruction........................................ 318 FISTP instruction ..................................... 318 FISUB instruction..................................... 322 flags instructions ........................................ 75 FLAGS register ..................................... 29, 37 FLD instruction......................................... 317 FLD1 instruction....................................... 320 FLDL2E instruction.................................. 320 FLDL2T instruction.................................. 320 FLDLG2 instruction ................................. 320 FLDLN2 instruction ................................. 320 FLDPI instruction..................................... 320 FLDZ instruction ...................................... 320 floating-point data types 128-bit media.......................................... 150 3DNow! TM ............................................... 243 64-bit media............................................ 243 x87 ........................................................... 301 flush ......................................................... xxiii flush-to-zero (FZ) bit ................................ 144 FMUL instruction ..................................... 322 FMULP instruction................................... 322 FNINIT instruction ................................... 331 FNOP instruction...................................... 331 FNSAVE instruction ......... 264, 265, 280, 334 FPATAN instruction ................................. 325 FPR0-FPR7 registers............................... 288 FPREM instruction .................................. 324 FPREM1 instruction ................................ 325 FPTAN instruction ................................... 325 FPU control word ..................................... 293 FPU status word ....................................... 289 FRNDINT instruction .............................. 324 FRSTOR instruction ................ 265, 280, 334 FS register .................................................. 20 FSAVE instruction ........... 264, 265, 280, 334 FSCALE instruction................................. 326 FSIN instruction....................................... 325 FSINCOS instruction ............................... 325 FST instruction......................................... 317 FSTP instruction ...................................... 318 FSUB instruction...................................... 322 FSUBP instruction ................................... 322 FSUBR instruction................................... 322 FSUBRP instruction ................................ 322 FSW register ............................................. 299 FTST instruction ...................................... 329 FTW register............................................. 299 FUCOMx instructions.............................. 328 full ..................................................... 277, 296 FXAM instruction .................................... 329 FXCH instruction..................................... 320 FXRSTOR instruction ..... 186, 265, 280, 334 FXSAVE instruction ........ 186, 265, 280, 334 FXTRACT instruction.............................. 320 FYL2X instruction ................................... 326 FYL2XP1 instruction............................... 326 FZ bit......................................................... 144 G general-purpose programming.................. 27 general-purpose registers (GPRs)............. 27 GPR registers.............................................. 27 GS register .................................................. 20 H hidden integer bit ............ 151, 153, 302, 306 I I/O .............................................................. 109 address space......................................... 110 addresses.......................................... 77, 110 instructions .............................................. 76 memory-mapped.................................... 111 ports.................................. 77, 110, 147, 240 privilege level ........................................ 112 IDIV instruction ......................................... 60 IE bit.................................. 142, 213, 291, 339 358 Index 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology IEEE 754 Standard ........... 143, 152, 285, 301 IGN ............................................................ xxiv IGNNE# input signal ................................ 350 IM bit ................................................. 143, 294 immediate operands....................... 20, 46, 82 implied integer bit............................ 151, 302 IMUL instruction ........................................ 60 IN instruction .............................................. 77 INC instruction ..................................... 61, 84 increment .................................................... 61 indefinite value floating-point.................................. 157, 311 integer............................................. 157, 311 packed-decimal ...................................... 311 indirect ..................................................... xxiv inexact-result exception... 143, 215, 291, 341 infinity ............................................... 155, 307 infinity bit (Y)........................................... 295 initialization MSCSR register ..................................... 141 x87 control word .................................... 293 XMM registers........................................ 140 inner product .................................... 136, 234 input/output (I/O) ..................................... 109 INS instruction............................................ 77 INSB instruction ......................................... 77 INSD instruction ......................................... 77 insert instructions............................. 171, 253 instruction pointer...................................... 24 instruction prefixes 128-bit media.......................................... 208 64-bit media............................................ 272 general-purpose ....................................... 85 x87 ........................................................... 335 instruction set ............................................... 4 instruction-relative address ....................... 19 instructions 128-bit media.................................. 127, 187 64-bit media.................................... 245, 265 floating-point.......... 130, 187, 236, 265, 285 general-purpose ....................................... 48 I/O............................................................ 117 invalid in 64-bit mode.............................. 83 locked...................................................... 117 memory ordering ................................... 115 prefixes ............................. 85, 208, 272, 335 serializing ............................................... 116 x87 ........................................................... 315 INSW instruction ........................................ 77 INT instruction............................................ 74 integer bit.......................... 151, 153, 302, 306 integer data types 128-bit media ......................................... 148 64-bit media ........................................... 241 general-purpose....................................... 41 x87 .......................................................... 304 interleave instructions..................... 196, 252 interrupt vector ........................................ 105 interrupts and exceptions ................. 74, 104 INTO instruction .................................. 74, 83 invalid-operation exception (IE)..... 213, 339 IOPL .......................................................... 112 IP register ................................................... 25 IP-relative addressing .......................... 19, 22 IRET instruction ........................................ 74 IRETD instruction...................................... 74 IRETQ instruction...................................... 74 J J bit.................................................... 151, 302 Jcc instructions .................................... 70, 95 JMP instruction.................................... 69, 83 L LAHF instruction ....................................... 76 last data pointer ....................................... 298 last instruction pointer............................ 297 last opcode ................................................ 298 LDMXCSR instruction............................. 187 LDS instruction .................................... 58, 83 LEA instruction.......................................... 58 LEAVE instruction..................................... 52 legacy mode ......................................... xxiv, 9 legacy x86 ................................................ xxiv LES instruction .................................... 58, 83 LFENCE instruction .................................. 79 LFS instruction........................................... 58 LGS instruction .......................................... 58 limiting...................................................... 149 linear address ....................................... 13, 15 LOCK prefix ............................................... 88 LODS instruction ....................................... 69 LODSB instruction ..................................... 69 LODSD instruction..................................... 69 LODSQ instruction..................................... 69 LODSW instruction.................................... 69 logarithmic functions............................... 326 logarithms ................................................. 320 logical instructions............. 67, 185, 206, 263 logical shift ................................................. 63 long mode............................................. xxiv, 7 LOOPcc instructions .................................. 72 LSB ............................................................ xxv lsb .............................................................. xxv Index 359 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 LSS instruction ........................................... 58 M mask ........................................... xxv, 143, 294 masked responses ............................. 218, 344 MASKMOVDQU instruction............ 163, 249 matrix operations ............................. 135, 233 MAXPD instruction .................................. 204 MAXPS instruction................................... 204 MAXSD instruction .................................. 204 MAXSS instruction................................... 204 MBZ............................................................ xxv memory addressing ................................................ 16 hierarchy................................................. 117 management....................................... 14, 79 model ........................................................ 11 optimization ........................................... 113 ordering .................................................. 113 physical..................................................... 13 segmented ................................................ 12 virtual ....................................................... 11 weakly ordered....................................... 112 memory management instructions ........... 79 memory-mapped I/O ........................... 76, 111 MFENCE instruction.................................. 79 MINPD instruction ................................... 204 MINPS instruction .................................... 204 MINSD instruction.................................... 204 MINSS instruction .................................... 204 MMX registers .......................................... 237 MMXTM instructions ................................. 229 modes 16-bit ........................................................ xxi 32-bit ........................................................ xxi 64-bit .................................................... xxi, 8 compatibility ...................................... xxii, 9 legacy ................................................. xxiv, 9 long..................................................... xxiv, 7 mode switches .......................................... 34 operating ................................................ 3, 7 protected ............................... xxvi, 9, 16, 93 real ......................................................... xxvi real mode............................................ 10, 16 virtual-8086 .............................. xxviii, 9, 16 moffset ....................................................... xxv MOV instruction ......................................... 49 MOV segReg instruction ............................ 58 MOVAPD instruction................................ 188 MOVAPS instruction ................................ 188 MOVD instruction....................... 49, 162, 248 MOVDQ2Q instruction ..................... 162, 248 MOVDQA instruction .............................. 162 MOVDQU instruction .............................. 162 MOVHLPS instruction............................. 188 MOVHPD instruction............................... 188 MOVHPS instruction ............................... 188 MOVLHPS instruction............................. 188 MOVLPD instruction ............................... 188 MOVLPS instruction ................................ 188 MOVMSKPD instruction ................... 55, 191 MOVMSKPS instruction .................... 55, 191 MOVNTDQ instruction .................... 163, 249 MOVNTI instruction .................................. 49 MOVNTPD instruction ............................ 191 MOVNTPS instruction ............................. 191 MOVNTQ instruction ............................... 249 MOVQ instruction ............................ 162, 248 MOVQ2DQ instruction .................... 162, 248 MOVS instruction....................................... 68 MOVSB instruction .................................... 68 MOVSD instruction ............................ 68, 188 MOVSQ instruction .................................... 68 MOVSS instruction .................................. 188 MOVSW instruction ................................... 68 MOVSX instruction .................................... 49 MOVUPD instruction............................... 188 MOVUPS instruction ............................... 188 MOVZX instruction.................................... 49 MSB ........................................................... xxv msb ............................................................ xxv MSR .......................................................... xxix MUL instruction......................................... 60 MULPD instruction.................................. 199 MULPS instruction .................................. 199 MULSD instruction.................................. 199 MULSS instruction .................................. 199 multiplication ............................................. 60 multiply-add...................................... 136, 233 N NaN.................................................... 155, 308 near branches ........................................... 103 near calls..................................................... 97 near jumps .................................................. 95 near returns .............................................. 100 NEG instruction ......................................... 59 NMI interrupt ........................................... 106 non-temporal data .................................... 121 non-temporal moves ................. 163, 191, 249 non-temporal stores ......................... 124, 225 NOP instruction.......................................... 80 normalized numbers ........ 153, 154, 305, 306 not a number (NaN) ......................... 155, 308 360 Index 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology NOT instruction .......................................... 67 number encodings 128-bit media floating-point ................. 156 x87 ........................................................... 308 number representation 128-media floating-point ....................... 153 64-bit media floating-point ................... 243 x87 floating-point................................... 305 O octword ..................................................... xxvi OE bit................................. 143, 214, 291, 341 OF bit ........................................................... 40 offset ......................................................... xxvi OM bit ................................................ 143, 294 opcode........................................................ 298 operand size 33, 44, 82, 84, 87, 125, 224, 281 operands 128-bit media.......................................... 145 64-bit media............................................ 238 addressing ................................................ 46 general-purpose ....................................... 41 x87 ........................................................... 300 operating modes ....................................... 3, 7 OR instruction............................................. 67 ordered compare............................... 206, 328 ORPD instruction ..................................... 207 ORPS instruction ...................................... 207 OSXMMEXCPT bit................................... 212 OUT instruction .......................................... 77 OUTS instruction........................................ 77 OUTSB instruction ..................................... 77 OUTSD instruction ..................................... 77 OUTSW instruction .................................... 77 overflow .................................................... xxvi overflow exception (OE) .................. 214, 341 overflow flag................................................ 40 P pack instructions .............................. 168, 251 packed....................................... xxvi, 129, 231 packed BCD digits ...................................... 43 packed-decimal data type ........................ 304 PACKSSDW instruction ................... 168, 251 PACKSSWB instruction.................... 168, 251 PACKUSWB instruction ................... 168, 251 PADDB instruction ........................... 174, 255 PADDD instruction ........................... 174, 255 PADDQ instruction ........................... 174, 255 PADDSB instruction ......................... 174, 256 PADDSW instruction........................ 174, 256 PADDUSB instruction .............................. 174 PADDUSW instruction ............................. 174 PADDW instruction.......................... 174, 255 PAND instruction ............................. 185, 263 PANDN instruction .......................... 185, 263 parallel operations ........................... 128, 231 parameter passing.................................... 223 parity flag ................................................... 39 partial remainder ..................................... 324 PAVGB instruction ........................... 179, 259 PAVGUSB instruction .............................. 260 PAVGW instruction .......................... 179, 259 PC field ............................................. 294, 313 PCMPEQB instruction ..................... 183, 262 PCMPEQD instruction..................... 183, 262 PCMPEQW instruction.................... 183, 262 PCMPGTB instruction ..................... 183, 262 PCMPGTD instruction..................... 183, 262 PCMPGTW instruction .................... 183, 262 PC-relative addressing......................... 19, 22 PE bit................................. 143, 215, 291, 341 performance considerations 128-bit media ......................................... 224 64-bit media ........................................... 281 general-purpose..................................... 124 x87 .......................................................... 352 PEXTRW instruction ....................... 171, 253 PF bit........................................................... 39 PF2ID instruction..................................... 266 PF2IW instruction.................................... 266 PFACC instruction ................................... 269 PFADD instruction................................... 267 PFCMPEQ instruction ............................. 271 PFCMPGE instruction ............................. 271 PFCMPGT instruction ............................. 271 PFMAX instruction.................................. 271 PFMIN instruction ................................... 271 PFMUL instruction .................................. 268 PFNACC instruction ................................ 269 PFPNACC instruction .............................. 269 PFRCP instruction ................................... 270 PFRCPIT1 instruction ............................. 270 PFRCPIT2 instruction ............................. 270 PFRSQIT1 instruction ............................. 271 PFRSQRT instruction .............................. 271 PFSUB instruction ................................... 268 PFSUBR instruction ................................ 268 physical memory ........................................ 13 Pi........................................................ 320, 326 PI2FD instruction..................................... 250 PI2FW instruction.................................... 250 PINSRW instruction......................... 171, 253 PM bit................................................ 143, 294 Index 361 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 PMADDWD instruction.................... 178, 258 PMAXSW instruction ....................... 185, 263 PMAXUB instruction ....................... 185, 263 PMINSW instruction ........................ 185, 263 PMINUB instruction......................... 185, 263 PMOVMSKB instruction .................. 165, 250 PMULHRW instruction............................ 257 PMULHUW instruction ................... 176, 257 PMULHW instruction ...................... 176, 257 PMULLW instruction ....................... 176, 257 PMULUDQ instruction .................... 176, 258 pointers........................................................ 22 POP instruction..................................... 52, 83 POP segReg instruction ............................. 58 POPA instruction .................................. 52, 83 POPAD instruction ............................... 52, 83 POPF instruction ........................................ 75 POPFD instruction ..................................... 75 POPFQ instruction ..................................... 75 POR instruction ................................ 186, 264 post-computation exceptions........... 216, 342 precision control (PC) field ............. 294, 313 precision exception (PE).................. 215, 341 pre-computation exceptions ............ 216, 342 PREFETCH instruction ..................... 79, 123 prefetching ................................ 122, 125, 225 PREFETCHlevel instruction ............. 79, 122 PREFETCHNTA instruction.................... 123 PREFETCHT0 instruction ....................... 123 PREFETCHT1 instruction ....................... 123 PREFETCHT2 instruction ....................... 123 PREFETCHW instruction.................. 79, 123 prefixes 128-bit media.......................................... 208 64-bit media............................................ 272 general-purpose ....................................... 85 REX........................................................... 30 x87 ........................................................... 335 priority of exceptions ....................... 216, 342 privilege level ..................................... 93, 108 procedure calls............................................ 96 procedure stack........................................... 94 processor features ...................................... 90 processor identification ............................. 79 processor vendor......................................... 91 processor version ........................................ 91 program order ........................................... 113 programming model 128-bit media.......................................... 127 64-bit media............................................ 229 general-purpose ....................................... 27 x87 .......................................................... 285 protected mode ............................. xxvi, 9, 16 PSADBW instruction ....................... 180, 260 pseudo-denormalized numbers ............... 307 pseudo-infinity ......................................... 305 pseudo-NaN............................................... 305 PSHUFD instruction................................ 172 PSHUFHW instruction ............................ 172 PSHUFLW instruction ............................. 172 PSHUFW instruction ............................... 254 PSLLD instruction ........................... 181, 260 PSLLDQ instruction................................. 181 PSLLQ instruction ........................... 181, 260 PSLLW instruction ........................... 181, 260 PSRAD instruction........................... 183, 261 PSRAW instruction .......................... 183, 261 PSRLD instruction ........................... 182, 261 PSRLDQ instruction ................................ 182 PSRLQ instruction ........................... 182, 261 PSRLW instruction........................... 182, 261 PSUBB instruction ........................... 175, 256 PSUBD instruction........................... 175, 256 PSUBQ instruction........................... 175, 256 PSUBSB instruction ......................... 175, 257 PSUBSW instruction........................ 175, 257 PSUBUSB instruction .............................. 175 PSUBUSW instruction............................. 175 PSUBW instruction .......................... 175, 256 PSWAPD instruction................................ 255 PUNPCKHBW instruction............... 169, 252 PUNPCKHDQ instruction ....................... 169 PUNPCKHQDQ instruction .................... 169 PUNPCKHWD instruction ...................... 169 PUNPCKLBW instruction ............... 169, 252 PUNPCKLDQ instruction................ 169, 252 PUNPCKLQDQ instruction..................... 169 PUNPCKLWD instruction ............... 169, 252 PUSH instruction ................................. 52, 83 PUSHA instruction .............................. 52, 84 PUSHAD instruction ........................... 52, 84 PUSHF instruction..................................... 75 PUSHFD instruction.................................. 75 PUSHFQ instruction.................................. 75 PXOR instruction............................. 186, 264 Q QNaN ................................................. 155, 308 quadword ................................................. xxvi quiet NaN (QNaN)............................ 155, 308 R R8B-R15B registers ................................... 30 R8D-R15D registers................................... 30 362 Index 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology r8-r15 ........................................................ xxix R8-R15 registers......................................... 30 R8W-R15W registers ................................. 30 range of values 128-bit media.......................................... 151 64-bit media.................................... 241, 244 x87 ........................................................... 303 RAX register ............................................... 30 rAX-rSP..................................................... xxx RAZ ........................................................... xxvi RBP register ................................................ 30 rBP register ................................................. 23 RBX register................................................ 30 RC field ..................................... 144, 294, 314 RCL instruction .......................................... 62 RCPPS instruction.................................... 202 RCPSS instruction .................................... 202 RCR instruction .......................................... 62 RCX register ............................................... 30 RDI register................................................. 30 RDX register ............................................... 30 read order .................................................. 114 real address mode. See real mode real mode...................................... xxvi, 10, 16 real numbers ..................................... 153, 305 reciprocal estimation........................ 202, 270 reciprocal square root ...................... 202, 270 register extensions ............................... 1, 3, 8 registers ......................................................... 3 128-bit media.......................................... 139 64-bit media............................................ 237 eAX-eSP ............................................. xxviii eFLAGS.................................................. xxix EIP............................................................. 25 eIP .......................................................... xxix extensions............................................... 1, 3 IP ............................................................... 25 MMX ....................................................... 237 r8-r15 ..................................................... xxix rAX-rSP.................................................. xxx rFLAGS ................................................... xxx RIP ............................................................ 25 rIP...................................................... xxx, 25 segment..................................................... 20 x87 control word .................................... 293 x87 last data pointer.............................. 298 x87 last opcode....................................... 298 x87 last-instruction pointer .................. 297 x87 physical............................................ 288 x87 stack ................................................. 288 x87 status word ...................................... 289 x87 tag word........................................... 295 XMM....................................................... 139 relative ..................................................... xxvi remainder ................................................. 324 REP prefix .................................................. 89 REPE prefix................................................ 89 repeat prefixes ................................... 89, 126 REPNE prefix ............................................. 89 REPNZ prefix ............................................. 89 REPZ prefix ................................................ 89 reset ........................................................... 141 restoring state .......................................... 351 RET instruction.................................... 74, 99 revision history ......................................... xvii REX prefixes .................................... 8, 30, 89 RFLAGS register .................................. 30, 37 rFLAGS register ....................................... xxx RIP register .......................................... 25, 30 rIP register.......................................... xxx, 25 RIP-relative addressing ................... xxvii, 22 ROL instruction.......................................... 62 ROR instruction ......................................... 62 rotate instructions...................................... 61 rounding 128-bit media ................................. 144, 158 64-bit media ................................... 243, 245 x87 .......................................... 294, 314, 324 rounding control (RC) field..... 144, 294, 314 RSI register................................................. 30 RSP register.......................................... 30, 96 rSP register ................................................. 23 RSQRTPS instruction .............................. 202 RSQRTSS instruction .............................. 202 S SAHF instruction ....................................... 76 SAL instruction .......................................... 62 SAR instruction.......................................... 62 saturation 128-bit media ......................................... 149 64-bit media ........................................... 242 saving state ............... 186, 222, 264, 279, 351 SBB instruction........................................... 59 scalar product ................................... 136, 234 SCAS instruction ........................................ 68 SCASB instruction ..................................... 68 SCASD instruction ..................................... 68 SCASQ instruction ..................................... 68 SCASW instruction .................................... 68 scientific programming............................ 129 segment override........................................ 88 segment registers ....................................... 20 Index 363 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 segmented memory .................................... 12 self-modifying code .................................. 121 semaphore instructions.............................. 78 set ............................................................. xxvii SETcc instructions...................................... 65 SF bit ........................................... 40, 291, 341 SFENCE instruction ................................... 79 shift instructions......................... 61, 181, 260 SHL instruction........................................... 62 SHLD instruction........................................ 63 SHR instruction .......................................... 63 SHRD instruction ....................................... 63 shuffle instructions................... 172, 197, 254 SHUFPD instruction ................................ 197 SHUFPS instruction ................................. 197 SI register .............................................. 29, 30 sign ............................. 148, 157, 241, 310, 324 sign extension ............................................. 54 sign flag ....................................................... 40 sign masks ................................................... 55 signaling NaN (SNaN) ...................... 155, 308 significand ......................... 151, 157, 302, 310 SIL register.................................................. 30 SIMD floating-point exceptions .............. 211 SIMD operations ............................... 129, 231 single-instruction, multiple-data (SIMD) ... 5 single-precision format............. 151, 243, 302 SNaN .................................................. 155, 308 software interrupts ............................. 74, 105 SP register ............................................. 29, 30 spatial locality........................................... 121 speculative execution............................... 114 SPL register................................................. 30 SQRTPD instruction................................. 201 SQRTPS instruction ................................. 201 SQRTSD instruction ................................. 201 SQRTSS instruction.................................. 201 square root ................................ 201, 270, 325 SSE ........................................................... xxvii SSE instructions................................ 127, 229 SSE-2 ........................................................ xxvii SSE-2 instructions............................. 127, 229 ST(0)-ST(7) registers ............................... 288 stack ..................................................... 94, 222 address ...................................................... 20 allocation ................................................ 126 frame................................................... 23, 52 operand size ............................................. 95 operations................................................. 52 pointer ................................................ 23, 94 x87 stack fault ........................................ 341 x87 stack management ......................... 330 x87 stack overflow................................. 341 x87 stack underflow .............................. 341 stack fault (SF) exceptions...................... 341 standard functions ..................................... 91 state saving ............... 186, 222, 264, 279, 351 status word................................................ 289 STC instruction .......................................... 75 STD instruction .......................................... 75 STI instruction............................................ 76 sticky bits ................................ xxvii, 142, 290 STMXCSR instruction ............................. 187 STOS instruction ........................................ 69 STOSB instruction...................................... 69 STOSD instruction ..................................... 69 STOSQ instruction ..................................... 69 STOSW instruction .................................... 69 streaming store......... 133, 163, 191, 225, 249 string address ............................................. 20 string instructions ................................ 68, 77 strings.......................................................... 44 SUB instruction .......................................... 59 SUBPD instruction................................... 199 SUBPS instruction ................................... 199 SUBSD instruction ................................... 199 SUBSS instruction.................................... 199 subtraction .................................................. 59 sum of absolute differences .................... 260 swap instructions...................................... 254 SYSCALL instruction .............................. 102 SYSECALL instruction.............................. 81 SYSENTER instruction ............... 80, 84, 102 SYSEXIT instruction ................... 80, 84, 102 SYSRET instruction........................... 81, 102 system call and return instructions.. 80, 102 T tag bits............................................... 276, 295 tag word..................................................... 295 task switch .................................................. 99 task-state segment (TSS)........................... 99 temporal locality ...................................... 121 TEST instruction ........................................ 64 test instructions.................................. 64, 327 tiny numbers............. 154, 214, 215, 306, 340 TOP field........................................... 288, 292 top-of-stack pointer (TOP)....... 276, 288, 292 transcendental instructions .................... 325 trap ............................................................ 106 trigonometric functions........................... 325 TSS........................................................... xxvii 364 Index 24593--Rev. 3.09--September 2003 AMD 64-Bit Technology U UCOMISD instruction .............................. 205 UCOMISS instruction............................... 205 UE bit ................................ 143, 215, 291, 341 ulp ...................................................... 159, 315 UM bit................................................ 143, 294 underflow ........................................ xxvii, 341 underflow exception (UE) ............... 215, 341 unit in the last place (ulp) ............... 159, 315 unmask .............................................. 143, 294 unmasked responses......................... 221, 348 unnormal numbers ................................... 305 unordered compare .......................... 206, 328 unpack instructions .................. 169, 196, 252 UNPCKHPD instruction .......................... 196 UNPCKHPS instruction ........................... 196 UNPCKLPD instruction ........................... 196 UNPCKLPS instruction............................ 196 unsupported number types...................... 305 V vector ............................................. xxviii, 105 vector operations .............................. 129, 231 virtual address ...................................... 13, 15 virtual memory ........................................... 11 virtual-8086 mode ....................... xxviii, 9, 16 W weakly ordered memory........................... 112 write buffers.............................................. 119 write combining ........................................ 115 write order................................................. 115 X x87 control word register ......................... 293 x87 environment ............................... 298, 333 x87 floating-point programming.............. 285 x87 status word register ........................... 289 x87 tag word register................................ 295 XADD instruction ....................................... 78 XCHG instruction ....................................... 78 XLAT instruction ........................................ 55 XMM registers .......................................... 139 XOR instruction.......................................... 67 XORPD instruction................................... 207 XORPS instruction ................................... 207 Y Y bit ........................................................... 295 Z ZE bit ................................. 143, 214, 291, 341 zero..................................................... 155, 307 zero flag ....................................................... 40 zero-divide exception (ZE) .............. 214, 341 zero-extension................................. 20, 33, 83 ZF bit........................................................... 40 ZM bit................................................ 143, 294 Index 365 AMD 64-Bit Technology 24593--Rev. 3.09--September 2003 366 Index |
Price & Availability of AMD64
![]() |
|
|
All Rights Reserved © IC-ON-LINE 2003 - 2022 |
[Add Bookmark] [Contact Us] [Link exchange] [Privacy policy] |
Mirror Sites : [www.datasheet.hk]
[www.maxim4u.com] [www.ic-on-line.cn]
[www.ic-on-line.com] [www.ic-on-line.net]
[www.alldatasheet.com.cn]
[www.gdcy.com]
[www.gdcy.net] |