📄 howto_sse2.lyx
字号:
#LyX 1.3 created this file. For more info see http://www.lyx.org/\lyxformat 221\textclass article\language american\inputencoding auto\fontscheme default\graphics default\paperfontsize default\spacing single \papersize a4paper\paperpackage a4wide\use_geometry 0\use_amsmath 0\use_natbib 0\use_numerical_citations 0\paperorientation portrait\secnumdepth 3\tocdepth 3\paragraph_separation indent\defskip medskip\quotes_language english\quotes_times 2\papercolumns 1\papersides 1\paperpagestyle default\layout Title\added_space_bottom 10cm HOWTO MMX/SSE/SSE2\layout Standard\align center Y. Srinivasulu IIT Delhi\layout Standard\pagebreak_bottom \align center Gaurav Gupta IIT Guwahati\layout Standard\pagebreak_bottom \begin_inset LatexCommand \tableofcontents{}\end_inset \layout SectionIntroduction\layout StandardMMX/SSE/SSE2 all belong to a set of Single Instruction Multiple Data (SIMD) instructions.\layout SubsectionMMX\layout StandardMMX stands for Multi-Media eXtensions. The MMX technology is designed to accelerate multimedia and communications applications by including new instructions and data types that allow applications to achieve a new level of performance. It exploits the parallelism inherent in many multimedia and communications algorithms, yet maintains full compatibility with existing operating systems and applications. A wide range of software applications, including graphics, MPEG video, music synthesis, speech compression and recognition, image processing, games, video conferencing and more, shows many common, fundamental characteristics: \layout Standard* small integer data types (for example: 8-bit pixels, 16-bit audio samples) \layout Standard* small, highly repetitive loops \layout Standard* frequent multiplies and accumulates \layout Standard* compute-intensive algorithms \layout Standard* highly parallel operations \layout StandardThe MMX technology is designed as a set of general purpose integer instructions that can be applied to the needs of the wide diversity of multimedia and communications applications. \layout StandardThe highlights of the technology are:\layout Standard* Single Instruction, Multiple Data (SIMD) technique \layout Standard* 57 new instructions \layout Standard* 8 64-bit wide MMX registers \layout Standard* 4 new data types \layout StandardMMX technology introduces four new data types: three packed data types and a new 64-bit entity. Each element within the packed data types is an independent fixed-point integer. The architecture does not specify the place of the fixed point within the elements, because it is up to the developer the control of its place within each element throughout the calculation. This adds a burden on the developer, but it also leaves a large amount of flexibility to choose and change the precision of fixed-point numbers during the course of the application in order to fully control the dynamic range of values. \layout StandardThe four MMX technology data types are: \layout Standard* Packed byte -- 8 bytes packed into one 64-bit quantity \layout Standard* Packed word -- 4 16-bit words packed into one 64-bit quantity \layout Standard* Packed doubleword -- 2 32-bit double words packed into one 64-bit quantity \layout Standard* Quadword -- one 64-bit quantity \layout StandardThe MMX technology is integrated into Intel x86 architecture in a way that maintains full compatibility with existing operating systems. This is obtained by aliasing MMX registers and state upon the x86 floating-point registers and state. Therefore, no new registers or states are added to support MMX technology, so that the operating system uses the standard mechanisms for interacting with the floating point state to save and restore MMX code. Aliasing the MMX state upon the floating-point state does not preclude applications from executing both MMX routines and floating point routines, but the developer cannot interleave MMX and floating point instructions.\layout SubsectionStreaming SIMD Extensions (SSE)\layout StandardThe Intel Streaming SIMD Extensions (SSE) comprise a set of extensions to the Intel x86 architecture which greatly enhance the performance of advanced media and communication applications in the following four ways:\layout Standard1. 8 new 128-bit SIMD floating-point registers that can be directly addressed.\layout Standard2. 50 new instructions that work on packed floating-point data\layout Standard3. 8 new instructions designed to control cacheability of all MMX and 32-bit x86 data types, including the ability to stream data to memory without polluting the caches, and to prefetch data before it is actually used.\layout Standard4. 12 new instructions that extend the MMX instruction set.\layout StandardThis set enables the programmer to develop algorithms that can mix packed, single-precision, floating-point and integer using both SSE and MMX instructions respectively. \layout StandardIntel SSE provides eight 128-bit general-purpose registers, each of which can be directly addressed using the register names XMM0 to XMM7. Each register consists of four 32-bit single precision, floating-point numbers, numbered 0 through 3. MMX registers are mapped onto the floating-point registers, requiring the EMMS instruction to pass from MMX code to x87 floating-point code, since SIMD floating-point registers are a separate register file, MMX or floating-point instructions can be mixed with SSE instructions without execution of a special instruction such as EMMS. On the downside, they require support from the operating system, since they must be saved when switching tasks. There is a new control/status register MXCSR, that is used to mask/unmask numerical exception handling, to set rounding modes, to set flush-to-zero mode and to view status flags. SSE instructions operate on either all or the least significant pairs of packed data operands in parallel. The packed instructions (with PS suffix) operate on a pair of operands, while scalar instructions (with SS suffix) always operate on the least significant pair of the two operands. For scalar operations, the three upper components from the first operand are passed through to the destination.\layout SubsectionStreaming SIMD Extensions 2 (SSE2)\layout Standard\pagebreak_bottom This is a new set of SIMD instructions that improve the capabilities of both the MMX and SSE instruction sets. The key benefits of SSE2 are that MMX instructions can work on 128-bit data blocks and that SSE instructions now support 64-bit floating-point values. It also supports use of 16 byte aligned memory which can provide a greater speedup as compared to use of non aligned memory.\layout SectionWriting Assembly Inline and Compiling\layout SubsectionRules\layout StandardWe can write assemble instructions such as those belonging to MMX directly in a C program.\layout StandardThe syntax is \layout Standard\emph on asm("<asm instrct1> ; <asm istrct2> ; ... <asm instrctn>":\layout Standard\emph on "=<output_var1 format>" (<output_var1 name>), ... "=<output_vark format>" (<output_vark name>):\layout Standard\emph on "<input_var1 format>" (<input_var1 name>), ... "<input_vark format>" (<input_vark name>));\layout Standardeg:\emph on asm ("movdqu %%xmm0, %0" :"=m"(output));\layout StandardAn asm instruction looks like this\layout Standard\emph on instruction_name operand1, operand2, operand3\layout SubsubsectionRegister naming\layout StandardRegister names are prefixed by %. That is, if eax has to be used, it should be used as %eax.\layout SubsubsectionImmediate operand \layout StandardAn immediate operand is specified by using $.\layout Standardmovl $0xffff, %eax -- will move the value of 0xffff into eax register. \layout SubsubsectionSource and destination ordering \layout Standard\emph on In any instruction, source comes first and destination follows. This differs from Intel syntax, where source comes after destination. The documentation provided later in this report is Intel documentation. So to use the instructions in gcc, just reverse the order of the operands.\layout Standardmov %eax, %ebx, transfers the contents of eax to ebx. \layout SubsubsectionSize of operand \layout StandardThe instructions are suffixed by b, w, or l, depending on whether the operand is a byte, word, or long. This is not mandatory; GCC tries provide the appropriate suffix by reading the operands. But specifying the suffixes manually improves the code readability and eliminates the possibility of the compilers guessing incorrectly.\layout Standardmovb %al, %bl -- Byte move \layout Standardmovw %ax, %bx -- Word move \layout Standardmovl %eax, %ebx -- Longword move \layout SubsubsectionA Sample program\layout Standardint main(void)\newline { int x = 10, y; \newline asm ("movl %1, %%eax; \newline "movl %%eax, %0;" \newline :"=r"(y) /* y is output operand */ \newline :"r"(x) /* x is input operand */ \newline :"%eax"); /* %eax is clobbered register */\newline }\layout SubsubsectionMemory operand constraint(m)\layout StandardWhen the operands are in the memory, any operations performed on them will occur directly in the memory location. Memory constraints can be used most efficiently in cases where a C variable needs to be updated inside "asm" and you really don't want to use a register to hold its value. For example, the value of idtr is stored in the memory location loc:\layout Standardeg: \emph on asm("sidt %0\backslash n" : :"m"(loc)); \layout SubsubsectionRegister constraint "r"\layout StandardFor any operation, register constraint forces the storage of the value in memory to a register, appropriate operations and then storage back to the memory location. But register constraints are usually used only when they are absolutely necessary for an instruction or they significantly speed up the process. GCC allocates registers, and it updates the value of variables.\layout Standard\emph on asm ("movl %1, %%eax; "movl %%eax, %0;"\layout Standard\emph on :"=r"(y) /* y is output operand */ \layout Standard\emph on :"r"(x) /* x is input operand */\layout Standard\emph on :"%eax"); /* %eax is clobbered register */); \layout SubsubsectionCompiling\layout StandardThe files containing inline assembly codes should be compiled with following options:\layout Standard1.For mmx:\emph on gcc test.c -mmmx -o test\layout Standard2.For sse: \emph on gcc test.c -msse -o test\layout Standard\pagebreak_bottom 3.For sse2: \emph on gcc test.c -msse2 -o test\layout SectionInstruction Set \layout StandardDefinitions:\layout StandardLatency: The number of clock cycles that are required for the execution core to complete the execution of all of the 祇ps that form a IA-32 instruction. Throughput: The number of clock cycles required to wait before the issue ports are free to accept the same instruction again. For many IA-32 instructions, the throughput of an instruction can be significantly less than its latency.\layout StandardThe latency and throughput data given below is specific to Intel Pentium 4 and Intel Xeon processors. \layout StandardAlso the usage is given in Intel style which means to compile the instructions with gcc, we have to reverse the order of the operands.\layout StandardAll the instructions can be broadly classified as:\layout Standard* Arithmetic instructions\layout Standard* Comparison instructions \layout Standard* Conversion instructions \layout Standard* Logical instructions\layout Standard* Shift instructions\layout Standard* Data transfer instructions\layout SubsectionArithmetic instructions\layout SubsubsectionAddition\layout StandardMMX\layout StandardUsage: instruction destination, source \layout StandardDestination: MMX register. \layout StandardSource: MMX register or 64-bit memory operand.\layout ItemizePADDB: Packed add with Wrap-around on Byte\layout StandardLatency : 2 Throughput : 1\layout StandardPurpose:\layout StandardPADDB adds the bytes of the source operand to the bytes of the destination operand and writes the results to the destination. When the result is too large to be represented in a packed byte (overflow), the result wraps around and the lower 8 bits are written to the destination register.\layout ItemizePADDW Add with Wrap-around on Word\layout StandardLatency : 2 Throughput : 1\layout StandardPurpose:\layout StandardPADDW adds the words of the source operand to the words of the destination operand and writes the results to the destination.When the result is too large to be represented in a packed word (overflow), the result wraps around and the lower 16 bits are written to the destination register.\layout ItemizePADDD Add with Wrap-around on Doubleword\layout StandardLatency : 2 Throughput : 1\layout StandardPurpose:\layout StandardPADDD adds the doublewords of the source operand to the doublewords of the destination operand and writes the results to destination.\layout StandardWhen the result is too large to be represented in a packed doubleword (overflow), the result wraps around and the lower 32 bits are written to the destination register. \layout Standard\begin_inset Graphics filename images_mmx/NewPADD.gif scale 55 keepAspectRatio\end_inset \layout ItemizePADDSB Add Signed with Saturation on Byte\layout StandardLatency : 2 Throughput : 1\layout StandardPurpose:\layout StandardPADDSB adds the signed bytes of the source operand to the signed bytes of the destination operand and writes the results to the destination MMX register. If the result is larger or smaller than the range of a signed byte, the value is saturated (in the case of an overflow to 0x7F, and in the case of an underflow to 0x80).\layout ItemizePADDSW Add Signed with Saturation on Words \layout StandardLatency : 2 Throughput : 1\layout StandardPurpose:\layout StandardPADDSW adds the signed words of the source operand to the signed words of the destination operand and writes the results to the destination MMX register.\layout StandardIf the result is larger or smaller than the range of a signed word, the value is saturated (in the case of an overflow to 7FFFh, and in the case of an underflow to 8000h).\layout ItemizePADDUSB Add Unsigned with Saturation on Byte\layout StandardLatency : 2 Throughput : 1\layout StandardPurpose:\layout StandardPADDUSB adds the unsigned bytes of the source operand to the unsigned bytes of the destination operand and stores the result in destination. If the result is larger than the range of an unsigned byte (overflow), the value is saturated to 0xFF. If the result is smaller than the range of an unsigned byte (underflow), the value is saturated to 0x00.\layout Itemize
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -