SSE5
The SSE5 (short for Streaming SIMD Extensions version 5) was an instruction set extension proposed by AMD on 30 August 2007 as a supplement to the 128-bit SSE core instructions in the AMD64 architecture.
AMD chose not to implement SSE5 as originally proposed. In May 2009, AMD replaced SSE5 with three smaller instruction set extensions named as XOP, FMA4, and CVT16, which retain the proposed functionality of SSE5, but encode the instructions differently for better compatibility with Intel's proposed AVX instruction set.
The three SSE5-derived instruction sets were introduced in the Bulldozer processor core, released in October 2011 on a 32 nm process.[1]
Contents
Compatibility
AMD's SSE5 extension bundle does not include the full set of Intel's SSE4 instructions, making it a competitor to SSE4 rather than a successor.
This complicates software development. It is recommended practice for a program to test for the presence of instruction set extensions by means of the CPUID instruction before entering a code path which depends upon those instructions to function correctly. For maximum portability, an optimized application will require three code paths: a base code path for compatibility with older processors (from either vendor), a separately optimized Intel code path exploiting SSE4 or AVX, and a separately optimized AMD code path exploiting SSE5.
Due to this proliferation, benchmarks between Intel and AMD processors increasingly reflect the cleverness or implementation quality of the divergent code paths rather than the strength of the underlying platform.
SSE5 enhancements
The proposed SSE5 instruction set consisted of 170 instructions (including 46 base instructions), many of which are designed to improve single-threaded performance. Some SSE5 instructions are 3-operand instructions, the use of which will increase the average number of instructions per cycle achievable by x86 code.[2] Selected new instructions include:[3]
- Fused multiply–accumulate (FMACxx) instructions
- Integer multiply–accumulate (IMAC, IMADC) instructions
- Permutation (PPERM, PERMPx) and conditional move (PCMOV) instructions
- Precision control, rounding, and conversion instructions
AMD claims SSE5 will provide dramatic performance improvements, particularly in high-performance computing (HPC), multimedia, and computer security applications, including a 5x performance gain for Advanced Encryption Standard (AES) encryption and a 30% performance gain for discrete cosine transform (DCT) used to process video streams.[2]
For more detailed information, consult the instruction sets as subsequently divided.
- XOP: A revision of most of the SSE5 instruction set
- FMA3: Floating-point vector multiply–accumulate.
- F16C: Half-precision floating-point conversion.
2009 revision
The SSE5 specification included a proposed extension to the general coding scheme of x86 instructions in order to allow instructions to have more than two operands. In 2008, Intel announced their planned AVX instruction set which proposed a different way of coding instructions with more than two operands. The two proposed coding schemes, SSE5 and AVX, are mutually incompatible, although the AVX scheme has certain advantages over the SSE5 scheme: most importantly, AVX has plenty of space for future extensions, including larger vector sizes.
In May 2009, AMD published a revised specification for the planned future instructions. This revision changes the coding scheme to make it compatible with the AVX scheme, but with a differing prefix byte in order to avoid overlap between instructions introduced by AMD and instructions introduced by Intel.
The revised instruction set no longer carries the name SSE5, which has been criticized for being misleading, but most of the instructions in the new revision are functionally identical to the original SSE5 specification—only the way the instructions are coded differs. The planned additions to the AMD instruction set consists of three subsets:
- XOP: Integer vector multiply–accumulate instructions, integer vector horizontal addition, integer vector compare, shift and rotate instructions, byte permutation and conditional move instructions, floating point fraction extraction.
- FMA4: Floating-point vector multiply–accumulate.
- F16C: Half-precision floating-point conversion.
These new instruction sets include support for future extensions for the vector size from 128 bits to 256 bits. It is unclear from these preliminary specifications whether the Bulldozer processor will support 256-bit vector registers (YMM registers).[4]