Simultaneous multithreading

Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading. SMT permits multiple independent threads of execution to better utilize the resources provided by modern processor architectures.

Details

The name multithreading is obfuscating, because not only multiple threads can be executed simultaneously on one CPU core, but also multiple tasks (with different Page tables, different Task state segments, different Protection rings, different IO permissions, ...). Although running on the same core, they are completely separated from each other. Multithreading is similar in concept to preemptive multitasking but is implemented at the thread level of execution in modern superscalar processors.

Simultaneous multithreading (SMT) is one of the two main implementations of multithreading, the other form being temporal multithreading. In temporal multithreading, only one thread of instructions can execute in any given pipeline stage at a time. In simultaneous multithreading, instructions from more than one thread can be executed in any given pipeline stage at a time. This is done without great changes to the basic processor architecture: the main additions needed are the ability to fetch instructions from multiple threads in a cycle, and a larger register file to hold data from multiple threads. The number of concurrent threads can be decided by the chip designers. Two concurrent threads per CPU core are common, but some processors support eight concurrent threads per core.

Because the technique is really an efficiency solution and there is inevitable increased conflict on shared resources, measuring or agreeing on the effectiveness of the solution can be difficult. However, measured energy efficiency of SMT with parallel native and managed workloads on historical 130 nm to 32 nm Intel SMT (Hyper-Threading) implementations found that in 45 nm and 32 nm implementations, SMT is extremely energy efficient, even with inorder Atom processors [ASPLOS'11]. In modern systems, SMT effectively exploits concurrency with very little additional dynamic power. That is, even when performance gains are minimal the power consumption savings can be considerable.^{[citation needed]}

Some researchers have shown that the extra threads can be used to proactively seed a shared resource like a cache, to improve the performance of another single thread, and claim this shows that SMT is not just an efficiency solution. Others use SMT to provide redundant computation, for some level of error detection and recovery.

However, in most current cases, SMT is about hiding memory latency, increasing efficiency, and increasing throughput of computations per amount of hardware used.

Taxonomy

In processor design, there are two ways to increase on-chip parallelism with less resource requirements: one is superscalar technique which tries to exploit instruction level parallelism (ILP); the other is multithreading approach exploiting thread level parallelism (TLP).

Superscalar means executing multiple instructions at the same time while thread-level parallelism (TLP) executes instructions from multiple threads within one processor chip at the same time. There are many ways to support more than one thread within a chip, namely:

Interleaved multithreading: Interleaved issue of multiple instructions from different threads, also referred to as temporal multithreading. It can be further divided into fine-grain multithreading or coarse-grain multithreading depending on the frequency of interleaved issues. Fine-grain multithreading—such as in a barrel processor—issues instructions for different threads after every cycle, while coarse-grain multithreading only switches to issue instructions from another thread when the current executing thread causes some long latency events (like page fault etc.). Coarse-grain multithreading is more common for less context switch between threads. For example, Intel's Montecito processor uses coarse-grain multithreading, while Sun's UltraSPARC T1 uses fine-grain multithreading. For those processors that have only one pipeline per core, interleaved multithreading is the only possible way, because it can issue at most one instruction per cycle.
Simultaneous multithreading (SMT): Issue multiple instructions from multiple threads in one cycle. The processor must be superscalar to do so.
Chip-level multiprocessing (CMP or multicore): integrates two or more processors into one chip, each executing threads independently.
Any combination of multithreaded/SMT/CMP.

The key factor to distinguish them is to look at how many instructions the processor can issue in one cycle and how many threads from which the instructions come. For example, Sun Microsystems' UltraSPARC T1 (known as "Niagara" until its November 14, 2005 release) is a multicore processor combined with fine-grain multithreading technique instead of simultaneous multithreading because each core can only issue one instruction at a time.

Historical implementations

While multithreading CPUs have been around since the 1950s, simultaneous multithreading was first researched by IBM in 1968 as part of the ACS-360 project.^[1] The first major commercial microprocessor developed with SMT was the Alpha 21464 (EV8). This microprocessor was developed by DEC in coordination with Dean Tullsen of the University of California, San Diego, and Susan Eggers and Henry Levy of the University of Washington. The microprocessor was never released, since the Alpha line of microprocessors was discontinued shortly before HP acquired Compaq which had in turn acquired DEC. Dean Tullsen's work was also used to develop the Hyper-threading (Hyper-threading technology or HTT) versions of the Intel Pentium 4 microprocessors, such as the "Northwood" and "Prescott".

Modern commercial implementations

The Intel Pentium 4 was the first modern desktop processor to implement simultaneous multithreading, starting from the 3.06 GHz model released in 2002, and since introduced into a number of their processors. Intel calls the functionality Hyper-threading, and provides a basic two-thread SMT engine. Intel claims up to a 30% speed improvement ^[2] compared against an otherwise identical, non-SMT Pentium 4. The performance improvement seen is very application-dependent; however, when running two programs that require full attention of the processor it can actually seem like one or both of the programs slows down slightly when Hyper-threading is turned on.^[3] This is due to the replay system of the Pentium 4 tying up valuable execution resources, increasing contention for resources such as bandwidth, caches, TLBs, re-order buffer entries, equalizing the processor resources between the two programs which adds a varying amount of execution time. The Pentium 4 Prescott core gained a replay queue, which reduces execution time needed for the replay system. This is enough to completely overcome that performance hit.^[4]

The latest^[when?] MIPS architecture designs include an SMT system known as "MIPS MT".^[5] MIPS MT provides for both heavyweight virtual processing elements and lighter-weight hardware microthreads. RMI, a Cupertino-based startup, is the first MIPS vendor to provide a processor SOC based on eight cores, each of which runs four threads. The threads can be run in fine-grain mode where a different thread can be executed each cycle. The threads can also be assigned priorities.

IBM The Blue Gene/Q has 4-way SMT.

The IBM POWER5, announced in May 2004, comes as either a dual core dual-chip module (DCM), or quad-core or oct-core multi-chip module (MCM), with each core including a two-thread SMT engine. IBM's implementation is more sophisticated than the previous ones, because it can assign a different priority to the various threads, is more fine-grained, and the SMT engine can be turned on and off dynamically, to better execute those workloads where an SMT processor would not increase performance. This is IBM's second implementation of generally available hardware multithreading. In 2010, IBM released systems based on the POWER7 processor with eight cores with each having four Simultaneous Intelligent Threads. This switches the threading mode between one thread, two threads or four threads depending on the number of process threads being scheduled at the time. This optimizes the use of the core for minimum response time or maximum throughput. IBM POWER8 has 8 intelligent simultaneous threads per core (SMT8).

IBM z13 has two threads per core (SMT-2).

Although many people reported that Sun Microsystems' UltraSPARC T1 (known as "Niagara" until its 14 November 2005 release) and the now defunct processor codenamed "Rock" (originally announced in 2005, but after many delays cancelled in 2009) are implementations of SPARC focused almost entirely on exploiting SMT and CMP techniques, Niagara is not actually using SMT. Sun refers to these combined approaches as "CMT", and the overall concept as "Throughput Computing". The Niagara has eight cores, but each core has only one pipeline, so actually it uses fine-grained multithreading. Unlike SMT, where instructions from multiple threads share the issue window each cycle, the processor uses a round robin policy to issue instructions from the next active thread each cycle. This makes it more similar to a barrel processor. Sun Microsystems' Rock processor is different, it has more complex cores that have more than one pipeline.

The Oracle Corporation Sparc T3 has eight fine-grained threads per core, Sparc T4, Sparc T5, Sparc M5 and M6 have eight fine-grained threads per core of which two can be executed simultaneously.

Fujitsu Sparc64 VI has coarse-grained Vertical Multithreading(VMT)Sparc VII and newer have 2-way SMT.

Intel Itanium Montecito used coarse-grained multithreading and Tukwila and newer use 2-way SMT (with Dual-domain multithreading).

Intel Xeon Phi has 4-way SMT (with Time-multiplexed multithreading) with hardware based threads which can't be disabled unlike regular Hyperthreading.

The Intel Atom, released in 2008, is the first Intel product to feature 2-way SMT (marketed as Hyper-Threading) without supporting instruction reordering, speculative execution, or register renaming. Intel reintroduced Hyper-Threading with the Nehalem microarchitecture, after its absence on the Core microarchitecture.

AMD Bulldozer microarchitecture FlexFPU and Shared L2 cache are multithreaded but integer cores in module are single threaded, so it is only a partial SMT implementation.^[6]

Disadvantages

Depending on the design and architecture of the processor, simultaneous multithreading can decrease performance if any of the shared resources are bottlenecks for performance.^[7] Critics argue that it is a considerable burden to put on software developers that they have to test whether simultaneous multithreading is good or bad for their application in various situations and insert extra logic to turn it off if it decreases performance. Current operating systems lack convenient API calls for this purpose and for preventing processes with different priority from taking resources from each other.^[8]

There is also a security concern with certain simultaneous multithreading implementations. Intel's hyperthreading implementation has a vulnerability through which it is possible for one application to steal a cryptographic key from another application running in the same processor by monitoring its cache use.^[9]

References

↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ CPU performance evaluation Pentium 4 2.8 and 3.0
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ http://cdn3.wccftech.com/wp-content/uploads/2013/07/AMD-Steamroller-vs-Bulldozer.jpg
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ How good is hyperthreading?
↑ Hyper-Threading Considered Harmful

General

LE Shar and ES Davidson, "A Multiminiprocessor System Implemented through Pipelining", Computer Feb 1974
D.M. Tullsen, S.J. Eggers, and H.M. Levy, "Simultaneous Multithreading: Maximizing On-Chip Parallelism," In 22nd Annual International Symposium on Computer Architecture, June, 1995
D.M. Tullsen, S.J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo, and R.L. Stamm, "Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor," In 23rd Annual International Symposium on Computer Architecture, May, 1996
H. Esmaeilzadeh, T. Cao, X. Yang, S.M. Blackburn, and K.S. McKinley, "Looking Back on the Language and Hardware Revolutions: Measured Power, Performance, and Scaling," In the ACM International Conference on Architectural Support for Programming Languages and Operating Systems, March 2011.

External links

Lua error in package.lua at line 80: module 'strict' not found.

[1] Lua error in package.lua at line 80: module 'strict' not found.

[2] Lua error in package.lua at line 80: module 'strict' not found.

[3] CPU performance evaluation Pentium 4 2.8 and 3.0

[4] Lua error in package.lua at line 80: module 'strict' not found.

[5] Lua error in package.lua at line 80: module 'strict' not found.

[6] ttp://cdn3.wccftech.com/wp-content/uploads/2013/07/AMD-Steamroller-vs-Bulldozer.jpg

[7] Lua error in package.lua at line 80: module 'strict' not found.

[8] How good is hyperthreading?

[9] Hyper-Threading Considered Harmful

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

v t e CPU technologies
Architecture	Von Neumann Harvard (Modified Harvard) Dataflow TTA
Instruction set	ASIP CISC RISC EDGE EPIC MISC OISC VLIW NISC ZISC TRIPS Comparison
Word size	1-bit 4-bit 8-bit 9-bit 10-bit 12-bit 15-bit 16-bit 18-bit 22-bit 24-bit 25-bit 26-bit 27-bit 31-bit 32-bit 33-bit 34-bit 36-bit 39-bit 40-bit 48-bit 50-bit 60-bit 64-bit 128-bit 256-bit 512-bit variable
Execution	Instruction pipelining Bubble Operand forwarding Out-of-order execution Register renaming Speculative execution Branch predictor Memory dependence prediction Hazards
Parallel level	Bit Bit-serial Word Instruction Scalar Superscalar Task Thread Process Data Vector Memory
Multithreading	Temporal Simultaneous Preemptive Cooperative
Flynn's taxonomy	SISD SIMD MISD MIMD SPMD Addressing mode
Types	Digital signal processor (DSP) GPGPU Microcontroller Physics processing unit System on a chip (SoC) Cellular
Components	Address generation unit (AGU) Arithmetic logic unit (ALU) Barrel shifter Floating-point unit (FPU) Back-side bus Multiplexer Demultiplexer Registers Memory management unit (MMU) Translation lookaside buffer (TLB) Cache Register file Microcode Control unit Clock rate
Power management	APM ACPI Dynamic frequency scaling Dynamic voltage scaling Clock gating
CPU hardware security	NX bit Hardware restriction (firmware) Trusted Execution Technology Secure cryptoprocessor Hardware security module Hengzhi chip

v t e Parallel computing
General	Distributed computing Cloud computing High-performance computing
Levels	Bit Instruction Task Data Memory
Multithreading	Temporal Simultaneous Preemptive Cooperative
Theory	PRAM model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window
Coordination	Multiprocessing Memory coherency Cache coherency Cache invalidation Barrier Synchronization Application checkpointing
Programming	Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD MISD MIMD Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Grid computer
APIs	POSIX Threads OpenMP OpenCL OpenHMPP OpenACC MPI PVM UPC TBB Boost.Thread Global Arrays Ateji PX Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP PLINQ TPL
Problems	Embarrassingly parallel Software lockout Scalability Race condition Deadlock Livelock Starvation Deterministic algorithm Parallel slowdown
Category: parallel computing Media related to Lua error in package.lua at line 80: module 'strict' not found. at Wikimedia Commons

Simultaneous multithreading

Contents

Details

Taxonomy

Historical implementations

Modern commercial implementations

Disadvantages

See also

References

External links

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools