Basic five-stage pipeline. In the best case scenario, this
pipeline can sustain a completion rate of one instruction per cycle.
One of the simplest methods used to accomplish increased
parallelism is to begin the first steps of instruction fetching and decoding
before the prior instruction finishes executing. This is the simplest form of a
technique known as instruction pipelining, and is utilized in almost all modern
general-purpose CPUs. Pipelining allows more than one instruction to be executed
at any given time by breaking down the execution pathway into discrete stages.
This separation can be compared to an assembly line, in which an instruction is
made more complete at each stage until it exits the execution pipeline and is
retired.
Pipelining does, however, introduce the possibility for a
situation where the result of the previous operation is needed to complete the
next operation; a condition often termed data dependency conflict. To cope with
this, additional care must be taken to check for these sorts of conditions and
delay a portion of the instruction pipeline if this occurs. Naturally,
accomplishing this requires additional circuitry, so pipelined processors are
more complex than subscalar ones (though not very significantly so). A pipelined
processor can become very nearly scalar, inhibited only by pipeline stalls (an
instruction spending more than one clock cycle in a stage).
Simple superscalar pipeline. By fetching and dispatching
two instructions at a time, a maximum of two instructions per cycle can be
completed.
Further improvement upon the idea of instruction pipelining
led to the development of a method that decreases the idle time of CPU
components even further. Designs that are said to be superscalar include a long
instruction pipeline and multiple identical execution units. [Huynh 2003] In a
superscalar pipeline, multiple instructions are read and passed to a dispatcher,
which decides whether or not the instructions can be executed in parallel
(simultaneously). If so they are dispatched to available execution units,
resulting in the ability for several instructions to be executed simultaneously.
In general, the more instructions a superscalar CPU is able to dispatch
simultaneously to waiting execution units, the more instructions will be
completed in a given cycle.
Most of the difficulty in the design of a superscalar CPU
architecture lies in creating an effective dispatcher. The dispatcher needs to
be able to quickly and correctly determine whether instructions can be executed
in parallel, as well as dispatch them in such a way as to keep as many execution
units busy as possible. This requires that the instruction pipeline is filled as
often as possible and gives rise to the need in superscalar architectures for
significant amounts of CPU cache. It also makes hazard-avoiding techniques like
branch prediction, speculative execution, and out-of-order execution crucial to
maintaining high levels of performance. By attempting to predict which branch
(or path) a conditional instruction will take, the CPU can minimize the number
of times that the entire pipeline must wait until a conditional instruction is
completed. Speculative execution often provides modest performance increases by
executing portions of code that may or may not be needed after a conditional
operation completes. Out-of-order execution somewhat rearranges the order in
which instructions are executed to reduce delays due to data dependencies.
In the case where a portion of the CPU is superscalar and
part is not, the part which is not suffers a performance penalty due to
scheduling stalls. The original Intel Pentium (P5) had two superscalar ALUs
which could accept one instruction per clock each, but its FPU could not accept
one instruction per clock. Thus the P5 was integer superscalar but not floating
point superscalar. Intel's successor to the Pentium architecture, P6, added
superscalar capabilities to its floating point features, and therefore afforded
a significant increase in floating point instruction performance.
Both simple pipelining and superscalar design increase a
CPU's ILP by allowing a single processor to complete execution of instructions
at rates surpassing one instruction per cycle (IPC).[10] Most modern CPU designs
are at least somewhat superscalar, and nearly all general purpose CPUs designed
in the last decade are superscalar. In later years some of the emphasis in
designing high-ILP computers has been moved out of the CPU's hardware and into
its software interface, or ISA. The strategy of the very long instruction word
(VLIW) causes some ILP to become implied directly by the software, reducing the
amount of work the CPU must perform to boost ILP and thereby reducing the
design's complexity.
Thread level parallelism
Another strategy of achieving performance is to execute
multiple programs or threads in parallel. This area of research is known as
parallel computing. In Flynn's taxonomy, this strategy is known as Multiple
Instructions-Multiple Data or MIMD.
One technology used for this purpose was multiprocessing
(MP). The initial flavor of this technology is known as symmetric
multiprocessing (SMP), where a small number of CPUs share a coherent view of
their memory system. In this scheme, each CPU has additional hardware to
maintain a constantly up-to-date view of memory. By avoiding stale views of
memory, the CPUs can cooperate on the same program and programs can migrate from
one CPU to another. To increase the number of cooperating CPUs beyond a handful,
schemes such as non-uniform memory access (NUMA) and directory-based coherence
protocols were introduced in the 1990s. SMP systems are limited to a small
number of CPUs while NUMA systems have been built with thousands of processors.
Initially, multiprocessing was built using multiple discrete CPUs and boards to
implement the interconnect between the processors. When the processors and their
interconnect are all implemented on a single silicon chip, the technology is
known as a multi-core microprocessor.
It was later recognized that finer-grain parallelism
existed with a single program. A single program might have several threads (or
functions) that could be executed separately or in parallel. Some of earliest
examples of this technology implemented input/output processing such as direct
memory access as a separate thread from the computation thread. A more general
approach to this technology was introduced in the 1970s when systems were
designed to run multiple computation threads in parallel. This technology is
known as multi-threading (MT). This approach is considered more cost-effective
than multiprocessing, as only a small number of components within a CPU is
replicated in order to support MT as opposed to the entire CPU in the case of
MP. In MT, the execution units and the memory system including the caches are
shared among multiple threads. The downside of MT is that the hardware support
for multithreading is more visible to software than that of MP and thus
supervisor software like operating systems have to undergo larger changes to
support MT. One type of MT that was implemented is known as block
multithreading, where one thread is executed until it is stalled waiting for
data to return from external memory. In this scheme, the CPU would then quickly
switch to another thread which is ready to run, the switch often done in one CPU
clock cycle, such as the UltraSPARC Technology. Another type of MT is known as
simultaneous multithreading, where instructions of multiple threads are executed
in parallel within one CPU clock cycle.
For several decades from the 1970s to early 2000s, the
focus in designing high performance general purpose CPUs was largely on
achieving high ILP through technologies such as pipelining, caches, superscalar
execution, out-of-order execution, etc. This trend culminated in large,
power-hungry CPUs such as the Intel Pentium 4. By the early 2000s, CPU designers
were thwarted from achieving higher performance from ILP techniques due to the
growing disparity between CPU operating frequencies and main memory operating
frequencies as well as escalating CPU power dissipation owing to more esoteric
ILP techniques.
CPU designers then borrowed ideas from commercial computing
markets such as transaction processing, where the aggregate performance of
multiple programs, also known as throughput computing, was more important than
the performance of a single thread or program.
This reversal of emphasis is evidenced by the proliferation
of dual and multiple core CMP (chip-level multiprocessing) designs and notably,
Intel's newer designs resembling its less superscalar P6 architecture. Late
designs in several processor families exhibit CMP, including the x86-64 Opteron
and Athlon 64 X2, the SPARC UltraSPARC T1, IBM POWER4 and POWER5, as well as
several video game console CPUs like the Xbox 360's triple-core PowerPC design,
and the PS3's 8-core Cell microprocessor.
Data parallelism
A less common but increasingly important paradigm of CPUs
(and indeed, computing in general) deals with data parallelism. The processors
discussed earlier are all referred to as some type of scalar device.[11] As the
name implies, vector processors deal with multiple pieces of data in the context
of one instruction. This contrasts with scalar processors, which deal with one
piece of data for every instruction. Using Flynn's taxonomy, these two schemes
of dealing with data are generally referred to as SISD (single instruction,
single data) and SIMD (single instruction, multiple data), respectively. The
great utility in creating CPUs that deal with vectors of data lies in optimizing
tasks that tend to require the same operation (for example, a sum or a dot
product) to be performed on a large set of data. Some classic examples of these
types of tasks are multimedia applications (images, video, and sound), as well
as many types of scientific and engineering tasks. Whereas a scalar CPU must
complete the entire process of fetching, decoding, and executing each
instruction and value in a set of data, a vector CPU can perform a single
operation on a comparatively large set of data with one instruction. Of course,
this is only possible when the application tends to require many steps which
apply one operation to a large set of data.
Most early vector CPUs, such as the Cray-1, were associated
almost exclusively with scientific research and cryptography applications.
However, as multimedia has largely shifted to digital media, the need for some
form of SIMD in general-purpose CPUs has become significant. Shortly after
floating point execution units started to become commonplace to include in
general-purpose processors, specifications for and implementations of SIMD
execution units also began to appear for general-purpose CPUs. Some of these
early SIMD specifications like HP's Multimedia Acceleration eXtensions (MAX) and
Intel's MMX were integer-only. This proved to be a significant impediment for
some software developers, since many of the applications that benefit from SIMD
primarily deal with floating point numbers. Progressively, these early designs
were refined and remade into some of the common, modern SIMD specifications,
which are usually associated with one ISA. Some notable modern examples are
Intel's SSE and the PowerPC-related AltiVec (also known as VMX).