The fundamental operation of most CPUs, regardless of the
physical form they take, is to execute a sequence of stored instructions called
a program. The program is represented by a series of numbers that are kept in
some kind of computer memory. There are four steps that nearly all CPUs use in
their operation: fetch, decode, execute, and writeback.
The first step, fetch, involves retrieving an instruction
(which is represented by a number or sequence of numbers) from program memory.
The location in program memory is determined by a program counter (PC), which
stores a number that identifies the current position in the program. In other
words, the program counter keeps track of the CPU's place in the current
program. After an instruction is fetched, the PC is incremented by the length of
the instruction word in terms of memory units.[2] Often the instruction to be
fetched must be retrieved from relatively slow memory, causing the CPU to stall
while waiting for the instruction to be returned. This issue is largely
addressed in modern processors by caches and pipeline architectures (see below).
The instruction that the CPU fetches from memory is used to
determine what the CPU is to do. In the decode step, the instruction is broken
up into parts that have significance to other portions of the CPU. The way in
which the numerical instruction value is interpreted is defined by the CPU's
instruction set architecture (ISA).[3] Often, one group of numbers in the
instruction, called the opcode, indicates which operation to perform. The
remaining parts of the number usually provide information required for that
instruction, such as operands for an addition operation. Such operands may be
given as a constant value (called an immediate value), or as a place to locate a
value: a register or a memory address, as determined by some addressing mode. In
older designs the portions of the CPU responsible for instruction decoding were
unchangeable hardware devices. However, in more abstract and complicated CPUs
and ISAs, a microprogram is often used to assist in translating instructions
into various configuration signals for the CPU. This microprogram is sometimes
rewritable so that it can be modified to change the way the CPU decodes
instructions even after it has been manufactured.
After the fetch and decode steps, the execute step is
performed. During this step, various portions of the CPU are connected so they
can perform the desired operation. If, for instance, an addition operation was
requested, an arithmetic logic unit (ALU) will be connected to a set of inputs
and a set of outputs. The inputs provide the numbers to be added, and the
outputs will contain the final sum. The ALU contains the circuitry to perform
simple arithmetic and logical operations on the inputs (like addition and
bitwise operations). If the addition operation produces a result too large for
the CPU to handle, an arithmetic overflow flag in a flags register may also be
set.
The final step, writeback, simply "writes back" the results
of the execute step to some form of memory. Very often the results are written
to some internal CPU register for quick access by subsequent instructions. In
other cases results may be written to slower, but cheaper and larger, main
memory. Some types of instructions manipulate the program counter rather than
directly produce result data. These are generally called "jumps" and facilitate
behavior like loops, conditional program execution (through the use of a
conditional jump), and functions in programs.[4] Many instructions will also
change the state of digits in a "flags" register. These flags can be used to
influence how a program behaves, since they often indicate the outcome of
various operations. For example, one type of "compare" instruction considers two
values and sets a number in the flags register according to which one is
greater. This flag could then be used by a later jump instruction to determine
program flow.
After the execution of the instruction and writeback of the
resulting data, the entire process repeats, with the next instruction cycle
normally fetching the next-in-sequence instruction because of the incremented
value in the program counter. If the completed instruction was a jump, the
program counter will be modified to contain the address of the instruction that
was jumped to, and program execution continues normally. In more complex CPUs
than the one described here, multiple instructions can be fetched, decoded, and
executed simultaneously. This section describes what is generally referred to as
the "Classic RISC pipeline," which in fact is quite common among the simple CPUs
used in many electronic devices (often called microcontroller). It largely
ignores the important role of CPU cache, and therefore the access stage of the
pipeline.
Design and implementation
Integer range
The way a CPU represents numbers is a design choice that
affects the most basic ways in which the device functions. Some early digital
computers used an electrical model of the common decimal (base ten) numeral
system to represent numbers internally. A few other computers have used more
exotic numeral systems like ternary (base three). Nearly all modern CPUs
represent numbers in binary form, with each digit being represented by some
two-valued physical quantity such as a "high" or "low" voltage.
Related to number representation is the size and precision
of numbers that a CPU can represent. In the case of a binary CPU, a bit refers
to one significant place in the numbers a CPU deals with. The number of bits (or
numeral places) a CPU uses to represent numbers is often called "word size",
"bit width", "data path width", or "integer precision" when dealing with
strictly integer numbers (as opposed to floating point). This number differs
between architectures, and often within different parts of the very same CPU.
For example, an 8-bit CPU deals with a range of numbers that can be represented
by eight binary digits (each digit having two possible values), that is, 28 or
256 discrete numbers. In effect, integer size sets a hardware limit on the range
of integers the software run by the CPU can utilize.
Integer range can also affect the number of locations in
memory the CPU can address (locate). For example, if a binary CPU uses 32 bits
to represent a memory address, and each memory address represents one octet (8
bits), the maximum quantity of memory that CPU can address is 232 octets, or 4
GiB. This is a very simple view of CPU address space, and many designs use more
complex addressing methods like paging in order to locate more memory than their
integer range would allow with a flat address space.
Higher levels of integer range require more structures to
deal with the additional digits, and therefore more complexity, size, power
usage, and general expense. It is not at all uncommon, therefore, to see 4- or
8-bit microcontrollers used in modern applications, even though CPUs with much
higher range (such as 16, 32, 64, even 128-bit) are available. The simpler
microcontrollers are usually cheaper, use less power, and therefore dissipate
less heat, all of which can be major design considerations for electronic
devices. However, in higher-end applications, the benefits afforded by the extra
range (most often the additional address space) are more significant and often
affect design choices. To gain some of the advantages afforded by both lower and
higher bit lengths, many CPUs are designed with different bit widths for
different portions of the device. For example, the IBM System/370 used a CPU
that was primarily 32 bit, but it used 128-bit precision inside its floating
point units to facilitate greater accuracy and range in floating point numbers
(Amdahl et al. 1964). Many later CPU designs use similar mixed bit width,
especially when the processor is meant for general-purpose usage where a
reasonable balance of integer and floating point capability is required.
Clock rate
Most CPUs, and indeed most sequential logic devices, are
synchronous in nature.[7] That is, they are designed and operate on assumptions
about a synchronization signal. This signal, known as a clock signal, usually
takes the form of a periodic square wave. By calculating the maximum time that
electrical signals can move in various branches of a CPU's many circuits, the
designers can select an appropriate period for the clock signal.
This period must be longer than the amount of time it takes
for a signal to move, or propagate, in the worst-case scenario. In setting the
clock period to a value well above the worst-case propagation delay, it is
possible to design the entire CPU and the way it moves data around the "edges"
of the rising and falling clock signal. This has the advantage of simplifying
the CPU significantly, both from a design perspective and a component-count
perspective. However, it also carries the disadvantage that the entire CPU must
wait on its slowest elements, even though some portions of it are much faster.
This limitation has largely been compensated for by various methods of
increasing CPU parallelism. (see below)
However, architectural improvements alone do not solve all
of the drawbacks of globally synchronous CPUs. For example, a clock signal is
subject to the delays of any other electrical signal. Higher clock rates in
increasingly complex CPUs make it more difficult to keep the clock signal in
phase (synchronized) throughout the entire unit. This has led many modern CPUs
to require multiple identical clock signals to be provided in order to avoid
delaying a single signal significantly enough to cause the CPU to malfunction.
Another major issue as clock rates increase dramatically is the amount of heat
that is dissipated by the CPU. The constantly changing clock causes many
components to switch regardless of whether they are being used at that time. In
general, a component that is switching uses more energy than an element in a
static state. Therefore, as clock rate increases, so does heat dissipation,
causing the CPU to require more effective cooling solutions.
One method of dealing with the switching off unneeded
components is called clock gating, which involves turning off the clock signal
to unneeded components (effectively disabling them). However, this is often
regarded as difficult to implement and therefore does not see common usage
outside of very low-power designs.[8] Another method of addressing some of the
problems with a global clock signal is the removal of the clock signal
altogether. While removing the global clock signal makes the design process
considerably more complex in many ways, asynchronous (or clockless) designs
carry marked advantages in power consumption and heat dissipation in comparison
with similar synchronous designs. While somewhat uncommon, entire asynchronous
CPUs have been built without utilizing a global clock signal. Two notable
examples of this are the ARM compliant AMULET and the MIPS R3000 compatible
MiniMIPS. Rather than totally removing the clock signal, some CPU designs allow
certain portions of the device to be asynchronous, such as using asynchronous
ALUs in conjunction with superscalar pipelining to achieve some arithmetic
performance gains. While it is not altogether clear whether totally asynchronous
designs can perform at a comparable or better level than their synchronous
counterparts, it is evident that they do at least excel in simpler math
operations. This, combined with their excellent power consumption and heat
dissipation properties, makes them very suitable for embedded computers (Garside
et al. 1999).
Parallelism
The description of the basic operation of a CPU offered in
the previous section describes the simplest form that a CPU can take. This type
of CPU, usually referred to as subscalar, operates on and executes one
instruction on one or two pieces of data at a time.
This process gives rise to an inherent inefficiency in
subscalar CPUs. Since only one instruction is executed at a time, the entire CPU
must wait for that instruction to complete before proceeding to the next
instruction. As a result the subscalar CPU gets "hung up" on instructions which
take more than one clock cycle to complete execution. Even adding a second
execution unit (see below) does not improve performance much; rather than one
pathway being hung up, now two pathways are hung up and the number of unused
transistors is increased. This design, wherein the CPU's execution resources can
operate on only one instruction at a time, can only possibly reach scalar
performance (one instruction per clock). However, the performance is nearly
always subscalar (less than one instruction per cycle).
Attempts to achieve scalar and better performance have
resulted in a variety of design methodologies that cause the CPU to behave less
linearly and more in parallel. When referring to parallelism in CPUs, two terms
are generally used to classify these design techniques. Instruction level
parallelism (ILP) seeks to increase the rate at which instructions are executed
within a CPU (that is, to increase the utilization of on-die execution
resources), and thread level parallelism (TLP) purposes to increase the number
of threads (effectively individual programs) that a CPU can execute
simultaneously. Each methodology differs both in the ways in which they are
implemented, as well as the relative effectiveness they afford in increasing the
CPU's performance for an application.