Pipeline processor¶
Learning Objectives: Understanding the principles of processor pipelining and resolving problems caused by instruction and data dependencies. Pipeline performance.
In previous material, we introduced the sequential processor and its weaknesses. To enhance processor performance, the concept of pipeline was introduced. In pipelining, all processor subsystems are always active, performing different phases of successive instructions. As the execution of one instruction progresses from one phase to another, the next instruction moves into the preceding phase.
Below is an example of a pipeline in the y86 processor. The letters refer to the different phases of instruction execution: Fetch, Decode, Execute, Memory, and Write Back. Note that the PC update phase is no longer present, an explanation will follow shortly.
With pipelined processors, the efficiency/performance of execution increases because the clock cycle length is determined by the longest subsystem phase, rather than the longest instruction execution time. In the figure, three instructions (instruction_1 + instruction_2 + instruction_3) would take 15 (3 x 5) time units in a sequential processor. With a pipeline like the one shown, they can be executed in just 7 time units! This leads to a significant reduction in program execution time.
Example: Comparing the
load word
instruction (loading a value from memory into a register) in a sequential and pipelined MIPS processor.In pipelining, the clock cycle time is reduced to a quarter of its original duration: 800ps -> 200ps. The benefit is evident as the processor delivers instruction results four times faster at intervals of approximately 200ps, compared to the sequential processor’s 800ps per instruction.
But Wait...¶
However, pipelined processors have two fundamental challenges. First, in the microarchitecture of a sequential processor, different subsystems use internal registers and signals during various execution phases! This creates a problem in pipelined implementations when different instructions require the same internal registers simultaneously for their execution.
In pipelined implementations, this issue is solved by introducing additional, separate pipeline registers between the phases. These registers store the input and output of each phase, effectively carrying intermediate results forward with the instruction. This ensures that intermediate results from one instruction do not interfere with those of others, allowing multiple instructions to execute synchronously in different subsystems. However, this slightly increases the duration of each phase, but instructions are still completed at a faster rate.
The second issue is that computer programs are typically written under the assumption that execution proceeds sequentially from one instruction to the next. That is, the output of one instruction often serves as the input to the next. This introduces dependencies between instructions.
Looking at the diagrams above, we can identify cases where the results of a preceding instruction are not available in the Write Back phase before they are needed in the Decode phase of the next instruction! This issue is addressed by creating feedback paths between phases so that intermediate results are available for subsequent instructions before being written to registers/memory.
Dependency Problems¶
Let’s analyze dependency issues between instructions with examples and their solutions:
- Dependency between instruction operands, also known as a data hazard. For example, this occurs when the output of one instruction is the input of another.
- Dependency between instructions themselves, also known as a control hazard. For example, conditional branch instructions rely on status flags set by a preceding instruction.
Data Hazard¶
Consider the following example code, which poses no issues in a sequential processor:
irmovq $10,%rdx # rdx=10 irmovq $3,%rax # rax=3 addq %rdx,%rax # rax=rax+rdx halt
When executed on a pipelined processor, a problem arises. The first two instructions do not reach the Write Back phase (where their results are written to the destination registers) before the third instruction requires their values in its Decode phase.
The diagram below shows that the register values are needed at clock cycle 4, but they only become available at cycles 5 and 6.
To resolve data hazards, several techniques are available.
Stalling¶
Execution can be delayed (stalling) by inserting
nop
instructions until the required inputs are available. The nop
instruction is useful because it does not alter the contents of the processor's registers.The diagram shows that by adding 3
nop
instructions, the Write Back phases of the first two instructions are completed before their values are needed in the Decode phase of the third instruction.Another method is to freeze the execution of instructions in their current phase until it is safe to proceed. This can be done by freezing the PC register and inserting bubbles to preserve the values in pipeline registers. Unlike
nop
, a bubble is not necessarily an instruction. (Although, bubbles are often implemented as nop
instructions.)Here, the inputs were required at cycle 4, so the PC is frozen, and bubbles are inserted starting from cycle 5. As a result, all subsequent instructions are delayed until the required inputs are available at cycle 7.
Forwarding¶
A drawback of the above methods is that inserting idle instructions or "idling" the processor reduces its efficiency, wasting clock cycles.
Forwarding (or bypassing) eliminates this problem by allowing the control logic to route intermediate results from internal registers directly to the inputs of the current instruction. In other words, if an input is not yet available, the system checks if the result is already computed/stored somewhere in the pipeline.
This technique can only be applied to signals available during the same clock cycle, which slightly increases the clock cycle duration.
In the diagram, the intermediate results of the first and second instructions (from the pipeline’s
valM
and valE
at cycle 4) are routed to the Decode phase input signals of the addq
instruction. Since the input values are needed only in the addq
instruction’s Execute phase, they can be accessed in time.Memory Address Hazard¶
When an instruction performs memory address operations to fetch an input instead of register calls, a load/use hazard can occur. This happens because the Memory phase is yet to come, but the value fetched from memory is already needed in the next instruction.
In this case, the
addq
instruction cannot have both operands set at clock cycle 7. The output of the fourth instruction (irmovq) is already available after the Execute phase, but the fifth instruction's output will only be available after the Memory phase. Here, Forwarding cannot be used because reading from memory to the register requires the Memory phase, so both inputs will only be available at clock cycle 8.The solution is to combine Stalling and Forwarding. A bubble is added, and the
addq
instruction continues its Decode phase until both inputs are available for Forwarding.Control Hazards¶
A control hazard refers to situations where there are dependencies between instructions such that the result of one instruction affects where the program execution continues(control dependency). In other words, what is the next instruction's memory address?
Subroutine Hazard¶
Let's consider a potential hazard caused by the
ret
instruction using the following code example:main: call funktio irmovq $10,%rdx halt funktio: irmovq $3, %rcx ret
The program's execution on a pipeline is shown below. Here, the subroutine is called and executed, but the return address is only determined during the
ret
instruction's Write Back phase. That is, when it has been fetched from the stack (Memory phase) and stored in the PC register (Write Back phase).The solution here too is to insert bubbles until Forwarding can be used to feed the return address into the Fetch phase.
Conditional Branch¶
Conditional branching in pipelined processors can be implemented speculatively in two ways: the conditional branch always occurs / the conditional branch never occurs.
The problem with speculation is that depending on the condition's result, it might fetch wrong instructions if the result differs from the prediction.
Here’s an example of a conditional branch in y86:
0000: xorq %rax,%rax 0002: jne target # Default: assume the branch always occurs! 000b: irmovq $1,%rax 0015: halt 0016: target: 0016: irmovq $2,%rdx 0020: irmovq $3,%rbx 002a: ret
In the example below, the instructions following the branch (shown in red) are fetched speculatively based on the assumption that the branch always occurs. However, the correct branch address is only determined during the instruction's Execute phase. This is because the status flags are checked during the Execute phase.
Example: In the diagram, the
jne 0x0016
branch instruction fetches two irmovq
instructions from the branch target address to the pipeline in advance. However, this situation is not possible in this processor, as the branch address is resolved in the Execute phase when condition flags are updated.The solution is to add bubbles to the pipeline until the address is determined.
As we will see later, modern processors prefetch instructions into the pipeline. If the prediction is incorrect, the solution is to remove incorrect instructions from the pipeline and insert bubbles.
Speculative Instruction Reordering¶
Sometimes, it is possible for the processor (or compiler or programmer...) to modify or change the program execution dynamically so that, instead of a bubble, upcoming instructions can be executed. These are instructions that have been checked and found to have no dependencies with stalled instructions.
As you might expect, this requires quite advanced control logic.
y86 Pipeline Implementation¶
Let’s now examine the y86-64 processor’s pipeline implementation and how it addresses the aforementioned challenges.
It is evident that the implementation (=microarchitecture) has become significantly more complex. First, pipeline registers (blue color) have been added between each phase. Additionally, various new registers and control signals have been incorporated into different phases to enable feedback loops between phases. Now, in the Decode phase, there is a new control logic block (circled area in the diagram) that selects inputs for the current instruction, either from its arguments or from intermediate results of other instructions.
But how do we know which intermediate result to select at any given time? In the y86 pipeline processor, a priority is set for various intermediate results, and inputs are chosen based on the instruction closest to the current phase. For example, when intermediate results are available from both the Execute and Memory phases of previous instructions, the intermediate results from the Execute phase are chosen because they are closer to the current phase (Decode).
Additionally, notice the new control logic attached to the Fetch phase (Select PC) and that the PC Update phase has disappeared. In the pipeline implementation, it is moved to the Fetch phase so that the address of the next instruction is fetched as late as possible. This block implements the prediction of the memory address of the next instruction (branch prediction) to improve performance. More on this later...
Now the pipeline register of the Fetch phase (Pred_PC) contains the predicted memory address of the next instruction based on the following rules:
1. If the instruction is not conditional or a jump, the next instruction’s address is in the current instruction’s
2. If the instruction is conditional, it is assumed that the condition is always true, and the register contains the address of the condition’s success.
3. If the condition is false, the next address is fetched from the intermediate results of the previous instruction
1. If the instruction is not conditional or a jump, the next instruction’s address is in the current instruction’s
valP
.2. If the instruction is conditional, it is assumed that the condition is always true, and the register contains the address of the condition’s success.
3. If the condition is false, the next address is fetched from the intermediate results of the previous instruction
valA
or valM
, either from the Memory or Write Back phases.Pipeline Performance¶
The pipeline processor’s performance can be expressed using two computational parameters:
- Execution Time/Latency (in ps): This represents the time it takes to execute an instruction/subsystem.
- Throughput: This refers to the number of instructions executed per second, measured in IPS (Instructions Per Second).
These metrics enable us to design various pipeline solutions that maximize throughput. In microarchitecture, the designer can divide instruction execution into as many phases (theoretically) as needed and insert as many pipeline registers as required. In fact, modern processor implementations now feature up to 18 phases! However, with this increased complexity, the write delay for output or pipeline registers must also be factored into the instruction execution time.
Below is an example comparison of the performance difference between a sequential and a pipelined processor.
1. Sequential Processor
- Latency: Instruction execution time (300ps) + write to output register (20ps) = 320ps
- Throughput: 1 / (320*10^-12) = 3.125 GIPS (Giga Instructions Per Second)
2. Pipelined Processor, where instruction execution is divided into three phases.
- Latency: Subsystem execution time (100ps per phase) + write to pipeline register (20ps) = 120ps per phase
- Total instruction execution time = 3 x 120ps = 360ps
- Throughput: 1 / (120*10^-12) = 8.333 GIPS
Now,
8.333 GIPS / 3.125 GIPS = 2.67
, which shows that the pipelined processor is significantly more efficient for program execution, as instructions are produced at a faster rate, even though the total execution time of an individual instruction is slightly longer: 360 / 320 = 1.125
.Processor Implementations¶
Let’s look at real-world examples of sequential and pipelined implementations in different generations of Intel x86 processor architectures.
8088 processor (8-bit architecture): It features an internal instruction queue holding up to four instructions, which are executed sequentially. This technique reduces slow memory accesses.
80286 processor (8/16-bit architecture): Its internal instruction queue automatically fetches 2–3 instructions at a time. Instructions are executed sequentially. If a fetched instruction is a jump, the queue is flushed and refilled from the jump instruction’s address.
80386 processor (32-bit architecture): This implementation is based on a two-phase pipeline, meaning the next instruction is fetched while the current one is being executed. Jump instructions can flush the fetched instruction.
80486 processor (32-bit architecture): An instruction execution cycle takes four clock cycles: fetch from memory, decode, fetch operand from memory, and execute. The 486 pipeline, however, consists of five stages: fetch, decode, address generation, execute, and write-back. The 486 also introduced an integrated math processor for floating-point calculations, whereas earlier models used a separate chip on the same board.
Pentium processors (initially 32-bit architecture): The pipeline has five stages, like the 486. However, Pentium processors use superscalar techniques (more on this later...), allowing multiple independent execution queues and ALUs. Pentium processors have two ALUs for integer calculations: one (V-line) for simple instructions and another (U-line) for all instructions. Additionally, Pentium processors have a separate math coprocessor for floating-point calculations, which has its own pipeline. Thus, Pentium effectively has three distinct ALU pipelines!
Additional bibliography¶
Please refer to the course book Bryant & O'Hallaron, Computer Systems: A Programmer's Perspective, 3rd edition. Chapter 4.
Conclusion¶
Pipeline implementations significantly improve processor performance, but the trade-off is longer execution times for individual instructions and a more demanding microarchitecture design.
In real processors, (M)IPS is not a reliable performance metric because it reflects the best-case scenario, and in actual program execution, IPS varies. We will explore processor performance further in future material.
Give feedback on this content
Comments about this material