This section covers all information needed to implement Task 2: Pipelining. First we describe our standard processor design. Then, we discuss how we changed our design to include pipelining. Finally, we cover what hazards our processor is susceptible to, and how we adjusted our pipeline to cover for these.
Our processor implementation reflects the standard five stages:
- Instruction Fetch
- Write Back
Stage 1: Instruction FetchEdit
The instruction fetch stage consists of reading in the current instruction from the instruction memory, and updating the program counter (PC) so that it is ready for the next instruction.
The Ifetch block reads in 32-bit instructions from a file called program.mif. If the input bit reset is '1', the PC is reset to zero. If the input bits Zero and Branch are both set to '1', then a branch has occured and the PC is set to the input bus Add_result[7..0]. Otherwise, the Ifetch block outputs its data when the input bit clock changes to '1'.
The Ifetch block outputs the instruction at the current PC on the output bus Instruction[31..0]. Since each instruction takes four bytes of memory, the next instruction is equal to the current PC + 4. Therefore, this next PC is calculated and avaliable on the output bus PC_plus_4_out[9..0]. If a jump were to occur, the address would be on the output bus PC_out[9..0], however, this functionality is will not be used by our processor.
Stage 2: DecodeEdit
The decode phase first sets up the control unit according to the instruction. Then, it accesses the appropriate registers and forwards on their data.
The control block sets up control flags for the processor based on the operation code (op code) of the instruction. The op code is the first six bits of an instruction, and is taken on the input bus Opcode[5..0]. From the op code, the following control bits are set: RegDst is '1' if the op code signals a R-type instruction, ALUSrc is '1' if the op code signals a load or a store instruction, MemtoReg is '1' if the op code signals a load instruction, RegWrite is '1' if the op code signals a R-type instruction or a load instruction, MemRead is '1' if the op code signals a load instruction, MemWrite is '1' if the op code signals a store instruction, and Branch is '1' if the op code signals a branch instruction. The control block also sets up the operation code for execution stage with the output bus ALUOp[1..0]. The first bit, ALUOp, is set to '1' if the op code signals a R-type instruction, and the second bit, ALUOp, is set to '1' if the op code signals a branch instruction.
Though the control block also reads input bits clock and reset, these signals are currently never used in the block.
The Idecode block begins decoding the instruction into its separate fields. It is specifically responsible for reading and writing to the registers. These tasks are actually split between the Decode stage and the Write Back stage, so we will only discuss the Decode stage for now.
In the Decode stage, the Idecode block is used to read the data already stored in the registers and pass it on to the next stage. The Idecode block first takes the entire instruction in on the input bus Instruction[31..0]. It then breaks down the instruction into the appropriate register addresses: the frist read register address ($rs) is Instruction[25..21], the second read register address ($rt) is Instruction[20..16], the R-type destination register address ($rd) is Instruction[15..11], the I-type destination register address ($rt) is Instruction[20..16], and the immediate value is Instruction[15..0]. Then, the Idecode block reads the register memory at the first and second read register addresses, and sends their data on the output buses read_data_1[31..0] and read_data_2[31..0] respectively. It also extends the immediate value according to its sign so that it is 32 bits long, and sends it out on the output bus Sign_extend[31..0].
Stage 3: ExecutionEdit
The execution stage consists of an arithmetic logic unit (ALU) which manipulates data according to signals passed to it from the control unit.
The Execute block begins by taking data from the Decode stage. The first read register ($rs) is read on the input bus Read_data_1[31..0] and the second read register ($rt) is read on the input bus Read_data_2[31..0]. The immediate value is read on the input bus Sign_extend[31..0]. Next, the Execute block takes in control signals. The instruction op_code is read on the input bus Function_opcode[5..0]. From the control block, the ALU op code is read on the input bus ALUOp[1..0], and the load/store signal is read on the input bit ALUSrc. From the Ifetch block, the address of the next instruction is read in on the input bus PC_plus_4[9..0].
Though the Execute block also reads input bits clock and reset, these signals are, again, currently never used.
The Execute block then generates its own ALU control bits based on the instruction op code. Based on these control bits, the ALU executes either an AND operation, an OR operation, an addition, or a subtraction. The ALU has the potential to execute three other operations, as well. If the control bits do not match any of these operations, the output of the ALU is zeroes. This result is sent out on the bus ALU_Result(31..0). Once the operation is performed, the Execute block checks if the results are equal to zero, and, if it is, the output bit zero is set to '1'. Finally, the address of a branch is calculated by adding the address of the next instruction with the immediate value. This result is sent out on the bus Add_result(7..0).
Stage 4: MemoryEdit
The memory stage is used to access the data memory, either to read or write data.
The dmemory block begins by reading in the address of the targeted memory on the input buss address(7..0). It takes the data that may be written into this memory location on the input bus write_data(31..0). The control signals to indicate reading and writing are send in on the input bits Memread and Memwrite, respectively. And finally, the dememory block also reads the consistent input bits clock and reset.
The dmemory block is actually very simple. If a write operation is signaled, it maps write_data(31..0) to the address address(7..0) into a file called dmemory.mif. Similarly, if a read operation is signaled, it maps the data at address address(7..0), again from dmemory.mif, and sends is out on the bus read_data(31..0). Theblock is designed so that it writes when the clock signal is low and reads when the clock signal is high.
Stage 5: Write BackEdit
The write back stage is used to update data in the registers.
The writebackmux is used to distinguish what data may be written into a registers. The could be data from memory, on the data1(31..0) input bus, or a calculation result from the ALU, on the data0(31..0) input bus. This is distinguished by the signal sel.
From here, we return to the Idecode block. In this stage, the Idecode block first takes data from the writebackmux on the input bus read_data(31..0). The results from the Execute stage are also directly read in on the input bus ALU_result(31..0). Then, the Idecode block checks the input bit MemtoReg. If MemtoReg is set to '0', then the data in ALU_result(31..0) is used. If the MemtoReg is set to '1', then the data in read_data(31..0) is used. Then, the Idecode block waits until the clock signal changes to be '1'. Upon this change, the input bit RegWrite and the input bus write_dest(4..0) are checked. RegWrite is a control signal that indicates that we actually want to overwrite a register. write_dest(4..0) is the address of the register we want to change. If RegWrite is set to '1', and if the write_dest(4..0) is a valid register, then the register is overwritten with the data indicated earlier. Finally, if the reset signal is set to '1', all registers are signaled to reset themselves.
The code for these blocks was written and provided by the Teacher Assistant for the course, Junjie Qian.
Initially, the processor only takes one instruction at a time, and must wait for it to go through all five stages before beginning the next instruction. However, we can significantly improve our time by pipelining our instructions. Pipelining allows us to begin the next instruction immediately after the current instruction moves on to the next stage.
To pipeline our processor, we constructed pipeline registers. These registers save the status of the control flags between each stage. This way, the control block can be freed up to be used by the next instruction. As an instruction moves from stage to stage, only the flags that are still needed in future stages are kept.
<come back with more detail/pictures>
Furthermore, we also adjusted our processor to support data forwarding. Typically, a processor runs just as the assembly code dictates: all computations must be stored in a register or in memory, then it must be loaded up to be used. However, it is often the case that a computation from the ALU is immediately used by the following instruction. Therefore, instead of waiting for the results to be stored at the end of stage 5, we can wire the ALU results directly back to the ALU as a possible input option, controlled by an additional flag.
<check for other forwarding>
There are two types of hazards that were introduced when we implemented the pipeline processors: data hazards and control hazards.
Data Hazards Edit
Even with the pipeline forwarding and bypassing there is still a scenario in which the processor must be stalled in order for the register values to be ready for the next instruction. This occurs when the target register from a load instruction is being used on the next sequential instruction. When the hazard unit component detects this scenario, it raises the stall flag. The design of our stall inserts a bubble in the current instruction and stalls the PC.
Unfortunately, we ran into an issue that caused the stall to not work properly. The stall flag raises correctly, but the instruction wasn't being updated immediately, and the current instruction proceed to be executed. To mitigate this issue, we used the assembler to insert a nop instruction whenever this data hazard occurs.
Our design accounts for control hazards that are introduced when branching in a pipelined processor. We used a branch always not taken approach which we describe further in the task 3 section.
<Verification for this task can be viewed in Final Report. We will also include it here when we are able. Thank you for your patience.>