homework 3 memory unit 1 software architecture

High School
You don't have any recent items yet.
You don't have any courses yet.
You don't have any books yet.
You don't have any Studylists yet.
Information

Cs433 fa21 hw3 solutions

Computer system organization (cs 433), university of illinois at urbana-champaign, students also viewed.

Cs433 fa21 hw2 solutions
Cs433 fa20 hw1 solutions
Cs246-lecture 2 - Cs246-lecture 2
Cs433 fa20 hw7 solution
ADV393 lectures - Advertising in society and culture
MACS INFO 326 Exam Note - Exam Notes

Preview text

CS 433: Computer Architecture – Fall 2021 Homework 3 Total Points: Undergraduates (69 points), Graduates (87 points) Undergraduate students should solve only the first 4 problems. Graduate students should solve all problems. Due Date: October 7, 2021 at 10:00 pm CT (See course information slides for more details)

Directions: ● All students must write and sign the following statement at the end of their homework submission. "I have read the honor code for this class in the course information handout and have done this homework in conformance with that code. I understand fully the penalty for violating the honor code policies for this class." No credit will be given for a submission that does not contain this signed statement.

● On top of the first page of your homework solution, please write your name and

NETID, your partner’s name and NETID, and whether you are an undergrad or grad student.

● Name your homework solution file as firstname_lastname_hw3

● Please show all work that you used to arrive at your answer. Answers without justification will not receive credit. Errors in numerical calculations will not be penalized. Cascading errors will usually be penalized only once.

● See course information slides for more details.

Problem 1 [39 points]

This problem concerns Tomasulo’s algorithm. Consider the following architecture specification:

Functional Unit Type Cycles in Ex Number of Functional Units Integer 1 1 FP Adder 5 1 FP Multiplier 8 1 FP Divider 15 1

(1) Assume that you have unlimited reservation stations. (2) Memory accesses use the integer functional unit to perform effective address calculation during the EX stage. For stores, memory is accessed during the EX stage (Tomasulo’s algorithm without speculation) or commit stage (Tomasulo’s algorithm with speculation). All loads access memory during the EX stage. Loads and Stores stay in EX for 1 cycle. (3) Functional units are not pipelined. (4) If an instruction moves to its WB stage in cycle x , then an instruction that is waiting on the same functional unit (due to a structural hazard) can start executing in cycle x . (5) An instruction waiting for data on the CDB can move to its EX stage in the cycle after the CDB broadcast. (6) Only one instruction can write to the CDB in one clock cycle. Branches and stores do not need the CDB. (7) Whenever there is a conflict for a functional unit or the CDB, assume that the oldest (by program order) of the conflicting instructions gets access, while others are stalled. (8) Assume that the result from the integer functional unit is also broadcast on the CDB and forwarded to dependent instructions through the CDB (just like any floating point instruction). (9) Assume that the BNEQZ occupies the integer functional unit for its computation and spends one cycle in EX.

Part D [18 points]

For this part, assume hardware speculation and dual-issue added to the Tomasulo pipeline you used for part A. That is, assume that an instruction can issue even before the branch has completed (or started) its execution (as with perfect branch and target prediction). However, assume that an instruction after a branch cannot issue in the same cycle as the branch; the earliest it can issue is in the cycle immediately after the branch (to give time to access the branch history table and/or buffer). Any other pair of instructions can issue in the same cycle. Assume that a store calculates its target address in EX and performs its memory access during the Commit stage. Recall that stores do not write back.

Additionally, assume that your reorder buffer has 12 entries (at the beginning of execution the ROB is empty). Furthermore, two instructions can commit each cycle .

Fill in the cycle numbers in each pipeline stage for each instruction in the first two iterations of the loop represented below, assuming the branch is always taken. If a stall occurs, describe the reason for the stall. The reason should include the type of hazard; the register, functional unit, etc. that caused the dependence; and the instruction on which the given instruction is dependent (use the IS stage cycle number to identify the instruction). Additionally, note any stalls due to commit ordering. The entries for the first two instructions of the first iteration are filled in for you. CM stands for the commit stage.

Instruction IS EX WB CM Reason for Stalls

Iteration 1

LP: L F0, 0(R1) 1 2 3 4 ADD F0, F0, F6 1 438 9 10 RAW on F0 (from 1) DIV F2, F2, F0 2 20334 35 36 RAW on F0 (from 1) FP DIV unit occupied (from 3) L F0, 8(R1) 2 3 4 36 In-order commit DIV F4, F0, F8 3 5319 20 37 RAW on F0 (from 2) In-order commit S F4, 16(R1) 3 4 4 37 In-order commit DADDI R1, R1, #-24 4 5 6 38 In-order commit BNEZ R1, LP 4 7 4 38 RAW on R1 (from 4) In-order commit

Iteration 2

LP: L F0, 0(R1) 5 8 10 39 RAW on R1 (from 4) INT unit occupied (from 4) CDB occupied (from 1) In-order commit ADD F0, F0, F6 5 11315 16 39 RAW on F0 (from 5) In-order commit DIV F2, F2, F0 6 50364 65 66 RAW on F2 (from 2) FP DIV unit occupied (from 7) L F0, 8(R1) 6 10 11 66 INT unit occupied (from 5) CDB occupied (from 5) In-order commit DIV F4, F0, F8 7 35349 50 67 FP DIV unit occupied (from 2) In-order commit S F4, 16(R1) 11 12 4 67 ROB full (from 1) In-order commit DADDI R1, R1, #-24 37 38 39 68 ROB full (from 2 and 2) In-order commit BNEZ R1, LP 37 40 4 68 RAW on R1 (from 37) In-order commit

Grading: ¼ point for each entry. 0 points for fully correct answer. Cascading errors will not be penalized additionally as long as the relevant dependencies are still observed.

NOTE (fa21 only): For the L F0, 8(R1) in iteration 2 (IS 6), count <9= as correct for EX as well4this was our original answer, but is incorrect (prior L stuck in EX until cycle 10 due to CDB conflict)

Problem 2: Dynamic Branch Prediction [14 points]

Consider the following MIPS code. The register R0 is always 0.

DADDI R1, R0, R

L1: DADDI R2, R0, R

L2: daddi r2, r2,, dsubi r3, r2,.

BNEQZ R3, L2 -- Branch 1

DADDI R1, R1,

Dsubi r4, r1,.

BNEQZ R4, L1 -- Branch 2

Each table below refers to only one branch. For instance, branch 1 will be executed 12 times.

Those 12 times should be recorded in the table for branch 1. Similarly, branch 2 is executed 4

Part A [4 points]

Assume that 1-bit branch predictors are used. When the processor starts to execute the above

code, both predictors contain value N (Not taken). What is the number of correct predictions?

Use the following tables to record the prediction and action of each branch. Several entries are

filled in for you.

Step Branch 1

Actual Branch 1 Action

Step Branch 2

Actual Branch 2 Action

Solution: Total of 6 correct predictions, 4 for Branch 1 and 2 for Branch 2.

Grading: -1/2 for not stating the correct number of predictions. -1/2 for each mistake in the

table, cascading errors will not be penalized. Minimum possible score is 0.

Part C [6 points]

Now assume that 2 level global correlating predictors of the form (2, 1) are used. (Note that

global here means that the history used captures the history of all previous branches. It does not

mean that there is only one set of prediction bits for all branches.) When the processor starts to

execute the above code, the outcome of the previous two branches is not taken (N). Also assume

that the initial state of predictors of all branches is not taken (N). What is the number of correct

predictions? Use the following table to record your steps. Record the "New State" of predictors

in the form W/X/Y/Z where,

W - state corresponds to the case where the last branch and the branch before the last are

both TAKEN.

X - state corresponds to the case where the last branch is TAKEN and the branch before the

last is NOT TAKEN.

Y - state corresponds to the case where the last branch is NOT TAKEN and the branch

before the last is TAKEN.

Z - state corresponds to the case where the last branch and the branch before the last are

both NOT TAKEN.

Note: The state of the predictor at position W/X/Y/Z can be N for predict-not-taken or T for

predict-taken.

1 N T N/N/N/T

2 n t n/t/n/t, 3 n n n/t/n/t, 4 t t n/t/n/t, 5 n t t/t/n/t, 6 t n n/t/n/t, 7 t t n/t/n/t, 8 n t t/t/n/t, 9 t n n/t/n/t, 10 t t n/t/n/t, 11 n t t/t/n/t, 12 t n n/t/n/t, 1 n t n/n/t/n, 2 t t n/n/t/n, 3 t t n/n/t/n, 4 t n n/n/n/n.

Solution: 6 total correct predictions, 4 for Branch 1 and 2 for Branch 2.

Part B [2 points]

Now consider a branch-target buffer design that distinguishes conditional and unconditional branches, storing the target address for a conditional branch and the target instruction for an unconditional branch. What is the penalty or benefit in clock cycles when an unconditional branch is found in the buffer? Explain.

Solution: Storing the target instruction of an unconditional branch effectively removes one instruction. If there is a BTB hit in instruction fetch and the target instruction is available, then that instruction is fed into decode in place of the branch instruction. The penalty is 31 cycle. In other words, it is a performance gain of 1 cycle.

Grading: 2 points for correct penalty/gain with correct reason.

Problem 4 [8 points]

This problem concerns the implications of the reorder buffer size on performance. Consider a

processor implementing Tomasulo’s algorithm with reservation stations and the reorder buffer

scheme described in detail in the lecture notes. Assume infinite processor resources unless stated

otherwise; e., infinite execution units and infinite reservation stations. Assume a perfect branch

predictor and assume there are no data dependences in the instruction stream we are considering.

Assume the maximum instruction fetch rate is 12 instructions per cycle. (The other stages in the

pipeline have no constraints; e., the processor can decode an unbounded number of instructions

per cycle.)

Part A [2 points]

Suppose all instructions take one cycle to execute and the processor has an infinite reorder buffer. What is the average instructions-per-cycle rate or IPC for this processor?

Solution: The average IPC would be 12, since it is limited only by the fetch rate. There is no reason for any stall since there are no data dependencies or branch mispredicts and we have infinite resources.

Grading: 2 points for saying IPC is 12 and why it is so.

Consider the system in part (a) except that now every 48thinstruction is a load that misses in the cache and incurs a miss latency of 500 cycles. What is the average instructions-per-cycle or IPC for this processor?

Solution: The average IPC would again be 12. The ROB would mask out the latencies associated with missing load instructions, and would allow us to keep fetching and issuing 12 instructions each cycle. The misses would introduce a lag of up to 500 cycles between the fetch and commit, but the average throughput would still be 12 instructions each cycle.

Part C [4 points]

Consider the system in part (b) except that now the reorder buffer size is 48 entries. What is the average IPC for this processor? If the IPC is less than 12, then what is the smallest reorder buffer size for which the IPC will be 12 again (assume the reorder buffer size can only be a multiple of 12).

Solution: We can no longer get an IPC of 12, since the limited ROB size will cause stalls. Suppose that at cycle 1, we issue a load that misses. We would keep fetching and issuing instructions until the ROB becomes full, so we would issue 47 more instructions and then stall, until cycle 500, when the instruction at the head of the ROB completes and is ready to commit (along with the other 47 instructions). At that point, we would issue the next missing load, and this cycle would repeat. Thus, every 500 cycles, we are able to commit 48 instructions, and the IPC is 48/500.

To obtain an average IPC of 12, we need to be able to overlap the execution of 12 instructions per cycle on average during the 500 cycles that the load is stalled. These instructions cannot commit until the load commits. This requires an ROB size of 12*500=6000 instructions.

Grading: 2 points for realizing that since commit is in order, a long load miss at the head of the ROB implies that fetch and retire will stall once the ROB fills up, until the load miss completes. 1 point for getting that 48 instructions complete in 500 cycles for an IPC of 48/500. 1 point for getting that for an average IPC of 12, we need to have 12 instructions per cycle times 500 cycles = 6000 instructions in the ROB.

NOTE: If students used500 + εfor some small number cycles for however many cycles theε load was in IS, EX, WB, CM, etc. excluding the 500-cycle miss latency, count as correct4as long as this was sufficiently explained.

Part C [2 points]

Now consider a branch with the following behavior:

T, T, T, T, T, N , N , N , N , N , N , N , N , T, T, T, T, T, T, T, T, T, T, N , N , N , N , N , T, T, T, T, T, T

Would you suggest using the predictor of part (a) for this branch? If yes, why? If not, which

predictor should be used instead?

Solution: Using a 2-bit predictor would lead to 2 mispredicts each time the branch behavior

changes. It would be better to use a 1-bit predictor, it would lead to 4 mispredicts, whereas a

2-bit predictor would lead to 8 mispredicts.

Part D [2 points]

Now suppose there are two static branches in a program that are invoked alternately (Branch 1

first, then

Branch 2) with the following behavior:

Branch 1: T, N, T, T, T, N, N, N, T, N, T, N, T, T, N, T, T, N

Branch 2: N, T, N, N, N, T, T, T, N, T, N, T, N, N, T, N, N, T

What type of branch predictor would you recommend for branch 2 and why?

Solution: A (1, 1) globally correlating branch predictor would be appropriate for branch 2.

Notice that branch 2 is always branching in the opposite direction from branch 1. In other words,

branch 2’s direction is tightly correlated to branch 1’s direction. Just one bit of history and a 1-bit

counter is sufficient to see this correlation. Thus, it takes very little time for branch 2’s predictor

to be <warmed up= into a state where a single bit of history (i. the most recent direction of

branch 1) can be used to select for the correct prediction4the opposite direction of branch 1.

Branch 2’s predictor incurs only a single misprediction during this <warm up= period and

remains stable the rest of the time.

Grading: For each case, 2 points for the correct predictor and justification.

Problem 6 - GRADUATE STUDENTS PROBLEM [10 points]

In class, we studied the use of a reorder buffer to maintain precise interrupts. With this

mechanism, an instruction does not update the register file with its newly computed value until it

is committed. Instead, the new value is stored in the reorder buffer. This requires later

instructions to possibly read source operand values from the reorder buffer.

An alternative is to use a history buffer. This mechanism updates the register file as soon as the

instruction computes a new value, but it stores the previous value of the register in the history

buffer. On an interrupt, appropriate old values are restored.

Consider using the above history buffer idea to maintain precise interrupts with the standard

Tomasulo algorithm design (with reservation stations) as covered in class (no reorder buffer).

Explain the modifications to the Tomasulo pipeline for this purpose.

Hint: As with the reorder buffer, we need to split the Write stage into Complete and Commit.

Separate your answer into the following parts. You need only give the conceptual changes from

the basic Tomasulo algorithm (e., at the level discussed in the lecture notes for the reorder

Part A [2 points] Explain how the fields of the history buffer would be different from the

reorder buffer.

Solution: History buffer is similar to the reorder buffer, but it will have the old value instead of

the new value of the destination register.

Grading: 2 points for this change. Incorrect changes will result in negative ½ point for each

Part B [1 point] Describe the changes to the Issue stage.

Solution: Allocate history buffer entry (HB). So in the issue stage, as before, we read the old

value of the destination if available or point to the reservation station (RS) that will get the last

value (as indicated by the result status register).

Grading: 1 point for this change. Incorrect changes will result in negative ½ point for each

Multiple Choice

Course : Computer System Organization (CS 433)

University : university of illinois at urbana-champaign.

homework 3 memory unit 1 software architecture

Discover more from: Computer System Organization CS 433 University of Illinois at Urbana-Champaign 5 Documents Go to course
More from: Computer System Organization CS 433 University of Illinois at Urbana-Champaign 5 Documents Go to course