# Execution time of vector programs with pipeline chaining

Hello,

as discussed in the Q&A session today, I wanted to ask about a formula to calculate the execution time of vector programs with pipeline chaining.

In the exercises there was such a question and I tried to calculate the longest chain like on slide 72 but I guess I did something wrong.

Maybe you could show that with an example.

+1 vote

Thank you for this question! Let's consider the following program

```    I[1,1]: LV V6,0(R1)
I[2,1]: MULVS V8,V6,R4
I[2,2]: LV V10,0(R2)
I[4,1]: SV V10,0(R2)
```

Due to the data dependencies I[1,1]->I[2,1] (RAW on V6), I[3,1]->I[2,2] (RAW on V10) and I[4,1]->I[3,1] (RAW on V10), we get four convoys, which are given by the first index of each instruction.

First of all, recall that on slide 25, we agreed upon a pipeline architecture that has execution pipelines of the following lengths:

```    load/store      12
division        20
multiplication   7
others           6 ```

If we do *not* have pipeline chaining, an instruction that has a conflict to a previous instruction must not be executed before its operands are *completely* available. Hence, (consider slide 38), when we send I[1,1] to an execution unit at time t=0, the first entry of its result vector leaves the execution pipeline at time 12 (since the load pipeline has 12 stages), and its final entry is leaving the pipeline at time l+11 where l is the length of the vectors. This is the earliest point of time where we can send I[2,1] to the execution pipeline so that the first entry of its result vector leaves the pipeline at time l+18 and the last entry leaves it at time 2l+17. Instruction I[2,2] does not have a dependency to I[2,1] and can therefore run in parallel to I[2,1], so that we can start it already at time t+12.

start first entry last entry
I[1,1]012l+11
I[2,1]l+11l+182*l+17
I[2,2]l+12l+24 2*l+23
I[3,1]2*l+232*l+29  3*l+28
I[4,1]3*l+283*l+40 4*l+39

Now, what is changed if we have pipeleine chaining? Pipeline chaining means that we start the execution of an instruction already when the first entry is available and don't have to wait for the final entry. For the above example this means the following updates to slide 38:

start first entry last entry
I[1,1]012l+11
I[2,1]1219l+18
I[2,2]1325l+24
I[3,1]2531l+30
I[4,1]3143l+42

As can be seen, the meaning of convoys, and thus their number does not matter much if we have pipeline chaining. The runtime using pipeline chaining is just determined by the vector length l (of course), and the (maximal) sum of the pipeline lengths that are chained up (note that the pipelines of I[2,1] and I[2,2] are not chained, they run in parallel).

To generalize all this by formulas, assume that we have convoys C[1],...,C[m] with instructions I[i,1],...,I[i,n[i]] in convoy C[i] and that the pipeline of instruction I[i,j] has q[i,j] many stages. Assuming that the execution of convoy C[i] starts at time t[i], then we get the following with or without pipeline chaining:

•     I[i,j] starts at time t[i]+j-1
•     the first entry of the result of I[i,j] leaves its pipeline at time t[i]+j-1+q[i,j]
•     the last entry of the result of I[i,j] leaves its pipeline at time t[i]+j-1+q[i,j]+l-1

Without pipeline chaining, we have t[i+1] = t[i]+l-2+max{j+q[i,j] | j=1..n[i]}, i.e., convoy C[i+1] can start when the final entry of the last instruction of convoy C[i] is available. Solving the recursion yields t[k] = Q[k] - 2*(k-1) with Q[k] := sum{i=1..k-1} max{j+q[i,j] | j=1..n[I]}. Finally, that means that the runtime is t[m+1] which is in O(m*l+Q[m+1]).

With pipeline chaining, we have t[i+1] = t[I]-1+max{j+q[i,j] | j=1..n[I]} and solving this recursion yields t[k] = Q[k] - (k-1) with Q[k] := sum{i=1..k-1} max{j+q[i,j] | j=1..n[I]}. Finally, that means that the runtime is t[m+1]+l-1 which is in O(l-m+Q[m+1]). Essentially, pipeline chaining increases the runtime compared with the same processor without chaining by a factor of m, i.e., the number of convoys.

Note that these formulas only apply if there is no structural conflict. For instance, if we have only one functional unit for loading and storing and have to perform two independent load instructions, then we cannot schedule the second one right after the first one. Instead, the second one has to wait until the last address of the first component vector entered the pipeline. This structural conflict adds a vector length to a pipeline chain.

For instance, for vector length l=4, the example in the discussion below yields the following figure

The runtime is then 4+12+7+6+12+4-1 = 44 and for l=64, we would have 64+12+7+6+12+64-1 = 164.

by (162k points)
edited by
Hello,

I know I've asked that question some time ago, but I have just reviewed your answer and afterwards the 5th exercise sheet exercise 4c) which asks for the execution time of the following program with pipeline chaining:

LV     V1, 0(R1)
LV     V2, 0(R2)
MULV   V3, V1, V2
SV     V4, 0(R4)

with l = 64, one L/S pipeline of death 12, one MULT pipeline of death 7 and one ADD pipeline of death 6.

At that time I calculated the execution time by calculating the longest chain. Since we only have each FU once, the longest chain should be 12 + 12 + 7 + 6 + 12 = 49. Then, according to the slides, I finally calculated the execution time like this: 1 + 64 + 49 - 2 = 112
After reviewing your answer, I kind of feel confirmed and think my answer was correct. Nevertheless Julius came to a different solution in the exercise lesson which confused me a bit since then. His answer was: 64 + 64 + 12 + 7 + 6 + 12 - 1 = 164.

So, who is actually right now?

I have added some text at the end of my previous answer since I cannot add images in the discussion here. To keep it short, Julius was right! The point is that we still have to consider structural conflicts also when we use pipeline chaining and these still add up additional latencies. The formulas considered assume a sequence of statements without structural conflicts but maybe with data dependences since we assumed to schedule an instruction as soon as the result of the first result needed for its operands is produced by the instructions in the previous convoy.
Oh, now I understand everything. Thank you!

+1 vote