Here you can ask questions and find or give answers to organizational, academic and other questions about studying computer science.

925 questions

1k answers

1.4k comments

441 users

0 votes
Hello,

I have some questions regarding Modulo Scheduling and how the schedule to the program on slide 78 looks if we change some things.

What influence does the VLIW word width have? Is the VLIW width the maximum amount of "columns" the schedule may have? And am I right that if the VLIW width would be 6 it could also be that because of dependencies one can not create VLIWs wider than 4 (for example)?

I also do not fully understand the meaning of data dependencies for the schedule. Regarding slide 78: am I right that we could execute several I2 from different iterations in parallel because there is no arrow from I2 to I2? But in this example thats not possible because there's a dependency between I1 and I2 and also I1 and I1? I.e. if I1 would have no dependency to itself we could change the schedule such that the prolog and epilog would be empty?

What would the schedule look like if we change I1 to: a[i+2] = a[i] + 1? Would I1 only appear in every second VLIW?

I hope my question are clear enough because I am a little bit confused right now.

Thanks in advance for any help!

Best regards
Marvin
in * TF "Emb. Sys. and Rob." by (1.2k points)

1 Answer

+1 vote
 
Best answer

We have to distinguish here between code generation and a compiler optimization technique. Modulo scheduling as well as the other techniques mentioned in that chapter are at first compiler optimization techniques that aim at increasing the amount of ILP of the program. To this end, the loop instructions are rescheduled by modulo scheduling such that the optimized loop will have more ILP. Once more ILP is available, the processors can make use of it. For a VLIW processor, this means that it can now better utilize its instruction tuples, i.e., has to fill in less NOP operations in the available slots. Hence, the width of the kernel does not necessarily correspond with the width of the instruction words of the VLIW processor (to answer the first question).

About the second question: Yes, you can execute even all I2 instructions in parallel, provided that all I1 instructions would have already been executed. However, that is not wise, since the I1 instructions all depend on each other and therefore you would first have to execute them all sequentially before you could execute the remaining instructions. Those could then be executed in parallel, i.e., first all I2 instructions, then all I3 instructions etc. That is however the way a vector computer will handle this loop which you will learn in a later chapter.

If the dependency between I1 and I2 would not exist, modulo scheduling will still generate a schedule as shown, since it simply will schedule a number of loop iterations with a delay of pi in parallel to generate a kernel. It does not reorder the instructions of the loop body (at first). That reordering is another technique that should be applied independent of modulo scheduling.

If we would change I1 from a[i+1] = a[i]+1 to a[i+2] = a[i]+1, we would get the following program:

    for(i=1;i<=n;i++){
        I1: a[i+2] = a[i]+1; 
        I2: b[i] = a[i+1]/2; 
        I3: c[i] = b[i]+3; 
        I4: d[i] = c[i]-1;
    }

This is now similar to the example on page 90, and it changes that the dependence between I1 to itself to (2,1) and the dependence from I2 to I1 to (1,1). This reduces the initiation interval, so we can execute the following

        I1[0]   I1[1] 
        I2[0]   I2[1]   I1[2]   I1[3]
        I3[0]   I3[1]   I2[2]   I2[3]   I1[4]   I1[5]
        I4[0]   I4[1]   I3[2]   I3[3]   I2[4]   I2[5]   I1[6]   I1[7]
        ---------------------------------------------------------------
        I4[i]   I4[i+1] I3[i+2] I3[i+3] I2[i+4] I2[i+5] I1[i+6] I1[i+7]
        ---------------------------------------------------------------
                        I4[n-5] I4[n-4] I3[n-3] I3[n-2] I2[n-1] I2[n]
                                        I4[n-3] I4[n-2] I3[n-1] I3[n]
                                                        I4[n-1] I4[n]

where the kernel must iterate for i=1..n-7.
by (143k points)
selected by
Imprint | Privacy Policy
...