We have to distinguish here between code generation and a compiler optimization technique. Modulo scheduling as well as the other techniques mentioned in that chapter are at first compiler optimization techniques that aim at increasing the amount of ILP of the program. To this end, the loop instructions are rescheduled by modulo scheduling such that the optimized loop will have more ILP. Once more ILP is available, the processors can make use of it. For a VLIW processor, this means that it can now better utilize its instruction tuples, i.e., has to fill in less NOP operations in the available slots. Hence, the width of the kernel does not necessarily correspond with the width of the instruction words of the VLIW processor (to answer the first question).

About the second question: Yes, you can execute even all I2 instructions in parallel, provided that all I1 instructions would have already been executed. However, that is not wise, since the I1 instructions all depend on each other and therefore you would first have to execute them all sequentially before you could execute the remaining instructions. Those could then be executed in parallel, i.e., first all I2 instructions, then all I3 instructions etc. That is however the way a vector computer will handle this loop which you will learn in a later chapter.

If the dependency between I1 and I2 would not exist, modulo scheduling will still generate a schedule as shown, since it simply will schedule a number of loop iterations with a delay of pi in parallel to generate a kernel. It does not reorder the instructions of the loop body (at first). That reordering is another technique that should be applied independent of modulo scheduling.

If we would change I1 from a[i+1] = a[i]+1 to a[i+2] = a[i]+1, we would get the following program:

for(i=1;i<=n;i++){
I1: a[i+2] = a[i]+1;
I2: b[i] = a[i+1]/2;
I3: c[i] = b[i]+3;
I4: d[i] = c[i]-1;
}

This is now similar to the example on page 90, and it changes that the dependence between I1 to itself to (2,1) and the dependence from I2 to I1 to (1,1). This reduces the initiation interval, so we can execute the following

I1[0] I1[1]
I2[0] I2[1] I1[2] I1[3]
I3[0] I3[1] I2[2] I2[3] I1[4] I1[5]
I4[0] I4[1] I3[2] I3[3] I2[4] I2[5] I1[6] I1[7]
---------------------------------------------------------------
I4[i] I4[i+1] I3[i+2] I3[i+3] I2[i+4] I2[i+5] I1[i+6] I1[i+7]
---------------------------------------------------------------
I4[n-5] I4[n-4] I3[n-3] I3[n-2] I2[n-1] I2[n]
I4[n-3] I4[n-2] I3[n-1] I3[n]
I4[n-1] I4[n]
where the kernel must iterate for i=1..n-7.