+1 vote

I wanted to know if my understanding is correct for the below scenarios:

1) 

for i=0..N {

    for j=0..M{

        for k=0..P{

               do_something();

}}}

Taking below into account

Loop3 =  the most inner loop, iterating over k

Loop2 = the middle loop, iterating over j

Loop1 = the most outer loop, iterating over i

Case 1:

There's no pause, so everything is done in parallel with N*M*P processors, O(n^3), and in O(1) time. (3 clocks?)

Case 2:

for i=0..N {

    for j=0..M{

        for k=0..P{

               do_something();

}}

pause;

}

this affects loop1 and makes it sequential, therefore M*P, O(n^2), processors are needed and the time is increased to O(n), (N clocks)

Case 3:

for i=0..N {

    for j=0..M{

        for k=0..P{

               do_something();

}

pause;

}

}

This affects the middle loop, so now N*P processors are needed, O(N^2), and time is O(N), M clocks.

Case 4 (My main question where I am not doing it correctly)

for i=0..N {

    for j=0..M{

        for k=0..P{

               do_something();

pause;

}

}

}

This affects the inner loop, Loop3, and makes it sequential. So now two outer loops are done in parallel so N*M processors are needed, O(N^2), and since 0->P is sequential but for all N*M loops it is done in parallel this will take P clocks, O(N).

I know this had to be O(N^3) processors with O(1) time though. But can't really see it here.

in * TF "Emb. Sys. and Rob." by (470 points)

1 Answer

+2 votes
 
Best answer

Let's consider your question in a more general setting: For a given statement S, you want to compute the minimal number of processors P(S), the work W(S), and the depth D(S). In contrast to the work W(S), P(S) and D(S) depend on scheduling the atomic actions. The possible schedules must thereby respect the data dependencies and possible pause statements of a program that enforce a sequential execution.

In the following, things will become clearer if we distinguish between sequential and parallel loops. A sequential loop seqfor(i=0..n-1) S[i] is an abbreviation for S[0];...;S[n-1], we have the following:

  • P(seqfor(i=0..n-1) S[i]) := max{P(S[i])| i=0,...,n-1}
  • W(seqfor(i=0..n-1) S[i]) := W(S[0]) + ... + W(S[n-1])
  • D(seqfor(i=0..n-1) S[i]) := D(S[0]) + ... + D(S[n-1])

A parallel loop parfor(i=0..n-1) S[i] is equivalent to the parallel execution S[0] || ... || S[n-1], i.e., all S[i] start at the same time and run in parallel. Thus, we have

  • P(parfor(i=0..n-1) S[i]) := P(S[0]) + ... + P(S[n-1])
  • W(parfor(i=0..n-1) S[i]) := W(S[0]) + ... + W(S[n-1])
  • D(parfor(i=0..n-1) S[i]) := max{D(S[i])| i=0,...,n-1}

The above definitions are justified in that we use the required number of processors to start all S[i] in parallel, the work is obvious, and the depth is then the time required by the S[i] that terminates last. 

Whether there is a pause statement in S[i] does not make a difference for the above formulas, it matters more whether there are data dependencies among the S[i] that will make it necessary to use a sequential loop. If there are no data dependencies, we can replace a sequential loop with a parallel one. 

Now, let's consider your loop nestings with the assumption that P(S[i,j,k])<=p, W(S[i,j,k])<=w, D(S[i,j,k])<=d holds. 

First, a completely parallel loop nesting L1:

    parfor(i=0..m-1) {
        parfor(j=0..n-1) {
            parfor(k=0..r-1) {
                S[i,j,k];
            }
        }
    }
  • P(L1) := sum_{i,j,k} P(S[i,j,k]) <= m*n*r*p
  • W(L1) := sum_{i,j,k} W(S[i,j,k]) <= m*n*r*w
  • D(L1) := max_{i,j,k} D(S[i,j,k]) <= d

Second, a sequential loop with two parallel loops L2:

    seqfor(i=0..m-1) {
        parfor(j=0..n-1) {
            parfor(k=0..r-1) {
                S[i,j,k];
            }
        }
    }
  • P(L2) := max_{i} sum_{j,k} P(S[i,j,k]) <= n*r*p
  • W(L2) := sum_{i} sum_{j,k} W(S[i,j,k]) <= m*n*r*w+m
  • D(L2) := sum_{i} max_{j,k} D(S[i,j,k]) <= m*d

Third, a sequential loop with two parallel loops L3:

    seqfor(i=0..m-1) {
        seqfor(j=0..n-1) {
            parfor(k=0..r-1) {
                S[i,j,k];
            }
        }
    }
  • P(L3) := max_{i,j} sum_{k} P(S[i,j,k]) <= r*p
  • W(L3) := sum_{i,j} sum_{k} W(S[i,j,k]) <= m*n*r*w
  • D(L3) := sum_{i,j} max_{k} D(S[i,j,k]) <= m*n*d

Fourth, consider a sequential loop nesting L4:

    seqfor(i=0..m-1) {
        seqfor(j=0..n-1) {
            seqfor(k=0..r-1) {
                S[i,j,k];
            }
        }
    }
  • P(L4) := max_{i,j,k} P(S[i,j,k]) <= p
  • W(L4) := sum_{i,j,k} W(S[i,j,k]) <= m*n*r*w
  • D(L4) := sum_{i,j,k} D(S[i,j,k]) <= m*n*r*d

Fifth, consider a sequential loop nesting L5:

    parfor(i=0..m-1) {
        seqfor(j=0..n-1) {
            seqfor(k=0..r-1) {
                S[i,j,k];
            }
        }
    }
  • P(L5) := sum_{i} max_{j,k} P(S[i,j,k]) <= m*p
  • W(L5) := sum_{i} sum_{j,k} W(S[i,j,k]) <= m*n*r*w
  • D(L5) := max_{i} sum_{j,k} D(S[i,j,k]) <= n*r*d

Sixth, consider a sequential loop nesting L6:

    parfor(i=0..m-1) {
        parfor(j=0..n-1) {
            seqfor(k=0..r-1) {
                S[i,j,k];
            }
        }
    }
  • P(L6) := sum_{i,j} max_{k} P(S[i,j,k]) <= m*n*p
  • W(L6) := sum_{i,j} sum_{k} W(S[i,j,k]) <= m*n*r*w
  • D(L6) := max_{i,j} sum_{k} D(S[i,j,k]) <= r*d

In the parallel computing course, synchronous programs were often used to describe PRAM algorithms. These programs distinguish between micro and macro steps in that micro steps are all atomic statements except for the pause statement that declares the end of one macro step and the beginning of the next one.

The depth is then not just the number of macro steps (which means synchronous time steps or in other words reaction steps). Instead, it also has to consider the longest dependency chain of the micro step actions of each of these macro steps and has to sum these up. Now this brings in some further difficulties in the analysis, since a sequential loop of micro-steps without data dependencies will still be executed in zero time, and is therefore the same as a parallel loop. This changes if there is pause statement in the loop body or if there are data dependencies. 

So, if a for-loop's body contains a pause statement, you should read it as a sequential loop, otherwise, it is the same as a parallel loop. In both cases, you also have to consider the data dependencies. For example, 

    for(i=0..n-1)
        x[i] = i;

can be executed on n processors in one step, while 

    for(i=1..n-1)
        x[i] = x[i-1];

has sequential dependencies and requires n-1 steps where at most one processor can be kept busy. Adding a pause statement to the loop body, makes both the same: a really sequential execution of O(n) atomic actions where at most one processor is used. 

by (96.2k points)
selected by
Thank you for the detailed answer.

In below example:

module BoolMatrixMult([D1][D2]bool ?a,[D2][D3]bool ?b,[D1][D3]bool !c) {
  for(i=0..D1-1) {
    for(j=0..D2-1) {
      for(k=0..D3-1) {
    pause;
       if(a[i][k] and b[k][j]) c[i][j] = true;
      }
    }
 
  }
}


Which is the equivalent of the sixth case:

  parfor(i=0..m-1) {
        parfor(j=0..n-1) {
            seqfor(k=0..r-1) {
                S[i,j,k];
            }
        }
    }
P(L6) := sum_{i,j} max_{k} P(S[i,j,k]) <= m*n*p
W(L6) := sum_{i,j} sum_{k} W(S[i,j,k]) <= m*n*r*w
D(L6) := max_{i,j} sum_{k} D(S[i,j,k]) <= r*d

I don't recall why the accepted answer by the online tool was:
Depth: O(N^3)
Processor count: O(1)
The tool is right with the answer: The innermost loop contains a pause, thus all loops contain a pause, and therefore they are to be read as sequential loops. It is therefore of type L4 which was computed in the same way. Maybe the following could be useful: Use the following input for the teaching tool http://es.cs.uni-kl.de/tools/teaching/Quartz.html and compile it to an EFSM. You will see states that have to be scheduled one after the other (these are macro steps), while the actions within a state can be scheduled according to their data dependencies (in this case all can be done in parallel):

macro D1 = 2;
macro D2 = 2;
macro D3 = 2;
module BoolMatrixMult() {
    [D1][D2][D3]nat x;
    for(i=0..D1-1) {
        for(j=0..D2-1) {
            for(k=0..D3-1) {
                x[i][j][k] = i+j+k;
                pause;
            }
        }
    }
}

Then, move the pause outside and check the differences in the generated output.
Imprint | Privacy Policy
...