Let's consider your question in a more general setting: For a given statement S, you want to compute the minimal number of processors P(S), the work W(S), and the depth D(S). In contrast to the work W(S), P(S) and D(S) depend on scheduling the atomic actions. The possible schedules must thereby respect the data dependencies and possible pause statements of a program that enforce a sequential execution.
In the following, things will become clearer if we distinguish between sequential and parallel loops. A sequential loop seqfor(i=0..n-1) S[i] is an abbreviation for S[0];...;S[n-1], we have the following:
- P(seqfor(i=0..n-1) S[i]) := max{P(S[i])| i=0,...,n-1}
- W(seqfor(i=0..n-1) S[i]) := W(S[0]) + ... + W(S[n-1])
- D(seqfor(i=0..n-1) S[i]) := D(S[0]) + ... + D(S[n-1])
A parallel loop parfor(i=0..n-1) S[i] is equivalent to the parallel execution S[0] || ... || S[n-1], i.e., all S[i] start at the same time and run in parallel. Thus, we have
- P(parfor(i=0..n-1) S[i]) := P(S[0]) + ... + P(S[n-1])
- W(parfor(i=0..n-1) S[i]) := W(S[0]) + ... + W(S[n-1])
- D(parfor(i=0..n-1) S[i]) := max{D(S[i])| i=0,...,n-1}
The above definitions are justified in that we use the required number of processors to start all S[i] in parallel, the work is obvious, and the depth is then the time required by the S[i] that terminates last.
Whether there is a pause statement in S[i] does not make a difference for the above formulas, it matters more whether there are data dependencies among the S[i] that will make it necessary to use a sequential loop. If there are no data dependencies, we can replace a sequential loop with a parallel one.
Now, let's consider your loop nestings with the assumption that P(S[i,j,k])<=p, W(S[i,j,k])<=w, D(S[i,j,k])<=d holds.
First, a completely parallel loop nesting L1:
parfor(i=0..m-1) {
parfor(j=0..n-1) {
parfor(k=0..r-1) {
S[i,j,k];
}
}
}
- P(L1) := sum_{i,j,k} P(S[i,j,k]) <= m*n*r*p
- W(L1) := sum_{i,j,k} W(S[i,j,k]) <= m*n*r*w
- D(L1) := max_{i,j,k} D(S[i,j,k]) <= d
Second, a sequential loop with two parallel loops L2:
seqfor(i=0..m-1) {
parfor(j=0..n-1) {
parfor(k=0..r-1) {
S[i,j,k];
}
}
}
- P(L2) := max_{i} sum_{j,k} P(S[i,j,k]) <= n*r*p
- W(L2) := sum_{i} sum_{j,k} W(S[i,j,k]) <= m*n*r*w+m
- D(L2) := sum_{i} max_{j,k} D(S[i,j,k]) <= m*d
Third, a sequential loop with two parallel loops L3:
seqfor(i=0..m-1) {
seqfor(j=0..n-1) {
parfor(k=0..r-1) {
S[i,j,k];
}
}
}
- P(L3) := max_{i,j} sum_{k} P(S[i,j,k]) <= r*p
- W(L3) := sum_{i,j} sum_{k} W(S[i,j,k]) <= m*n*r*w
- D(L3) := sum_{i,j} max_{k} D(S[i,j,k]) <= m*n*d
Fourth, consider a sequential loop nesting L4:
seqfor(i=0..m-1) {
seqfor(j=0..n-1) {
seqfor(k=0..r-1) {
S[i,j,k];
}
}
}
- P(L4) := max_{i,j,k} P(S[i,j,k]) <= p
- W(L4) := sum_{i,j,k} W(S[i,j,k]) <= m*n*r*w
- D(L4) := sum_{i,j,k} D(S[i,j,k]) <= m*n*r*d
Fifth, consider a sequential loop nesting L5:
parfor(i=0..m-1) {
seqfor(j=0..n-1) {
seqfor(k=0..r-1) {
S[i,j,k];
}
}
}
- P(L5) := sum_{i} max_{j,k} P(S[i,j,k]) <= m*p
- W(L5) := sum_{i} sum_{j,k} W(S[i,j,k]) <= m*n*r*w
- D(L5) := max_{i} sum_{j,k} D(S[i,j,k]) <= n*r*d
Sixth, consider a sequential loop nesting L6:
parfor(i=0..m-1) {
parfor(j=0..n-1) {
seqfor(k=0..r-1) {
S[i,j,k];
}
}
}
- P(L6) := sum_{i,j} max_{k} P(S[i,j,k]) <= m*n*p
- W(L6) := sum_{i,j} sum_{k} W(S[i,j,k]) <= m*n*r*w
- D(L6) := max_{i,j} sum_{k} D(S[i,j,k]) <= r*d
In the parallel computing course, synchronous programs were often used to describe PRAM algorithms. These programs distinguish between micro and macro steps in that micro steps are all atomic statements except for the pause statement that declares the end of one macro step and the beginning of the next one.
The depth is then not just the number of macro steps (which means synchronous time steps or in other words reaction steps). Instead, it also has to consider the longest dependency chain of the micro step actions of each of these macro steps and has to sum these up. Now this brings in some further difficulties in the analysis, since a sequential loop of micro-steps without data dependencies will still be executed in zero time, and is therefore the same as a parallel loop. This changes if there is pause statement in the loop body or if there are data dependencies.
So, if a for-loop's body contains a pause statement, you should read it as a sequential loop, otherwise, it is the same as a parallel loop. In both cases, you also have to consider the data dependencies. For example,
for(i=0..n-1)
x[i] = i;
can be executed on n processors in one step, while
for(i=1..n-1)
x[i] = x[i-1];
has sequential dependencies and requires n-1 steps where at most one processor can be kept busy. Adding a pause statement to the loop body, makes both the same: a really sequential execution of O(n) atomic actions where at most one processor is used.