Yes, you are right, we follow in that exam problem the accelerated cascading of the algorithm FirstOnePar which is very similar to the algorithm considered in this exam problem. In the lecture notes, it is assumed that the length of the array is a square number N = M^2, i.e., we partition the array into M parts of size M where M := sqrt(N). Now, this exam problem asks in essence whether that was a good choice.

So, we consider a more general approach where the array of size N = p*q is split into p groups of size q. In the first step, we compute an array y of size p such that its values y[i] are true iff the group i has an element x[i*q..(i+1)*q-1] which is between m and n. That can be done by N=p*q processors in one step, thus requiring work of N=p*q many actions.

In the next step, we apply the considered algorithm to the generated array y which has size p. Our algorithm performs work O(p^2) in time O(1) with O(p^2) many processors to do this. So, this step requires only O(p^2) many processors and work and not O(p^2+q^2) as you understood.

Having found the least i where y[i] holds, we apply the algorithm once again to the part x[i*q..(i+1)*q-1] which is done with work O(q^2) in time O(1) with O(q^2) many processors.

In total, we have work O(p^2+q^2) in O(1) time with O(p^2+q^2) many processors. Now, we want to find out which values for p and q would be best to minimize p^2+q^2. Obviously, if both would be zero we would have no work to do, so to be more precise, we have to minimize p^2+q^2 with the constraint that p*q=N holds for some constant N.

That means, q = N/p, and thus, we have to minimize f(p) = p^2 + (N/p)^2 which has a minimum at p=sqrt(N) so that p=q=sqrt(N) follows which yields optimal work O(p^2+q^2) = O(sqrt(N)^2+sqrt(N)^2) = O(N+N) = O(N).