This is an excellent question!
However, on second thought, this does not make that much sense to me, since the size of the function units is not so problematic anymore. So if we can implement addition in 2 cycles, we would use that implementation also in PU1, where addition will then also take 2 cycles, even though other operations may take longer. If the PUs were implemented like this, it would not matter to which PU we would send the addition (at least concerning the completion time of the two additions). However, PU1 would be blocked for other instructions, so we might regret using PU1 for the additions when other instructions would follow.
All in all, I am not aware of any optimized dispatching that makes use of such considerations. Dispatching units also need to be kept as simple as possible, so I think they just send operations to the next free unit they find. Many processors also use distributed reservation stations where each PU has an input queue of operands. Then, dispatching is even done earlier and can just consider how many operations are queueing up.
Also, I am not aware of any PUs with "overlapping functionality", i.e. that one PU can do what another can also do, but also other operations as well. Instead, we usually find ADD/SUB/COMPARE on some units, and MULT and DIV on others. So, then, the problem does not appear.
Anyway, it is an interesting question that could be thought over to further optimize dynamic scheduling! However, since we do not foresee the next instructions, we may always regret the scheduling done so far, so I doubt that there is an optimal solution for online scheduling (which is what we would have to do).
See also