Load Imbalance

The load imbalance will grow as a function of 7. The earlier discussion of scaling assumed that the amount of work that a processor must perform is... [Pg.24]

In Fig. 4 we compare the timings for three different models, the simple one K per processor, the wrapped algorithm, and a model where two states are assigned per processor sequentially. Note that until J = 50 the one K per processor model job uses the smallest amount of wall clock time. It is clear, however, that this method does not make efficient use of computer resources. The wrapped model, however, scales very well and outperforms the sequential two K per processor model at every / > 0, a clear illustration of the degradation of performance due to load imbalance. [Pg.27]

An application that could be parallelized perfectly, with no load imbalance or added overhead, would have simply speed-up(F) = P. Unfortunately, this situa-... [Pg.219]

Figure 7 Speed-up curves that are more representative of actual applications on the Intel Touchstone Delta Supercomputer. The ideal curve is also known as the linear speed-up curve. Each case is defined as casefT. , Tp raiiei overhead)- Except for the last case, each one grows as case(0(N), 0(N ), 0(N)) similar to a traditional Hartree— Fock application. The last case represents no serial time execution and no overhead for parallel computation. Therefore, the deviation from ideal performance is due only to the load imbalance in the application.

This simple modified Amdahl s law illustrates the incentives for optimal load balancing. The case(0,3000,0) corresponds to a hypothetical situation in which there is no serial execution time and no overhead for communication. The deviation from linear speed-up in this case (about 10%) is due only to load imbalance. [Pg.224]

Once a program has been implemented, measurement-based evaluation is useful to refine its implementation by highlighting sources of inefficiency like load imbalance or high communication costs. Performance visualization, based on recorded program behavior, is a particularly powerful method for detecting unexpected behavior within a parallel program. [Pg.236]

A Transputer-based systolic loop implementation for the simulation of rigid polyatomic molecules described by Craven and Pawley revealed that in contrast to many problems on MIMD computers, the main loss of efficiency arose not from the time spent in communication between processors but, rather, from certain extra calculation and addressing work that was deemed inevitable in the parallel decomposition of the problem, and from a slight load imbalance. This study provided an excellent practical example of the ability of the Transputer to communicate and calculate in parallel, with minimal degradation of the computation rate, and provided very encouraging timing comparisons with the Cray X-MP. [Pg.263]

Speedup curves illustrating commonly encountered performance patterns, (a) ideal (b) superlinear speedup with performance degradation due to, e.g., communication overhead or load imbalance (c) logarithmic communication overhead (d) linear communication overhead (e) incompletely parallelized program (serial fraction of 0.025). See text for details. [Pg.79]

The presence of nonuniform computational tasks whose sizes are not known in advance is often the cause of load imbalance, and quantitative modeling of load imbalance can therefore be difficult to do. However, simulations that involve distribution of nonuniform tasks can sometimes provide empirical data that can be used to model load imbalance. In chapter 7 we will illustrate the use of empirical data to model load imbalance in the computation of the two-electron integrals, which is a required step in many quantum chemical methods. Parallel programs involving uniform computational tasks will experience load imbalance whenever the number of tasks is not a multiple of the number of processes. This kind of load imbalance is easier to include in a performance model because the amount of work assigned to the process... [Pg.82]

We have used expressions involving the latency, a, and inverse bandwidth, /3, to model the communication time. An alternative model, the Hockney model, is sometimes used for the communication time in a parallel algorithm. The Hockney model expresses the time required to send a message between two processes in terms of the parameters Too and ni, which represent the asymptotic bandwidth and the message length for which half of the asymptotic bandwidth is attained, respectively. Metrics other than the speedup and efficiency are used in parallel computing. One such metric is the Karp-Flatt metric, also referred to as the experimentally determined serial fraction. This metric is intended to be used in addition to the speedup and efficiency, and it is easily computed. The Karp-Flatt metric can provide information on parallel performance characteristics that caimot be obtained from the speedup and efficiency, for instance, whether degrading parallel performance is caused by incomplete parallelization or by other factors such as load imbalance and communication overhead. ... [Pg.90]

Is load imbalance likely to be an issue Is it accounted for in the model ... [Pg.114]

To analyze the performance of algorithms (a) and (b), we will first obtain expressions for the parallel efficiencies. For these algorithms, which are fully parallelized and involve no communication, load imbalance is the only factor that may contribute significantly to lowering the parallel efficiency, and to predict the parallel performance, we must be able to estimate this load imbalance. Additionally, to make quantitative predictions for the efficiency, it is necessary to collect some statistics for the computational times required for evaluation of the integrals in a shell quartet. [Pg.121]

We will now use these distributions to express the total execution time, that is, the maximum execution time for any one process, for algorithms (a) and (b). This time can be approximated by a sum of the average execution time and a load imbalance term proportional to the standard deviation as... [Pg.121]

We have here expressed the load imbalance as the standard deviation a times a factor k p), which is a function of the number of processes, and we will determine the functional form for k p) in section 7.2.2.I. From Eqs. 7.5 and... [Pg.121]

We will use this functional form for k p) in the performance models for algorithms (a) and (b) below. The term i/l.9 x In p -l- 5.3 increases very slowly with p and is nearly constant over a wide range of process counts. Note that the statistical approach employed in modeling the load imbalance entails using a k p) function that does not tend to 0 as p approaches 1, and the model breaks down for very small process counts. However, if Eqs. 7.5 and 7.6 were employed simply to predict the single-process timings, a value of 0 should be used for k(l). [Pg.123]

Let us consider a performance model for algorithm (c). We first note that, for a manager-worker model in which one process is dedicated to distributing tasks to the others, the maximum efficiency that can be attained with p processes is bounded by [(p — l)/p] x 100%. Other factors that may contribute to lowering the efficiency are load imbalance on the worker processes and communication overhead. [Pg.128]

The load imbalance resulting from a dynamic distribution of tasks is very difficult to model because the times required for the individual computational tasks are not known in advance. Provided that the number of tasks is much larger than the number of processes, however, it is reasonable to assume that the dynamic task distribution will enable an essentially even distribution of the load. For this to remain true as the number of processes increases, the number of tasks, umn, must increase proportionally to p. Although this is the same growth rate as obtained for a static work distribution, the actual value for umn needed for high efficiency for a given process count is much smaller for the dynamic distribution, and the assumption of perfect load balance is therefore adequate for our purposes. [Pg.128]

Processes request tasks (atom quartets) by calling the function get quartet, which has been implemented in both a dynamic and a static version. The dynamic work distribution uses a manager-worker model with a manager process dedicated to distributing tasks to the other processes, whereas the static version employs a round-robin distribution of tasks. When the number of processes is small, fhe sfafic scheme achieves the best parallel performance because the dynamic scheme, when run on p processes, uses only p - 1 processes for compulation. As the number of processes increases, however, the parallel performance for the dynamic task distribution surpasses that of the static scheme, whose efficiency is reduced by load imbalance. Wifh fhe entire Fock and density matrix available to every process, no communication is required during the computation of the Fock matrix other than the fetching of tasks in the dynamic scheme. After all ABCD tasks have been processed, a global summation is required to add the contributions to the Fock matrix from all processes and send the result to every process. [Pg.135]

Let us develop a simple performance model for the replicated data algorithm. We will ignore load imbalance in the model and assume that the computation time can be expressed simply as... [Pg.136]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...