Speedup and Efficiency

2 Performance Measures for Parallel Programs 5.2.1 Speedup and Efficiency [Pg.74]

One of the most widely used performance measures for parallel programs is the speedup, S(p), which is defined as [Pg.74]

The efficiency measures how well the computer is utilized by a parallel application, and the efficiency is usually expressed as a percentage, with an ideal efficiency corresponding to E(p) = 100%. [Pg.76]

In practice, ideal speedups are difficult to achieve, especially if the number of processes is large. One factor that reduces the speedup is the existence in an algorithm of inherently sequential parts of code that cannot benefit from a parallel implementation. An upper bound on the speedup was formulated by Amdahl, who expressed the maximum attainable speedup for a parallel algorithm in terms of the serial fraction, f, of the algorithm [Pg.77]

In this relation, known as Amdahl s law, f represents the fraction of the singleprocess execution time that is consumed by the parts of the algorithm that have not been parallelized, and / is defined as [Pg.77]

Table 1 describes the timing results (in seconds) for a system of 4000 atoms on 4, 8 and 16 nodes. The average CPU seconds for 10 time steps per processor is calculated. In the case of the force-stripped row and force-row interleaving algorithms the CPU time is reduced by half each time the number of processors is doubled. This indicates a perfect speedup and efficiency as described in Table 2. Tables 3, refibm table3 and 5 describe the timing results, speedups and efficiencies for larger systems. In particular. Table 4 shows the scaling in the CPU time with increase in the system size. These results are very close to predicted theoretical results.

Table 2. Speedup and Efficiency results for a system of 4000 atoms on 4, 8 and 16 processors...

Predicted and measured speedups and efficiencies for the simple parallel matrix-vector multiplication outlined in Figure 5.4. Dashed curves represent predictions by the performance model, solid curves show measured values, and the dot-dashed line is the ideal speedup curve. The... [Pg.85]

We have used expressions involving the latency, a, and inverse bandwidth, /3, to model the communication time. An alternative model, the Hockney model, is sometimes used for the communication time in a parallel algorithm. The Hockney model expresses the time required to send a message between two processes in terms of the parameters Too and ni, which represent the asymptotic bandwidth and the message length for which half of the asymptotic bandwidth is attained, respectively. Metrics other than the speedup and efficiency are used in parallel computing. One such metric is the Karp-Flatt metric, also referred to as the experimentally determined serial fraction. This metric is intended to be used in addition to the speedup and efficiency, and it is easily computed. The Karp-Flatt metric can provide information on parallel performance characteristics that caimot be obtained from the speedup and efficiency, for instance, whether degrading parallel performance is caused by incomplete parallelization or by other factors such as load imbalance and communication overhead. ... [Pg.90]

The complexity analysis shows that the load is evenly balanced among processors and therefore we should expect speedup close to P and efficiency close to 100%. There are however few extra terms in the expression of the time complexity (first order terms in TV), that exist because of the need to compute the next available row in the force matrix. These row allocations can be computed ahead of time and this overhead can be minimized. This is done in the next algorithm. Note that, the communication complexity is the worst case for all interconnection topologies, since simple broadcast and gather on distributed memory parallel systems are assumed. [Pg.488]

The time complexity of this algorithm shows that the force computation does not involve any extra overheads and therefore, the speedup should be equal to P and efficiency 100% in theory. [Pg.489]

In this case, there is a fixed upper limit to the achievable speedup, and the efficiency decreases as the number of processors increases, regardless of the problem size. Thus, much work must be invested to achieve efficient parallelization of linear-scaling methods these methods involve a large number of steps that scale linearly in N, and each step must be parallelized to achieve high efficiency when running on large numbers of processors. [Pg.12]

We will use this algorithm instead of the all-to-all broadcast algorithm provided in the employed implementation of MPl because the latter algorithm displayed very irregular performance. A performance model for a matrix-vector multiplication that uses the all-to-all broadcast is discussed in section 6.4.1. The total execution time, the speedup, and the efficiency can then be expresssed as the following functions of p and n... [Pg.84]

Fig. 10. Comparison of speedup (a) and efficiency (b) of parallel algorithm for different grid sizes.

The lawn-mowing analogy is also interesting in that up to a certain point there will be a linear speedup as mowers are added. This speedup occurs because the mowers do not interfere with each other s work and have efficient mechanisms for coordinating their efforts. The lawn-mowing problem exhibits very good data paraHeHsm. [Pg.94]

N.2 Computational speedup for the direct and reciprocal sums Computational speedups can be obtained for both the direct and reciprocal contributions. In the direct space sum, the issue is the efficient evaluation of the erfc function. One method proposed by Sagui et al. [64] relies on the McMurchie-Davidson [57] recursion to calculate the required erfc and higher derivatives for the multipoles. This same approach has been used by the authors for GEM [15]. This approach has been shown to be applicable not only for the Coulomb operator but to other types of operators such as overlap [15, 62],... [Pg.166]

Figure 17-2. Speedup S and parallel efficiency P as functions of number of processors...

The rationale of using hybrid simulation here is that a classic diffusion-adsorption type of model, Eq. (2), can efficiently handle large distances between steps by a finite difference coarse discretization in space. As often happens in hybrid simulations, an explicit, forward discretization in time was employed. On the other hand, KMC can properly handle thermal fluctuations at the steps, i.e., provide suitable boundary conditions to the continuum model. Initial simulations were done in (1 + 1) dimensions [a pseudo-2D KMC and a ID version of Eq. (2)] and subsequently extended to (2 + 1) dimensions [a pseudo-3D KMC and a 2D version of Eq. (2)] (Schulze, 2004 Schulze et al., 2003). Again, the term pseudo is used as above to imply the SOS approximation. Speedup up to a factor of 5 was reported in comparison with KMC (Schulze, 2004), which while important, is not as dramatic, at least for the conditions studied. As pointed out by Schulze, one would expect improved speedup, as the separation between steps increases while the KMC region remains relatively fixed in size. At the same time, implementation is definitely complex because it involves swapping a microscopic KMC cell with continuum model cells as the steps move on the surface of a growing film. [Pg.22]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...