Bandwidth and Latency

In Chapter 1, we discussed the performance gap between microprocessors and DRAM based main memory. The so-called memory wall affects microprocessor performance in two aspects bandwidth and latency. Latency is the waiting time after a memory request is sent to the memory controller, while bandwidth is the maximum throughput a memory system could provide. [Pg.56]

Flexibility of configuration. The organization of cluster systems is determined by the topology of their interconnection networks, which can be determined at time of installation and easily modified. Depending on fhe requirements of the user applications, various system configurations can be implemented to optimize for data flow bandwidth and latency. [Pg.2]

SCI was perhaps the first SAN to achieve IEEE standardization and has very good bandwidth and latency characteristics. Existing implementations provide between 3.2- and 8-Gbps peak bandwidth with best latencies below 4 /u,sec. The SCI standard includes protocol for support of distributed shared memory operation. However, most clusters employing SCI use PCI-compatible network control cards (e.g.. Dolphin) that cannot support cross-node cache coherence. Nonetheless, even in distributed memory clusters, it provides an effective network infrastructure. [Pg.8]

The bandwidth and latencies defined are at the MPI level, i.e., the ones seen by a code-like LAMMPS in actual run time. For further details see the LAMMPS website. [Pg.299]

Algorithms and cost analyses for a number of collective communication operations have been discussed in some detail by Grama et al. A comprehensive performance comparison of implementations of MPI on different network interconnects (InfiniBand, Myrinet , and Quadrics ), including both micro-level benchmarks (determination of latency and bandwidth) and application-level benchmarks, has been carried out by Liu et al. A discussion of the optimization of collective communication in MPICH, including performance analyses of many collective operations, has been given by Thakur et al. ... [Pg.56]

We have used expressions involving the latency, a, and inverse bandwidth, /3, to model the communication time. An alternative model, the Hockney model, is sometimes used for the communication time in a parallel algorithm. The Hockney model expresses the time required to send a message between two processes in terms of the parameters Too and ni, which represent the asymptotic bandwidth and the message length for which half of the asymptotic bandwidth is attained, respectively. Metrics other than the speedup and efficiency are used in parallel computing. One such metric is the Karp-Flatt metric, also referred to as the experimentally determined serial fraction. This metric is intended to be used in addition to the speedup and efficiency, and it is easily computed. The Karp-Flatt metric can provide information on parallel performance characteristics that caimot be obtained from the speedup and efficiency, for instance, whether degrading parallel performance is caused by incomplete parallelization or by other factors such as load imbalance and communication overhead. ... [Pg.90]

MPP. This multiple instruction stream, multiple data stream or MIMD class of parallel computer integrates many (from a few to several thousand) CPUs (central processing units) with independent instruction streams and flow control coordinating through a high-bandwidth, low-latency internal communication network. Memory blocks associated with each CPU may be independent of the oth-... [Pg.3]

Performance In general, a service performs more poorly than an in-process method call due to network latency and bandwidth constraints. To make things worse, a service may call other services to fulfill its responsibilities—a chain of services. Performance has to be a design consideration throughout the development cycle. [Pg.42]

Tolerance of latency and low bandwidth for references to remote memory locations... [Pg.224]

We assume that the total physical memory, required by the problem and algorithm, will fit into the total available memory. For situations - often called out-of-core problems-when this is not true, it is the sustainable bandwidth to secondary media, like disk and tape, which becomes the bottleneck. A very interesting paper on what can be done using an algorithm that has both modest I/O bandwidth requirements and a substantial latency tolerance can found in [48]. [Pg.243]

It may sound like a paradox, but currently, in optimizing for a parallel computer most of the effort should be spent on making sure that the individual node performance is as good as possible. This is a consequence of the power of the individual node compared to the network latency and bandwidth. In short, the current parallel machines are of large grain type. The parallel algorithm used should make every effort to communicate as seldom as possible. For a best performance it often means that a particular calculated value, needed on several nodes, can actually be recalculated more quickly on each node, compared to communicating it to the nodes where it is needed. This is the parallel form of the classic optimization trade-off between memory and CPU cycles. [Pg.247]

As a first approximation, after the first word of data enters the network, the subsequent words are assumed to immediately follow as a steadily flowing stream of data. Thus, the time required to send data is the sum of the time needed for the first word of data to begin arriving at the destination (the latency, a) and the additional time that elapses until the last word arrives (the number of words times the inverse bandwidth, j8) ... [Pg.24]

The network performance characteristics for a parallel computer may greatly influence the performance that can be obtained with a parallel application. The latency and bandwidth are among the most important performance characteristics because their values determine the communication overhead for a parallel program. Let us consider how to determine these parameters and how to use them in performance modeling. To model the communication time required for a parallel program, one first needs a model for the time required to send a message between two processes. For most purposes, this time can... [Pg.71]

Another network performance characteristic, related to the latency and bandwidth, is the effective bandwidth, which is defined as the message length divided by the total send time... [Pg.72]

Latency a, inverse bandwidth p, and bandwidth for Gigabit Ethernet (GigE) and InfiniBand (using IPoIB) interconnects on a Linux cluster. Data were determined using the program shown in Figure 5.2, and the reported bandwidths are unidirectional... [Pg.74]

The execution time for a parallel algorithm is a function of the number of processes, p, and the problem size, n. Additionally, the execution time depends parametrically on several machine-specific parameters that characterize the communication network and the computation speed the latency and the inverse of the bandwidth, a and p, respectively (both defined in section 5.1),... [Pg.80]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...