Matrix-vector multiplication, parallel

The well-known domain decomposition (DD) methods, see e.g. Chan Mathew (1994), use data partition induced by a decomposition of the computational domain. This decomposition can be used for two purposes Firstly for parallel implementation of vector updates, inner products and matrix-vector multiplication, i.e. for parallel implementation of the CG method. Secondly, for a construction of efficient preconditioners. [Pg.399]

Outline of a simple parallel algorithm to perform the matrix-vector multiplication Ab = c. A is an n X n matrix distributed by rows, b and c are replicated vectors, p is the number of processes, and ttiis proc is the process ID. A static work distribution scheme is used, and a single global communication operation is required. Each process computes the elements c, of the c vector corresponding to the locally stored rows of A, and an all-to-all broadcast operation puts a copy of the entire c vector on all processes. [Pg.83]

Let us develop a performance model for the parallel matrix-vector multiplication. We first note that if the dimensions of A are n x n, the maximum number of processes that can be utilized in this parallel algorithm equals n. The total number of floating point operations required is rr (where we have counted a combined multiply and add as a single operation), and provided that the work is distributed evenly, which is a good approximation if n p, the computation time per process is... [Pg.83]

Predicted and measured speedups and efficiencies for the simple parallel matrix-vector multiplication outlined in Figure 5.4. Dashed curves represent predictions by the performance model, solid curves show measured values, and the dot-dashed line is the ideal speedup curve. The... [Pg.85]

We will illustrate parallel matrix-vector multiplication algorithms using collective communication in section 6.4, and detailed examples and performance analyses of quantum chemistry algorithms employing collective communication operations can be found in sections 8.3,9.3, and 10.3. [Pg.105]

To illustrate some of the parallel program design principles discussed in this chapter, let us consider the design of a parallel algorithm to perform a dense matrix-vector multiplication Ab = c, where A is an n x n matrix and b and c are n-dimensional vectors. To be able to take full advantage of the... [Pg.107]

Data residing on process P in a parallel matrix-vector multiplication Ab = c, where A is a row-distributed n xn matrix, and b and c are replicated vectors of length n. The process count is p, and Pi holds the rows of A numbered/n/p through (i +l)n/p—1 and computes the corresponding elements of c. A final all-to-all broadcast puts the entire c vector on all processes. [Pg.108]

From Eq. 6.10 it follows that the dimension n must grow at the same rate as p to maintain a constant efficiency as the number of processes increases. If n increases at the same rate as p, however, the memory requirement per process n /p + 2n) will increase with the number of processes. Thus, a fc-fold increase in p, with a concomitant increase in n to keep the efficiency constant, will lead to a fc-fold increase in the memory required per process, creating a potential memory bottleneck. Measured performance data for a parallel matrix-vector multiplication algorithm using a row-distributed matrix are presented in section 5.3.2. [Pg.109]

Parallel matrix-vector multiplication Ab = c on p processes using fuUy distributed data. The processes are shown on a grid with process Pj, j in row i and column j (0 < i < 0 < j <, /p). [Pg.111]

Outline of the algorithm for parallel matrix-vector multiplication Ab = c with block-distributed matrix A discussed in the text. The number of processes is designated p, and this.proc is the process ID. The employed data distribution is shown in Figure 6.11 bi and c represent blocks of length n/yp of the b and c vectors, respectively, and A -i is the block of A stored by... [Pg.112]

The use of concurrence or parallelism in chemistry applications is not new. In the 1980s chemistry applications evolved to take advantage of multiple vector registers on vector supercomputers and attached processors by restructuring the software to utilize matrix—vector and matrix—matrix operations. In fact, once the software had been adapted to vector supercomputers, many applications ran faster on serial machines because of improved use of these machines memory hierarchies. The use of asynchronous disk operations (overlapped computation and disk reads or writes) and a few loosely coupled computers and workstations are other concurrency optimizations that were used before the development of current MPP technology. The challenge of the... [Pg.210]

The development of vector and parallel computers has greatly influenced methods for solving linear systems, for such computers greatly speed up many matrix and vector computations. For instance, the addition of two n-dimensional vectors or of two nxn matrices or multiplication of such a vector or of such a matrix by a constant requires n or arithmetic operations, but all of them can be performed in one parallel step if n or processors are available. Such additional power dramatically increased the previous ability to solve large linear systems in a reasonable amount of time. This development also required revision of the previous classification of known algorithms in order to choose algorithms most suitable for new computers. For instance, Jordan s version of Gaussian elimina-... [Pg.196]

The parallelization of PEDICI mainly concerned itself with the direct Cl [10] part of the program, which genertdly is by far the most time consuming. In the iterative procedure first proposed by Davidson [11], or the many existing variants hereof, a central step consists of the efficient multiplication of the electronic Hamiltonian matrix onto a given trial vector ... [Pg.269]

Generally, a set of coupled diffusion equations arises for multiple-component diffusion when N > 3. The least complicated case is for ternary (N = 3) systems that have two independent concentrations (or fluxes) and a 2 x 2 matrix of interdiffusivities. A matrix and vector notation simplifies the general case. Below, the equations are developed for the ternary case along with a parallel development using compact notation for the more extended general case. Many characteristic features of general multicomponent diffusion can be illustrated through specific solutions of the ternary case. [Pg.134]

However, Harrison and Zarrabian suggest that for parallel-vector machines, it is better to revert to a matrix multiplication such as that used by Knowles and Handy.109 This algorithm is produced in Fig. 11. These loops are run... [Pg.204]

Given these characteristics, it is evident that large-scale semiempirical SCF-MO calculations are ideally suited for vectorization and shared-memory parallelization the dominant matrix multiplications can be performed very efficiently by BLAS library routines, and the remaining minor tasks of integral evaluation and Fock matrix construction can also be handled well on parallel vector processors with shared memory (see Ref. [43] for further details). The situation is less advantageous for massively parallel (MP) systems with distributed memory. In recent years, several groups have reported on the hne-grained parallelization of their semiempirical SCF-MO codes on MP hardware [76-79], but satisfactory overall speedups are normally obtained only for relatively small numbers of nodes (see Ref. [43] for further details). [Pg.571]

The expansion coefficients are eigenvectors of the interaction matrix. Sparse matrix methods are used since, as the size of the expansion increases, more and more matrix elements are zero. An implementation of the Davidson method [14] is used for large cases. Since it is based on the multiplication of the interaction matrix by a vector, the method can readily be parallelized [15]. [Pg.119]

Big Chemical Encyclopedia

Chemical substances, components, reactions, process design ...