Implementing Asynchronous Linear Solvers Using NonUniform Distributions
ACM Subject Categories
Computing methodologies~Linear algebra algorithms

Computing methodologies~Massively parallel algorithms
Applied computing~Mathematics and statistics
Keywords
 Asynchronous Iteration
 Linear Solvers
 Randomized Linear Algebra
Abstract
Asynchronous iterative methods may improve the timetosolution of their synchronous counterparts on highly parallel computational platforms. This paper considers asynchronous iterative linear system solvers that employ nonuniform randomization and develops a new implementation for such methods. Experiments with a twodimensional finitedifference discrete Laplacian problem are presented. The new finer grain implementation is compared with an existing blockbased one and shown to be superior in terms of the convergence speed and accuracy. In general, using nonuniform distributions in selecting components to update may lead to faster convergence. In particular, the new implementation converges up to 10% faster when it uses a nonuniform distribution.
Introduction
Asynchronous iterative methods describe a class of parallel iterative algorithms where each computing element is allowed to perform its task without waiting for updates from any of the other processes. These methods are often applied to the parallel solution of fixedpoint problems and have been used in a wide variety of applications including: the faulttolerant solution of linear systems (Anzt, Dongarra, & QuintanaOrtฤฑฬ, 2019), the preconditioning of linear solvers (Chow & Patel, 2015), and optimization (Recht et al., 2011), among many others. These solvers tend not to converge to high precision as quickly as their Krylov subspace counterparts; however, they can converge very quickly to a low level of accuracy (Avron, Druinsky, & Gupta, 2015). This loss of accuracy may cause the use of asynchronous linear solvers to be suboptimal for some applications, but the fact that they are able to reach an approximate solution quickly opens up several other application areas. Possible use cases include preconditioning to a Krylov method, solving systems that may not need a high level of accuracy (e.g., big data and machine learning), or smoothing a multigrid method.
Here we study asynchronous iterative methods for solving linear systems of the form ๐ด๐ฅ = ๐, such as asynchronous Jacobi. One way to attempt to improve the performance of asynchronous linear solvers is to have each processor select randomly the (block of) components it updates next, as opposed to fixing an update order a priori. This approach has been studied previously by Avron, Druinsky, and Gupta (2015) for the case where the random selection is done uniformly. Our work continues to investigate the potential performance increase of dynamically weighting the random selection of the next component to update. In the synchronous case, weighting the selection using the norm of the row of ๐ด associated with the selected component has been done previously (Strohmer & Vershynin, 2009; Leventhal & Lewis, 2010; Griebel & Oswald 2012). However, the idea employed here is to periodically sort and rank the residuals associated with each component and make the random selection using a nonuniform distribution that is more likely to select components with a larger contribution to the residual. This is motivated by the success of weighted stationary solvers, such as the Southwell iteration, which typically converge in fewer iterations than traditional Jacobi or GaussSeidel relaxation schemes (see e.g., Southwell (1946) and WolfsonPou and Chow (2017)).
In a previous work, we have already studied the use of nonuniform distributions for selecting components to update (Coleman, Jensen, & Sosonkina, 2019). The present work extends that work by making the following new contributions:
 Propose a new rowbased randomized asynchronous linear solver with a significantly different approach to the selection of components to update;
 Develop an alternative component ordering criterion that uses component differences instead of residuals;
 Observe experimentally that new rowbased solver exhibits convergence in fewer component relaxations than serial GaussSeidel;
 Compare the performance of the block and rowbased solvers and demonstrates that the proposed new rowbased solver improves upon the blockbased one.
The structure of the rest of the paper is as follows. Section 2 provides information on related studies. Section 3 gives an overview of asynchronous iterative methods. Section 4 provides the design of randomized asynchronous iterative solvers that use nonuniform distributions. Section 5 presents the experimental results of the two implementations considered in this work and their comparisons. Section 6 concludes and proposes some future works.
Related Work
The Department of Energy has commissioned two very detailed reports about the progression towards exascale level computing; one from a general computing standpoint conducted by Ashby et al. (2010), and a report aimed specifically at applied mathematics for exascale computing by Dongarra et al. (2014); both of which emphasize the importance of developing scalable algorithms moving forward towards exascale platforms. Development of scalable applications on a large scale starts with modifying algorithms that form the basis for those applications, and the stationary iterative methods examined here (e.g., Jacobi, GaussSeidel, block variants) form an important aspect of many preconditioning techniques for Krylov subspace methods, as well as commonly acting as smoother in multigrid methods.
Several recent studies focus on improving scalability by attempting to remove the synchronization delay: a finegrained algorithm for computing incomplete LU factors for the purposes of preconditioning of linear solvers was created by Chow and Patel (2015), an optimization technique based upon an asynchronous approach to stochastic gradient descent was created by Recht et al. (2011), and the efficacy of asynchronous multigrid smoothers was explored for Computational Fluid Dynamics (CFD) applications in (Kashi, Vangara, & Nadarajah, 2018).
The use of randomization in linear algebra has found use in a variety of areas including transforming linear systems using Random Butterfly Transformations to eliminate (with probability 1) the need for pivoting. This has been used to aid in the performance of direct solvers for dense matrices by Parker (1995), and later adopted for sparse matrices by Baboulin, Li, and Rouet (2015). Other examples include the random component selection in stochastic gradient descent methods, including an early study in Srivastava and Nedic (2011) that incorporates asynchronous computation. More pertinent to the topic studied here, randomized linear relaxation based solvers have been studied in the past by Strikwerda (2002) who extend the original asynchronous model presented by Chazan and Miranker (1969) to allow component choice and (theoretical) delay to be based upon probability distributions.
The present work follows a greedy approach, similar in spirit to the Southwell iteration. WolfsonPou and Chow (2017) extend a Southwelloriented approach to the case of parallel asynchronous solvers, whereby an equation is relaxed if it has the largest residual among all coupled equations.
Overview of Asynchronous Iterative Methods
In asynchronous computation, each part of the problem is updated such that no information from other parts is needed while each individual computation is performed. This allows each processor to act independently. The model that is shown here to provide a basis for asynchronous computation comes mainly from Frommer and Szyld (2000). To start, consider a fixed point iteration with the function, ๐บ: ๐ท โ ๐ท. Given a finite number of processors ๐_{1}, ๐_{2}, โฆ, ๐_{๐} each assigned to a block ๐ต of components ๐ต_{1}, ๐ต_{2}, โฆ, ๐ต_{๐}, the computational model can be stated as shown in Algorithm 1.
If each processor (๐_{๐}) waits for the other processors to finish each update, then the model describes a parallel synchronous form of computation. If no order is established for the processors, then the computation is asynchronous.
At the end of an update by processor ๐_{๐}, the components associated with the block ๐ต_{๐๐} will be updated. This results in a vector, $x=\left({x}_{1}^{{s}_{1}\left(k\right)},{x}_{2}^{{s}_{2}\left(k\right)},\mathrm{\u0e42\x80\u0e06},{x}_{m}^{{s}_{m}\left(k\right)}\right)$, where ๐ _{๐}(๐) indicates how many times component ๐ has been updated, and ๐ is a global iteration counter that is updated every time that any processing element makes an update. A set of indices ๐ผ^{๐} contains the components that were updated on the ๐^{th} iteration. Given these definitions, the three following conditions provide a framework for asynchronous computation:
Definition 1. If the following three conditions hold

${s}_{l}\left(k\right)\u0e42\x89\u0e04k\u0e42\x88\x921$, i.e., only components that have finished computing are used in the current approximation.

${lim}_{k\u0e42\x86\x92\mathrm{\u0e42\x88\x9e}}{s}_{l}\left(k\right)=\mathrm{\u0e42\x88\x9e}$, i.e., the newest updates for each component are used.

$\leftk\u0e42\x88\x88\mathrm{\u0e42\x84\x95}:l\u0e42\x88\x88{I}^{k}\right=\mathrm{\u0e42\x88\x9e}$, i.e., all components will continue to be updated.
Then given an initial ${x}^{0}\u0e42\x88\x88D$, the iterative update process defined by,
This basic computational model provided by the combination of Algorithm 1 and Definition 1 allows for many different results on finegrained iterative methods. In particular, our earlier work (Coleman, Jensen, & Sosonkina, 2019) introduced a blockbased randomized asynchronous linear solver that uses nonuniform distributions for dynamically prioritizing components to update.
Relaxation methods have been the focus of many studies related to asynchronous iterations starting with Chazan and Miranker (1969). They are typically used to solve linear systems of the form ๐ด๐ฅ = ๐ and can be written as fixed point iterations that can be expressed as
This iteration can give successive updates to the ๐ฅ_{๐} component in the solution vector. In synchronous computing environments, each update to an element of the solution vector, ๐ฅ_{๐}, is computed sequentially using the same data for the other components of the solution vector (i.e., the values for ๐ฅ_{๐} in Equation (2)). Conversely, in an asynchronous computing environment, each update to an element of the solution vector occurs when the computing element responsible for updating that component is ready to write the update to memory and the other components used are simply the latest ones available to the computing element. Expressing Equation (2) in a block form similar to Equation (1) gives an iteration matrix of ๐ถ = โ๐ท^{โ1}(๐ฟ + ๐) where ๐ท is the diagonal portion of ๐ด, and ๐ฟ and ๐ are the strictly lower and upper triangular portions of ๐ด respectively. Convergence of asynchronous fixed point methods of the form presented in Equation (1) is determined by the spectral radius of the iteration matrix, ๐ถ.
Theorem 1. For a fixed point iteration of the form given in Equation (1) that adheres to the asynchronous computational model provided by Algorithm 1 and Definition 1, if the spectral radius of ๐ถ, ฯ(๐ถ) , is less than one, then the iterative method will converge to the fixed point solution.
If ๐ฅ^{*} is the fixed point of the iteration defined by the matrix ๐ถ, then convergence is given by ensuring that the error at a given iteration, โ ๐ฅ^{(๐)} โ ๐ฅ^{*} โ, is sufficiently small. In practice, this is accomplished by verifying that the residual, ๐^{(๐)} = ๐ โ ๐ด๐ฅ^{(๐)}, is beneath a given threshold. Asymptotic results such as this, i.e., that guarantee eventual convergence but offer no guarantee as to the rate of that convergence, exist for many variants of the iteration described above (see Frommer and Szyld (2000) for a summary).
Randomized Linear Solvers
The use of randomization in asynchronous linear solvers allows for the possibility of statements concerning the rate of convergence to be made. A randomized GaussSeidel method was introduced by Leventhal and Lewis (2010) building off of the randomized Kaczmarz algorithm proposed by Strohmer and Vershynin (2009), whereby the decrease in the expected value of the error at each step is bounded. The analysis was generalized by Griebel and Oswald (2012) who also added a new parameter that allows for both over and under relaxation. Both of these studies weight the random selection of row ๐ by the size of the element ๐_{๐๐} โ ๐ด. In the case that ๐ด has unit diagonal this simplifies to a uniform distribution. More recently, Avron, Druinsky, and Gupta (2015) build upon the analysis by Leventhal and Lewis (2010) and Griebel and Oswald (2012) and explicitly analyze the case of asynchronous computation with a uniform distribution.
All of the methods select the vector component to update from a random distribution instead of either sequentially looping through the available components or by tying the updates for a single component to a particular processor (see Equation (2)). In a traditional parallelization of either a synchronous or asynchronous linear solver, processor ๐ is responsible for updating component ๐; the asynchronous variant allows processor ๐ to continue to compute relaxations for the component assigned to it regardless of the state of the other processors. The use of randomization in the selection of which component to update allows for the possibility of any processor updating any component. In a randomized asynchronous linear solver, when a processor finishes computing an update to a component, it writes the update to the shared memory and then randomly draws the next component to update from the list of all available components. In the randomized asynchronous linear solvers proposed by others to date, this random selection is always done using either uniform random number generation, or with a probability proportional to a row norm of the matrix ๐ด. Leventhal and Lewis (2010) cite Fourier analysis as an application area that can benefit from this type of weighting; however, there is no reason not to expect improved behavior for an arbitrary problem. The authors have proposed inย Coleman, Jensen, and Sosonkina (2019) to use the nonuniform distributions in the asynchronous Jacobi iterative method. In this work, efficient implementations of such an iterative method are investigated.
Southwell Algorithm
The Southwell algorithm (Southwell, 1946) works similarly to Jacobi by relaxing a single equation at a time, but chooses the equation with the largest local contribution to the residual. For a given row ๐, this local contribution is defined to be
at iteration ๐. This difference allows the Southwell algorithm to often converge in fewer iterations than Jacobi, but raises the expense of computing an update since the local residuals need to be stored and ranked at each iteration. After a given iteration, the Southwell algorithm chooses the component that contributes the most to the global residual; thus, the algorithm ranks the residuals from largest to smallest. Using the insight from the Southwell algorithm, the idea behind the randomized linear solvers developed here is for each processor to select the next component for updating randomly, using a distribution that more heavily weights selection of components that contribute more to the global residual. Pseudocode for a randomized variant is provided in Algorithm 2. The key difference of the present work is that here nonuniform distributions in Line 3 of Algorithm 2 are investigated.
In an effort to simulate the effect of the Southwell algorithm using randomized asynchronous solvers, the local residuals associated with each equation (or block of equations) are ranked and sorted, and the selection of the next equation (i.e., component) to update is performed using a nonuniform distribution that forces the random selection to pick components with larger local residuals more frequently. The goal behind the proposed modification is that relaxing the components with a more significant contribution to the global residual may reduce the total number of iterations required. Motivation for this comes from a myriad of different studies, see for instance Nutini et al. (2015) that shows that for some cases (Gauss)Southwell selection can converge faster than uniform random selection for coordinate descent. In general, the improvement in convergence will have to be shown to be significant enough to offset the extra computational and communication cost associated with storing and ranking all of the local residuals. To help offset the increased computational expense, the periodicity with which the sorting and ranking procedures are done is experimented with since it contributes directly to the overall efficiency of the algorithm.
Asynchronous Solver Design with NonUniform Distributions
The focus here is initially on the potential performance of different randomized asynchronous linear solvers through a series of tests in MATLAB^{ยฎ} (Section 4.2), followed by the descriptions of two sharedmemory algorithms, a blockbased (Section 4.3) and a novel rowbased (Section 4.4).
Problem Description
This work examines the asynchronous Jacobi relaxation algorithm for solving finitedifference discretizations of Partial Differential Equations (PDEs) on a regular grid. In science and engineering, PDEs mathematically model systems in which continuous variables, such as temperature or pressure, change with respect to two or more independent variables, such as time, length, or angle (Smith, 1985). The specific problem under study here is Laplace equation in two dimensions:
where the twodimensional finitedifference discretization uses Dirichlet boundary conditions. This PDE, which is a fundamental equation for modeling equilibrium and steady state problems, is also used in more complex problems based on PDEs. Equation (3) may be discretized such that a finite difference operator computes difference quotients over a discretized domain. For example, the twodimensional discrete Laplace operator
approximates the twodimensional continuous Laplacian using a fivepoint stencil (Lindeberg, 1990). From this, a discretized version of the Jacobi algorithm
may be applied to solve a twodimensional sparse linear system of equations (Strikwerda, 2004). Indices ๐, ๐, and ๐ define discrete grid nodes in two dimensions and the iteration number, respectively, for updating the discretized solution vector ๐ฃ.
In the particular instance of this 2D Laplacian problem, as solved with the Jacobi method here, the grid of 800 ร 800 is used to obtain experimental results, the Dirichlet boundary conditions are 100, 0, 75, and 50 for the top, bottom, left, and right boundaries, respectively; the solution vector ๐ฃ is initilalized to 0 in each nonboundary grid point, and the righthand side vector ๐ is equal to the initial ๐ฃ.
ProofofConcept
Preliminary experiments are performed using MATLAB^{ยฎ} to demonstrate the improvement in convergence with Southwell and with nonuniform component selection, compared with Jacobi and with uniform component selection, for the problem tested in this work. As an example of potential convergence rates, Figure 1 shows the progression of the residuals over the first 10,000 iterations when solving the two and threedimensional finitedifference discretizations of the Laplacian over a 10 ร 10 and 10 ร 10 ร 10 grids, respectively. Here, the four solution methods used are the traditional synchronous Jacobi algorithm, a traditional Southwell algorithm, and two randomized linear solvers: one choosing the component to update using a uniform random distribution, and another using an exponential random number distribution with the parameter ฮป = 2. Note that the convergence of the randomized linear solver using the uniform distribution is slightly inferior to traditional solvers and to the one with exponential distribution. The latter performs on par with the Southwell, both in the 2D (Figure 1a) and 3D (Figure 1b) cases.
Blockbased Algorithm
The following blockbased algorithm design has been introduced inย Coleman, Jensen, and Sosonkina (2019) and is provided here as the reference for a wider performance analysis and comparison with the novel, rowbased, algorithm. In the taskbased asynchronous solver, a thread chooses a block of grid rows to update by sampling from a distribution. The number it draws corresponds to an index in a list of blocks, ranked in order of descending component residuals. For example, if a thread draws the number zero from the distribution, it will update the blockrow of components with the largest residual, assuming that block is not being updated by another thread. In the case that a thread selects a block that is already being worked on by another thread, the selecting thread searches sequentially either up or down in the rankings until it finds an available block.
Initially, block residual rankings are assigned via a natural ascending ordering (seeย Figure 2). A single thread, denoted the residual ranking thread, is tasked with computing the component residuals, sorting the residual rankings, and updating the global ranking list that all the threads use to select blocks for updating. Note that using a single thread leads to a more accurate global ranking list and does not result in a bottleneck for a moderate number of threads. For largescale distributed implementations, a different ranking procedure has to be developed.
In this work, the residual ranking thread performs ranking and listupdating after every five iterations of the linear system solver. Essentially, Algorithm 2 may be modified to include ranking periodicity ฯ as shown inย Algorithm 3. This ranking period needs to be chosen judiciously, depending on several factors, such as the number ๐ of relaxations performed, the number of threads used, and the number $\hat{n}$ of blockrows to rank. Here, ฯ = 5 was found experimentally to mitigate the ranking overhead for the obtained number of iterations to convergence, while the number of relaxations was varied. A more detailed investigation of the ranking periodicity is warranted and left as future work.
Rowbased Algorithm
Algorithm 4 illustrates a novel rowbased method. Similarly toย Algorithm 3, the master thread periodically, every ฯ relaxations, ranks and sorts the rows (line 20). However, there are several important distinctions between the two algorithms, due to which Algorithm 4 exhibits better performance. In lineย 10, a thread uses a probability distribution function ๐ to select a single target row to relax instead of a block of rows shown inย Algorithm 3, and then transitions from the current (start) row rฬ to the target row ๐_{๐ฃ} by relaxing all the rows between rฬ and ๐_{๐ฃ} in their natural ordering, instead of jumping to the target row to relax next as done in the blockbased implementation (Algorithm 3). Furthermore, while making this transition, a thread may move inward the domain or toward its top or bottom boundary rows, depending on the direction of the shortest distance ๐_{๐ฃ} from the current start row to the target (see Equation (4)).
where ๐ is the total number of rows
in the subdomain, and the direction of progression to the target is
toward and across the boundary if the first term inย Equation (4) is taken as ๐_{๐ฃ}; otherwise, the
boundary is not crossed. The former is also chosen when the terms
are equal. Then, in lineย 13, the nextr
function assigns
the next row number to consider by decrementing or incrementing the
row number rฬ for the boundary or
nonboundary progression direction, respectively; and performing
circular shift of the row numbers if they reach the boundary. Note
that fewer than ๐_{๐ฃ} rows
may be relaxed if certain rows in the path towards the target row
are not free, i.e., they are already being relaxed by
another thread at the time of their consideration, as specified by
the conditional statement in lineย 14. A shared array of size ๐ maintains row availability, in
which a threads "locks" the row number while it relaxes that row and
releases the lock upon finishing the operations in lines 15โ20.
The use of the shortest distance is motivated by an attempt to adhere to the ranking order of rows while also relaxing in the neighborhood of the target row; thereby, making the transition to the target smoother. Additionally, in a distributedmemory environment, the ability to more frequently relax boundary rows may facilitate a better data movement among subdomains possibly leading to a faster convergence. Another distinction between the blockbased implementation and the rowbased one is that the rowbased performs the ranking of rows using rowsum differences instead of residuals. In particular (see lineย 17), after every row rฬ relaxation, a thread performs a summation ฯ_{rฬ} of the absolute values of all the components in rฬ and updates the rowsum difference ฯ_{rฬ}
where ${\mathrm{\u0e2f\x83}}_{\stackrel{~}{r}}^{\u0e42\x88\x92}$ is the sum taken after the previous relaxation of rฬ. This difference ฯ_{rฬ} is assumed to be decreasing between the two adjacent relaxations and arbitrarily small when the algorithm has converged. A strong linear relationship has been observed between the row difference method rank and the row residual rank during the entire convergence process. Table 1 presents a small sample of representative correlation coefficients ๐ at regular intervals throughout a sample calculation, which quantify the magnitude and direction of this relationship. Of the hundreds of thousands of computed correlation coefficients, the minimum and mean coefficients are 0.77 and 0.96, respectively, with a standard deviation of 0.02. Using this difference instead of residuals decreases the computational overhead of ranking the rows. In particular, finding the row difference requires about 7 times fewer floating point operations per iteration than when using the row residual for this problem. Note, that, while it is shown that the differenceranking method works for this sample problem, it has not been tested with other types of problems.
Row Ranking Iteration  0  20๐^{3}  40๐^{3}  60๐^{3}  80๐^{3}  100๐^{3}  120๐^{3}  1400๐^{3} 
๐  0.99  0.99  0.96  0.95  0.97  0.96  0.97  0.98 
Implementation Results
The blockbased and rowbased algorithms are implemented and tested on two sharedmemory computing platforms. For both platforms and both implementations, results show that the calculation time decreases using nonuniform distributions, compared with a uniform distribution. Additionally, the rowbased implementation shows a decrease in iterations, compared with GaussSeidel.
Experimental Design
The experiments using OpenMP^{ยฎ} are conducted on two computing platforms at Old Dominion University^{1}. The Rulfo system has an Intel^{ยฎ} Xeon Phi^{โข} Knight's Landing 7210 model processor with 64 cores running at 1.30 GHz and 112 GB of DDR4 physical memory used as DRAM in these experiments. One thread per core is utilized, with one core reserved for interfacing with the operating system, resulting in 63 computational threads for the experiments inย Section 5.2. On the Wahab system, a single node of the cluster is utilized, containing two Intel^{ยฎ} Xeon Gold 6148 CPUs each with 20 physical cores and 376 GB of DDR4 memory. The code uses standard C++ routines for sorting residuals and generating random numbers, with the default parameters and the builtin distributions. Experimental parameters are presented in Table 2.
Hardw 
Thrds 
Grid 
Block 
Tol 
Norm 
Expo 


Blockbased  Rulfo  63  800 ร 800  5  1๐^{3}  (16โ54,8)  0.01โ0.8 
Blockbased  Wahab  40  800 ร 800  5  1๐^{3}  (16โ40,8)  0.01โ0.8 
Rowbased  Wahab  40  800 ร 800  1  1๐^{3}  (80โ400,40)  0.002โ0.16 
Blockbased Implementation on Rulfo
For block selection, three different distributions are tested. The uniform distribution is used as a control; a thread may select any block with equal probability. The normal distribution is used to examine the effects of targeting different segments of blocks in the rankings, i.e., blocks with lower ranks and higher residuals versus blocks with higher ranks and lower residuals. This is achieved by varying the mean parameter ฮผ while keeping the standard deviation ฯ fixed in the normal distribution. The exponential distribution, with the mode ฮป close to zero, will tend to sample lowerranked blocks.
For both normal and exponential distributions, the algorithm convergence may be observed inย Figure 3 and Figure 4, respectively. In the figures throughoutย Section 5, the term Recording Iteration points out that the data is recorded by a thread every 1,000 iterations. For the normal distribution, it may be observed inย Figure 3a that the convergence rate depends strongly on ฮผ: Its smaller values (up to ฮผ = 46) lead to rapid convergence whereas, at ฮผ = 46, the convergence sharply deteriorates. This may be also observed when considering the timetosolution inย Figure 3b. Due to very slow convergence, at large ฮผ values, the normal distribution becomes extremely noncompetitive with the uniform distribution, which timing is shown as red dashed line inย Figure 3b. Figure 4a shows that the parameter ฮป for the exponential distributions does not have as much an impact on performance as the parameter ฮผ does so for the normal distribution runs. As ฮป moves farther away from zero, however, it hinders convergence and the exponential distribution results in slower timings than those obtained with the uniform distribution as seen inย Figure 4b. Once the best parameter choices are found for normal and exponential distributions, their performances compare favorably to the uniform distribution (Figure 5), and up to 10% and 13% fewer iterations are observed, respectively.
Figure 6 provides a more detailed explanation for performance differences based on the selection of ฮผ. In particular, Figure 6a and Figure 6b depict that the ordered component residual values for ฮผ equal to 16 and 44 are nearly indistinguishable. However, when ฮผ increases to 48 (Figure 6c) and then again to 52 (Figure 6d) residuals of the lowestranked blocks decrease slowly while the residuals of all other blocks decrease more quickly. Note that all the blockbased implementation (BBI) experiments use 8 for ฯ, which is appropriate for all the chosen ฮผ ranges of 16 to 54 on Rulfo and 16 to 40 on Wahab.
Figure 7a and Figure 7b show that, for the minimum and maximum values of ฮป, respectively, the component residual decrease is balanced among the component ranks as iterations progress.
Rowbased Implementation on Wahab
The results of the BBI show that nonuniform probability distribution functions may be used to efficiently select components to update, leading to convergence for the sample problem used in this work. However, relaxing blocks of rows asynchronously tends to cluster errors on block boundaries, and thereby hindering convergence. A rowbased implementation (RBI) has been introduced to mitigate this problem. Here, the RBI solves the same sample problem in shared memory as BBI (see Section 5.2). Recall that RBI does not consider blocks of rows to be relaxed by a single thread. Instead, a thread selects only a single row to relax at a time.
For row selection, as with the BBI, the same three distributions are tested. Again, the uniform distribution is used as a baseline for comparison with the normal and exponential distributions. Similar to the BBI experiments, the normal and exponential distributions are geared to consider different ranges of row numbers by, respectively, keeping the standard deviation ฯ parameter fixed and the parameter ฮป close to zero. Figure 8a shows the diminishing row differences as the system converges, and the disparity between the rows with the least and greatest differences decreases. In Figure 8b, initially the lowestindex rows have the greatest differences since these are the boundary rows, and in effect, the greatest discontinuity initially is between the top boundary and the first row of grid points (seeย Section 4.1). Conversely, the least discontinuity initially is between the bottom boundary and the last row of grid points. These respective discontinuities are reflected in the row component differences of consecutive iterations, i.e., the top row initially changes quickly, while the bottom row changes slowly. However, as the calculation progresses, the change in the first rows decreases. For most of the calculation, the middle rows experience the most change.
For the normal distribution, Figure 9 shows the effects of choosing appropriate and excessively large values of the normal distribution mean parameter ฮผ, values of 80 and 400, respectively. Note that the normal distribution standard deviation parameter ฯ is kept at 40, which is appropriate for the range of ฮผ values considered for RBI here. Compared with Figure 9a, Figure 9b shows increased iterations, greater rowdifference disparity between bottom and topranked rows, and increasing row differences for ranks 300โ400 during the first 1,000 iterations.
Similarly toย Figure 8b,ย Figure 10 shows the rank changes during the convergence processes albeit here for the normal distribution for the same parameters as inย Figure 9. Inย Figure 10a with ฮผ = 80, the middle rows are targeted so frequently that the ranks of the rows with the greatest differences are pushed outward, toward the first and last ranks, much more than what is observed for the uniform and normal with ฮผ = 400 distributions (cf.ย Figure 8b and Figure 10b, respectively). For ฮผ = 400, because the lowerdifference rows are targeted more often, the group of highranked rows (shown as the middle yellow band) does not shift ranks to the extent seen with the uniform distribution inย Figure 8b, and hence, is updated fewer times, which leads to inferior convergence. This pattern is expected to continue for ฮผ > 400.
For the exponential distribution, Figure 11 and Figure 12 show the progression of row differences and rankings, respectively. Here, both small and large values of ฮป provide similar results and are equally effective, on par with good values of ฮผ when using the normal distribution. The convergence history is presented inย Figure 13 for the three distributions and their respective parameter choices considered for RBI. As expected, for the exponential distribution and the normal distribution with smaller ฮผ of 80, the residual decreases more quickly than with the uniform distribution, whereas with the normal distribution parameter ฮผ = 400, the residual decreases the most slowly (see Figure 13).
Performance Comparison of Block and Rowbased Implementations
Here, block and rowbased implementations are mutually compared on the same platform, Wahab, as to their number of relaxations and time to converge for a range of nonuniform distribution parameters ฮผ and ฮป, as shown inย Table 2.
Note that the distribution parameters in the rowbased implementation differ from those used by the blockbased one, which reflects the sorted array sizes and different convergence behavior of the implementations. In particular, for the given test problem, the BBI has 160 entities (blocks) to sort, while there are 800 entities (rows) to sort in the case of RBI. The difference in convergence behavior is especially evident when comparing results from the two implementations when both use normal distributions to select components. In Figure 14a, for BBI, there is a distinct difference in results for ฮผ = 44 and ฮผ = 46. For RBI, Figure 14b shows a smoother transition between good and poor normal distribution parameters. Note that good and poor, respectively, are termed so because they yield the calculation times faster and slower than those for the uniform distribution test cases. In particular, the poor distribution parameters are those starting with the first ฮผ that yields a significant jump in the calculation time; and this percentage increase for RBI is taken to be comparable with the one in BBI. By comparing the results for the ฮผ values inย Figure 14a and Figure 14b, it is seen that the RBI tolerates a much higher relative value for ฮผ than the BBI does so before significantly degrading the performance. For example, while ฮผ = 46 is already a poor choice for the BBI, ฮผ = 230 (which is equal to 46 ร 5 rows in a block) is still well within the range of good parameters for the RBI.
In addition, Figure 14a and Figure 14b compare the block and rowbased implementation iterations with the iterations of serial GaussSeidel (shown as red horizontal lines), respectively, to converge for the sample problem. The BBI cannot converge in fewer than serial GaussSeidel component relaxations even with the best distribution parameters. The RBI, however, converges in about 10% fewer component relaxations than serial GaussSeidel, using nonuniform distributions with appropriate parameters. This happens consistently, although it has been shown theoretically that more component relaxations may be required when threads update components asynchronously (Avron et al., 2015). A better convergence in the RBI compared with that in BBI may be attributed to the (finegrained) ranking of rows rather than blocks and to relaxing all the rows on the path from the current and the selected target one. Such a relaxation process leads to a smoother transition between rows and possibly to relaxations of more rows by a thread at a time than those contained in a block of the BBI. Although the rowbased implementation ranks and sorts more entries than the BBI does so, the former has faster timetosolution (seeย Figure 18) and is not hindered at large scalesโwhere distributed implementations are a mustโbecause ranking and sorting will be performed within each node independently.
Complementing the convergence comparisons of BBI and RBI fromย Figure 14, Figure 15 demonstrates (as vertical lines in each bar) a greater variability in how often each block in BBI may be relaxed compared with each row relaxation in RBI. This metric bears significance for the nonuniform distributions since they may "neglect" certain components to relax often enough to hinder convergence, as has been shown earlier inย Section 5, and thereby making a proof of convergence more difficult.
Figure 16 and Figure
17 compare BBI and RBI as to which parts of the problem grid
are relaxed more times when good or poor ฮผ is used, respectively. For the
former, Figure 16 shows not only that both
implementations emphasize the relaxation of the middle rows, away
from the fixed top and bottom boundaries, but also that the RBI
places greater emphasis on the rows near the top and bottom
boundaries, and less emphasis on the middle rows, compared to the
BBI. In particular, about 15% of component selections result in a
boundarycrossing event in the row based implementation, which
provides for relaxing all the rows more uniformly. With poor
distribution parameters, Figure 17
shows a different behavior of the RBI from the one inย Figure 16. Now, the RBI relaxes boundary rows
more frequently than it does so for the innermost rows. In
particular, some of the inner rows are now relaxed about as many
times as for good ฮผ but the boundary
rows are relaxed more frequently leading to
an overall higher number of iterations to converge. Generally, the
RBI permits more frequent relaxation of boundary rows, compared with
the BBI. Note that the frequency of boundaryrow relaxation stays
low for BBI given either value of ฮผ
(cf.ย Figure 16 and Figure
17). Such a beneficial behavior of RBI
is expressed in line 13 of Algorithm 4, in
which the nextr
function directs a thread to or from
a boundary row according to the shortest distance (line 11) as
determined by Equation (4).
Figure 18 shows that RBI decreases calculation time (Figure 18b) compared with BBI (Figure 18a) for all the distributions on the Wahab cluster. Furthermore, a 10% convergencetime reduction is observed for the rowbased implementation using normal and exponential distributions with good parameter choices, as compared to a uniform distribution. Figure 18b shows a gradual increase in calculation time for increasing values of ฮผ beyond 200, similar to the gradual increase in numbers of relaxations seen in Figure 14b and Figure 15b. For BBI on the Wahab platform (Figure 18a), the results show a jump in calculation time when the normal distribution is used, which is also observed on Rulfo (cf.ย Figure 3b) albeit at a larger ฮผ value of 46. On Wahab, the BBI threshold ฮผ is 40, which suggests that, for the normal distribution sharedmemory implementation, good parameter selection is platformdependent, as expected. In particular, having more threads results in smaller size blocks, which may mitigate poor ฮผ selection in the BBI.
In addition to the performance benefit seen with the rowbased implementation, Figure 19 and Figure 20 illustrate that the RBI produces a solution with the residual values more uniformly dispersed among all components. For each implementation, the plots display the two runs with the smallest and largest maximum component residuals, out of a set of ten runs that use the exponential distribution with ฮป = 0.05 for BBI and ฮป = 0.01 for RBI. The BBI gives a mean maximum component residual of 4.3๐^{11}, with a standard deviation of 2.2๐^{11}, while the rowbased implementation gives a mean of 1.0๐^{11} and a standard deviation of 1.6๐^{12}. Note that the largest maximum component residual produced by the RBI, as seen in Figure 20b, is about half the size of the smallest component residual produced by the BBI, as seen in Figure 19a. Observe also that the variations between runs are less for RBI than they are for BBI.
Summary and Future Work
This paper develops and tests a novel implementation of a randomized asynchronous iterative solver that uses nonuniform distributions. Complementing a traditional approach of blockrow updates, this implementation blends aspects of different solvers and relies on a finer granularity (rowbased) of grid component updates. As a result, the rowbased implementation (RBI) improves on the blockbased one in multiple aspects: solution quality, the number of iterations required for convergence, and the calculation time. The RBI also supports a wider range of parameters that yield fast convergence for the normal distribution.
For the two asynchronous randomized solver implementations, blockbased and novel rowbased, this paper demonstrates a benefit of using a nonuniform distribution in prioritizing component updates. Both BBI and RBI with nonuniform distributions converge 10% faster than their counterparts with the uniform distribution do so. The rowbased implementation may also converge with 10% fewer iterations than serial GaussSeidel, which is not observed for the blockbased implementation.
A further investigation into the ranking periodicity and technique for sorting the residuals is warranted in the scope of studying the overall efficiency of future randomized asynchronous linear solver variants. Continuing to optimize the implementations will improve their ability to be used either in a standalone capacity or as part of another solution scheme, such as preconditioners for Krylov subspace methods or as smoothers in multigrid methods. Additionally, testing on a more diverse problem set may reveal further benefits to the solver by dynamically focusing on the components that are furthest from convergence.
Acknowledgements
This work was supported in part by the U.S. Department of Energy (DOE) Office of Science, Office of Basic Energy Sciences, Computational Chemical Sciences (CCS) Research Program under work proposal number AL18380057 and the Exascale Computing Project (ECP) through the Ames Laboratory, operated by Iowa State University under contract No. DEAC0007CH11358, by the U.S. Department of Defense High Performance Computing Modernization Program, through a HASI grant and through the ILIR/IAR program at the Naval Surface Warfare Center, Dahlgren Division and by the National Science Foundation under grant CNS1828593.
Footnotes
 The code is available from the corresponding author by request.^{[back]}
Bibliography
 Anzt, H., Dongarra, J., & QuintanaOrtฤฑฬ, E. S. (2019). FineGrained BitFlip Protection for Relaxation Methods. Journal of Computational Science, 36, 100583.
 Ashby, S., Beckman, P., Chen, J., Colella, P., Collins, B., Crawford, D., Dongarra, J., Kothe, D., Lusk, R., Messina, P., Mezzacappa, T., Moin, P., Norman, M., Rosner, R., Sarkar, V., Siegel, A., Streitz, F., White, A., & Wright, M. (2010). The Opportunities and Challenges of Exascale Computing. U.S. Department of Energy.
 Avron, H., Druinsky, A., & Gupta, A. (2015). Revisiting Asynchronous Linear Solvers: Provable Convergence Rate Through Randomization. Journal of the ACM, 62(6), 51.
 Baboulin, M., Li, X. S., & Rouet, FH. (2015). Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods. In Daydรฉ, M., Marques, O., & Nakajima, K. (eds) High Performance Computing for Computational Science (pp. 135โ144). Lecture Notes in Computer Science, vol 8969. Cham: Springer.
 Chazan, D. & Miranker, W. (1969). Chaotic relaxation. Linear Algebra and Its Applications, 2(2), 199โ222.
 Chow, E. & Patel, A. (2015). FineGrained Parallel Incomplete LU Factorization. SIAM Journal on Scientific Computing, 37(2), C169โC193.
 Coleman, E., Jensen, E., & Sosonkina, M. (2019). Enhancing Asynchronous Linear Solvers through Randomization. In Proceedings of the High Performance Computing Symposium. San Diego, CA: Society for Computer Simulation International.
 Dongarra, J., Hittinger, J., Bell, J., Chacon, L., Falgout, R., Heroux, M., Hovland, P., Ng, E., Webster, C., & Wild, S. (2014). Applied Mathematics Research for Exascale Computing. Livermore, CA: Lawrence Livermore National Laboratory.
 Frommer, A. & Szyld, D. B. (2000). On Asynchronous Iterations. Journal of Computational and Applied Mathematics, 123(1โ2), 201โ216.
 Griebel, M. & Oswald, P. (2012). Greedy and Randomized Versions of the Multiplicative Schwarz Method. Linear Algebra and its Applications, 437(7), 1596โ1610.
 Kashi, A., Vangara, S., & Nadarajah, S. (2018). Asynchronous FineGrain Parallel Smoothers for Computational Fluid Dynamics. In 2018 Fluid Dynamics Conference.
 Leventhal, D. & Lewis, A. S. (2010). Randomized Methods for Linear Constraints: Convergence Rates and Conditioning. Mathematics of Operations Research, 35(3), 641โ654.
 Lindeberg, T. (1990). ScaleSpace for Discrete Signals. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(3): 234โ254.
 Nutini, J., Schmidt, M., Laradji, I., Friedlander, M., & Koepke, H. (2015). Coordinate Descent Converges Faster with the GaussSouthwell Rule Than Random Selection. In Proceedings of the 32nd International Conference on Machine Learning, 1632โ1641.
 Parker, D. S. (1995). Random Butterfly Transformations with Applications in Computational Linear Algebra (Report No. CSD950023). Los Angeles, CA: University of California.
 Recht, B., Re, C., Wright, S. & Niu, F. (2011). Hogwild: A LockFree Approach to Parallelizing Stochastic Gradient Descent. In ShaweTaylor, J., Zemel, R. S., Bartlett, P. L., Pereira, F., & Weinberger, K. Q. (eds) Advances in Neural Information Processing Systems 24 (pp. 693โ701). Red Hook, NY: Curran Associates.
 Smith, G. D. (1985). Numerical Solution of Partial Differential Equations: Finite Difference Methods. Oxford: Oxford University Press.
 Southwell, R. V. (1946). Relaxation Methods in Theoretical Physics: A Continuation of the Treatise, Relaxation Methods in Engineering Science. Oxford: The Claderon Press.
 Srivastava, K. & Nedic, A. (2011). Distributed Asynchronous Constrained Stochastic Optimization. IEEE Journal of Selected Topics in Signal Processing, 5(4), 772790.
 Strikwerda, J. C. (2002). A Probabilistic Analysis of Asynchronous Iteration. Linear Algebra and Its Applications, 349(13), 125โ154.
 Strikwerda, J. C. (2004). Finite Difference Schemes and Partial Differential Equations. Philadelphia, PA: Society for Industrial and Applied Mathematics.
 Strohmer, T. & Vershynin, R. (2009). A Randomized Kaczmarz Algorithm with Exponential Convergence. Journal of Fourier Analysis and Applications, 15(2), 262.
 WolfsonPou, J. & Chow, E. (2017). Distributed Southwell: An Iterative Method with Low Communication Costs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. New York, NY: Association for Computing Machinery.
Copyright Information
Copyright ยฉ 2020 Erik Jensen, Evan C. Coleman, Masha Sosonkina. This article is licensed under a Creative Commons Attribution 4.0 International License.