WO2015004421A1

WO2015004421A1 - A method of performing a matrix operation in a distributed processing system

Info

Publication number: WO2015004421A1
Application number: PCT/GB2014/051986
Authority: WO
Inventors: Ashraf Aboulnaga; Jingen XIANG
Original assignee: Qatar Foundation; Hoarton, Lloyd
Priority date: 2013-07-08
Filing date: 2014-06-30
Publication date: 2015-01-15
Also published as: GB201312218D0

Abstract

A method of performing a matrix operation in a distributed processing system having a plurality of processing nodes including a master processing node, the method comprising recursively partitioning an original matrix into matrix blocks; identifying a first set of matrix blocks and a second set of matrix blocks; processing the first set of matrix blocks on the master node to produce first set result; determining second set result for the second set of matrix blocks using the plurality of processing nodes by processing the second set of matrix blocks with the first set result; assembling the first set result and the second set result to form intermediate matrix blocks; performing the matrix operation on the intermediate matrix blocks using the plurality of processing nodes to provide an intermediate result; and determining the result of the matrix operation on the original matrix by processing the intermediate result.

Description

A METHOD OF PERFORMING A MATRIX OPERATION IN A DISTRIBUTED PROCESSING SYSTEM Description of Invention Introduction

Matrix operations are a fundamental building block of many computational tasks in fields as diverse as scientific computing, machine learning, and data mining. Matrix inversion is an important matrix operation, but it is difficult to implement in today's popular parallel dataflow systems, such as MapReduce. We present a method and system to optimise the operations performed on a large matrix, with reduced network bandwidth usage, and optimised disk and memory I/O.

Each element in the inverse of a matrix depends on multiple elements in the input matrix, so the computation is not easily partitionable. We disclose a method and system which integrates parallel computing architeture and related frameworks to implement a scalable and efficient technique for matrix operations, such as using MapReduce.

Embodiments of the present invention focus on matrix inversion and rely on computing the LU (Lower-Upper) decomposition of the input matrix and using that decomposition to compute the required matrix inverse. The paritioning of the matrix takes advantage of efficeint use of storing data in files, on a storage medium that may be accessed simultaneously. The integration of the parallel computing architecture and spciefic framework with the proposed technique results in an efficent method of inverting a matrix. For example, in a preferred embodiment we present a technique for computing the LU decomposition and the matrix inverse using a pipeline of MapReduce jobs. We also present optimizations in aspects of the embodiment of this technique in the context of Hadoop. We show in an experimental setting that embodiments of the present invention are scalable , enabling us to invert a 10⁵ xl0⁵ matrix in 5 hours on Amazon EC2. We also show that our technique outperforms other matrix inversion methods which are not so colsley integrated with the hardware architecture and framework.

Background Parallel dataflow systems like MapReduce and Pregel have become very popular as platforms for scalable, analytical, data intensive computing.

Several methods have been developed that support matrix inversion, such as LINPACK, LAPACK, and ScaLAPACK. The LINPACK package is written in Fortran and designed for supercomputers. The LAPACK package is developed from LINPACK and is designed to run efficiently on shared-memory vector supercomputers. ScaLAPACK is a software package that tries to provide a high-performance linear algebra library for parallel distributed memory machines instead of shared-memory in LAPACK. This package provides some routines for matrix inversion. However, this package does not provide any fault tolerance.

Parallel methods for inverting some special matrices also appear in the literature. Lau, Kumar, and Venkatesh propose methods for parallel inversion of sparse symmetric positive matrices on SIMD and MIMD parallel computing platforms. These algorithms perform better than general methods that do not take into account any special properties of the input matrix. For symmetric positive definite matrices (not necessarily sparse), Bientinesi, Gunter, and Geijn present a parallel matrix inversion algorithm based on the Cholesky factorization for symmetric matrices. Their implementation is based on the Formal Linear Algebra Methodology Environment (FLAME). The implementation shows good performance and scalability, but it does not work for general matrices and is not suitable for large clusters.

Agullo et al. have shown that the LU decomposition in double precision arithmetic can reach a throughput of 500 Gflops. That work uses powerful CPUs, GPUs, and large memory. Although this method can solve the LU decomposition problem very efficiently, it is a centralized method that is not suitable for MapReduce. Moreover, it needs special hardware (GPUs) and large memory.

Zheng et al. present an implementation of LU decomposition on a multi-core digital signal processor that does pre-fetching and pre-shuffling in MapReduce. The method is very simple and only runs row operations in reduce tasks, using the LU decomposition method on a single node. That is, one reduce task computes one row as in lines 10-12 in method 1 below, so that the method needs n MapReduce tasks to decompose an n x n matrix, which represents very poor scalability.

Matrix inversion using LU decomposition has been investigated recently by Dongarra et al., where a tile data layout is used to compute the LU decomposition and the upper triangular matrix inversion. In that paper, a run time environment called QUARK is used to dynamically schedule numerical kernels on the available processing units in order to reduce the synchronization required between different CPU cores. The method is suitable for multi-core architectures with shared memory and it achieves better performance than other numerical libraries, such as LAPACK and ScaLAPACK. However, this method is not suitable for a distributed environment, since it relies on large shared memory. Hence, its scalability is limited. Parallel dataflow systems like MapReduce and Pregel have become very popular as platforms for scalable, analytical, data intensive computing. These systems offer the scalability and fault tolerance that are required to run in a cloud computing environment, plus simple programming models and easy-to- use programmer interfaces. Rich software ecosystems and large user communities have developed around these systems, and they are being used for many large-scale computational tasks in diverse scientific, business, and web applications. Due to the success and popularity of MapReduce. and similar parallel dataflow systems, they are being used not only for the data intensive tasks for which they were originally designed, but also for computationally intensive tasks that have traditionally been assigned to high- performance computing systems. Therefore, there is significant benefit in developing optimized implementations of computationally intensive tasks using parallel dataflow systems.

Matrix operations are a fundamental building block of many computational tasks in diverse fields including physics, bioinformatics, simulation, finite element analysis, machine learning, data mining, data analytics, imaging techniques such as nMRI, signal processing, audio analysis and many others. In most of these fields, there is a need to scale to large matrices to obtain higher fidelity and better results (e.g., running a simulation on a finer grid, or training a machine learning model with more data). Matrix inversion is difficult to implement in, parallel dataflow systems, and is non-trivialbecause each element in the inverse of a matrix depends on multiple elements in the input matrix, so the computation is not easily partitionable as required by the parallel dataflow systems. We address this problem and design a method and system that leverage the hardware architecture and framework using a novel partitionable matrix inversion technique. We motivate the importance of matrix inversion by presenting some of its applications in different fields. A key application of matrix inversion is solving systems of linear equations. To solve the equation Ax = b , where A is an n x n matrix, and x and b are both vectors with n elements, one can multiply both sides of the equation on the left by the the matrix inverse A^~l , to get x = A^~lb . Matrix inversion is also related to finding eigenvalues and eigenvectors, which is a central problem in many physics and engineering applications. The eigenvalues and eigenvectors of an n x n matrix can be computed using the inverse iteration method. This method assumes that there is an approximate eigenvalue μ and an approximate eigenvector v₀ for matrix A . The method uses an iteration step to compute an increasingly accurate eigenvector, v_k+l = (A - ^_ny^lv_k II (A - ^_ny^lv_k II , where I_n is an n x n identity matrix. At any step in the iteration, the current eigenvalue can be computed as λ = ν^τΑνν^τν , for real-valued A and v . The efficiency of this iterative method relies on the ability to efficiently invert matrix A - μΙ_η .

Matrix inversion is also widely used in computed tomography (CT). In CT, the relationship between the original image of the material ( S) and the image ( T) detected by the detector can be written as: T = MS , where M is the projection matrix. In order to reconstruct the original image, we can simply invert the projection matrix and calculate the product of ¹ and T . As the accuracy of the detector increases, the number of image pixels increases and hence the order of the projection matrix (M ) also increases, motivating the need for scalable matrix inversion techniques. Image reconstruction using matrix inversion can also be found in other fields such as astrophysics.

In bioinformatics, matrix inversion is used to solve the problem of protein structure prediction. A scalable matrix inversion technique would enable novel insights into the evolutionary dynamics of sequence variation and protein folding. These are but a few of the numerous applications that rely on matrix inversion. It is clear that a scalable and efficient matrix inversion method and system would be highly useful in many applications. MapReduce and the above technique for computing the inverse of the input matrix from its LU decomposition may be used, and optimizations to improve numerical accuracy and performance in the context of Hadoop using multiple nodes and clusters of nodes. In summary we partition the computation required for matrix inversion into independent subtasks leveraging the hardware architecture and framework. Embodiments of the present invention can be used as a basis for implementing matrix inversion in other cluster computing systems such as Spark or DryadLINQ.

Matrix Inversion Preliminaries

We discuss the background to matrix inversion and LU decomposition to establish the context of the embodiments of the present invention.

For any matrix A , let [A]_y denote its element of the z^' -th row and the y^' -th column, and denote by the block defined by the beginning row x_l

(inclusive) and the ending row x₂ (exclusive), and by the beginning column y_l (inclusive) and the ending column y₂ (exclusive).

A square matrix A is a matrix with the same number of rows and columns. The number of rows and columns n is called the order of the matrix. The inverse of A is another square matrix, denoted by A^~l , such that AA^~l = A^'1 A = I_n , where I_n is the identity matrix of order n . A square matrix A is invertible if and only if A is non-singular, that is, A is of full rank n . The rank of a matrix is the number of rows (or columns) in the largest collection of linearly independent rows (or columns) of a matrix. Therefore a square matrix is invertible if and only if there are no linearly dependent rows (or columns) in this matrix.

The inverse of a matrix can be computed using many methods, such as Gauss-Jordan elimination, LU decomposition (also called LU factorization), Singular Value Decomposition (SVD), and QR decomposition. Most of these methods are not easily parallelizable in a parallel dataflow system setting. We discuss in brief these methods to provide a suitable background to the problem faced in matrix inversion.

Gauss-Jordan elimination is a classical and well-known method to calculate the inverse of a matrix. This method has two different variants: row elimination and column elimination. They are quite similar so we only discuss the method using row elimination. The method first concatenates the matrix A and the identity matrix I_n into a new matrix [_^4 | J . Then, using elementary row operations which include row switching, row multiplication, and row addition, the method transforms the left side to the identity matrix, which leaves the right side as the inverse of matrix A . Specifically, j rowoperations [ J j _jjj rowoperati ons | — 1 -|

where U is an upper triangular matrix (nonzero elements only on the diagonal and above). The Gauss-Jordan elimination method first converts the matrix A into an upper triangular matrix in n steps using linear operations on the rows of the matrix as follows. In the first step, the first row is multiplied by a constant such that the first element in this row equals to 1 , and the first row times a constant is subtracted from the i -th (l < i≤ n ) row of [A \ I_n] such that the first element in the -th (\ < i≤n ) row is 0. In the k -th step, the k -th row is multiplied by a constant such that the k -i element in this row equals to 1 , and the k -th row times a constant is subtracted from the z^' -th ( k < i≤n ) row of [A \ I_n \ such that the k -th element in the z^' -th ( k < i≤n ) row is 0. If the k -th element of the k -th row is already 0 or close to 0 before the subtraction, we first swap the k -th row with any row below k where the k -th element is not 0. This is called pivoting, and it improves numerical accuracy. After n - \ steps, the left part of matrix [A \ I_n ] is converted into an upper triangular matrix.

Next, the method converts the upper triangular matrix into an identity matrix using row operations similar to the ones described in the previous paragraph. This requires n steps, and it also converts the right part into the inverse of matrix A . The Gauss-Jordan method uses n³ multiplication operations and n³ addition operations to invert an n x n matrix, which is as efficient as, if not better than, any other method.

However, due to the large number of steps that depend on each other in a sequential fashion, this method is difficult to parallelize in a parallel dataflow system such as MapReduce since it would require a large number of MapReduce jobs that are executed sequentially. For example, Quintana et al. propose a parallel (not MapReduce) matrix inversion algorithm based on Gauss-Jordan elimination, but this method requires n iterations so a MapReduce implementation would require a pipeline of n MapReduce jobs.

LU decomposition also called LU factorization, first decomposes the original matrix into a product of two matrices A = LU , where I is a lower triangular matrix that has nonzero elements only on the diagonal and below, and U is an upper triangular matrix that has nonzero elements only on the diagonal and above. Since the inverse of a triangular matrix is easy to compute using back substitution, we can compute the inverse of A as U ^~lL ^~l . A recursive block method to parallelize LU decomposition can be used. Instead of decomposing matrix A into the product of two matrices, the SVD decomposition method decomposes matrix A into the product of three matrices, A = UWV^T { V^T is the transpose of matrix V ). W is a diagonal matrix with only positive or zero elements. U and V are both orthogonal matrices (i.e., UU^T = W^T = I„), such that the inverse of A can be given by A^~l = VW^~lU^T , where the inverse of diagonal matrix W is easily obtained by [W^ , = l/[W _u in running time 0(n) .

The SVD method needs frequent row exchanges, which means that the computation cannot be partitioned into independent subtasks . The QR decomposition first decomposes the original matrix A into a product of an orthogonal matrix Q and an upper triangular matrix R , i.e., A = QR and

A^'1 = R^~lQ^T . One way to compute the QR decomposition is using the Gram- Schmidt process. This effectively decomposes the matrix but requires computing a sequence of n vectors where each vector relies on all previous vectors (i.e., n steps are required).

We can parallelize using LU decomposition to determine the optimal number of operations . LU decomposition can also be used with pivoting to improve numerical accuracy.

According to one aspect of the present invention there is provided a method of performing a matrix operation in a distributed processing system having a plurality of processing nodes including a master processing node, the method comprising recursively partitioning an original matrix into matrix blocks; identifying a first set of matrix blocks and a second set of matrix blocks; processing the first set of matrix blocks on the master node to produce first set result; determining second set result for the second set of matrix blocks using the plurality of processing nodes by processing the second set of matrix blocks with the first set result; assembling the first set result and the second set result to form intermediate matrix blocks; performing the matrix operation on the intermediate matrix blocks using the plurality of processing nodes to provide an intermediate result; and determining the result of the matrix operation on the original matrix by processing the intermediate result. Preferably the method further comprising determining a mapping configuration using the number of nodes, wherein the mapping configuration comprises a mapper used to control parallel processing at the nodes.

Advantageously the method further comprising recursively partitioning the matrix blocks into sub-matrix blocks using the mapper to map the matrix blocks to the plurality of nodes.

Conveniently the matrix operation is an inversion operation on the matrix and the result of the matrix operation is an inverted matrix.

Preferably, the method further comprises determining a matrix size limit for the master node; and recursively partitioning the original matrix into matrix blocks, using the mapper until the largest matrix block size is below the matrix size limit.

Advantageously the second set result is further determined using the inter- dependency of the second set of matrix blocks.

Conveniently the matrix blocks are Upper and Lower triangular matrices.

Preferably the Upper triangular matrix is stored in memory as a transposed upper triangular matrix.

Advantageously the mapping configuration is based on a distributed filing system. Conveniently the distributed filing system is based on Hadoop and uses MapReduce.

Preferably the mapping configuration reads an equal number of consecutive rows in the matrix.

Advatnageously the number of matrix blocks equals the number of nodes.

Preferably the system comprises a distributed processing system having a plurality of processing nodes including a master processing node, the system comprising a partitioner module operable to recursively partition an original matrix into matrix blocks; the master processing node being operable to identify a first set of matrix blocks and a second set of matrix blocks and to process the first set of matrix blocks to produce a first set result; the plurality of processing nodes being operable to determine a second set result for the second set of matrix blocks by processing the second set of matrix blocks with the first set result; the master processing node being operable to assemble the first set result and the second set result to form intermediate matrix blocks; the plurality of processing nodes being operable to perform the matrix operation on the intermediate matrix blocks to provide an intermediate result; and the system determining the result of the matrix operation on the original matrix by processing the intermediate result.

Conveniently the distributed processing system is a parallel processing system.

Drawings

So that the present invention may be more readily understood, embodiments of the present invention will now be described, by way of example, with reference to the accompanying drawings, in which: Figure 1 shows an example of how the input matrix is partitioned;

Figure 2 shows a preferred embodiment of matrix data partitioning and distributing for processing to nodes;

Figure 3 shows matrix paritioning with data file numbers for LU decomposition;

Figure 4 shows the mapping configuration used by HDFS, directory "Root" is the work directory, and file "a.txt" in "Root" is the input matrix;

Figure 5 shows MapReduce job to compute J^' ₂ , U₂ , and A -£₂U₂ ;

Figure 6 shows the scalability of our algorithm, in comparison with ideal scalability (purple line), which is defined as T(n) = T(l)/n , where T(n) is the running time on n medium EC2 instances;

Figure 7 shows the running time of the optimized algorithm compared to the algorithm without the separate intermediate files optimization (blue), and without the block wrap optimization (red);

Figure 8 shows the ratio of the running time of ScaLAPACK to the running time of our algorithm; Figure 9 MapReduce pipeline for matrix inversions. Detailed Description

Embodiments of the present invention optimize I/O for big matrices and reuses the data in memory as many times as possible to reduce the need for reading data from disk.This has a direct impact on the number of read/write cycles to storage and data transfer between nodes. In a preferred embodiment we use MapReduce and efficient paritioning of the matrix to achieve LU decomposition and the inversion of a matrix. The preferred embodiment provides fault tolerance through the use of MapReduce. We further extend, the efficent processing of the partitioned matrix in one aspect of the preferred embodiment by utilising efficent storage strategies based on the partitioned matrix charcteristics. We show the preferred embodiment has better scalability than ScaLAPACK.

A preferred embodiment that uses MapReduce and Hadoop has several advantages. First, the preferred embodiment benefits from the scalability and fault tolerance of MapReduce. Second, the preferred embodiment is part of a rich and popular ecosystem of Hadoop that includes many systems and libraries such as HDFS, Pig, Mahout, Zookeeper, Cassandra, etc. Thus, our matrix inversion method can easily be adapted into different embodiments with other parts of this ecosystem as needed. Finally, it is important to note that matrix inversion is often not a standalone activity, but rather one component of a rich data analysis workflow.

In a further preferred embodiment such a workflow is implemented in Hadoop and the input matrix to be inverted would be generated by a MapReduce job and stored in HDFS, and the inverse matrix produced as output also needs to be stored in HDFS to be consumed by another MapReduce job. Therefore, while it may be possible to implement matrix inversion using other parallelization platforms such as MPI, a MapReduce matrix inversion technique that can be used as a pluggable component in complex Hadoop data analysis workflows is highly desirable. It may be possible in an aspect of the preferred embodiment to switch between MapReduce and MPI to perform scalable matrix inversion in these workflows. The preferred embodiment to invert a matrix uses the LU decomposition of the input matrix. LU decomposition enables us to partition the computation in a way that is suitable for parallel dataflow systems, such as MapReduce, and leverage efficient use of storage and processing to invert the matrix. The mapping may be arranged such that the network traffic and I/O from disks is reduced. In particular, the LU decomposition of the input matrix uses subtasks that do not communicate among each other except through intermediate results, which are arranged to exactly match the way mappers and reducers communicate in MapReduce. In addition, the LU decomposition technique is carefully designed to partition the computation into subtasks that require the same amount of computation on nodes, which ensures that the load among the nodes using mappers and reducers is balanced and no map or reduce task can hold up the computation. In a further aspect we assign datafile storage based on the subtasks and sub-matrices to improve efficency using hardware the architecture and framework.The preferred embodiment LU decomposition method may be implemented as a pipeline of MapReduce jobs, where the number of jobs in the pipeline and the data movement between the jobs can be precisely determined before the start of the computation.

LU decomposition is used for the matrix inversion problem. We can compute the lower triangular matrix L and the upper triangular matrix U in both single node and parallel node settings. We may use a configuration mapper to distribute and determine I/O strategy, nodes and file allocation. Although, distribution of matrix data, I/O strategy, nodes and file allocation may be implemented individually. LU decomposition on a single node is known. We study the single node method before presenting an embodiment of the present invention. Single node LU decomposition has two variants: with and without pivoting. Since pivoting can significantly improve numerical accuracy, we only discuss with pivoting (the row having the maximum y^' -th element among rows j to n is selected in the y^' -th loop). Method 1 LU decomposition on a single node.

1: function LUDecomposition(/4 )

2: for i = 1 to n do

3: j = {j I [Α]_β = max([A]_u,[Al_+ll,-^]_m)}

4: Add j to P

5: Swap i -th row with j -th row if /^'≠ j

6: for j = i + 1 to n do

7: [Α]_β=[Α]_β/[Α]_π

8: end for

9: for j = i + 1 to n do

10: for k = i + 1 to n do

12: end for

13: end for

14: end for

15: return (Α,Ρ) I^* i.e., return (L, U, P) ^*/

Let α_ϋ=[Α]_ϋ, /,,=[ ]„, ^and «_ν=[υ\. The LU decomposition A = LU can be presented as follows (the blank elements in the L and U matrices are zeros):

α_{31 32} (2)

J

This matrix multiplication can be viewed as a system of linear equations. Since the difference between the number of unknown arguments (/_y and u_l} ) and the number of equations is n , there are n free arguments that can be set to any value. Generally, these n free arguments are chosen to be /„. or u_u (z^' = l, ...,« ) and they are all set to be 1.0. We set all /„. to 1.0. The other remaining unknown arguments can be derived using the following equations:

In order to improve numerical accuracy, the rows of the original matrix A are permuted, and we decompose the pivoted (i.e., permuted) matrix PA instead of the original one. That is, we find L and U such that PA = LU . The permutation matrix P is a square binary matrix, in each row and column of which there is exactly one entry that is 1 , and 0's elsewhere. Multiplying this matrix into A will permute the rows or columns of A . It should be noted that pivoting in LU decomposition does not affect the final inverse of matrix A because we can apply the permutation matrix P to the product of the inverses of L and U . That is, we can compute U^~ll ^lP to obtain the inverse of original matrix A since U^~ll ^lPA = I_n . Computing U^~ll ^lP is equivalent to permuting the columns in U^~lL^~l according to P .

In method 1 , the result lower triangular matrix and upper triangular matrix are stored in place of the input matrix A . That is, at the end of the method 1 , the lower triangle of A will be replaced with the lower triangle of L (excluding the diagonal elements of L , which are all set to 1.0 and do not need to be stored), and the upper triangle of A will be replaced with the upper triangle of U . Since there is only one nonzero element in each row or column of the permutation matrix P , the permutation of rows can be stored in an array S , where [S indicates the permuted row number for the z^' -th row of A . After decomposing A into L and U , we need to invert L and U separately. The inverse of a lower triangular matrix is given by

The inverse of the upper triangular matrix can be computed similarly. In the preferred embodiment in one optimised aspect we invert the upper triangular matrix U by computing the inverse of U^T , which is a lower triangular matrix.

The master computer node (200) controls the components of the system to distribute and map work to nodes using mapping jobs (208).

We split the input matrix(201 ) into blocks (202) to intergrate the computation of the LU decomposition in a parallel processing system, where the system nodes and file system are correlected to the blocks, also known as matrix blocks. The block method splits the input matrix (201 ) .The lower triangular matrix L and the upper triangular matrix U are both preferably split into three submatrices, while the original matrix A is split into four submatrices. The number of blocks or sub-matrices is preferably determined using the number of available nodes (202 may be determined using 209). These smaller matrices satisfy the following equations: 2 ^1^2

L₂U_l = A₃ (5) L₃U₃ = P₂{A_L

where both P_l and P₂ are permutations of rows. The entire LU decomposition can be represented as

where P is also a permutation of rows obtained by augmenting P₁ and P₂

The preferred embodiment partitions the LU decomposition into two smaller LU decompositions and two linear equations that can easily be computed in parallel. The logical order of computing the L and U blocks on the left-hand- side of Equation 5 is as follows: First, L_l and υ (204) are computed from

A_l (203). Then,

and U₂ (206) are computed from L_l , U_l , A₂ , and A₃ (205 and 204). Third, L₃ and U₃ (207) are computed from A

(206 and 205).

Finally, L₂ (207) is computed from L₂ and P₂.

The computation of L_l and U_l (204) is considered. If submatrix A_l is small enough, e.g., in the preferred embodiment we use in the order of 10³ or less, this is dependent on the specification of the node and its capability, A_l can be decomposed into L_l and U_l on a single node very efficiently (about 1 second on a typical modern computer)(203). in the preferred embodiment using MapReducewe preferably decompose such small matrices in the MapReduce master node using method 1 . If submatrix A_l is not small enough to be computed on the master node, we preferably recursively partition it into smaller submatrices until the final submatrix is small enough to decompose on a single node. Note that while this is conceptually a recursive computation, the number of partitioning steps (i.e. , the depth of recursion ) may be pre-determined at the start of the matrix inversion process, so in the preferred emboiment we can determine a mapping configuration, which may be a predefined pipeline (Figure 9) of MapReduce jobs. In this pipeline, the input matrix is read only once and the partitioned matrix is written only once. This is achieved using a strategy based on the mapping configuration and leveraging the system architecture..

After obtaining L_x and U₁ (204), the elements of J^' ₂ and C/₂ (206) can be computed using the following two equations (for simplicity, we present the equations without pivoting since pivoting does not increase the computational complexity, although a skilled person would appreciate the use of pivoting in an aspect of the preferred embodiment):

(6)

From these equations, it is clear that the elements in one row of Z'₂ are independent of the elements in other rows. Similarly, the elements in one column of U₂ are independent of the elements in other columns. We use these characteristics of the matrices to contribute in determining the mapping configuration. Each row of Z'₂ and each column of U₂ may be computed independently, so we can parallelize (208 and 209) the computation of Z'₂ and In the preferred embodiment, using MapReduce, we can use one map function (multiple copies of which are executed in parallel in multiple map tasks) (208) to compute L and U₂ in parallel (209). We use the reduce function of the same MapReduce job to compute A -L U in parallel (return path 209) in the reduce tasks. In one aspect of the preferred embodiment we use a strategy that relates the characteristics of the matricies with the number of data files (208) transmitted in the parallel processing system (209). It is feasible that the parallel processing system is connected using a network. This optimisation feature reduces network traffic and disk access.

After obtaining A - ₂U₂ , we decompose it into L₃ and U₃ . If A - ₂U₂ is small enough, L₃ and U₃ are computed on the MapReduce master node using method 1 . Otherwise, A - ₂U₂ is further partitioned and the computation proceeds recursively. As with A_l , the number of partitioning steps can be precomputed and the recursive computation can be implemented by a predefined pipeline of MapReduce jobs.

One difference between A_l and A -L U is that A_l can be read from the input matrix and completely partitioned into as many pieces as necessary before the LU decomposition starts, while A - ₂U₂ can be computed and partitioned only after L and U₂ are computed. Note that while A_l and A - ₂U₂ may need additional partitioning, A₂ and A₃ never need additional partitioning due to the easily parallelizable nature of computing L and C/₂ (206 and 207) using Equation 6.

The pseudocode of block LU decomposition is shown in method 2. In this method, at the end of the block decomposition, the permutation matrix P is obtained by augmenting P_l and P₂ . The lower triangular matrix L is obtained by augmenting t permuted by P₂ , and J₃ (210). The upper triangular matrix U is obtained by augmenting U_{l t} U₂ , and U₃ (210) and (100).

Method 2: Block LU decomposition

1 : function BlockLUDecom( ^ )

2: if A is small enough then

3: (L,U,P ) = LUDecompoistion(^ )

4: else

5: Partition A into A₁, A₂, A₃, A

6: {^,υ,,Ρ, ) = BlockLUDecom(4 )

7: Compute U₂ from A₂ , U_l and P_l

8: Compute L from A₃ and U_l

9: Compute B = A -£₂U₂

10: {L₃,U₃,P₂ ) = BlockLUDecom^ )

1 1 : P = Combination of P₁ and P₂

12: L = Combination of L , L , L₃ , and P₂

13: U = Combination of U₂ and U₃

14: end if

15: return (L,U,P )

The preferred embodiment using MapReduce implementation of block LU decomposition consists of a pipeline of MapReduce jobs. MapReduce implementations of other matrix inversion methods such as Gauss-Jordan elimination or QR decomposition would also consist of pipelines of MapReduce jobs. However, a key advantage of block LU decomposition in the preferred embodiment is that the number of iteration steps (i.e., the number of MapReduce jobs in the pipeline) can be reduced by LU decomposing small matrices on one computer, or node, preferably the master node.

If n_b is the maximum order of a matrix that can be LU decomposed on a single computer in a few seconds, then block LU decomposition would require around nln_b iterations (modulo rounding if n is not a power of 2 and is not divisible by n_b ). The present invention in the preferred embodiment reduces the number n to around nln_b jobs by correlating the sub-matrix (or blocks) with the number of nodes. Preferably matrix data files also correlate to the matrix characteristics, where matrix may include a sub-matrix or matrix block.

The computation in block LU decomposition proceeds one block at a time, and we have around nln_b blocks (202).

We consider the advantages of the present invention in the preferred embodiment against alternative methods in the experiments. In the experiments we use ¾ = 3200. For this n_b , inverting a matrix with n = \0⁵ requires 32 iterations using block LU decomposition as opposed to 10⁵ iterations using, say, QR decomposition. After obtaining the lower triangular matrix L and the upper triangular matrix U , we compute the inverses (21 1 ) of these two matrices using Equation 4. A column of the matrix inverse is independent of other columns of the inverse and the columns can be computed independently in parallel.We use the mapping configuration and mappers to determine the inverses. After computing the inverses of L and U , the inverse of the original matrix can be obtained by multiplying U^~l by L^~l , which can also be done in parallel (208 and 209), and then permuting the resulting matrix according to array S (recall that S is a compact representation of P). That is, =∑"_k=l[U^~1]_lk[ ¹]_kJ .

We discuss the preferred embiodiment by using examples. The preferred emboiment in the example uses MapReduce and Hadoop: (1 ) We use the master compute node, or master node, to create a mapping configuration in HDFS, the mapping configuration comprises control files which are used as input files for the mappers of all MapReduce jobs. (2) We launch a MapReduce job to recursively partition the input matrix A (202).

(3) We launch a series of MapReduce jobs to compute J^' ₂ , U₂ (206 and 208), and B = A -£₂U₂ (return of 208) for the different partitions of A as in method 2. Matrices L_{l t} U_l , L₃ , and U₃ are computed in the master nodes of these

MapReduce jobs if they are small enough (line 3 in method 2). Otherwise, they are preferably computed by other MapReduce jobs assigned by the master node (line 15 in method 2). (4) We launch a final MapReduce job to produce the final output by computing

U^~ L ¹ and A^~l = U ^lL ^lP (212).

The number of MapReduce jobs required to compute the LU decomposition (Step 3) depends on the order n of matrix A , and on the bound value n_b . Recall that n_b is the order of the largest matrix that can be LU decomposed on a single node preferably the MapReduce master node. The number of

MapReduce jobs is given by 2^ilog2^ . Thus, we have a pipeline of MapReduce jobs (200) . For typical values of n and n_b , the number of MapReduce jobs in this pipeline is small enough that coordination and bookkeeping overheads are minimal. In the experiments, the maximum number of MapReduce jobs in any pipeline was 33 .

The bound value is set so that the time to LU decompose a matrix of order n_b on the master node is approximately equal to the constant time required to launch a MapReduce job. If the running time to decompose a matrix of order n_b on the master node is significantly less than the launch time of a MapReduce job, there will be more MapReduce jobs than necessary and we can increase n_b to reduce the number of MapReduce jobs and also reduce the total running time of LU decomposition.

If the running time on the master node is significantly larger than the launch time of a job, the LU decomposition on the master node becomes the bottleneck and we improve performance by reducing n_b so that we partition the matrix into smaller submatrices. In our experimental environment, we set n_b to 3200.

The preferred embodiment uses the Java data type double (64 bits) for the input matrix and calculations. This data type provides sufficient numerical accuracy, although the present invention and embodiments are not limited to this data type.

The master compute node controls the entire workflow (200), preferably using a mapping configuration, although individual control of workflow components is also possible, the master compute node creates m₀ (208) files in HDFS, where m₀ is the number of compute nodes (209). These files are used as input files for all launched MapReduce jobs and they are stored in "Root/Maplnput/" (400). The purpose of these files is to provide an identifier for each mapper, so that the mapper can determine its role based on this identifier. Each file contains one integer: the first file A.O contains 0, the second file A.1 contains 1 , and the last file A. m₀ - l contains m₀ - l . The mappers use these files to control the computation, and they produce the output required for inverting the input matrix by writing directly to HDFS. To leverage the hardware architecture and determine the mapping configuration, or control components, we discuss how the input matrix is partitioned for LU decomposition in the preferred embodiment and how the different partitions flow through the MapReduce pipeline. We also discuss how the input, intermediate, and output data files are stored in HDFS to improve I/O efficiency. In (400) we indicate which process produces each result or intermediate data file. For example "Map 0" means that the result is produced by the mappers in the first MapReduce job, and "Master 0" means that the result is produced by the master node of that job. Therefore, these labels also represent the computation process of the LU decomposition.

In the preferred embodiment a MapReduce job is used to partition matrix A . This is a map-only job where the mappers do all the work and the reduce function does nothing. The partitioning job recursively partitions A into as many submatrices as needed, according to the depth of recursion implied by n and n_b . The mappers of this job read their input files from "Root/Maplnput/".

The integer read by each mapper tells that mapper which rows of the input matrix in the HDFS file "Root/a. txt" to read and partition. In order to improve I/O performance, each map function reads an equal number of consecutive rows from this file to increase I/O sequentiality. Preferably worker j (the mapper assigned input file "Root/Maplnput/A. j ") reads rows r_x = ^ to ^r ₂ ^{= r} _i + ^ (exclusively). Preferably each block is written to "Root/ΑΓ and

"Root/A2" if the block is in the first half of A . Otherwise it is written to "Root/A3" and "Root/A4".

In order to improve I/O efficiency while reading submatrices from disk in subsequent MapReduce jobs, each submatrix, whether A_{l t} A₂ , A₃ , or A₄ , is split into multiple parts, each of which is stored in a separate file. This ensures that there will never be multiple mappers that simultaneously read the same file. For example in one aspect of the preferred embodiment, A₃ is stored in files because we use only half the compute nodes to compute L₂ using A₃ , while the other half are used to compute U₂ using A₂ . Therefore, m = ^- - 1 (400). This approach also ensures that no two mappers write data into the same file, thereby eliminating the need for synchronization between mappers and improving I/O efficiency. Mappers and reducers in subsequent jobs also write their data into independent files, so synchronization on file writes is never required and I/O efficiency is maintained. The separate files written by worker nodes are , for example, "L2/L.1 " (400).

Matrix A (300) partitioned by four mappers, the square blocks surrounded by solid lines are submatrices, while the rectangular blocks divided by dashed lines are separate files storing these submatrices. In (400) we denote the file names in of partitioned submatrices in braces. The value f₂ is the maximum factor of m₀ less than φϊ . The depth d of the directory structure equals the depth of data partitioning given by

,ίη the this example depth is 2

(400). The pseudocode of the data partitioning method for LU decomposition is given in method 3. This listing shows one map function partitioning the block of data in rows r to r .

Method 3 Data partitioning for LU decomposition.

1 : function Partition^ , r_{l t} r₂ , n , n_b , m , f f₂ , "path")

2: /^* A is the original matrix. r is the index of the beginning row to be saved by this function and r is the index of the ending row. n is the order of the matrix. n_b is the bound value for data partitioning, m is the number of map workers partitioning the current submatrix (e.g., the submatrix in the directory Root/A1 shown in Figure 4), f and f₂ indicate that A₄ is partitioned to f x f₂ blocks according to the optimization for block wrap^*/

3: if n < n_b then

4: /^* Save A1 ^*/ 5: Save [A]^ _{¾][0 n]} to "path/A1/A.

6: else

7: if r < then

8: Partitioned_■ r₂ , f , ¾ , , /_{l f} /₂ , "path/A1")

9: /^* Save A2 ^*/

10: for i = 0 to m - 1 do

1 1 : Save ΐ _u„ , to "path/A2/A. / .

^J[i -¾^][f ···^«] "

12 end for

13 else

14 /^* Save A3 ^*/

15 for i = 0 to In 1 do

(2>,-«)m

16:

18 end for

19 /^* Save A4 ^*/

20 for = 0 to f₂ - \ do

21 : / = (^- ₁)Λ +

22: for z = 0 to 1 do

23: Save to "path/A4/AJi ^:

24 end for

25 end for

26 end if

27 end if Partitioning submatrix B = A^ - ₁U₁ is handled differently from the input matrix.

This submatrix is produced by m₀ reducers of a MapReduce job and stored in m₀ files. These files are not read simultaneously by many mappers or reducers, so instead of materializing the data partitions of B after this submatrix is produced, we only record the indices of the beginning and ending row, and the beginning and ending column, of each partition in this submatrix. We also record the names of the files storing this data. Using this approach, the files in " Root/O UT/A1 ", "Root/OUT/A2", "Root/OUT/A3", and "Root/OUT/A4" (400) are very small (in general, less than 1 KB). The running time to partition A - ₂U₂ with such a method is quite short (less than 1 second). In the preferred embodiment we preferably partition this matrix in the master node.

After partitioning A , we use method 2 to compute the LU decomposition. Since A has been partitioned, line 5 is ignored, and A_{l t} A₂ , A₃ , and A₄ are read from HDFS files. We launch one MapReduce job for lines 7-9 of this method. One MapReduce job is sufficient regardless of the size of the input block. The map function of this job computes L₂ and U₂ , while the reduce function computes A -L U (400). In the preferred embodiment we use half of the mappers to compute L and the other half to compute U₂ , since computing L has the same computational complexity as computing U₂ . Each mapper reads an input file from the directory "Root/Maplnput/". If the value in this file is≤m = ^ - l , the mapper computes part of L . Otherwise it computes part of U₂ (500).

Table 1

The Time complexity of our LU decomposition method is shown above, in comparison with the ScaLAPACK method, for n x n matrix on m₀ compute nodes, where m₀ = f x f₂ and / = (m₀ + 2^ + 2/₂) . If worker j is computing part of J^' ₂ , this worker is assigned the file " A.j ". The worker reads L from HDFS directory "Root/A1 " and A_{2 j} from files "Root/A2/A. y.0 , Root/A2/A. j .1 , Root/A. j.m ", and computes one part of J^' ₂ , which is written to "Root/L2/L. y ".

Each mapper in this MapReduce job emits one ( key, value) pair containing where j is the value read by the mapper from its input file. These ( key, value) pairs are used to control which part of A - ₂U₂ each reducer should compute. In the preferred embodiment we use block wrap for matrix multiplication so worker j computes the y^' -th block of A - ₂U₂ (500). It should be noted that after obtaining L and P₂ , L₂ can be easily obtained by permuting L based on the permutation matrix P₂ . L₂ is constructed only as it is read from HDFS.

The time complexity of the LU decomposition in the preferred embodiment is shown in Table 1 above. A comparison is shown with another method ScaLAPACK. The data in the preferred embodiment using MapReduce is stored in a distributed environment, the amount of data transferred is one of the main bottlenecks. The amount of data read from HDFS is the same as the amount of data transferred between compute nodes. In the MPI implementation of ScaLAPACK large amounts of data are transferred over the network between the master and workers , it is evident from the comparision in the table that the present invention as implemented in the preferred embodiment reduces the data transferred over the network.

In the preferred embodiment one MapReduce job is used to compute the inverses of the triangular matrices L and U and the product of their inverses. In the map phase, the inverses of the triangular matrices are computed using Equation 4. Half of the mappers compute the inverse of L and the other half compute the inverse of U . In order to balance load, preferably across the available nodes, the z^' -th node is used to compute the ( k x m₀ +/^' )-th column of J ¹ if i is less than f . If i > , the node computes the ( k x nm₀ + i - )-th row of U^~l , where k is an integer ( 0 < £ < for / < or 0<&< ¾^ for /^' > f ). Thus, each mapper computes an equal number of non-contiguous columns of L^~l, which is designed to ensure balanced load. For example, if there are 4 mappers, MapperO computes columns 0,4,8,12,..., Mapperl computes columns 1,5,9,13,..., and so on.

Table 2

The time complexity of our triangular matrix inversion and final matrix inversion, in comparison with the ScaLAPACK method, for nxn matrix on m₀ compute nodes, where m₀ = f x f₂ and / = }(w₀ +f_l+ f₂).

In the reduce phase, the product of these two inverses U^~lL^~l is computed. Each reducer reads a number of columns of J ¹ and a number of rows of U^~l, and multiples these two parts. In order to reduce read I/O we use block wrap for matrix multiplication. In order to balance load, instead of partitioning the final matrix into f x f₂ blocks, each of which contains consecutive rows and consecutive columns, the matrix is partitioned into grid blocks, each of which contains discrete rows and discrete columns. Worker j computes the product of row ^! -k_l+j_l of U^'1 and column

j₂ = j mod f . Here k is any of the non-negative integers that satisfy ^k_x + j<m₀, and k₂ is any of the non-negative integers that satisfy

^k₂+j<m₀. In the preferred embodiment implementation of LU decomposition, all outputs, L^~l,U^~l and A^~l=U^~lL^~lP, are written to HDFS. The map function only emits integers (J, j) to control the reduce tasks, and does not emit outputs. Table 2 shows the time complexity and data transfer of the preferred embodiment for the matrix inversion method , and the corresponding values in ScaLAPACK.

In aspects of the preferred embodiment we describe optimizations that aim to further reduce read and write I/O, storing intermediate data in separate files and block wrap. A third optimization aims to improve memory access locality. Aspects of the preferred embodiment of the present invention reduce at least network traffic between nodes and internal data transfers.

To reduce the amount of read and write I/O in the preferred embodiment for different MapReduce jobs, we do not combine the results, such as L_{l t} L₂ , and J3 , in any stage. The results are located in many different files. Method 2 writes all outputs into HDFS as separate files and skips lines 1 1 -13. The total number of files for the final lower triangular or upper triangular matrix is

N(d) = 2^d + ^(2^d - 1) , where m₀ is the number of compute nodes, and d is the recursive depth that is constrained by the matrix order n , i.e. , d =

. For example, given a square matrix A with n = 2¹⁵ , ¾ = 2ⁿ = 2048 , and m₀ = 64 , the recursive depth d is 4 and the final lower triangular matrix L is stored in N(d) = 496 files. In the preferred embodiment these files are read into memory recursively.

This optimization significantly improves performance by combining intermediate files which can only happen on one compute node, preferably the master node. Other compute nodes normally have to wait until combination is completed, however, combining intermediate files significantly reduces the running time. The method in the preferred embodiment requires multiplying two matrices at different stages, for example L₂ and U₂ , or U^~l and J ¹ . To reduce the amount of data read we use the block method for matrix multiplication. In general, in order to compute ₂U₂ , each compute node can read a number of rows, e.g., z^' -th to y^' -th rows, of L₂ and the entire matrix U₂ . The compute node then computes the z^' -th to y^' -th rows of L U . If the number of compute nodes is m_n , the amount of data read in each node is (l + ^ and the total data read is (m₀ + l)n² . To optimise the multiplying matrix we reduce the amount of data read and implement the block wrap method. In this method, L₂ is divided into f blocks, each of which contains j- consecutive rows, while U₂ is divided into f₂ blocks, each of which contains j- consecutive columns. Using this partitioning, every block of L₂ will need to be multiplied by every block of U₂ , and the final matrix is partitioned into f f₂ blocks. Each of these blocks is computed by one compute node. That is, each compute node reads j- rows of L₂ and j- columns of U₂ (one block from each matrix) and computes the product of these two block. f and f₂ are chosen so that m₀ = f x f₂ . The data read in each compute node is (- + ^n² , and the total data is C/i +/₂) ¾² , which is significantly less than (m₀ + l)n² . In order to obtain the minimum data read, we compute f and f₂ from n such that \ f_l - f₂ \ is as small as possible. That is, we choose f₂≤ f and there is no other factor of m₀ between f and f₂ . For example, given 64 nodes, in the naive algorithm each node reads data of size 6564/7² , and the total data read for all 64 nodes is 65n² . Using the block wrap method and f = f₂ = 8 , each node reads data of size 14«² , and the total data read for all nodes is \6n² , and improvement on using the basic method. Matrices J^' ₂ and U₂ are linearized in row-major order both in memory and in HDFS. The product of J^' ₂ and U₂ is computed as follows:

However, when the order of the matrices n is large, each read of an element from U₂ will access a separate memory page, potentially generating a TLB miss and a cache miss. If a page can hold k data items, this access pattern can generate up to n³ +^n² misses for data read and matrix multiplication. In the preferred embodiment the upper triangular matrix is always stored in a transposed fashion, i.e., we store U^T instead of U . The product of J^' ₂ and U₂ ^T can be computed as follows:

Storing the transposed upper triangular matrix reduces the number of misses to +- - , which is significantly less than the unoptimized method and may substantially improve performance.

To demonstrate the advantages of using the present invention in the preferred embodiment we present an experimental evaluation and compare to the state of the art alternative, namely ScaLAPACK.

The experimental evaluation is conducted on Hadoop 1.1.1. All experiments were performed on medium instances of Amazon's Elastic Compute Cloud (EC2), except for the largest matrix ₄ , for which large instances were used. Each medium instance has 3.7 GB of memory and 1 virtual core with 2 EC2 compute unit, where each EC2 compute unit has performance similar to a 2007-era 1 .0-1 .2 GHz AMD Opteron or Xeon processor. Matrix Order Elements Text Binary Number of

(Billion) (GB) (GB) Jobs 1 20480 0.42 8 3.2 9

M₂ 32768 1 .07 20 8 17 3 40960 1 .68 40 16 17 4 102400 10.49 200 80 33

M₅ 16384 0.26 5 2 9

Table 3: Five matrices used for the experiments. We use five matrices in our experiments. All of the test matrices were randomly generated using the Random class in Java. The performance depends on the order of the input matrix and not on the data values in this matrix, so performance is not expected to change if we use real-world matrices instead of synthetic ones.

Details about the matrices used are shown in Table 3, which shows the order of each matrix, the number of elements (data type double), the size of the matrix in text format, and the size in binary format. Recall that the bound value n_b used in our experiments is 3200. Table 3 shows, for this value of n_b , the total number of MapReduce jobs in the preferred embodiment required for inverting each matrix. The matrices are stored in HDFS with the default replication factor of 3.

One ideal scalability line (i.e., running time proportional to 1 over the number of nodes) has been over-plotted (600) in order to demonstrate the scalability of teh preferred embodiment.

The preferred embodiment shows strong scalability, with a minor deviation from ideal scalability when the number of nodes is high. This deviation is due to the constant launch time of MapReduce jobs, since we use multiple MapReduce jobs. However, we also note that the number of MapReduce jobs is proportional to the matrix order n , while the running time is proportional to n³ . Therefore we can expect that the larger the matrix, the better the scalability (600).

The present invention can be augmented and used in conjunction with techniques for reducing the overhead of launching MapReduce jobs, such as having pools of worker processes that are reused by map and reduce tasks. These techniques can be of benefit and do not require any changes to the present invention, for example as implemented in the MapReduce pipeline. Specifically, our analysis of how finely to decompose the computation holds even under faster job launching. In order to verify the correctness of the present invention in the preferred embodiment and check whether the data type double is precise enough, we compute I_n -MM^~ for matrices M_l , M₂ , ₃ , and M₅ . We find that every element in the computed matrices is less than 10 ⁵ , which validates that the data type double is sufficiently precise in the preferred embodiment.

The largest matrix ₄ is used to further test the scalability limits and the smallest matrix M₅ is used to evaluate the aspects of the preferred embodiment optimizations. In one aspect we consider storing intermediate data in separate files. We use the preferred embodiment to illustrate the optimisation. Without this optimization, we combine all separate files of L and U in each MapReduce job in iterative MapReduce processes. The combination happens in the master node and the combined file is written by that node into HDFS. Since the combination is a serial process done on one node, it takes a constant time to combine the files independent of the number of the compute nodes. Therefore the benefit of keeping separate intermediate files increases as the number of compute nodes increases, since the running time gets smaller and the overhead of combination remains constant.

To validate this expectation, we conduct an experiment with matrix M₅ in which we compare the time taken by the optimised aspect in the preferred embodiment to the time taken by the preferred embodiment that combines L and U files at each step. We measure the running time without this optimization on 4-64 compute nodes and compare this to the running time with the optimization. The unoptimized version (700) to be close to 30% slower in some cases, demonstrating the importance of this optimization.

The block wrap method can significantly reduce read I/O, thereby improving performance. We use matrix M₅ to demonstrate the effect of using block wrap on performance. We measure the running time without this optimization on 4- 64 compute nodes and compare this to the running time with the optimization. The larger the number of compute nodes, the larger the improvement in performance (700).

Transpose storing greatly improves the performance of our algorithm, by a factor of 2-3.

We now consider the ability of the preferred embodiment to invert a very large matrix, namely ₄ , which is a matrix of order 102400. We measure the running time on 128 Amazon EC2 large instances, each of which has two medium CPU cores, for a total of 256 medium CPU cores. A medium CPU core has performance similar to two 2007-era 1 .0-1 .2 GHz AMD Opteron or Xeon processors. The large matrix ₄ is about 80 GB in size in binary representation. The preferred embodiment implementation on the EC2 large instances writes more than 500 GB of data and reads more than 20 TB of data in the 33 MapReduce jobs required to invert the matrix.

We also used 64 medium EC2 instances to invert this matrix. It took about 15 hours in this case to invert the matrix. Analyzing the scalability of the medium instances compared to the large instances, we see that the medium instances show better scalability, based on a simple calculation. Assume for simplicity that each medium instance core has similar compute performance to a large instance core. When we used 128 large EC2 instances we were using 256 cores, whereas when we used 64 medium instances, we were using 64 cores. Thus, we have four times as many cores when using large instances. Therefore, if our algorithm has ideal scalability, the running time in large instances should be 15/4 = 3.8 hours (four times the cores should result in the running time). However, the running time we observed on large instances (5 hours) is longer than this time that assumes ideal scalability. There are two possible reasons related to EC2 for the running time being longer than expected. The first reason is that we found that the performance variance between different large EC2 instances is high, even though the instances are supposed to have similar performance. The second reason is that the data read speed on some large instances is less than the speed on the medium instances. We found that the speed of copying files between large instances is around 30-60 MB/s, while the speed of copying files between medium instances is around 60 MB/s. The experiment demonstrates the advantages of using the present invention in the preferred embodiment which is able to scale to such large scales in terms of both input size and number of nodes, and that this scalability holds in different runs on different cluster sizes. The prferred embodiment using MapReduce is able to scale even in the presence of failures. We present a comparision with the state-of-the-art method using ScaLAPACK. ScaLAPACK is a popular library of high-performance linear algebra routines for distributed memory message passing computers and clusters. ScaLAPACK is an extension of LAPACK, and it has been shown to have good scalability and performance. More details about ScaLAPACK.

In this experiment, the package libscalapack-mpi-dev in Ubuntu is used. The version of MPI used is MPICH. The drive routines PDGETRF and PDGETRI in ScaLAPACK are used to compute the LU decomposition and the triangular matrix inverse respectively. In order to reduce the data transfer between compute nodes in ScaLAPACK, we use an optimization similar to our block wrap optimization. In particular, we set the process grid to f * f₂ , where ^mo ⁼ fi ^x fi '^{s tne} number of compute nodes, f < f₂ , and there is no factor of m₀ between f and f₂ , which means that the matrix is partitioned into _j x ₂ .

The matrix is first partitioned into blocks of dimension 128 x 128 , since we found that this size provides the best performance in our experiments. Next, these blocks are assigned to the process grid. In order to improve load balancing, the blocks are assigned as follows: the block in row f_l x m_l + i and column j X ffl_j i j is assigned to the ( f₂ x j + i )-th compute node, where m , n , i and j are integers that are constrained by following inequalities: f_l x m_l + i < -^ , f₂ x m₂ + j < _T¾- , i < f₁ , and j < f₂ , where n is the order of the matrix. In our

ScaLAPACK implementation, all intermediate data is stored in memory, such that the matrix is read only once and written only once.

The ratio of the running time of ScaLAPACK to the running time for the preferred embodiment on medium EC2 nodes for matrices M_l to ₃ shows that for mall matrices, there is a slight performance penalty for using the prferred embodiment. The preferred embodiment approaches or outperforms ScaLAPACK for larger matrices and a larger number of nodes (800) . The preferred embodiment has better scalability than ScaLAPACK. To better demonstrate the scalability compared to ScaLAPACK, we run another experiment in which we use ScaLAPACK on 128 large EC2 instances (256 CPU cores) and 64 medium EC2 instances (64 CPU cores) to compute the inverse of the largest matrix, ₄ . These are the same cluster sizes used in the example with the preferred embodiment for this matrix. ScaLAPACK took 8 hours on the large instances and more than 48 hours on the medium instances to invert matrix ₄ , both of which are significantly longer than 5 hours on large instances and 15 hours on medium instances using the preferred embodiment. The present invention in the preferred embodiment has better scalability and performance than ScaLAPACK at high scale.

The scalability and performace is better than ScaLAPACK at high scale is that ScaLAPACK transfers large amounts of data over the network (Tables 1 and 2), whereas the present invention in the preferred embodiment reduces the amount of network traffic. At low scale, the network can accommodate the required bandwidth, but it becomes a bottleneck at high scale.

When used in this specification and claims, the terms "comprises" and "comprising" and variations thereof mean that the specified features, steps or integers are included. The terms are not to be interpreted to exclude the presence of other features, steps or components.

The features disclosed in the foregoing description, or the following claims, or the accompanying drawings, expressed in their specific forms or in terms of a means for performing the disclosed function, or a method or process for attaining the disclosed result, as appropriate, may, separately, or in any combination of such features, be utilised for realising the invention in diverse forms thereof.

Techniques for implementing aspects of embodiments of the invention:

[1 ] E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, J. Langou, H. Ltaief, and S. Tomov. LU factorization for accelerator-based systems. In Proc. ACS/IEEE Int. Conf. on Comp. Systems and Applications (AICCSA), 201 1 .

[2] E. Anderson, Z. Bai, J. Dongarra, A. Greenbaum, A. McKenney, J. Du Croz, S. Hammerling, J. Demmel, C. Bischof, and D. Sorensen. LAPACK: a portable linear algebra library for high-performance computers. In Proc. Conf. on Supercomputing (SC), 1990.

[3] P. Bientinesi, B. Gunter, and R. A. v. d. Geijn. Families of algorithms related to the inversion of a symmetric positive definite matrix. ACM Trans. Mathematics Software, 35(1 ):3: 1-3:22, 2008.

[4] L. S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK Users' Guide. Society for Industrial and Applied Mathematics, 1997. [5] Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. HaLoop: efficient iterative data processing on large clusters. Proc. VLDB Endow. (PVLDB), 3(1 -2):285- 296, 2010. [6] K. Dackland, E. Elmroth, B. Kagstrom, and C. V. Loan. Design and evaluation of parallel block algorithms: LU factorization on an IBM 3090 VF/600J. In Proc. SIAM Conf. on Parallel Proc. for Scientific Computing, 1991 .

[7] B. De Schutter and B. De Moor. The QR decomposition and the singular value decomposition in the symmetrized max-plus algebra. In Proc. European Control Conf. (ECC), 1997. [8] J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Comm. ACM, 51 (1 ): 107-1 13, 2008.

[9] J. Dongarra, M. Faverge, H. Ltaief, and P. Luszczek. High performance matrix inversion based on LU factorization for multicore architectures. In Proc. ACM Int. Workshop on Many Task Computing on Grids and Supercomputers, 201 1 .

[10] J. J. Dongarra, P. Luszczek, and A. Petitet. The LINPACK benchmark: Past, present, and future. Concurrency and Computation: Practice and Experience, 15:2003-2016, 2003.

[1 1 ] Amazon Elastic Compute Cloud (Amazon EC2).

http://aws.amazon.com/ec2/. [12] R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. In Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 201 1 .

[13] A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. SystemML: Declarative machine learning on Map Reduce. In Proc. IEEE Int. Conf. on Data Engineering (ICDE), 201 1 .

[14] G. H. Golub and C. F. Van Loan. Matrix computations. Johns Hopkins University Press, third edition, 1996.

[15] J. M. Hellerstein, C. Re, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, and A. Kumar. The MADlib analytics library: or MAD skills, the SQL. Proc. VLDB Endow. (PVLDB), 5(12): 1700-171 1 , 2012.

[16] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. In Proc. USENIX Conf. on Networked Systems Design and Implementation (NSDI), 201 1 .

[17] K. K. Lau, M. Kumar, and S. Venkatesh. Parallel matrix inversion techniques. In Proc. IEEE Int. Conf. on Algorithms and Architectures for Parallel Processing, 1996.

[18] B. Li, S. Tata, and Y. Sismanis. Sparkler: Supporting large-scale matrix factorization. In Proc. Int. Conf. on Extending Database Technology (EDBT), 2013. [19] S. Lupke. LU-decomposition on a massively parallel transputer system. In Proc. Int. Conf. on Parallel Architectures and Languages, 1993.

[20] E. Lusk, N. Doss, and A. Skjellum. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Computing, 22:789-828, 1996. [21 ] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2010. [22] D. Marks, L. Colwell, R. Sheridan, T. Hopf, A. Pagnani, R. Zecchina, and C. Sander. Protein 3d structure computed from evolutionary sequence variation. PLoS One, 6(12):e28766-e28766, 201 1 .

[23] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes: The Art of Scientific Computing. Cambridge University Press, third edition, 2007.

[24] Z. Qian, X. Chen, N. Kang, M. Chen, Y. Yu, T. Moscibroda, and Z. Zhang. MadLINQ: Large-scale distributed matrix computation for the cloud. In Proc. ACM European Conf. on Computer Systems (EuroSys), 2012.

[25] E. S. Quintana, G. Quintana, X. Sun, and R. vande Geijn. A note on parallel matrix inversion. SI AM J. Scl. Comput. , 22(5): 1762-1771 , 2000. [26] S. Seo, I. Jang, K. Woo, I. Kim, J.-S. Kim, and S. Maeng. HPMR: Prefetching and pre-shuffling in shared MapReduce computation environment. In Proc. IEEE Int. Conf. on Cluster Computing and Workshops, 2009.

[27] M. Wall, A. Rechtsteiner, and L. Rocha. Singular value decomposition and principal component analysis. In A Practical Approach to Mlcroarray Data Analysis. Springer, 2003.

[28] R. J. Warp, D. J. Godfrey, and J. T. Dobbins III. Applications of matrix inversion tomosynthesis. Physics of Medical Imaging, 3977:376-383, 2000. [29] J. Xiang. Scalable scientific computing algorithms using MapReduce. Master's thesis, University of Waterloo, 2013.

[30] J. Xiang, S. N. Zhang, and Y. Yao. Probing the spatial distribution of the interstellar dust medium by high angular resolution X-ray halos of point sources. The Astrophysical J. , 628:769-779, 2005.

[31 ] J. Xinag, H. Meng, and A. Aboulnaga. Source code of "Scalable Matrix Inversion Using MapReduce". https://github.com/JingenXiang/Matrixlnversion.

[32] Apache Hadoop NextGen MapReduce (YARN).

http://hadoop.apache.org/docs/r2.3 /hadoop-varn/hadoop-yarn

site/YARN.html. [33] Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language. In Proc. Symp. on Operating Systems Design and Implementation (OSDI), 2008. [34] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: a fault- tolerant abstraction for in-memory cluster computing. In Proc. USENIX Conf. on Networked Systems Design and Implementation (NSDI), 2012. [35] Y. Zhang and J. Yang. Optimizing I/O for big array analytics. Proc. VLDB Endow. (PVLDB), 5(8):764-775, 2012.

[36] Q. Zheng, X. Wu, M. Fang, H. Wang, S. Wang, and X. Wang. Application of HPMR in parallel matrix computation. Comp. Engineering, 36(08):49-52, 2010.

Claims

1 . A method of performing a matrix operation in a distributed processing system having a plurality of processing nodes including a master processing node, the method comprising:

recursively partitioning an original matrix into matrix blocks;

identifying a first set of matrix blocks and a second set of matrix blocks; processing the first set of matrix blocks on the master node to produce first set result;

determining second set result for the second set of matrix blocks using the plurality of processing nodes by processing the second set of matrix blocks with the first set result;

assembling the first set result and the second set result to form intermediate matrix blocks;

performing the matrix operation on the intermediate matrix blocks using the plurality of processing nodes to provide an intermediate result; and

determining the result of the matrix operation on the original matrix by processing the intermediate result.

2. A method of claim 1 further comprising:

determining a mapping configuration using the number of nodes, wherein the mapping configuration comprises a mapper used to control parallel processing at the nodes.

3. A method of claim 2 further comprising:

recursively partitioning the matrix blocks into sub-matrix blocks using the mapper to map the matrix blocks to the plurality of nodes.

4. The method of any preceding claim wherein: the matrix operation is an inversion operation on the matrix and the result of the matrix operation is an inverted matrix.

5. The method of any preceding claim further comprising: determining a matrix size limit for the master node; and recursively partitioning the original matrix into matrix blocks, using the mapper until the largest matrix block size is below the matrix size limit.

6. The method of any preceding claim wherein the second set result is further determined using the inter-dependency of the second set of matrix blocks.

7. The method of claims 2 to 6 wherein the matrix blocks are Upper and Lower triangular matrices.

8. The method of claim 7 wherein the Upper triangular matrix is stored in memory as a transposed upper triangular matrix.

9. The method of any one of claims 2 to 7 wherein the mapping configuration is based on a distributed filing system.

10. The method of claim 9 wherein the distributed filing system is based on Hadoop and uses MapReduce.

1 1 . The method of any preceding claim wherein the mapping configuration reads an equal number of consecutive rows in the matrix.

12. The method claim of any one of the preceding claims wherein the number of matrix blocks equals the number of nodes.

13. A system to perform a matrix operation, wherein the system comprises a distributed processing system having a plurality of processing nodes including a master processing node, the system comprising: a partitioner module operable to recursively partition an original matrix into matrix blocks;

the master processing node being operable to identify a first set of matrix blocks and a second set of matrix blocks and to process the first set of matrix blocks to produce a first set result;

the plurality of processing nodes being operable to determine a second set result for the second set of matrix blocks by processing the second set of matrix blocks with the first set result;

the master processing node being operable to assemble the first set result and the second set result to form intermediate matrix blocks;

the plurality of processing nodes being operable to perform the matrix operation on the intermediate matrix blocks to provide an intermediate result; and

the system determining the result of the matrix operation on the original matrix by processing the intermediate result.

14. The system of claim 13 wherein the distributed processing system is a parallel processing system.