CN114139108B

CN114139108B - Matrix LU decomposition vectorization calculation method of vector DSP core

Info

Publication number: CN114139108B
Application number: CN202111491342.4A
Authority: CN
Inventors: 夏宁; 李世平
Original assignee: Jiangsu Huachuang Micro System Co ltd; CETC 14 Research Institute
Current assignee: Jiangsu Huachuang Micro System Co ltd; CETC 14 Research Institute
Filing date: 2021-12-08
Publication date: 2024-07-09
Anticipated expiration: 2041-12-08

Abstract

The invention discloses a matrix LU decomposition vectorization calculation method of a vector DSP core, which comprises the following steps: s1, matrix zero filling; s2, transposing the matrix B to obtain a transposed matrix C; s3, performing row elimination on the rows Dr of the upper triangular matrix D in the transposed matrix C; s4, calculating an update matrix panel according to a formula R=R-Dr'; s5, judging whether r is equal to N-1, if not, making r=r+1, turning to step S3, and if yes, turning to step S6; s6, the vector DSP core uses a vector instruction to carry out matrix transposition to obtain an LU decomposition result of the matrix B; and S7, copying the LU decomposition result of the matrix B to the storage position of the original matrix in the DDR memory by the vector DSP core. The advantages are that: according to the calculation method, discontinuous storage access in matrix LU decomposition is converted into continuous storage access through vectorized matrix transposition operation, so that the advantage of vector loading data can be fully exerted.

Description

Matrix LU decomposition vectorization calculation method of vector DSP core

Technical Field

The invention relates to a matrix LU decomposition vectorization calculation method of a vector DSP core, belonging to the technical field of digital signal processing-DSP.

Background

LU decomposition is a classical algorithm with intensive operations, and has been in the core position for a long time due to its wide application range and important application value. Well-known LINPACK test programs test the performance of supercomputers by solving a system of linear equations, while matrix LU decomposition (LU Factorization, LU) is one of the most common methods of solving a dense system of linear equations. And LINPACK, decomposing the matrix into an upper triangular matrix and a lower triangular matrix by the matrix partitioning LU decomposition, and solving a final equation set by solving the triangular equation set twice, wherein the calculated amount of the matrix LU decomposition occupies more than 95% of the whole calculated amount. Therefore, the performance of LU decomposition directly determines the performance of solving the equation set, which is very important for the high performance computing field.

Disclosure of Invention

The invention provides a matrix LU decomposition vectorization calculation method of a vector DSP core, which adopts the following technical scheme:

a matrix LU decomposition vectorization calculation method of a vector DSP core comprises the following steps: the method comprises the following steps:

s1, completing matrix zero padding by a vector DSP core according to the bit width of a chip vector register;

S2, the vector DSP core transposes the matrix B by using a vector instruction to obtain a transposed matrix C;

S3, checking the row Dr of the upper triangular matrix D in the transposed matrix C by the vector DSP, performing row elimination according to a formula Dr=Dr/Dr [0], updating the row Dr of the matrix D, and initializing r=0;

s4, the vector DSP core calculates an update matrix panel according to a formula R=R-Dr';

s5, judging whether r is equal to N-1, if not, making r=r+1, turning to step S3, and if yes, turning to step S6;

s6, the vector DSP core uses a vector instruction to carry out matrix transposition to obtain an LU decomposition result of the matrix B;

and S7, copying the LU decomposition result of the matrix B to the storage position of the original matrix in the DDR memory by the vector DSP core.

In the computing method of the invention, the vector DSP core refers to a digital processor chip with a vector register.

Further preferably, the specific method of step S1 is as follows:

Setting a matrix to be solved as a matrix A, wherein the scale of the matrix A is M.M orders, and the first address is recorded as A00; the matrix after zero filling is marked as a matrix B, the scale is N.N-order, and the first address is marked as B00; the matrix elements of each matrix are u bytes, and the bit widths of a plurality of vector registers in the vector DSP core are w bytes;

S1.1, judging whether the line length of a matrix A is an integer multiple of the bit width w of a vector register, if so, the matrix A does not need zero padding, the matrix A and the matrix B are marked as the same matrix, N=M, B00=A00, and turning to the step S2; otherwise, go to step S1.2;

S1.2, determining the scale N of the matrix B, wherein the method comprises the following steps: condition one, N > M, condition two, (N x u)% w=0; the minimum N value is obtained as the required N value while the first condition and the second condition are met, so that a matrix B is obtained; initializing a matrix B as an all-zero matrix;

s1.3, respectively calculating the first addresses of the ith row elements of the matrix A and the matrix B:

ai [0] =a00+i×m×u; formula (1)

Bi [0] =b00+i×n×u; formula (2)

Wherein Ai [0] is the pointer variable of the i-th row element head address of the matrix A, bi [0] is the pointer variable of the i-th row element head address of the matrix B; i value range [0- (M-1) ], initializing i=0;

S1.4, copying continuous M elements at the position where the i-th row head address in the matrix A is Ai [0] to the position where the i-th row head address in the matrix B is Bi [0 ];

S1.5, judging whether i is equal to M-1, if not, making i=i+1, and turning to the step S1.3; otherwise, go to step S2.

Further preferably, the specific method of step S2 is as follows:

setting the transposed matrix as C, and marking the initial address of the matrix C as C00;

S2.1, matrix transposition is carried out according to the unit of a submatrix, and since w bytes of data can be loaded by using a vector Load instruction, the size of the submatrix is set as (w/u x w/u) order;

s2.2, respectively calculating the head addresses of all the submatrices in the matrix B and the matrix C:

Bkj [0] =b00+knu+j w formula (3)

Cjk [0] =c00+j×n×u+k×w formula (4)

Wherein Bkj [0] is a pointer variable of a first address of a sub-matrix in the matrix B, ckj [0] is a pointer variable of a first address of a sub-matrix in the matrix C, k is row coordinates of the sub-matrix of the matrix B or column coordinates of the sub-matrix of the matrix C, a value range [0- (N x u/w-1) ] of k is a column coordinates of the sub-matrix of the matrix B or row coordinates of the sub-matrix of the matrix C, and a value range [0- (N x u/w-1) ] of j is a value range; initializing k=0, j=0;

S2.3, loading each row of data of the submatrix from the position with the head address Bkj [0] in the matrix B by using a vector Load instruction in sequence, and respectively storing the data into a vector register;

s2.4, using a vector shuffling instruction to finish transpose calculation of a submatrix in the matrix B, and using a vector Store instruction to Store a calculation result into the submatrix with a head address Cjk [0] in the matrix C;

S2.5, judging whether j is equal to N x u/w-1, if not, making j=j+1, turning to step S2.2, and if yes, turning to step S2.6;

S2.6, judging whether k is equal to N x u/w-1, if not, making k=k+1, turning to step S2.2, and if yes, turning to step S3.

Further preferably, the specific method of step S3 is as follows:

Dividing a matrix C in the S2 into an upper triangular area and a lower triangular area along the diagonal direction from top left to bottom right, wherein diagonal data is divided into the upper triangular area, the upper triangular area is marked as an upper triangular matrix D, and the lower triangular area is marked as a lower triangular matrix E; marking the upper triangular matrix D as D0, D1 and D2 … DN-1 according to the direction from top to bottom, and marking the line head element of each line as D0[0], D1[0] and D2[0] … D N-1[0]; dr is the r line of the upper triangular matrix D, dr [0] is the line head element of the r line, wherein, the value range of r [0- (N-1) ];

S3.1, using a scalar Load instruction to read pointer variable Dr [0] of a head element of a matrix D, and storing the pointer variable Dr [0] into a general register S0;

S3.2, calculating the value of 1/Dr [0] by using a scalar floating point division instruction, storing the result into a general register S1, and broadcasting the data of the general register S1 to a vector register Z0 by using a broadcasting instruction;

s3.3, sequentially loading other element data of the Dr line by using a vector Load instruction, loading w/u elements each time, and storing the w/u elements into a vector register Z1;

s3.4, multiplying the vector data Z1 with the vector data Z0 by using a vector multiplication instruction, and storing the result into a vector register Z2;

S3.5, storing the vector data result in the vector register Z2 into the original position of the Dr row data by using a vector Store instruction;

s3.6, repeating the steps S3.3 to S3.5 until the line elimination calculation of the line Dr of the upper triangular matrix D is completed;

further preferably, the specific method of step S4 is as follows:

the matrix at the right lower part of the row head element Dr [0] is marked as R, the g row of the matrix R is marked as Rg, and the value range of g is [0- (N-R-2) ]; the part of the r-th column in the matrix C, which is positioned in the lower triangular matrix E, is marked as a column element er, the f-th element of the column element er is marked as erf, and the value range of f is [0- (N-r-2) ]; the portion of row Dr that does not include head of row element Dr [0] is denoted as Dr';

s4.1, using a scalar Load instruction to obtain an element erf of an er, storing the element erf into a general register S0, and initializing f=0;

S4.2, broadcasting the data of the general register S0 in the S4.1 into a vector register Z0 by using a broadcasting instruction;

S4.3, sequentially loading the element data of Dr' by using a vector Load instruction, loading w/u elements each time, and storing the result into a vector register Z1;

S4.4, sequentially loading the element data of Rg by using a vector Load instruction, loading w/u elements each time, and storing the result into a vector register Z2;

S4.5, executing Z2-Z1 by using a vector multiply-subtract instruction, and storing a calculation result into a vector register Z3;

s4.6, storing the result of the vector register Z3 into the home position of Rg row data;

s4.7, repeating the steps S4.3 to S4.6 until the update calculation of the row Rg of the matrix R is completed;

S4.8, judging whether g is equal to N-r-2, if not, making g=g+1, and turning to the step S4.1; if yes, finishing updating calculation of the rows Rg in the matrix R, and turning to the step S5;

Further preferably, the specific method of step S6 is as follows:

s6.1, matrix transposition is carried out according to the unit of the submatrices, and the scale of the submatrices is set to be (w/u) order;

s6.2, respectively calculating the head addresses of all the submatrices in the matrix C and the matrix B:

Cjk [0] =c00+j×n×u+k×w formula (5)

Bkj [0] =b00+knu+j w formula (6)

Wherein Cjk [0] is a pointer variable of a first address of a sub-matrix in the matrix C, bkj [0] is a pointer variable of a first address of a sub-matrix in the matrix B, k is row coordinates of a sub-matrix of the matrix B or column coordinates of a sub-matrix of the matrix C, a value range [0- (N x u/w-1) ] of k is a column coordinates of a sub-matrix of the matrix B or row coordinates of a sub-matrix of the matrix C, and a value range [0- (N x u/w-1) ] of j is a value range; initializing k=0, j=0;

S6.3, loading each row of data of the submatrix from the position with the head address of Cjk [0] in the matrix C by using a vector Load instruction in sequence, and respectively storing the data into a vector register;

S6.4, using a vector shuffling instruction to finish transpose calculation of a submatrix in the matrix C, and using a vector Store instruction to Store a calculation result into a submatrix with a head address Bkj [0] in the matrix B;

S6.5, judging whether j is equal to N x u/w-1, if not, making j=j+1, turning to step S6.2, and if so, turning to step S6.6;

S6.6, judging whether k is equal to N x u/w-1, if not, making k=k+1, turning to step S6.2, and if yes, turning to step S7.

Further preferably, the specific method of step S7 is as follows:

s7.1, respectively calculating the first addresses of the ith row elements of the matrix B and the matrix A:

bi [0] =b00+i×n×u; formula (2)

Ai [0] =a00+i×m×u; formula (1)

s7.2, copying the continuous M elements of the matrix B where the i-th row head address is Bi 0 to the matrix A where the i-th row head address is Ai 0;

s7.3, judging whether i is equal to M-1, if not, making i=i+1, and turning to the step S7.1; otherwise, the calculation is ended.

Compared with the prior art, the invention has the beneficial effects that:

The method converts discontinuous storage access in matrix LU decomposition into continuous storage access through vectorized matrix transposition operation, is favorable for fully playing the advantages of vector loading data, further can complete subsequent row elimination and matrix panel updating operation by using vectorization method, and greatly improves the parallelism of data processing, thereby effectively reducing the time consumption of matrix LU decomposition.

Drawings

FIG. 1 is a flow chart of matrix LU decomposition based on vector DSP cores in an embodiment of the invention.

Fig. 2 is a schematic diagram of the present invention for matrix zero padding in an embodiment.

Fig. 3 is a schematic diagram of the invention in an embodiment for matrix transposition.

Fig. 4 is a schematic diagram of the invention in an embodiment of the line elimination process.

Fig. 5 is a schematic diagram of the invention for matrix panel update in an embodiment.

Detailed Description

The technical scheme of the present invention is described in detail below, but the scope of the present invention is not limited to the embodiments.

In order to make the contents of the present invention more comprehensible, the present invention is further described with reference to fig. 1 to 5 and the detailed description below.

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The matrix LU decomposition vectorization calculation method of the vector DSP core comprises the following steps:

S4, the vector DSP core calculates an update matrix panel according to a formula R=R-Dr'. E;

Example 1

In this embodiment, the vector DSP core preferably employs a chip No. 2 chip; chip No. 2 is known to those skilled in the art. Therefore, the chip No. 2 of the chip contains a 256-bit vector processing unit with strong computing power, which is used for supporting the parallel computing of intensive computing tasks. The conventional matrix LU decomposition method is implemented on chip No. 2 of the chip, and can suffer from insufficient use of hardware operation units, memory access conflicts and the like, the system structure features of the vector access mode and the concurrent processing of the vector processing unit which are not suitable for the chip of the HuaRui No. 2, it is difficult to exert the vector calculation advantage of the chip of the HuaRui No. 2, therefore, the embodiment provides a method for performing vector analysis and calculation on the matrix LU of the chip 2 with the Ci 2 by fully utilizing the capability of powerful parallel calculation and high bandwidth vector data access and storage of the chip vector processing unit with the Ci 2, and designs an LU analysis algorithm with high precision and high real-time performance.

As shown in fig. 1, in embodiment 1, a matrix LU decomposition vectorization calculation method based on chip No. 2 of the chinese Rui includes the following steps:

s1, a DSP core number 2 of HuaRui completes matrix zero padding according to the bit width of a chip vector register;

S2, the DSP core number 2 uses vector instructions to transpose the matrix B to obtain a transposed matrix C;

S3, checking the row Dr of the upper triangular matrix D in the transposed matrix C by the chip DSP of the HuaRui 2, performing row elimination according to a formula Dr=Dr/Dr [0], updating the row Dr of the matrix D, and initializing r=0;

S4, calculating an update matrix panel by a chip DSP core of the HuaRui 2 according to a formula R=R-Dr';

S6, a chip DSP core of the HuaRui 2 carries out matrix transposition by using a vector instruction to obtain an LU decomposition result of a matrix B;

and S7, copying the LU decomposition result of the matrix B to the storage position of the original matrix in the DDR memory by using the DSP core of the chip of the HuaRui 2.

The matrix LU decomposition vectorization calculation method based on the chip No. 2 of the embodiment 1 fully utilizes the capability of powerful parallel calculation and high bandwidth vector data access of the chip No. 2 vector processing unit, greatly improves the matrix LU decomposition vectorization calculation method of the chip No. 2 of the chip, and designs the LU decomposition algorithm with high precision and high real-time.

As shown in fig. 2, in embodiment 1, the DSP core No. S1-chip 2 completes matrix zero padding according to the bit width of the chip vector register, and the specific method is as follows:

Setting a matrix to be solved as a matrix A, wherein the scale of the matrix A is M.M orders, and the first address is recorded as A00; the matrix after zero filling is marked as a matrix B, the scale is N.N-order, and the first address is marked as B00; the matrix elements of each matrix are u bytes, and the bit width of a vector register in a chip of Circu 2 is w bytes;

ai [0] =a00+i×m×u; formula (1)

Bi [0] =b00+i×n×u; formula (2)

The data length of the rows and columns of matrix B obtained by matrix zero padding in embodiment 1 is an integer multiple of the bit width of the vector register, which is beneficial to increasing vectorization calculation and reducing tail processing.

As shown in fig. 3, in embodiment 1, the DSP cores S2 and 2 use vector instructions to transpose the matrix B for computation, so that the matrix B is adapted to the architecture of the chip No.2, and vector computation of LU decomposition is performed; the specific method of step S2 is as follows:

Bkj [0] =b00+knu+j w formula (3)

Cjk [0] =c00+j×n×u+k×w formula (4)

The general matrix LU decomposition method is to perform the element elimination processing according to the columns, but the column elements of the matrix are not continuous in the memory, and cannot be loaded and calculated by using the vector instruction, so that the column elements of the matrix are continuous in the memory by the matrix transposition adopted in the embodiment 1, and the vector instruction can be used to perform the subsequent row element elimination operation to replace the column element elimination operation in the general method.

As shown in fig. 4, in embodiment 1, the DSP chip S3 and the chip 2 check the row Dr of the upper triangular matrix D in the transposed matrix C, perform row elimination according to the formula dr=dr/Dr [0], update the row Dr of the matrix D, and initialize r=0; the specific method of step S3 is as follows:

Dividing a matrix C in the S2 into an upper triangular area and a lower triangular area along the diagonal direction from top left to bottom right, wherein diagonal data is divided into the upper triangular area, the upper triangular area is marked as an upper triangular matrix D, and the lower triangular area is marked as a lower triangular matrix E; marking the upper triangular matrix D as D0, D1 and D2 … DN-1 according to the direction from top to bottom, and marking the line head element of each line as D0[0], D1[0] and D2[0] … D N-1[0]; dr is the r line of the upper triangular matrix D, dr [0] is the line head element of the r line, wherein the value range of r [0- (N-1) ];

s3.1, reading a head element Dr [0] of the matrix D by using a scalar Load instruction, and storing the head element Dr [0] into a general register S0;

s3.3, using a vector Load instruction to sequentially Load other element data of a D _r line, and storing w/u elements into a vector register Z1 each time;

s3.5, using a vector Store instruction to Store the vector data result in the vector register Z2 into the original position of the Dr row data;

S3.6, repeating the steps S3.3 to S3.5 until the line elimination calculation of the line Dr of the upper triangular matrix D is completed.

As shown in fig. 5, in embodiment 1, the DSP cores of the chips S4 and wari 2 perform the calculation of the update matrix panel according to the formula r=r-Dr'; the specific method of step S4 is as follows:

The matrix at the right lower part of the row head element Dr [0] is marked as R, the g row of the matrix R is marked as Rg, and the value range of g is [0- (N-R-2) ]; the part of the r-th column in the matrix C, which is positioned in the lower triangular matrix E, is marked as a column element er, the f-th element of the column element er is marked as erf, and the value range of f [0- (N-r-2) ]; the portion of row Dr that does not include head of row element Dr [0] is denoted as Dr';

S4.3, sequentially loading element data of D _r' by using a vector Load instruction, loading w/u elements each time, and storing a result into a vector register Z1;

in embodiment 1, the DSP cores of chips S6 and celi 2 use vector instructions to perform matrix transposition to obtain LU decomposition results of matrix B; the specific method of step S6 is as follows:

Cjk [0] =c00+j×n×u+k×w formula (5)

Bkj [0] =b00+knu+j w formula (6)

In embodiment 1, the DSP core of chip No. S7 and chip No. 2 copies the LU decomposition result of matrix B to the original storage location of DDR memory; the specific method of step S7 is as follows:

bi [0] =b00+i×n×u; formula (2)

Ai [0] =a00+i×m×u; formula (1)

The invention is not related in part to the same as or can be practiced with the prior art.

As described above, although the present invention has been shown and described with reference to certain preferred embodiments, it is not to be construed as limiting the invention itself. Various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A matrix LU decomposition vectorization calculation method of a vector DSP core comprises the following steps:

The specific method comprises the following steps:

Setting a matrix to be solved as a matrix A, wherein the scale of the matrix A is M.M orders, and the first address is recorded as A00; the matrix after zero filling is marked as a matrix B, the scale is N.N-order, and the first address is marked as B00; the matrix elements of each matrix are u bytes, and the bit width of a vector register in a vector DSP core is w bytes;

Ai [0] =a00+i×m×u; formula (1)

Bi [0] =b00+i×n×u; formula (2)

Wherein Ai [0] is the pointer variable of the first address of the i-th row element of the matrix A, bi [0] is the pointer variable of the first address of the i-th row element of the matrix B; i value range [0- (M-1) ], initializing i=0;

s1.5, judging whether i is equal to M-1, if not, making i=i+1, and turning to the step S1.3; otherwise, go to step S2;

The specific method comprises the following steps:

S2.1, matrix transposition is carried out according to the unit of a submatrix, and since w bytes of data are loaded by using a vector Load instruction, the size of the submatrix is set as (w/u) order;

bkj [0] =b00+knu+j w formula (3)

Cjk [0] =c00+j×n×u+k×w formula (4)

s2.6, judging whether k is equal to N x u/w-1, if not, making k=k+1, turning to step S2.2, and if yes, turning to step S3;

2. The matrix LU decomposition vectorization calculation method of vector DSP core according to claim 1, the specific method of step S3 is as follows:

Dividing a matrix C in the S2 into an upper triangular area and a lower triangular area along the diagonal direction from top left to bottom right, wherein diagonal data is divided into the upper triangular area, the upper triangular area is marked as an upper triangular matrix D, and the lower triangular area is marked as a lower triangular matrix E; marking the upper triangular matrix D as D0, D1 and D2 … DN-1 according to the direction from top to bottom, and marking the line head element of each line as D0[0], D1[0] and D2[0] … DN-1 [0]; dr is the r line of the upper triangular matrix D, dr [0] is the line head element of the r line, wherein, the value range of r [0- (N-1) ];

S3.3, using a vector Load instruction to sequentially Load other element data of the Dr line except the head element, and storing w/u elements into a vector register Z1 each time;

3. The matrix LU decomposition vectorization calculation method of vector DSP core according to claim 2, the specific method of step S4 is as follows:

The matrix at the right lower part of the row head element Dr [0] is marked as R, the g row of the matrix R is marked as Rg, and the value range of g is [0- (N-R-2) ]; the part of the r-th column of the matrix C, which is positioned in the lower triangular matrix E, is marked as er, and the f-th element of the er is marked as erf; the value range of f is [0- (N-r-2) ]; the portion of row Dr that does not include head of row element Dr [0] is denoted as Dr';

S4.8, judging whether g is equal to N-r-2, if not, making g=g+1, and turning to the step S4.1; if so, the update calculation of the row Rg in the matrix R is completed, and the process goes to step S5.

4. The matrix LU decomposition vectorization calculation method of vector DSP core according to claim 3, the specific method of step S6 is as follows:

cjk [0] =c00+j×n×u+k×w formula (5)

Bkj [0] =b00+knu+j w formula (6)

5. The matrix LU decomposition vectorization calculation method of vector DSP core according to claim 4, the specific method of step S7 is as follows:

Bi [0] =b00+i×n×u; formula (2)

Ai [0] =a00+i×m×u; formula (1)