CN114139108A

CN114139108A - Matrix LU decomposition vectorization calculation method of vector DSP core

Info

Publication number: CN114139108A
Application number: CN202111491342.4A
Authority: CN
Inventors: 夏宁; 李世平
Original assignee: Jiangsu Huachuang Micro System Co ltd; CETC 14 Research Institute
Current assignee: Jiangsu Huachuang Micro System Co ltd; CETC 14 Research Institute
Priority date: 2021-12-08
Filing date: 2021-12-08
Publication date: 2022-03-04
Anticipated expiration: 2041-12-08

Abstract

The invention discloses a matrix LU decomposition vectorization calculation method of a vector DSP core, which comprises the following steps: s1, zero padding of the matrix; s2, transposing the matrix B to obtain a transposed matrix C; s3, performing row elimination on rows Dr of an upper triangular matrix D in the transposed matrix C; s4, calculating an update matrix panel according to the formula R = R-Dr'; s5, judging whether r is equal to N-1, if not, making r = r +1, and turning to step S3, if yes, turning to step S6; s6, the vector DSP core uses the vector instruction to perform matrix transposition to obtain the LU decomposition result of the matrix B; and S7, copying the LU decomposition result of the obtained matrix B to the storage position of the original matrix in the DDR memory by the vector DSP core. The advantages are that: the calculation method of the invention converts discontinuous storage access in matrix LU decomposition into continuous storage access through vectorized matrix transposition operation, and is beneficial to fully exerting the advantage of vector loading data.

Description

Matrix LU decomposition vectorization calculation method of vector DSP core

Technical Field

The invention relates to a matrix LU decomposition vectorization calculation method of a vector DSP core, and belongs to the technical field of digital signal processing-DSP.

Background

LU decomposition is a classical algorithm of intensive operation, and is always at the core position due to wide application range and important application value. The well-known LINPACK test program tests supercomputer performance by solving a system of linear equations, whereas matrix LU decomposition (LU Factorization) is one of the most common methods for solving a dense system of linear equations. The LINPACK test program decomposes the matrix into an upper triangular matrix and a lower triangular matrix through matrix partitioning LU decomposition, and then realizes final equation set solution through twice triangular equation set solution, wherein the calculated amount of matrix LU decomposition occupies more than 95% of the whole calculated amount. Therefore, the performance of LU decomposition directly determines the performance of equation system solution, and is very important for the high-performance computing field.

Disclosure of Invention

The invention provides a matrix LU decomposition vectorization calculation method of a vector DSP core, which adopts the technical scheme that:

a matrix LU decomposition vectorization calculation method of a vector DSP core comprises the following steps: the method comprises the following steps:

s1, completing matrix zero padding by the vector DSP core according to the bit width of the chip vector register;

s2, the vector DSP core transposes the matrix B by using a vector instruction to obtain a transposed matrix C;

s3, checking, by the vector DSP, a row Dr of the upper triangular matrix D in the transposed matrix C, performing row elimination according to a formula Dr/Dr [0], updating the row Dr of the matrix D, and initializing r to 0;

s4, calculating an update matrix panel by the vector DSP core according to a formula R-Dr';

s5, determining whether r is equal to N-1, if not, making r equal to r +1, and going to step S3, if yes, going to step S6;

s6, the vector DSP core uses the vector instruction to perform matrix transposition to obtain the LU decomposition result of the matrix B;

and S7, copying the LU decomposition result of the obtained matrix B to the storage position of the original matrix in the DDR memory by the vector DSP core.

In the calculation method of the invention, the vector DSP core is a digital processor chip with a vector register.

Further preferably, the method of step S1 is as follows:

setting a matrix to be solved as a matrix A, wherein the scale of the matrix is M x M, and the initial address is A00; marking the matrix after zero padding as a matrix B, wherein the scale is NxNth, and the initial address is marked as B00; matrix elements of each matrix are u bytes, and bit widths of a plurality of vector registers in a vector DSP core are w bytes;

s1.1, determining whether the row length of the matrix a is an integer multiple of the bit width w of the vector register, if so, the matrix a does not need to be zero-padded, the matrix a and the matrix B are marked as the same matrix, N is equal to M, and B00 is equal to a00, and go to step S2; otherwise, go to step S1.2;

s1.2, determining the scale N of the matrix B, wherein the method comprises the following steps: condition one, N > M, condition two, (N × u)% w ═ 0; obtaining the minimum N value as the required N value while meeting the conditions I and II to obtain a matrix B; initializing the matrix B into an all-zero matrix;

s1.3, calculating the first addresses of the ith row elements of the matrix A and the matrix B respectively:

ai [0] ═ a00+ i M u; formula (1)

Bi [0] ═ B00+ i × N × u; formula (2)

Wherein, Ai [0] is the pointer variable of the first address of the ith row element of the matrix A, and Bi [0] is the pointer variable of the first address of the ith row element of the matrix B; the value range of i is [0- (M-1) ], and i is initialized to be 0;

s1.4, copying continuous M elements at the position where the head address of the ith row in the matrix A is Ai [0] to the position where the head address of the ith row in the matrix B is Bi [0 ];

s1.5, determining whether i is equal to M-1, if not, making i ═ i +1, and then go to step S1.3; otherwise, go to step S2.

Further preferably, the method of step S2 is as follows:

setting the matrix after the conversion as C, and recording the initial address of the matrix C as C00;

s2.1, matrix transposition is carried out according to the submatrix as a unit, and the scale of the submatrix is set to be (w/u) order because a vector Load instruction can Load w byte data;

s2.2, respectively calculating the first addresses of all the submatrices in the matrix B and the matrix C:

bkj [0] ═ B00+ k × N × u + j × w formula (3)

Cjk [0] ═ C00+ j N u + k w formula (4)

Wherein, Bkj [0] is a pointer variable of a first address of a sub-matrix in the matrix B, Ckj [0] is a pointer variable of a first address of a sub-matrix in the matrix C, k is a row coordinate of the sub-matrix in the matrix B or a column coordinate of the sub-matrix in the matrix C, a value range [0- (N u/w-1) ] of k, j is a column coordinate of the sub-matrix in the matrix B or a row coordinate of the sub-matrix in the matrix C, and a value range [0- (N u/w-1) ] of j; initializing k to 0, j to 0;

s2.3, sequentially using a vector Load instruction to Load each row of data of the sub-matrix from the position with the first address Bkj [0] in the matrix B, and respectively storing the data into a vector register;

s2.4, completing transposition calculation of a sub-matrix in the matrix B by using a vector shuffle instruction, and storing a calculation result into the sub-matrix with the initial address of Cjk [0] in the matrix C by using a vector Store instruction;

s2.5, determining whether j is equal to N × u/w-1, if not, making j equal to j +1, and going to step S2.2, if so, going to step S2.6;

s2.6, determine whether k is equal to N × u/w-1, if not, let k be k +1, go to step S2.2, if so, go to step S3.

Further preferably, the method of step S3 is as follows:

dividing the matrix C in the S2 into an upper triangular area and a lower triangular area along the diagonal direction from top left to bottom right, wherein diagonal data are divided into the upper triangular area, the upper triangular area is marked as an upper triangular matrix D, and the lower triangular area is marked as a lower triangular matrix E; marking the upper triangular matrix D as D0, D1 and D2 … DN-1 according to the directions from top to bottom, respectively, marking the head element of each row as D0[0], D1[0] and D2[0] … D N-1[0 ]; dr is the r-th row of the upper triangular matrix D, and Dr [0] is the row head element of the r-th row, wherein the value range of r [0- (N-1) ];

s3.1, reading a pointer variable Dr [0] of a row head element of the matrix D by using a scalar Load instruction, and storing the pointer variable Dr [0] into a general register S0;

s3.2, calculating the value of 1/Dr [0] by using a scalar floating-point division instruction, storing the result into a general register S1, and broadcasting the data of the general register S1 to a vector register Z0 by using a broadcasting instruction;

s3.3, sequentially loading other element data of the Dr line by using a vector Load instruction, loading w/u elements each time, and storing the w/u elements into a vector register Z1;

s3.4, multiplying the vector data Z1 and the vector data Z0 by using a vector multiplication instruction, and storing the result into a vector register Z2;

s3.5, storing the vector data result in the vector register Z2 into the original position of Dr row data by using a vector Store instruction;

s3.6, repeating the step S3.3 to the step S3.5 until the row elimination element calculation of the row Dr of the upper triangular matrix D is completed;

further preferably, the method of step S4 is as follows:

recording a matrix at the lower right of a row head element Dr [0] as R, recording a g-th row of the matrix R as Rg, wherein the value range of g is [0- (N-R-2) ]; the part of the r-th column in the matrix C, which is positioned in the lower triangular matrix E, is recorded as a column element er, the f-th element of the column element er is recorded as erf, and the value range of f is [0- (N-r-2) ]; the portion of row Dr that does not include the row head element Dr [0] is denoted as Dr';

s4.1, obtaining an element erf of the er by using a scalar Load instruction, storing the element erf into a general register S0, and initializing f to be 0;

s4.2, broadcasting the data of the general register S0 in the S4.1 to the vector register Z0 by using a broadcast instruction;

s4.3, sequentially loading element data of Dr' by using a vector Load instruction, wherein w/u elements are loaded each time, and storing the result into a vector register Z1;

s4.4, sequentially loading the element data of the Rg by using a vector Load instruction, wherein w/u elements are loaded each time, and storing the result into a vector register Z2;

s4.5, executing Z2-Z1X Z0 by using a vector multiplication and subtraction instruction, and storing a calculation result into a vector register Z3;

s4.6, storing the result of the vector register Z3 into the original position of Rg row data;

s4.7, repeating the steps S4.3 to S4.6 until the updating calculation of the Rg of the row of the matrix R is completed;

s4.8, determining whether g is equal to N-r-2, if not, making g equal to g +1, and going to step S4.1; if so, ending the updating calculation of the row Rg in the matrix R, and turning to the step S5;

further preferably, the method of step S6 is as follows:

s6.1, performing matrix transposition by taking the submatrix as a unit, and setting the scale of the submatrix to be (w/u × w/u) order;

s6.2, respectively calculating the first addresses of all the submatrices in the matrix C and the matrix B:

cjk [0] ═ C00+ j N u + k w formula (5)

Bkj [0] ═ B00+ k × N × u + j × w formula (6)

Wherein, Cjk [0] is a pointer variable of a first address of a sub-matrix in the matrix C, Bkj [0] is a pointer variable of a first address of a sub-matrix in the matrix B, k is a row coordinate of the sub-matrix in the matrix B or a column coordinate of the sub-matrix in the matrix C, a value range [0- (N u/w-1) ] of k, j is a column coordinate of the sub-matrix in the matrix B or a row coordinate of the sub-matrix in the matrix C, and a value range [0- (N u/w-1) ] of j; initializing k to 0, j to 0;

s6.3, sequentially using a vector Load instruction to Load each row of data of the sub-matrix from the position with the initial address of Cjk [0] in the matrix C, and respectively storing the data into a vector register;

s6.4, completing transposition calculation of the submatrix in the matrix C by using a vector shuffle instruction, and storing a calculation result into the submatrix with the initial address Bkj [0] in the matrix B by using a vector Store instruction;

s6.5, determining whether j is equal to N × u/w-1, if not, making j equal to j +1, and going to step S6.2, if so, going to step S6.6;

s6.6, determine whether k is equal to N × u/w-1, if not, let k be k +1, go to step S6.2, if so, go to step S7.

Further preferably, the method of step S7 is as follows:

s7.1, calculating the first addresses of the ith row elements of the matrix B and the matrix A respectively:

bi [0] ═ B00+ i × N × u; formula (2)

Ai [0] ═ a00+ i M u; formula (1)

s7.2, copying continuous M elements at the position where the head address of the ith row in the matrix B is Bi [0] to the position where the head address of the ith row in the matrix A is Ai [0 ];

s7.3, determining whether i is equal to M-1, if not, making i equal to i +1, and going to step S7.1; otherwise, the calculation is ended.

Compared with the prior art, the invention has the beneficial effects that:

the method is a vectorization calculation method of matrix LU decomposition designed based on the hardware characteristics of a vector DSP core, non-continuous storage access in the matrix LU decomposition is converted into continuous storage access through vectorization matrix transposition operation, the advantages of vector loading data are fully exerted, further, subsequent row elimination and matrix panel updating operations can be completed by using the vectorization method, the parallelism degree of data processing is greatly improved, and therefore time consumption of the matrix LU decomposition is effectively reduced.

Drawings

FIG. 1 is a flow chart illustrating matrix LU decomposition based on vector DSP cores in an embodiment of the present invention.

FIG. 2 is a schematic diagram illustrating the zero padding of the matrix according to an embodiment of the present invention.

FIG. 3 is a diagram illustrating matrix transposition performed in an embodiment of the present invention.

FIG. 4 is a diagram illustrating the elimination of the erasure in an exemplary embodiment of the invention.

FIG. 5 is a diagram illustrating the updating of the matrix panel according to an embodiment of the present invention.

Detailed Description

The technical solution of the present invention is described in detail below, but the scope of the present invention is not limited to the embodiments.

In order to make the content of the present invention more comprehensible, the following description is further described with reference to fig. 1 to 5 and the detailed description.

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides a matrix LU decomposition vectorization calculation method of a vector DSP core, which comprises the following steps:

Example 1

In this embodiment, the vector DSP core preferably employs a core 2 chip; the chip of HuaRui 2 is known to those skilled in the art. Therefore, the HuaRui 2 chip includes a computationally intensive 256-bit vector processing unit for supporting parallel computations of intensive arithmetic tasks. When the traditional matrix LU decomposition method is implemented on a HuaRui 2 chip, the problems of insufficient use of hardware arithmetic units, memory access conflict and the like can be faced, the vector access mode of the HuaRui 2 chip and the system structural characteristics of concurrent processing of the vector processing unit are not suitable, and the vector calculation advantage of the HuaRui 2 chip is difficult to exert, so that the embodiment provides a method which can fully utilize the powerful parallel calculation and high-bandwidth vector data access capability of the vector processing unit of the HuaRui 2 chip, greatly improves the matrix LU decomposition vectorization calculation of the HuaRui 2 chip, and designs an LU decomposition algorithm with high precision and high real-time property.

As shown in fig. 1, in embodiment 1, a matrix LU decomposition vectorization calculation method based on a core 2 of a core includes the following steps:

s1, finishing matrix zero padding by the core of the HuaRui 2 DSP according to the bit width of the chip vector register;

s2, transposing the matrix B by using a vector instruction by using a HuaRui 2 DSP core to obtain a transposed matrix C;

s3, checking up row Dr of upper triangular matrix D in transposed matrix C by chip DSP of chinese Rui 2 according to formula Dr/Dr [0], updating row Dr of matrix D, and initializing r to 0;

s4, calculating an update matrix panel by a chip DSP core of the HuaRui 2 according to a formula R-R 'Dr';

s6, carrying out matrix transposition by using a DSP core of a HuaRui 2 chip through a vector instruction to obtain an LU decomposition result of a matrix B;

s7, the chip DSP core of HuaRui 2 copies the LU decomposition result of the obtained matrix B to the storage position of the original matrix in the DDR memory.

The matrix LU decomposition vectorization calculation method based on the core 2 of the embodiment 1 fully utilizes the strong parallel calculation and high bandwidth vector data access capability of the core 2 chip vector processing unit, greatly improves the LU decomposition vectorization calculation method of the core 2 chip matrix LU decomposition, and designs the LU decomposition algorithm with high precision and high real-time performance.

As shown in fig. 2, in embodiment 1, the S1 core 2-wise core completes the zero padding of the matrix according to the bit width of the chip vector register, and the specific method is as follows:

setting a matrix to be solved as a matrix A, wherein the scale of the matrix is M x M, and the initial address is A00; marking the matrix after zero padding as a matrix B, wherein the scale is NxNth, and the initial address is marked as B00; matrix elements of each matrix are u bytes, and the bit width of a vector register in a core of the Chinese Rui 2 is w bytes;

ai [0] ═ a00+ i M u; formula (1)

Bi [0] ═ B00+ i × N × u; formula (2)

In embodiment 1, the length of the row and column data of the matrix B obtained by using the matrix zero padding is an integral multiple of the bit width of the vector register, which is beneficial to increasing vectorization calculation and reducing tail processing.

As shown in fig. 3, in embodiment 1, the S2 and the core of the chinese 2 DSP transposes the matrix B by using vector instructions, so that the matrix B adapts to the architecture of the chinese 2 chip and performs the vectorization calculation of LU decomposition; the specific method of step S2 is as follows:

bkj [0] ═ B00+ k × N × u + j × w formula (3)

Cjk [0] ═ C00+ j N u + k w formula (4)

The general matrix LU decomposition method performs the elimination by columns, but the column elements of the matrix are not continuous in the memory, and cannot be loaded and calculated by using the vector instruction.

As shown in fig. 4, in embodiment 1, the S3 chip DSP core checks the row Dr of the upper triangular matrix D in the transpose matrix C, and performs row elimination according to the formula Dr/Dr [0], updates the row Dr of the matrix D, and initializes r to 0; the specific method of step S3 is as follows:

dividing the matrix C in the S2 into an upper triangular area and a lower triangular area along the diagonal direction from top left to bottom right, wherein diagonal data are divided into the upper triangular area, the upper triangular area is marked as an upper triangular matrix D, and the lower triangular area is marked as a lower triangular matrix E; marking the upper triangular matrix D as D0, D1 and D2 … DN-1 according to the directions from top to bottom, respectively, marking the head element of each row as D0[0], D1[0] and D2[0] … D N-1[0 ]; dr is the r-th row of the upper triangular matrix D, Dr [0] is the row head element of the r-th row, wherein the value range of r [0- (N-1) ];

s3.1, reading a row head element Dr [0] of the matrix D by using a scalar Load instruction, and storing the row head element Dr [0] into a general register S0;

s3.3, sequentially loading D by using vector Load instruction_rThe data of other elements of the line are loaded with w/u elements each time and are stored into a vector register Z1;

and S3.6, repeating the steps S3.3 to S3.5 until the row elimination element calculation of the row Dr of the upper triangular matrix D is completed.

As shown in fig. 5, in embodiment 1, the S4 and the core chip DSP core 2 of the chinese Rui performs calculation of the update matrix panel according to the formula R — Dr'. er; the specific method of step S4 is as follows:

s4.3, sequentially loading D by using vector Load instructions_r' the element data is loaded w/u elements at a time, and the result is stored in a vector register Z1;

in embodiment 1, S6 and the core DSP of the core 2 chip perform matrix transposition using vector instructions to obtain the LU decomposition result of matrix B; the specific method of step S6 is as follows:

cjk [0] ═ C00+ j N u + k w formula (5)

Bkj [0] ═ B00+ k × N × u + j × w formula (6)

In embodiment 1, S7, core chip DSP core 2 of core copies LU decomposition result of matrix B to original storage location of DDR memory; the specific method of step S7 is as follows:

bi [0] ═ B00+ i × N × u; formula (2)

Ai [0] ═ a00+ i M u; formula (1)

The parts not involved in the present invention are the same as or can be implemented using the prior art.

As noted above, while the present invention has been shown and described with reference to certain preferred embodiments, it is not to be construed as limited thereto. Various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A matrix LU decomposition vectorization calculation method of a vector DSP core comprises the following steps:

s3, checking the row Dr of the upper triangular matrix D in the transposed matrix C by the vector DSP core, performing row elimination according to a formula Dr = Dr/Dr [0], updating the row Dr of the matrix D, and initializing r = 0;

s4, calculating an update matrix panel by the vector DSP core according to a formula R = R-Dr'. er;

s5, judging whether r is equal to N-1, if not, making r = r +1, and turning to step S3, if yes, turning to step S6;

2. The matrix LU decomposition vectorization calculation method of a vector DSP core according to claim 1, wherein the specific method of step S1 is as follows:

setting a matrix to be solved as a matrix A, wherein the scale of the matrix is M x M, and the initial address is A00; marking the matrix after zero padding as a matrix B, wherein the scale is NxNth, and the initial address is marked as B00; matrix elements of each matrix are u bytes, and the bit width of a vector register in a vector DSP core is w bytes;

s1.1, judging whether the matrix row length of a matrix A is an integral multiple of the bit width w of a vector register, if so, marking the matrix A and the matrix B as the same matrix, wherein N = M, B00= A00, and turning to the step S2; otherwise, go to step S1.2;

s1.2, determining the scale N of the matrix B, wherein the method comprises the following steps: condition one, N > M, condition two, (N × u)% w = 0; obtaining the minimum N value as the required N value while meeting the conditions I and II to obtain a matrix B; initializing the matrix B into an all-zero matrix;

ai [0] = a00+ i M u; formula (1)

Bi [0] = B00+ i N u; formula (2)

Wherein, Ai [0] is the pointer variable of the first address of the ith row element of the matrix A, and Bi [0] is the pointer variable of the first address of the ith row element of the matrix B; i is in a value range [0- (M-1) ], and i =0 is initialized;

s1.5, judging whether i is equal to M-1 or not, if not, enabling i = i +1, and turning to the step S1.3; otherwise, go to step S2.

3. The matrix LU decomposition vectorization calculation method of a vector DSP core according to claim 1 or 2, wherein the specific method of step S2 is as follows:

bkj [0] = B00+ k N u + j w formula (3)

Cjk [0] = C00+ j N u + k w equation (4)

Wherein, Bkj [0] is a pointer variable of a first address of a sub-matrix in the matrix B, Ckj [0] is a pointer variable of a first address of a sub-matrix in the matrix C, k is a row coordinate of the sub-matrix in the matrix B or a column coordinate of the sub-matrix in the matrix C, a value range [0- (N u/w-1) ] of k, j is a column coordinate of the sub-matrix in the matrix B or a row coordinate of the sub-matrix in the matrix C, and a value range [0- (N u/w-1) ] of j; initializing k =0, j = 0;

s2.5, judging whether j is equal to N × u/w-1, if not, enabling j = j +1, and turning to the step S2.2, and if so, turning to the step S2.6;

s2.6, determine whether k is equal to N × u/w-1, if not, let k = k +1, go to step S2.2, if so, go to step S3.

4. The matrix LU decomposition vectorization calculation method of a vector DSP core according to claim 3, wherein the specific method of step S3 is as follows:

dividing the matrix C in the S2 into an upper triangular area and a lower triangular area along the diagonal direction from top left to bottom right, wherein diagonal data are divided into the upper triangular area, the upper triangular area is marked as an upper triangular matrix D, and the lower triangular area is marked as a lower triangular matrix E; marking the upper triangular matrix D as D0, D1 and D2 … DN-1 according to the directions from top to bottom, respectively, marking the head element of each row as D0[0], D1[0] and D2[0] … DN-1 [0 ]; dr is the r-th row of the upper triangular matrix D, and Dr [0] is the row head element of the r-th row, wherein the value range of r [0- (N-1) ];

s3.3, sequentially loading data of other elements except row head elements in the Dr row by using a vector Load instruction, loading w/u elements each time, and storing the w/u elements into a vector register Z1;

5. The matrix LU decomposition vectorization calculation method of a vector DSP core according to claim 4, wherein the specific method of step S4 is as follows:

recording a matrix at the lower right of a row head element Dr [0] as R, recording a g-th row of the matrix R as Rg, wherein the value range of g is [0- (N-R-2) ]; the part of the r-th column in the matrix C, which is positioned in the lower triangular matrix E, is recorded as er, and the f-th element of the er is recorded as erf; the value range of f is [0- (N-r-2) ]; the portion of row Dr that does not include the row head element Dr [0] is denoted as Dr';

s4.1, obtaining an element erf of the er by using a scalar Load instruction, storing the element erf into a general register S0, and initializing f = 0;

s4.8, judging whether g is equal to N-r-2, if not, enabling g = g +1, and turning to the step S4.1; if so, the update calculation of the row Rg in the matrix R is finished, and the process goes to step S5.

6. The matrix LU decomposition vectorization calculation method of a vector DSP core according to claim 5, wherein the specific method of step S6 is as follows:

cjk [0] = C00+ j N u + k w equation (5)

Bkj [0] = B00+ k N u + j w equation (6)

Wherein, Cjk [0] is a pointer variable of a first address of a sub-matrix in the matrix C, Bkj [0] is a pointer variable of a first address of a sub-matrix in the matrix B, k is a row coordinate of the sub-matrix in the matrix B or a column coordinate of the sub-matrix in the matrix C, a value range [0- (N u/w-1) ] of k, j is a column coordinate of the sub-matrix in the matrix B or a row coordinate of the sub-matrix in the matrix C, and a value range [0- (N u/w-1) ] of j; initializing k =0, j = 0;

s6.5, determining whether j is equal to N × u/w-1, if not, making j = j +1, and going to step S6.2, if so, going to step S6.6;

s6.6, determine whether k is equal to N × u/w-1, if not, let k = k +1, go to step S6.2, if so, go to step S7.

7. The matrix LU decomposition vectorization calculation method of a vector DSP core according to claim 6, wherein the specific method of step S7 is as follows:

bi [0] = B00+ i N u; formula (2)

Ai [0] = a00+ i M u; formula (1)

s7.3, judging whether i is equal to M-1 or not, if not, enabling i = i +1, and turning to the step S7.1; otherwise, the calculation is ended.