CN114139108A - Matrix LU decomposition vectorization calculation method of vector DSP core - Google Patents

Matrix LU decomposition vectorization calculation method of vector DSP core Download PDF

Info

Publication number
CN114139108A
CN114139108A CN202111491342.4A CN202111491342A CN114139108A CN 114139108 A CN114139108 A CN 114139108A CN 202111491342 A CN202111491342 A CN 202111491342A CN 114139108 A CN114139108 A CN 114139108A
Authority
CN
China
Prior art keywords
matrix
vector
row
sub
dsp core
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111491342.4A
Other languages
Chinese (zh)
Other versions
CN114139108B (en
Inventor
夏宁
李世平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Huachuang Micro System Co ltd
CETC 14 Research Institute
Original Assignee
Jiangsu Huachuang Micro System Co ltd
CETC 14 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Huachuang Micro System Co ltd, CETC 14 Research Institute filed Critical Jiangsu Huachuang Micro System Co ltd
Priority to CN202111491342.4A priority Critical patent/CN114139108B/en
Priority claimed from CN202111491342.4A external-priority patent/CN114139108B/en
Publication of CN114139108A publication Critical patent/CN114139108A/en
Application granted granted Critical
Publication of CN114139108B publication Critical patent/CN114139108B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8053Vector processors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8053Vector processors
    • G06F15/8076Details on data register access

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a matrix LU decomposition vectorization calculation method of a vector DSP core, which comprises the following steps: s1, zero padding of the matrix; s2, transposing the matrix B to obtain a transposed matrix C; s3, performing row elimination on rows Dr of an upper triangular matrix D in the transposed matrix C; s4, calculating an update matrix panel according to the formula R = R-Dr'; s5, judging whether r is equal to N-1, if not, making r = r +1, and turning to step S3, if yes, turning to step S6; s6, the vector DSP core uses the vector instruction to perform matrix transposition to obtain the LU decomposition result of the matrix B; and S7, copying the LU decomposition result of the obtained matrix B to the storage position of the original matrix in the DDR memory by the vector DSP core. The advantages are that: the calculation method of the invention converts discontinuous storage access in matrix LU decomposition into continuous storage access through vectorized matrix transposition operation, and is beneficial to fully exerting the advantage of vector loading data.

Description

Matrix LU decomposition vectorization calculation method of vector DSP core
Technical Field
The invention relates to a matrix LU decomposition vectorization calculation method of a vector DSP core, and belongs to the technical field of digital signal processing-DSP.
Background
LU decomposition is a classical algorithm of intensive operation, and is always at the core position due to wide application range and important application value. The well-known LINPACK test program tests supercomputer performance by solving a system of linear equations, whereas matrix LU decomposition (LU Factorization) is one of the most common methods for solving a dense system of linear equations. The LINPACK test program decomposes the matrix into an upper triangular matrix and a lower triangular matrix through matrix partitioning LU decomposition, and then realizes final equation set solution through twice triangular equation set solution, wherein the calculated amount of matrix LU decomposition occupies more than 95% of the whole calculated amount. Therefore, the performance of LU decomposition directly determines the performance of equation system solution, and is very important for the high-performance computing field.
Disclosure of Invention
The invention provides a matrix LU decomposition vectorization calculation method of a vector DSP core, which adopts the technical scheme that:
a matrix LU decomposition vectorization calculation method of a vector DSP core comprises the following steps: the method comprises the following steps:
s1, completing matrix zero padding by the vector DSP core according to the bit width of the chip vector register;
s2, the vector DSP core transposes the matrix B by using a vector instruction to obtain a transposed matrix C;
s3, checking, by the vector DSP, a row Dr of the upper triangular matrix D in the transposed matrix C, performing row elimination according to a formula Dr/Dr [0], updating the row Dr of the matrix D, and initializing r to 0;
s4, calculating an update matrix panel by the vector DSP core according to a formula R-Dr';
s5, determining whether r is equal to N-1, if not, making r equal to r +1, and going to step S3, if yes, going to step S6;
s6, the vector DSP core uses the vector instruction to perform matrix transposition to obtain the LU decomposition result of the matrix B;
and S7, copying the LU decomposition result of the obtained matrix B to the storage position of the original matrix in the DDR memory by the vector DSP core.
In the calculation method of the invention, the vector DSP core is a digital processor chip with a vector register.
Further preferably, the method of step S1 is as follows:
setting a matrix to be solved as a matrix A, wherein the scale of the matrix is M x M, and the initial address is A00; marking the matrix after zero padding as a matrix B, wherein the scale is NxNth, and the initial address is marked as B00; matrix elements of each matrix are u bytes, and bit widths of a plurality of vector registers in a vector DSP core are w bytes;
s1.1, determining whether the row length of the matrix a is an integer multiple of the bit width w of the vector register, if so, the matrix a does not need to be zero-padded, the matrix a and the matrix B are marked as the same matrix, N is equal to M, and B00 is equal to a00, and go to step S2; otherwise, go to step S1.2;
s1.2, determining the scale N of the matrix B, wherein the method comprises the following steps: condition one, N > M, condition two, (N × u)% w ═ 0; obtaining the minimum N value as the required N value while meeting the conditions I and II to obtain a matrix B; initializing the matrix B into an all-zero matrix;
s1.3, calculating the first addresses of the ith row elements of the matrix A and the matrix B respectively:
ai [0] ═ a00+ i M u; formula (1)
Bi [0] ═ B00+ i × N × u; formula (2)
Wherein, Ai [0] is the pointer variable of the first address of the ith row element of the matrix A, and Bi [0] is the pointer variable of the first address of the ith row element of the matrix B; the value range of i is [0- (M-1) ], and i is initialized to be 0;
s1.4, copying continuous M elements at the position where the head address of the ith row in the matrix A is Ai [0] to the position where the head address of the ith row in the matrix B is Bi [0 ];
s1.5, determining whether i is equal to M-1, if not, making i ═ i +1, and then go to step S1.3; otherwise, go to step S2.
Further preferably, the method of step S2 is as follows:
setting the matrix after the conversion as C, and recording the initial address of the matrix C as C00;
s2.1, matrix transposition is carried out according to the submatrix as a unit, and the scale of the submatrix is set to be (w/u) order because a vector Load instruction can Load w byte data;
s2.2, respectively calculating the first addresses of all the submatrices in the matrix B and the matrix C:
bkj [0] ═ B00+ k × N × u + j × w formula (3)
Cjk [0] ═ C00+ j N u + k w formula (4)
Wherein, Bkj [0] is a pointer variable of a first address of a sub-matrix in the matrix B, Ckj [0] is a pointer variable of a first address of a sub-matrix in the matrix C, k is a row coordinate of the sub-matrix in the matrix B or a column coordinate of the sub-matrix in the matrix C, a value range [0- (N u/w-1) ] of k, j is a column coordinate of the sub-matrix in the matrix B or a row coordinate of the sub-matrix in the matrix C, and a value range [0- (N u/w-1) ] of j; initializing k to 0, j to 0;
s2.3, sequentially using a vector Load instruction to Load each row of data of the sub-matrix from the position with the first address Bkj [0] in the matrix B, and respectively storing the data into a vector register;
s2.4, completing transposition calculation of a sub-matrix in the matrix B by using a vector shuffle instruction, and storing a calculation result into the sub-matrix with the initial address of Cjk [0] in the matrix C by using a vector Store instruction;
s2.5, determining whether j is equal to N × u/w-1, if not, making j equal to j +1, and going to step S2.2, if so, going to step S2.6;
s2.6, determine whether k is equal to N × u/w-1, if not, let k be k +1, go to step S2.2, if so, go to step S3.
Further preferably, the method of step S3 is as follows:
dividing the matrix C in the S2 into an upper triangular area and a lower triangular area along the diagonal direction from top left to bottom right, wherein diagonal data are divided into the upper triangular area, the upper triangular area is marked as an upper triangular matrix D, and the lower triangular area is marked as a lower triangular matrix E; marking the upper triangular matrix D as D0, D1 and D2 … DN-1 according to the directions from top to bottom, respectively, marking the head element of each row as D0[0], D1[0] and D2[0] … D N-1[0 ]; dr is the r-th row of the upper triangular matrix D, and Dr [0] is the row head element of the r-th row, wherein the value range of r [0- (N-1) ];
s3.1, reading a pointer variable Dr [0] of a row head element of the matrix D by using a scalar Load instruction, and storing the pointer variable Dr [0] into a general register S0;
s3.2, calculating the value of 1/Dr [0] by using a scalar floating-point division instruction, storing the result into a general register S1, and broadcasting the data of the general register S1 to a vector register Z0 by using a broadcasting instruction;
s3.3, sequentially loading other element data of the Dr line by using a vector Load instruction, loading w/u elements each time, and storing the w/u elements into a vector register Z1;
s3.4, multiplying the vector data Z1 and the vector data Z0 by using a vector multiplication instruction, and storing the result into a vector register Z2;
s3.5, storing the vector data result in the vector register Z2 into the original position of Dr row data by using a vector Store instruction;
s3.6, repeating the step S3.3 to the step S3.5 until the row elimination element calculation of the row Dr of the upper triangular matrix D is completed;
further preferably, the method of step S4 is as follows:
recording a matrix at the lower right of a row head element Dr [0] as R, recording a g-th row of the matrix R as Rg, wherein the value range of g is [0- (N-R-2) ]; the part of the r-th column in the matrix C, which is positioned in the lower triangular matrix E, is recorded as a column element er, the f-th element of the column element er is recorded as erf, and the value range of f is [0- (N-r-2) ]; the portion of row Dr that does not include the row head element Dr [0] is denoted as Dr';
s4.1, obtaining an element erf of the er by using a scalar Load instruction, storing the element erf into a general register S0, and initializing f to be 0;
s4.2, broadcasting the data of the general register S0 in the S4.1 to the vector register Z0 by using a broadcast instruction;
s4.3, sequentially loading element data of Dr' by using a vector Load instruction, wherein w/u elements are loaded each time, and storing the result into a vector register Z1;
s4.4, sequentially loading the element data of the Rg by using a vector Load instruction, wherein w/u elements are loaded each time, and storing the result into a vector register Z2;
s4.5, executing Z2-Z1X Z0 by using a vector multiplication and subtraction instruction, and storing a calculation result into a vector register Z3;
s4.6, storing the result of the vector register Z3 into the original position of Rg row data;
s4.7, repeating the steps S4.3 to S4.6 until the updating calculation of the Rg of the row of the matrix R is completed;
s4.8, determining whether g is equal to N-r-2, if not, making g equal to g +1, and going to step S4.1; if so, ending the updating calculation of the row Rg in the matrix R, and turning to the step S5;
further preferably, the method of step S6 is as follows:
s6.1, performing matrix transposition by taking the submatrix as a unit, and setting the scale of the submatrix to be (w/u × w/u) order;
s6.2, respectively calculating the first addresses of all the submatrices in the matrix C and the matrix B:
cjk [0] ═ C00+ j N u + k w formula (5)
Bkj [0] ═ B00+ k × N × u + j × w formula (6)
Wherein, Cjk [0] is a pointer variable of a first address of a sub-matrix in the matrix C, Bkj [0] is a pointer variable of a first address of a sub-matrix in the matrix B, k is a row coordinate of the sub-matrix in the matrix B or a column coordinate of the sub-matrix in the matrix C, a value range [0- (N u/w-1) ] of k, j is a column coordinate of the sub-matrix in the matrix B or a row coordinate of the sub-matrix in the matrix C, and a value range [0- (N u/w-1) ] of j; initializing k to 0, j to 0;
s6.3, sequentially using a vector Load instruction to Load each row of data of the sub-matrix from the position with the initial address of Cjk [0] in the matrix C, and respectively storing the data into a vector register;
s6.4, completing transposition calculation of the submatrix in the matrix C by using a vector shuffle instruction, and storing a calculation result into the submatrix with the initial address Bkj [0] in the matrix B by using a vector Store instruction;
s6.5, determining whether j is equal to N × u/w-1, if not, making j equal to j +1, and going to step S6.2, if so, going to step S6.6;
s6.6, determine whether k is equal to N × u/w-1, if not, let k be k +1, go to step S6.2, if so, go to step S7.
Further preferably, the method of step S7 is as follows:
s7.1, calculating the first addresses of the ith row elements of the matrix B and the matrix A respectively:
bi [0] ═ B00+ i × N × u; formula (2)
Ai [0] ═ a00+ i M u; formula (1)
Wherein, Ai [0] is the pointer variable of the first address of the ith row element of the matrix A, and Bi [0] is the pointer variable of the first address of the ith row element of the matrix B; the value range of i is [0- (M-1) ], and i is initialized to be 0;
s7.2, copying continuous M elements at the position where the head address of the ith row in the matrix B is Bi [0] to the position where the head address of the ith row in the matrix A is Ai [0 ];
s7.3, determining whether i is equal to M-1, if not, making i equal to i +1, and going to step S7.1; otherwise, the calculation is ended.
Compared with the prior art, the invention has the beneficial effects that:
the method is a vectorization calculation method of matrix LU decomposition designed based on the hardware characteristics of a vector DSP core, non-continuous storage access in the matrix LU decomposition is converted into continuous storage access through vectorization matrix transposition operation, the advantages of vector loading data are fully exerted, further, subsequent row elimination and matrix panel updating operations can be completed by using the vectorization method, the parallelism degree of data processing is greatly improved, and therefore time consumption of the matrix LU decomposition is effectively reduced.
Drawings
FIG. 1 is a flow chart illustrating matrix LU decomposition based on vector DSP cores in an embodiment of the present invention.
FIG. 2 is a schematic diagram illustrating the zero padding of the matrix according to an embodiment of the present invention.
FIG. 3 is a diagram illustrating matrix transposition performed in an embodiment of the present invention.
FIG. 4 is a diagram illustrating the elimination of the erasure in an exemplary embodiment of the invention.
FIG. 5 is a diagram illustrating the updating of the matrix panel according to an embodiment of the present invention.
Detailed Description
The technical solution of the present invention is described in detail below, but the scope of the present invention is not limited to the embodiments.
In order to make the content of the present invention more comprehensible, the following description is further described with reference to fig. 1 to 5 and the detailed description.
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides a matrix LU decomposition vectorization calculation method of a vector DSP core, which comprises the following steps:
s1, completing matrix zero padding by the vector DSP core according to the bit width of the chip vector register;
s2, the vector DSP core transposes the matrix B by using a vector instruction to obtain a transposed matrix C;
s3, checking, by the vector DSP, a row Dr of the upper triangular matrix D in the transposed matrix C, performing row elimination according to a formula Dr/Dr [0], updating the row Dr of the matrix D, and initializing r to 0;
s4, calculating an update matrix panel by the vector DSP core according to a formula R-Dr';
s5, determining whether r is equal to N-1, if not, making r equal to r +1, and going to step S3, if yes, going to step S6;
s6, the vector DSP core uses the vector instruction to perform matrix transposition to obtain the LU decomposition result of the matrix B;
and S7, copying the LU decomposition result of the obtained matrix B to the storage position of the original matrix in the DDR memory by the vector DSP core.
Example 1
In this embodiment, the vector DSP core preferably employs a core 2 chip; the chip of HuaRui 2 is known to those skilled in the art. Therefore, the HuaRui 2 chip includes a computationally intensive 256-bit vector processing unit for supporting parallel computations of intensive arithmetic tasks. When the traditional matrix LU decomposition method is implemented on a HuaRui 2 chip, the problems of insufficient use of hardware arithmetic units, memory access conflict and the like can be faced, the vector access mode of the HuaRui 2 chip and the system structural characteristics of concurrent processing of the vector processing unit are not suitable, and the vector calculation advantage of the HuaRui 2 chip is difficult to exert, so that the embodiment provides a method which can fully utilize the powerful parallel calculation and high-bandwidth vector data access capability of the vector processing unit of the HuaRui 2 chip, greatly improves the matrix LU decomposition vectorization calculation of the HuaRui 2 chip, and designs an LU decomposition algorithm with high precision and high real-time property.
As shown in fig. 1, in embodiment 1, a matrix LU decomposition vectorization calculation method based on a core 2 of a core includes the following steps:
s1, finishing matrix zero padding by the core of the HuaRui 2 DSP according to the bit width of the chip vector register;
s2, transposing the matrix B by using a vector instruction by using a HuaRui 2 DSP core to obtain a transposed matrix C;
s3, checking up row Dr of upper triangular matrix D in transposed matrix C by chip DSP of chinese Rui 2 according to formula Dr/Dr [0], updating row Dr of matrix D, and initializing r to 0;
s4, calculating an update matrix panel by a chip DSP core of the HuaRui 2 according to a formula R-R 'Dr';
s5, determining whether r is equal to N-1, if not, making r equal to r +1, and going to step S3, if yes, going to step S6;
s6, carrying out matrix transposition by using a DSP core of a HuaRui 2 chip through a vector instruction to obtain an LU decomposition result of a matrix B;
s7, the chip DSP core of HuaRui 2 copies the LU decomposition result of the obtained matrix B to the storage position of the original matrix in the DDR memory.
The matrix LU decomposition vectorization calculation method based on the core 2 of the embodiment 1 fully utilizes the strong parallel calculation and high bandwidth vector data access capability of the core 2 chip vector processing unit, greatly improves the LU decomposition vectorization calculation method of the core 2 chip matrix LU decomposition, and designs the LU decomposition algorithm with high precision and high real-time performance.
As shown in fig. 2, in embodiment 1, the S1 core 2-wise core completes the zero padding of the matrix according to the bit width of the chip vector register, and the specific method is as follows:
setting a matrix to be solved as a matrix A, wherein the scale of the matrix is M x M, and the initial address is A00; marking the matrix after zero padding as a matrix B, wherein the scale is NxNth, and the initial address is marked as B00; matrix elements of each matrix are u bytes, and the bit width of a vector register in a core of the Chinese Rui 2 is w bytes;
s1.1, determining whether the row length of the matrix a is an integer multiple of the bit width w of the vector register, if so, the matrix a does not need to be zero-padded, the matrix a and the matrix B are marked as the same matrix, N is equal to M, and B00 is equal to a00, and go to step S2; otherwise, go to step S1.2;
s1.2, determining the scale N of the matrix B, wherein the method comprises the following steps: condition one, N > M, condition two, (N × u)% w ═ 0; obtaining the minimum N value as the required N value while meeting the conditions I and II to obtain a matrix B; initializing the matrix B into an all-zero matrix;
s1.3, calculating the first addresses of the ith row elements of the matrix A and the matrix B respectively:
ai [0] ═ a00+ i M u; formula (1)
Bi [0] ═ B00+ i × N × u; formula (2)
Wherein, Ai [0] is the pointer variable of the first address of the ith row element of the matrix A, and Bi [0] is the pointer variable of the first address of the ith row element of the matrix B; the value range of i is [0- (M-1) ], and i is initialized to be 0;
s1.4, copying continuous M elements at the position where the head address of the ith row in the matrix A is Ai [0] to the position where the head address of the ith row in the matrix B is Bi [0 ];
s1.5, determining whether i is equal to M-1, if not, making i ═ i +1, and then go to step S1.3; otherwise, go to step S2.
In embodiment 1, the length of the row and column data of the matrix B obtained by using the matrix zero padding is an integral multiple of the bit width of the vector register, which is beneficial to increasing vectorization calculation and reducing tail processing.
As shown in fig. 3, in embodiment 1, the S2 and the core of the chinese 2 DSP transposes the matrix B by using vector instructions, so that the matrix B adapts to the architecture of the chinese 2 chip and performs the vectorization calculation of LU decomposition; the specific method of step S2 is as follows:
setting the matrix after the conversion as C, and recording the initial address of the matrix C as C00;
s2.1, matrix transposition is carried out according to the submatrix as a unit, and the scale of the submatrix is set to be (w/u) order because a vector Load instruction can Load w byte data;
s2.2, respectively calculating the first addresses of all the submatrices in the matrix B and the matrix C:
bkj [0] ═ B00+ k × N × u + j × w formula (3)
Cjk [0] ═ C00+ j N u + k w formula (4)
Wherein, Bkj [0] is a pointer variable of a first address of a sub-matrix in the matrix B, Ckj [0] is a pointer variable of a first address of a sub-matrix in the matrix C, k is a row coordinate of the sub-matrix in the matrix B or a column coordinate of the sub-matrix in the matrix C, a value range [0- (N u/w-1) ] of k, j is a column coordinate of the sub-matrix in the matrix B or a row coordinate of the sub-matrix in the matrix C, and a value range [0- (N u/w-1) ] of j; initializing k to 0, j to 0;
s2.3, sequentially using a vector Load instruction to Load each row of data of the sub-matrix from the position with the first address Bkj [0] in the matrix B, and respectively storing the data into a vector register;
s2.4, completing transposition calculation of a sub-matrix in the matrix B by using a vector shuffle instruction, and storing a calculation result into the sub-matrix with the initial address of Cjk [0] in the matrix C by using a vector Store instruction;
s2.5, determining whether j is equal to N × u/w-1, if not, making j equal to j +1, and going to step S2.2, if so, going to step S2.6;
s2.6, determine whether k is equal to N × u/w-1, if not, let k be k +1, go to step S2.2, if so, go to step S3.
The general matrix LU decomposition method performs the elimination by columns, but the column elements of the matrix are not continuous in the memory, and cannot be loaded and calculated by using the vector instruction.
As shown in fig. 4, in embodiment 1, the S3 chip DSP core checks the row Dr of the upper triangular matrix D in the transpose matrix C, and performs row elimination according to the formula Dr/Dr [0], updates the row Dr of the matrix D, and initializes r to 0; the specific method of step S3 is as follows:
dividing the matrix C in the S2 into an upper triangular area and a lower triangular area along the diagonal direction from top left to bottom right, wherein diagonal data are divided into the upper triangular area, the upper triangular area is marked as an upper triangular matrix D, and the lower triangular area is marked as a lower triangular matrix E; marking the upper triangular matrix D as D0, D1 and D2 … DN-1 according to the directions from top to bottom, respectively, marking the head element of each row as D0[0], D1[0] and D2[0] … D N-1[0 ]; dr is the r-th row of the upper triangular matrix D, Dr [0] is the row head element of the r-th row, wherein the value range of r [0- (N-1) ];
s3.1, reading a row head element Dr [0] of the matrix D by using a scalar Load instruction, and storing the row head element Dr [0] into a general register S0;
s3.2, calculating the value of 1/Dr [0] by using a scalar floating-point division instruction, storing the result into a general register S1, and broadcasting the data of the general register S1 to a vector register Z0 by using a broadcasting instruction;
s3.3, sequentially loading D by using vector Load instructionrThe data of other elements of the line are loaded with w/u elements each time and are stored into a vector register Z1;
s3.4, multiplying the vector data Z1 and the vector data Z0 by using a vector multiplication instruction, and storing the result into a vector register Z2;
s3.5, storing the vector data result in the vector register Z2 into the original position of Dr row data by using a vector Store instruction;
and S3.6, repeating the steps S3.3 to S3.5 until the row elimination element calculation of the row Dr of the upper triangular matrix D is completed.
As shown in fig. 5, in embodiment 1, the S4 and the core chip DSP core 2 of the chinese Rui performs calculation of the update matrix panel according to the formula R — Dr'. er; the specific method of step S4 is as follows:
recording a matrix at the lower right of a row head element Dr [0] as R, recording a g-th row of the matrix R as Rg, wherein the value range of g is [0- (N-R-2) ]; the part of the r-th column in the matrix C, which is positioned in the lower triangular matrix E, is recorded as a column element er, the f-th element of the column element er is recorded as erf, and the value range of f is [0- (N-r-2) ]; the portion of row Dr that does not include the row head element Dr [0] is denoted as Dr';
s4.1, obtaining an element erf of the er by using a scalar Load instruction, storing the element erf into a general register S0, and initializing f to be 0;
s4.2, broadcasting the data of the general register S0 in the S4.1 to the vector register Z0 by using a broadcast instruction;
s4.3, sequentially loading D by using vector Load instructionsr' the element data is loaded w/u elements at a time, and the result is stored in a vector register Z1;
s4.4, sequentially loading the element data of the Rg by using a vector Load instruction, wherein w/u elements are loaded each time, and storing the result into a vector register Z2;
s4.5, executing Z2-Z1X Z0 by using a vector multiplication and subtraction instruction, and storing a calculation result into a vector register Z3;
s4.6, storing the result of the vector register Z3 into the original position of Rg row data;
s4.7, repeating the steps S4.3 to S4.6 until the updating calculation of the Rg of the row of the matrix R is completed;
s4.8, determining whether g is equal to N-r-2, if not, making g equal to g +1, and going to step S4.1; if so, ending the updating calculation of the row Rg in the matrix R, and turning to the step S5;
in embodiment 1, S6 and the core DSP of the core 2 chip perform matrix transposition using vector instructions to obtain the LU decomposition result of matrix B; the specific method of step S6 is as follows:
s6.1, performing matrix transposition by taking the submatrix as a unit, and setting the scale of the submatrix to be (w/u × w/u) order;
s6.2, respectively calculating the first addresses of all the submatrices in the matrix C and the matrix B:
cjk [0] ═ C00+ j N u + k w formula (5)
Bkj [0] ═ B00+ k × N × u + j × w formula (6)
Wherein, Cjk [0] is a pointer variable of a first address of a sub-matrix in the matrix C, Bkj [0] is a pointer variable of a first address of a sub-matrix in the matrix B, k is a row coordinate of the sub-matrix in the matrix B or a column coordinate of the sub-matrix in the matrix C, a value range [0- (N u/w-1) ] of k, j is a column coordinate of the sub-matrix in the matrix B or a row coordinate of the sub-matrix in the matrix C, and a value range [0- (N u/w-1) ] of j; initializing k to 0, j to 0;
s6.3, sequentially using a vector Load instruction to Load each row of data of the sub-matrix from the position with the initial address of Cjk [0] in the matrix C, and respectively storing the data into a vector register;
s6.4, completing transposition calculation of the submatrix in the matrix C by using a vector shuffle instruction, and storing a calculation result into the submatrix with the initial address Bkj [0] in the matrix B by using a vector Store instruction;
s6.5, determining whether j is equal to N × u/w-1, if not, making j equal to j +1, and going to step S6.2, if so, going to step S6.6;
s6.6, determine whether k is equal to N × u/w-1, if not, let k be k +1, go to step S6.2, if so, go to step S7.
In embodiment 1, S7, core chip DSP core 2 of core copies LU decomposition result of matrix B to original storage location of DDR memory; the specific method of step S7 is as follows:
s7.1, calculating the first addresses of the ith row elements of the matrix B and the matrix A respectively:
bi [0] ═ B00+ i × N × u; formula (2)
Ai [0] ═ a00+ i M u; formula (1)
Wherein, Ai [0] is the pointer variable of the first address of the ith row element of the matrix A, and Bi [0] is the pointer variable of the first address of the ith row element of the matrix B; the value range of i is [0- (M-1) ], and i is initialized to be 0;
s7.2, copying continuous M elements at the position where the head address of the ith row in the matrix B is Bi [0] to the position where the head address of the ith row in the matrix A is Ai [0 ];
s7.3, determining whether i is equal to M-1, if not, making i equal to i +1, and going to step S7.1; otherwise, the calculation is ended.
The parts not involved in the present invention are the same as or can be implemented using the prior art.
As noted above, while the present invention has been shown and described with reference to certain preferred embodiments, it is not to be construed as limited thereto. Various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (7)

1. A matrix LU decomposition vectorization calculation method of a vector DSP core comprises the following steps:
s1, completing matrix zero padding by the vector DSP core according to the bit width of the chip vector register;
s2, the vector DSP core transposes the matrix B by using a vector instruction to obtain a transposed matrix C;
s3, checking the row Dr of the upper triangular matrix D in the transposed matrix C by the vector DSP core, performing row elimination according to a formula Dr = Dr/Dr [0], updating the row Dr of the matrix D, and initializing r = 0;
s4, calculating an update matrix panel by the vector DSP core according to a formula R = R-Dr'. er;
s5, judging whether r is equal to N-1, if not, making r = r +1, and turning to step S3, if yes, turning to step S6;
s6, the vector DSP core uses the vector instruction to perform matrix transposition to obtain the LU decomposition result of the matrix B;
and S7, copying the LU decomposition result of the obtained matrix B to the storage position of the original matrix in the DDR memory by the vector DSP core.
2. The matrix LU decomposition vectorization calculation method of a vector DSP core according to claim 1, wherein the specific method of step S1 is as follows:
setting a matrix to be solved as a matrix A, wherein the scale of the matrix is M x M, and the initial address is A00; marking the matrix after zero padding as a matrix B, wherein the scale is NxNth, and the initial address is marked as B00; matrix elements of each matrix are u bytes, and the bit width of a vector register in a vector DSP core is w bytes;
s1.1, judging whether the matrix row length of a matrix A is an integral multiple of the bit width w of a vector register, if so, marking the matrix A and the matrix B as the same matrix, wherein N = M, B00= A00, and turning to the step S2; otherwise, go to step S1.2;
s1.2, determining the scale N of the matrix B, wherein the method comprises the following steps: condition one, N > M, condition two, (N × u)% w = 0; obtaining the minimum N value as the required N value while meeting the conditions I and II to obtain a matrix B; initializing the matrix B into an all-zero matrix;
s1.3, calculating the first addresses of the ith row elements of the matrix A and the matrix B respectively:
ai [0] = a00+ i M u; formula (1)
Bi [0] = B00+ i N u; formula (2)
Wherein, Ai [0] is the pointer variable of the first address of the ith row element of the matrix A, and Bi [0] is the pointer variable of the first address of the ith row element of the matrix B; i is in a value range [0- (M-1) ], and i =0 is initialized;
s1.4, copying continuous M elements at the position where the head address of the ith row in the matrix A is Ai [0] to the position where the head address of the ith row in the matrix B is Bi [0 ];
s1.5, judging whether i is equal to M-1 or not, if not, enabling i = i +1, and turning to the step S1.3; otherwise, go to step S2.
3. The matrix LU decomposition vectorization calculation method of a vector DSP core according to claim 1 or 2, wherein the specific method of step S2 is as follows:
setting the matrix after the conversion as C, and recording the initial address of the matrix C as C00;
s2.1, matrix transposition is carried out according to the submatrix as a unit, and the scale of the submatrix is set to be (w/u) order because a vector Load instruction can Load w byte data;
s2.2, respectively calculating the first addresses of all the submatrices in the matrix B and the matrix C:
bkj [0] = B00+ k N u + j w formula (3)
Cjk [0] = C00+ j N u + k w equation (4)
Wherein, Bkj [0] is a pointer variable of a first address of a sub-matrix in the matrix B, Ckj [0] is a pointer variable of a first address of a sub-matrix in the matrix C, k is a row coordinate of the sub-matrix in the matrix B or a column coordinate of the sub-matrix in the matrix C, a value range [0- (N u/w-1) ] of k, j is a column coordinate of the sub-matrix in the matrix B or a row coordinate of the sub-matrix in the matrix C, and a value range [0- (N u/w-1) ] of j; initializing k =0, j = 0;
s2.3, sequentially using a vector Load instruction to Load each row of data of the sub-matrix from the position with the first address Bkj [0] in the matrix B, and respectively storing the data into a vector register;
s2.4, completing transposition calculation of a sub-matrix in the matrix B by using a vector shuffle instruction, and storing a calculation result into the sub-matrix with the initial address of Cjk [0] in the matrix C by using a vector Store instruction;
s2.5, judging whether j is equal to N × u/w-1, if not, enabling j = j +1, and turning to the step S2.2, and if so, turning to the step S2.6;
s2.6, determine whether k is equal to N × u/w-1, if not, let k = k +1, go to step S2.2, if so, go to step S3.
4. The matrix LU decomposition vectorization calculation method of a vector DSP core according to claim 3, wherein the specific method of step S3 is as follows:
dividing the matrix C in the S2 into an upper triangular area and a lower triangular area along the diagonal direction from top left to bottom right, wherein diagonal data are divided into the upper triangular area, the upper triangular area is marked as an upper triangular matrix D, and the lower triangular area is marked as a lower triangular matrix E; marking the upper triangular matrix D as D0, D1 and D2 … DN-1 according to the directions from top to bottom, respectively, marking the head element of each row as D0[0], D1[0] and D2[0] … DN-1 [0 ]; dr is the r-th row of the upper triangular matrix D, and Dr [0] is the row head element of the r-th row, wherein the value range of r [0- (N-1) ];
s3.1, reading a row head element Dr [0] of the matrix D by using a scalar Load instruction, and storing the row head element Dr [0] into a general register S0;
s3.2, calculating the value of 1/Dr [0] by using a scalar floating-point division instruction, storing the result into a general register S1, and broadcasting the data of the general register S1 to a vector register Z0 by using a broadcasting instruction;
s3.3, sequentially loading data of other elements except row head elements in the Dr row by using a vector Load instruction, loading w/u elements each time, and storing the w/u elements into a vector register Z1;
s3.4, multiplying the vector data Z1 and the vector data Z0 by using a vector multiplication instruction, and storing the result into a vector register Z2;
s3.5, storing the vector data result in the vector register Z2 into the original position of Dr row data by using a vector Store instruction;
and S3.6, repeating the steps S3.3 to S3.5 until the row elimination element calculation of the row Dr of the upper triangular matrix D is completed.
5. The matrix LU decomposition vectorization calculation method of a vector DSP core according to claim 4, wherein the specific method of step S4 is as follows:
recording a matrix at the lower right of a row head element Dr [0] as R, recording a g-th row of the matrix R as Rg, wherein the value range of g is [0- (N-R-2) ]; the part of the r-th column in the matrix C, which is positioned in the lower triangular matrix E, is recorded as er, and the f-th element of the er is recorded as erf; the value range of f is [0- (N-r-2) ]; the portion of row Dr that does not include the row head element Dr [0] is denoted as Dr';
s4.1, obtaining an element erf of the er by using a scalar Load instruction, storing the element erf into a general register S0, and initializing f = 0;
s4.2, broadcasting the data of the general register S0 in the S4.1 to the vector register Z0 by using a broadcast instruction;
s4.3, sequentially loading element data of Dr' by using a vector Load instruction, wherein w/u elements are loaded each time, and storing the result into a vector register Z1;
s4.4, sequentially loading the element data of the Rg by using a vector Load instruction, wherein w/u elements are loaded each time, and storing the result into a vector register Z2;
s4.5, executing Z2-Z1X Z0 by using a vector multiplication and subtraction instruction, and storing a calculation result into a vector register Z3;
s4.6, storing the result of the vector register Z3 into the original position of Rg row data;
s4.7, repeating the steps S4.3 to S4.6 until the updating calculation of the Rg of the row of the matrix R is completed;
s4.8, judging whether g is equal to N-r-2, if not, enabling g = g +1, and turning to the step S4.1; if so, the update calculation of the row Rg in the matrix R is finished, and the process goes to step S5.
6. The matrix LU decomposition vectorization calculation method of a vector DSP core according to claim 5, wherein the specific method of step S6 is as follows:
s6.1, performing matrix transposition by taking the submatrix as a unit, and setting the scale of the submatrix to be (w/u × w/u) order;
s6.2, respectively calculating the first addresses of all the submatrices in the matrix C and the matrix B:
cjk [0] = C00+ j N u + k w equation (5)
Bkj [0] = B00+ k N u + j w equation (6)
Wherein, Cjk [0] is a pointer variable of a first address of a sub-matrix in the matrix C, Bkj [0] is a pointer variable of a first address of a sub-matrix in the matrix B, k is a row coordinate of the sub-matrix in the matrix B or a column coordinate of the sub-matrix in the matrix C, a value range [0- (N u/w-1) ] of k, j is a column coordinate of the sub-matrix in the matrix B or a row coordinate of the sub-matrix in the matrix C, and a value range [0- (N u/w-1) ] of j; initializing k =0, j = 0;
s6.3, sequentially using a vector Load instruction to Load each row of data of the sub-matrix from the position with the initial address of Cjk [0] in the matrix C, and respectively storing the data into a vector register;
s6.4, completing transposition calculation of the submatrix in the matrix C by using a vector shuffle instruction, and storing a calculation result into the submatrix with the initial address Bkj [0] in the matrix B by using a vector Store instruction;
s6.5, determining whether j is equal to N × u/w-1, if not, making j = j +1, and going to step S6.2, if so, going to step S6.6;
s6.6, determine whether k is equal to N × u/w-1, if not, let k = k +1, go to step S6.2, if so, go to step S7.
7. The matrix LU decomposition vectorization calculation method of a vector DSP core according to claim 6, wherein the specific method of step S7 is as follows:
s7.1, calculating the first addresses of the ith row elements of the matrix B and the matrix A respectively:
bi [0] = B00+ i N u; formula (2)
Ai [0] = a00+ i M u; formula (1)
Wherein, Ai [0] is the pointer variable of the first address of the ith row element of the matrix A, and Bi [0] is the pointer variable of the first address of the ith row element of the matrix B; i is in a value range [0- (M-1) ], and i =0 is initialized;
s7.2, copying continuous M elements at the position where the head address of the ith row in the matrix B is Bi [0] to the position where the head address of the ith row in the matrix A is Ai [0 ];
s7.3, judging whether i is equal to M-1 or not, if not, enabling i = i +1, and turning to the step S7.1; otherwise, the calculation is ended.
CN202111491342.4A 2021-12-08 Matrix LU decomposition vectorization calculation method of vector DSP core Active CN114139108B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111491342.4A CN114139108B (en) 2021-12-08 Matrix LU decomposition vectorization calculation method of vector DSP core

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111491342.4A CN114139108B (en) 2021-12-08 Matrix LU decomposition vectorization calculation method of vector DSP core

Publications (2)

Publication Number Publication Date
CN114139108A true CN114139108A (en) 2022-03-04
CN114139108B CN114139108B (en) 2024-07-09

Family

ID=

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118093021A (en) * 2024-04-26 2024-05-28 北京壁仞科技开发有限公司 Method, computing core, apparatus, medium, and program product for performing transpose computation

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6606725B1 (en) * 2000-04-25 2003-08-12 Mitsubishi Electric Research Laboratories, Inc. MAP decoding for turbo codes by parallel matrix processing
CN103959233A (en) * 2011-09-15 2014-07-30 埃克森美孚上游研究公司 Optimized matrix and vector operations in instruction limited algorithms that perform eos calculations
CN105373497A (en) * 2015-10-29 2016-03-02 中国人民解放军国防科学技术大学 Digital signal processor (DSP) chip based matrix transposition device
CN109145255A (en) * 2018-06-11 2019-01-04 山东省计算中心(国家超级计算济南中心) A kind of heterogeneous Computing method that sparse matrix LU scanning line updates

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6606725B1 (en) * 2000-04-25 2003-08-12 Mitsubishi Electric Research Laboratories, Inc. MAP decoding for turbo codes by parallel matrix processing
CN103959233A (en) * 2011-09-15 2014-07-30 埃克森美孚上游研究公司 Optimized matrix and vector operations in instruction limited algorithms that perform eos calculations
CN105373497A (en) * 2015-10-29 2016-03-02 中国人民解放军国防科学技术大学 Digital signal processor (DSP) chip based matrix transposition device
CN109145255A (en) * 2018-06-11 2019-01-04 山东省计算中心(国家超级计算济南中心) A kind of heterogeneous Computing method that sparse matrix LU scanning line updates

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
姜波;蓝伯雄;: "可变维核心矩阵LU分解方法", 中国管理科学, no. 06, 15 December 2008 (2008-12-15), pages 67 - 74 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118093021A (en) * 2024-04-26 2024-05-28 北京壁仞科技开发有限公司 Method, computing core, apparatus, medium, and program product for performing transpose computation

Similar Documents

Publication Publication Date Title
CN111213125B (en) Efficient direct convolution using SIMD instructions
CN110415157B (en) Matrix multiplication calculation method and device
Bell et al. Efficient sparse matrix-vector multiplication on CUDA
US7337205B2 (en) Matrix multiplication in a vector processing system
US8321492B1 (en) System, method, and computer program product for converting a reduction algorithm to a segmented reduction algorithm
KR20170110690A (en) A vector processor configured to operate on variable length vectors using one or more complex arithmetic instructions;
CN107451097B (en) High-performance implementation method of multi-dimensional FFT on domestic Shenwei 26010 multi-core processor
US20210117755A1 (en) Power-efficient hybrid traversal apparatus and method for convolutional neural network accelerator architecture
CN109726441B (en) Body and surface mixed GPU parallel computing electromagnetism DGTD method
US6804771B1 (en) Processor with register file accessible by row column to achieve data array transposition
Ploskas et al. GPU accelerated pivoting rules for the simplex algorithm
Watanabe et al. SIMD vectorization for the Lennard-Jones potential with AVX2 and AVX-512 instructions
CN116710912A (en) Matrix multiplier and control method thereof
US4916649A (en) Method and apparatus for transforming a bit-reversed order vector into a natural order vector
JP2023070746A (en) Information processing program, information processing apparatus, and information processing method
CN114139108A (en) Matrix LU decomposition vectorization calculation method of vector DSP core
US8826252B2 (en) Using vector atomic memory operation to handle data of different lengths
CN114139108B (en) Matrix LU decomposition vectorization calculation method of vector DSP core
CN114330669B (en) Vector processor-oriented semi-precision vectorization conv1 multiplied by 1 convolution method and system
Wang et al. A fast tridiagonal solver for Intel MIC architecture
CN111125950A (en) CFD parallel processing method for nuclear reactor thermal hydraulic simulation software
Ploskas et al. A computational comparison of scaling techniques for linear optimization problems on a graphical processing unit
CN113890508A (en) Hardware implementation method and hardware system for batch processing FIR algorithm
CN114281755A (en) Vector processor-oriented semi-precision vectorization convolution method and system
US9582473B1 (en) Instruction set to enable efficient implementation of fixed point fast fourier transform (FFT) algorithms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant