CN112445752B

CN112445752B - Matrix inversion device based on Qiaohesky decomposition

Info

Publication number: CN112445752B
Application number: CN201910804096.XA
Authority: CN
Inventors: 张应松; 矫渊培
Original assignee: Shanghai Huawei Technologies Co Ltd
Current assignee: Shanghai Huawei Technologies Co Ltd
Priority date: 2019-08-28
Filing date: 2019-08-28
Publication date: 2024-01-05
Anticipated expiration: 2039-08-28
Also published as: WO2021036313A1; CN112445752A

Abstract

The invention discloses a matrix inversion device based on George decomposition, which comprises a data writing control unit, a first data shifting unit, a control unit, an operation unit, a second data shifting unit, a storage unit and an output unit, wherein the operation unit comprises 8 single-precision complex multiply-add units (CMACs), the CMACs are provided with four-stage pipeline operation structures, the operation unit is connected with the control unit, the first data shifting unit, the control unit, the second data shifting unit and the output unit are respectively connected with the storage unit, the control unit is connected with the second data shifting unit, and the data writing control unit is connected with the first data shifting unit. According to the technical scheme, the problems that the internal computing resources of the current vector processor are fewer and the utilization rate of the computing resources is low are solved by using the 8 CMAC computing units, so that the decomposition inversion processing time delay based on the cholesky is reduced, and the network performance is improved.

Description

Matrix inversion device based on Qiaohesky decomposition

Technical Field

The invention relates to the field of digital signal processing, in particular to a matrix inversion device based on Georgi decomposition.

Background

Based on the Cholesky decomposition, the method is a common positive matrix inversion method, and the principle of matrix inversion is as follows: for an n-order symmetric positive definite matrix a, there is a lower triangular matrix L such that a=l×l ^T Then the inverse A of the positive definite matrix A ^-1 ＝(L*L ^T ) ^-1 ＝(L ^T ) ^-1 *L ^-1 ＝(L ^-1 ) ^T *L ^-1 . It is common practice in the industry to employ vector processors to implement a cholesky-based decomposition inversion.

In the matrix inversion operation process based on the cholesky decomposition, a large number of iterative and interleaving operations exist, and the internal calculation amount is very large. Vector processors rely primarily on their internal vector processing units to perform cholesky-based decomposition inversion. However, the current vector processor only includes 16 half-precision complex multiply add units (complex signal processor, CMAC), which is equivalent to 4 single-precision CMACs, i.e., ideally only 4 single-precision complex operations can be performed at a time, so that even if the utilization rate of computing resources can reach hundred percent, the processing capability is still weak. Meanwhile, when the vector processor performs the decomposition inversion based on the arbor, the vector processor performs arbor decomposition first, and performs inversion operation after all decomposition results are obtained. Because of the data dependence in the operation process, the decomposition process is that as iterations proceed, less CMAC is required, while the inversion is reversed, as iterations proceed, more CMAC is required. Whether decomposition or inversion, there are scenarios where CMAC utilization becomes better or worse as iteration progresses, i.e., CMAC average utilization is lower.

In summary, the current vector processor has fewer internal computing resources and lower utilization rate of computing resources, so that the decomposition inversion processing based on the cholesky is longer in time delay, which causes the decomposition inversion processing to be a bottleneck of a link easily, and affects network performance.

Disclosure of Invention

The embodiment of the invention provides a matrix inversion device based on arbor decomposition, which can reduce the time delay of the arbor decomposition inversion processing and improve the network performance.

The application provides a matrix inversion device based on arbor base decomposition, including data write-in control unit, first data shift unit, control unit, arithmetic unit, second data shift unit, memory cell and output unit, wherein, arithmetic unit includes 8 single precision complex multiplication unit CMACs, every CMAC possesses four-stage pipeline operation structure, arithmetic unit links to each other with control unit, first data shift unit, control unit, second data shift unit and output unit link to each other with memory cell respectively, control unit and second data shift unit interconnect, data write-in control unit links to each other with first data shift unit.

In the matrix inversion device based on the George decomposition, a data writing control unit is used for finishing writing control of a matrix, wherein the matrix is an N-order positive definite matrix, and N is an integer which is more than 1 and less than or equal to 32.

In the matrix inversion device based on the George decomposition, a first data shifting unit is used for shifting diagonal data of a matrix to the first bit of each column so as to obtain first shift data.

In the matrix inversion device based on the George decomposition, a control unit is used for communication and control among a storage unit, a second data shifting unit and an operation unit.

In the matrix inversion device based on the George decomposition, an operation unit is used for carrying out N times of parallel iterative operation on a matrix according to control information of a control unit so as to obtain an operation result of each parallel iterative operation in the N times of parallel iterative operation, wherein the operation result of the x-th time of parallel iterative operation is obtained according to the x-th column component data of an N-order positive definite matrix and the operation result of the previous (x-1) iterative operation, the operation result of the x-th time of parallel iterative operation comprises the column component data of the x-th column of a lower triangular matrix based on the George decomposition of the matrix and the row component data of the x-th row of an inverse matrix of the matrix, the column component data of the x-th column does not comprise diagonal line data of the lower triangular matrix, and x is an integer greater than 0 and less than or equal to N.

In the matrix inversion device based on the George decomposition, a second data shifting unit is used for carrying out data shifting on the operation result of each parallel iterative operation so as to obtain second shift data of the operation result of each iterative operation, and the second shift data is used for inputting the next parallel iterative operation.

In the matrix inversion device based on the George decomposition, a storage unit is used for storing first shift data and second shift data of operation results of each iteration operation. The memory cells are local caches, a maximum cacheable 32x32x64bit matrix.

In the matrix inversion device based on the George decomposition, the output unit is used for outputting an inverse matrix according to the second shift data of the operation result of each iteration operation stored in the storage unit.

For a positive definite matrix with n=8, the matrix inversion apparatus can support interleaving operations of 4 matrices simultaneously. For positive definite matrices with n=4, the matrix inversion apparatus can support interleaving operations of 8 matrices simultaneously. For a matrix of n=16, the matrix inversion means can support interleaving operations for 2 matrices simultaneously. For a positive definite matrix of n=32, no interleaving operation of the matrix is required. When N is not an integer multiple of 8, a largely close approach may be employed, and the interleaving manner coincides with that of a positive definite matrix when N is an integer multiple of 8.

The embodiment of the application provides a matrix inversion device based on arbor decomposition, which can solve the problems of less internal computing resources and lower utilization rate of computing resources of a current vector processor, thereby reducing the time delay of the arbor decomposition inversion processing and improving the network performance.

Drawings

Fig. 1 is a schematic diagram of an embodiment of a matrix inversion device based on a georgette decomposition according to an embodiment of the present application;

FIG. 2 (a) is a schematic diagram of a matrix format received by a data write control unit according to an embodiment of the present application;

FIG. 2 (b) is a schematic diagram of a matrix format output by the data write control unit according to the embodiment of the present application;

fig. 3 is a schematic diagram of first shift data after a first data shift unit shifts a matrix according to an embodiment of the present application;

fig. 4 (a) is a schematic diagram of data change in the process of performing a first parallel iterative operation in an operation unit by using a 4-group interleaved 8-order matrix provided in an embodiment of the present application;

fig. 4 (b) is a schematic diagram of data change in the process of performing a second parallel iterative operation on the 4-group interleaved 8-order matrix provided in the embodiment of the present application in the operation unit;

fig. 4 (c) is a schematic diagram of data change in the process of performing a third parallel iterative operation on the 4-group interleaved 8-order matrix provided in the embodiment of the present application in the operation unit;

fig. 5 is a schematic storage format of an operation result of each parallel iterative operation of the 4-group interleaved 8-order matrix according to the embodiment of the present application.

Detailed Description

The embodiment of the application provides a matrix inversion device based on arbor base decomposition, which can reduce the time delay of the arbor base decomposition inversion processing and improve the network performance.

In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.

The terms first, second and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In the matrix solving equation operation, the matrix inverse, for example, ax=b, is required to be solved, and a is required to be first solved ^-1 Then according to X=A ^-1 * B obtains the value of X.

The cholesky decomposition algorithm is a very common matrix decomposition method, and the basic principle is as follows: for an n-order symmetric positive definite matrix a, there is a lower triangular matrix L such that a=l×l ^T The numbers on the diagonal of the L matrix are all positive real numbers, L ^T The conjugate transpose of the lower triangular matrix L:

the basic formula of the cholesky decomposition algorithm is:

where j=1, 2, …, n; l (L) _jj And l _ij The initial values of (2) are:

finally according to A ^-1 ＝(L*L ^T )-1＝(L ^T ) ^-1 *L ^-1 ＝(L ^-1 ) ^T *L ^-1 The inverse matrix of A is obtained.

The method for realizing matrix inversion based on the Georll Stroke decomposition by the vector processor mainly comprises the following two steps: first performing a decomposition operation, i.e. according to a=l×l ^T A lower triangular matrix L is obtained. Because of the dependency of the data, the vector processor must calculate the matrix a column by column in the process of performing the georgette decomposition, first calculate the first column, calculate the second column after the first column is calculated, and calculate each column by relying on the calculation results of all the previous columns. After the cholesky decomposition is completed to obtain the lower triangular matrix of matrix A, the inversion operation is performed, namely according to A ^-1 ＝(L ^-1 ) ^T *L ^-1 Calculation of A ^-1 . Because of the dependency relationship of the data, the vector processor also needs to calculate according to the row in the process of inverting the matrix A based on the decomposition result, and calculates the next row after the calculation of one row is completed.

Therefore, in the decomposition inversion process of the current vector processor, inversion operation is performed after all decomposition results are solved, so that the CMAC utilization rate is low and the CMAC computing resources of the vector processor are less.

Fig. 1 is a schematic diagram of an embodiment of a matrix inversion apparatus 10 based on a georgette decomposition according to an embodiment of the present application.

Referring to fig. 1, the matrix inversion apparatus 10 based on the georgette decomposition provided in the embodiment of the application includes a data writing control unit 101, a first data shifting unit 102, a control unit 103, an operation unit 104, a second data shifting unit 105, a storage unit 106, and an output unit 107, where the operation unit 104 includes 8 single-precision complex multiply-add units CMAC, the CMAC has a four-stage pipeline operation structure, the operation unit 104 is connected with the control unit 103, the first data shifting unit 102, the control unit 103, the second data shifting unit 105, and the output unit 107 are respectively connected with the storage unit 106, the control unit 103 is connected with the second data shifting unit 105, and the data writing control unit 101 is connected with the first data shifting unit 102.

The matrix inversion device 10 based on the georgette decomposition provided in the embodiment of the present application includes 8 single-precision CMACs, each CMAC has a four-stage running water operation structure, and can directly support the operation of the decomposition inversion of the N-order positive definite matrix with N less than or equal to 32, and when N is greater than 32, it can be disassembled into dimensions of 32 or less through software to calculate, which is not limited in the embodiment of the present application.

Since the matrix inversion apparatus 10 provided in the embodiment of the present application includes 8 single-precision CMACs, and each CMAC has a four-stage pipeline operation structure, for a positive definite matrix with n=8, the matrix inversion apparatus 10 can support interleaving operation of 4 matrices at the same time. For a positive definite matrix of n=4, the matrix inversion apparatus 10 can support interleaving operations of 8 matrices simultaneously. For a matrix of n=16, the matrix inversion means can support interleaving operations for 2 matrices simultaneously. For a positive definite matrix of n=32, no interleaving operation of the matrix is required. These several cases may enable a percentage utilization of the CMAC. When N is not an integer multiple of 8, the interleaving method may be a method of largely approaching, and the interleaving method may be a method of interleaving a positive definite matrix when N is an integer multiple of 8.

Specifically, in the embodiment of the present application, the functions of the respective functional units included in the matrix inversion apparatus 10 are as follows:

the data writing control unit 101 is configured to complete writing control of a matrix, and it should be noted that, in the embodiment of the present application, the matrix is an N-order positive definite matrix, and N is an integer greater than 1 and less than or equal to 32.

A first data shift unit 102 for shifting diagonal data of the matrix to the first bit of each column to obtain first shifted data.

And a control unit 103 for scheduling and controlling the whole calculation task.

The operation unit 104 is configured to perform N parallel iterative operations on the matrix according to the control signal of the control unit, so as to obtain an operation result of each parallel iterative operation in the N parallel iterative operations, where an operation result of an xth parallel iterative operation is obtained according to an xth column component data of an N-order positive definite matrix and an operation result of a previous (x-1) iterative operation, and the operation result of the xth parallel iterative operation includes an xth column component data of a lower triangular matrix based on a georgette decomposition and an xth row component data of an inverse matrix of the matrix, and x is an integer greater than 0 and less than or equal to N.

And a second data shift unit 105, configured to shift data of the operation result of each parallel iterative operation, so as to obtain second shift data of the operation result of each iterative operation, and use the second shift data for input of the next parallel iterative operation.

A storage unit 106, configured to store the first shift data and second shift data of an operation result of each iterative operation.

It should be noted that, in the embodiment of the present application, the storage unit 106 preferentially responds to the operation unit 104, and responds to the external input only when the operation unit 104 has no requirement; the memory unit 106 may also guarantee the number of matrices received, back-pressing the front stage when the internal buffer is full. It should be noted that, in the embodiment of the present application, the storage unit is a local cache, and the maximum is a 32×32×64bit matrix. Alternatively, the bandwidth of the memory unit 106 is 64×8=512 bits, which may be implemented by 128bits×128depth×4 bank.

An output unit 107, configured to output an inverse matrix of the matrix according to the second shift data of the operation result of each iterative operation stored in the storage unit 106.

Optionally, the embodiment of the present application specifically describes the functions of each unit module in the matrix inversion apparatus 10 in the embodiment of the present application in a manner of n=8, 4 sets of matrix interleaving.

In this embodiment of the present application, when n=8, and the operation is performed by using 4 groups of interleaving, the 4 matrices may be different 8-order positive definite matrices. The data writing control unit 101 first completes writing control of the matrix inputted from the outside, and the matrices received by the data writing control unit 101 are all transferred by row. As shown in fig. 2 (a), the matrices 1 to 4 are 4 matrices received by the data writing control unit 101 in rows. Specifically, the write control of the externally input matrix by the data write control unit 101 may refer to the conjugation of the matrix transferred by rows to implement column-row transposition of the matrix, and the matrix format output by the data write control unit 101 to the first data shift unit 102 is as shown in fig. 2 (b).

In this embodiment of the present application, according to the formula of the georgette decomposition, the diagonal data dii of the inverse matrix of the matrix needs to be first obtained according to the diagonal data of the matrix, and then the next decomposition inversion operation of each column can be continued, so that the operation at the beginning of each column must include the diagonal data. In order to solve the problem of addressing complexity caused by searching diagonal data in the subsequent operation process, the diagonal data of each column of each matrix is shifted to the first bit by the first data shifting unit 102, so as to obtain first shifted data, and the first shifted data is stored in the storage unit 106 to be used as input of the subsequent calculation.

In this embodiment, after receiving the matrix transmitted by the data writing control unit 101, the first data shifting unit 102 shifts the diagonal data of the matrix to the first bit of each column, thereby simplifying the addressing complexity of the subsequent calculation. The shifted data is first shift data, which is stored in the storage unit 106. For example, fig. 3 shows first shift data corresponding to each of the matrices obtained by the matrices 1 to 4 after being shifted by the first data shift unit 102.

In the embodiment of the present application, the control unit 103 completes communication among the operation unit 104, the second data shift unit 105, and the storage unit 106, performs address calculation of data in the storage unit 106, iterative control of each iterative operation in the operation unit 104, data shift control of the second data shift unit 105, and control of a plurality of matrix interleaving calculations. The overall computing task may have less sequential logic and more combinational logic.

For example, in the embodiment of the present application, when n=8, and the operation is performed by using 4 sets of interleaving, each matrix corresponds to 8 parallel iterative operations. Fig. 4 (a) -4 (b) show the variation of data during 8 parallel iterative operations in the arithmetic unit 104 for 4 sets of interleaved 8-order matrices.

As shown in fig. 4 (a), in the first parallel iterative operation process, the control unit 103 controls the input of the first column component data in the first shift data of each of the matrices 1-4 in the operation unit 104, and obtains the first parallel iterative operation result corresponding to each matrix through the operation of 8 CMACs, where the first parallel iterative operation result includes the first column vector data L10-L70 of the lower triangular matrix corresponding to each matrix and the first row vector data d00 of the inverse matrix.

After obtaining the first parallel iterative operation result, the control unit 103 controls the second data shift unit 105 to perform data shift on the first parallel iterative operation result to obtain second shift data corresponding to each matrix as input of the next iterative operation, as second shift data of the first parallel iterative result shown in fig. 4 (b).

Fig. 4 (b) shows that in the second parallel iterative operation process, the control unit 103 controls the input of the second column component data in the first shift data of each of the matrices 1 to 4 and the second shift data corresponding to the operation result of the first parallel iterative operation in the operation unit 104, and the second parallel iterative operation result corresponding to each of the matrices is obtained by the operation of 8 CMACs, and the second column vector data L21 to L71 of the lower triangular matrix corresponding to each of the matrices and the first row vector data d10 and d11 of the inverse matrix are included in the second parallel iterative operation result.

Correspondingly, after obtaining the second parallel iterative operation result, the control unit 103 controls the second data shift unit 105 to perform data shift on the second parallel iterative operation result to obtain second shift data corresponding to each matrix, as input of the next iterative operation, as second shift data of the second parallel iterative result shown in fig. 4 (c).

Fig. 4 (c) shows that in the third parallel iterative operation process, the control unit 103 controls the input of third column component data in the first shift data of each of the matrices 1 to 4 and second shift data corresponding to the operation results of the first and second parallel iterative operations, respectively, in the operation unit 104, and obtains the third parallel iterative operation result corresponding to each of the matrices by the operation of 8 CMACs, where the third parallel iterative operation result includes third column vector data L32 to L72 of the lower triangular matrix corresponding to each of the matrices, and first row vector data d20, d21, and d22 of the inverse matrix.

Correspondingly, after obtaining the third parallel iterative operation result, the control unit 103 controls the second data shift unit 105 to perform data shift on the third parallel iterative operation result to obtain second shift data corresponding to each matrix, as input of the next iterative operation, as second shift data of the third parallel iterative operation result shown in fig. 4 (c).

By analogy, until the eighth parallel iterative operation, the control unit 103 controls the input of the eighth column component data in the first shift data of each matrix of the matrices 1-4 and the second shift data corresponding to the operation result of the previous 7 parallel iterative operations in the operation unit 104, and obtains the operation result of the eighth parallel iterative operation corresponding to each matrix through the operation of 8 CMACs, where the operation result of the eighth parallel iterative operation includes the eighth row vector data d70-d77 of the inverse matrix corresponding to each matrix.

After the end of the eight parallel iterative operations, the storage format of the operation result of each parallel iterative operation stored in the storage unit 106 is shown in fig. 5. Finally, the output unit 107 reads the operation result of each parallel iterative operation in the storage unit 106, and outputs the operation result.

The functions of the respective unit modules in the matrix inversion apparatus 10 in the embodiment of the present application are specifically described above by taking the manner of n=8, 4 sets of matrix interleaving as an example. It should be understood that, for positive definite matrices where N is not equal to 8, the matrix inversion apparatus 10 in the embodiment of the present application may also perform matrix decomposition inversion by using the same principle. For example, the operation of n=4, 8 matrix interleaving, the operation of n=16, 2 matrix interleaving, the operation of n=32, the operation of no matrix interleaving, and the operation of N being not an integer multiple of 8, the interleaving mode being in close proximity to the interleaving mode of positive definite matrix when N is an integer multiple of 8, are all within the scope of protection of the present application.

The matrix inversion device provided by the embodiment of the application can solve the problems that the current vector processor is less in internal computing resources and the utilization rate of the computing resources is low, so that the decomposition inversion processing time delay based on the Qiaorse base can be reduced, and the network performance is improved.

It will be appreciated that the various numbers or letter designations referred to in the embodiments of the present application are merely descriptive convenience and are not intended to limit the scope of the embodiments of the present application. The sequence number of each process does not mean the sequence of the execution sequence, and the execution sequence of each process should be determined according to the function and the internal logic.

The matrix inversion device based on the cholesky decomposition provided in the embodiment of the present application is described in detail, and specific examples are applied to illustrate the principles and embodiments of the present invention, and the description of the above examples is only used to help understand the method and core idea of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. The matrix inversion device based on the George decomposition comprises a data writing control unit, a first data shifting unit, a control unit, an operation unit, a second data shifting unit, a storage unit and an output unit, and is characterized in that the operation unit comprises 8 single-precision complex multiply-add units (CMACs), the CMACs are provided with four-stage pipeline operation structures, the operation unit is connected with the control unit, the first data shifting unit, the control unit, the second data shifting unit and the output unit are respectively connected with the storage unit, the control unit is connected with the second data shifting unit, and the data writing control unit is connected with the first data shifting unit;

the data writing control unit is used for completing writing control of a matrix, wherein the matrix is an N-order positive definite matrix, and N is an integer which is more than 1 and less than or equal to 32;

the operation unit is configured to perform N parallel iterative operations on the matrix according to the control information of the control unit, so as to obtain an operation result of each parallel iterative operation in the N parallel iterative operations, where an operation result of an xth parallel iterative operation is obtained according to an xth column component data of the N-order positive definite matrix and an operation result of a previous (x-1) iterative operation, the operation result of the xth parallel iterative operation includes an xth column component data of a lower triangular matrix of the matrix based on a georgette decomposition and an xth row component data of an inverse matrix of the matrix, and the xth column component data does not include diagonal data of the lower triangular matrix, and x is an integer greater than 0 and less than or equal to N.

2. The apparatus of claim 1, wherein the first data shifting unit is configured to shift diagonal data of the matrix to a first bit of each column to obtain first shifted data;

the storage unit is used for storing the first shift data.

3. The apparatus according to claim 1 or 2, wherein the control unit is configured to communicate and control among the storage unit, the second data shift unit, and the arithmetic unit.

4. The apparatus according to claim 1, wherein the second data shift unit is configured to shift data of the operation result of each parallel iterative operation to obtain second shift data of the operation result of each iterative operation, where the second shift data is used for input of a next parallel iterative operation.

5. The apparatus of claim 4, wherein the storage unit is configured to store second shift data of an operation result of the each iterative operation.

6. The apparatus according to claim 5, wherein the output unit is configured to output the inverse matrix based on second shift data of the operation result of each iterative operation stored in the storage unit.