WO2021036313A1

WO2021036313A1 - Cholesky decomposition-based matrix inversion apparatus

Info

Publication number: WO2021036313A1
Application number: PCT/CN2020/086987
Authority: WO
Inventors: 张应松; 矫渊培
Original assignee: 华为技术有限公司
Priority date: 2019-08-28
Filing date: 2020-04-26
Publication date: 2021-03-04
Also published as: CN112445752B; CN112445752A

Abstract

Disclosed in the present application is a Cholesky decomposition-based matrix inversion apparatus, comprising a data write control unit, a first data shift unit, a control unit, a calculation unit, a second data shift unit, a memory unit and an output unit, the calculation unit comprising eight single-precision complex multiplier-accumulators (CMAC), said CMACs having a four-stage pipeline calculation structure, the calculation unit being connected to the control unit, the first data shift unit, the control unit, the second data shift unit and the output unit each being connected to the memory unit, the control unit being connected to the second data shift unit, and the data write control unit being connected to the first data shift unit. Technical solutions of the present application, by means of using an eight-CMAC calculation unit, solve the problems of internal computing resources of current vector processors being relatively few and calculation resource utilization being relatively low, decrease processing delay for Cholesky decomposition-based inversion, and increase network functionality.

Description

A matrix inversion device based on Cholesky decomposition

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on August 28, 2019, the application number is 201910804096.X, and the invention title is "a matrix inversion device based on Cholesky decomposition", all of which The content is incorporated in this application by reference.

Technical field

This application relates to the field of digital signal processing, in particular to a matrix inversion device based on Cholesky decomposition.

Background technique

Based on the Cholesky decomposition, it is a commonly used method to find the inversion of a positive definite matrix. The principle of the inversion of the matrix is: for an n-th order symmetric positive definite matrix A, there is a lower triangular matrix L such that A=L*L ^T , ^{Then the inverse A -1 of} the positive definite matrix A = (L*L ^T ) ^-1 = (L ^T ) ^-1 *L ^-1 = (L ^-1 ) ^T *L ^-1 . The common practice in the industry is to use vector processors to realize the decomposition and inversion based on Cholesky.

There are a large number of iterations and interleaving operations in the matrix inversion operation based on Cholesky decomposition, and the amount of internal calculations is very large. The vector processor mainly relies on its internal vector processing unit to perform the decomposition and inversion based on Cholesky. However, the current vector processor only contains 16 half-precision complex signal processors (CMAC), which is only equivalent to 4 single-precision CMACs, that is, ideally, it can only do 4 single-precision at a time. Complex number operations make even if the utilization rate of computing resources can reach 100%, the processing power is still weak. At the same time, when the vector processor is doing the inversion based on the Cholesky decomposition, it first performs the Cholesky decomposition, and then performs the inversion operation after all the decomposition results are obtained. Due to the data dependence in the calculation process, the decomposition process requires less and less CMAC as the iteration progresses, while the reverse is the opposite. As the iteration progresses, more and more CMAC are required. Whether it is decomposition or reversal, as the iteration progresses, there are scenarios where the CMAC utilization rate becomes better or worse, that is, the average CMAC utilization rate is low.

To sum up, the current vector processor has less internal computing resources and low utilization of computing resources, which makes the decomposition and inversion processing based on Choleski longer delay, which makes it easy to become the bottleneck of the link and affect the network. performance.

Summary of the invention

The embodiment of the present application provides a matrix inversion device based on Cholesky decomposition, which can reduce the processing delay of Cholesky-based decomposition and inversion, and improve network performance.

This application provides a matrix inversion device based on Cholesky decomposition, which includes a data writing control unit, a first data shift unit, a control unit, an arithmetic unit, a second data shift unit, a storage unit, and an output unit, Among them, the arithmetic unit includes 8 single-precision complex multiplication and addition units CMAC, each CMAC has a four-stage pipeline operation structure, the arithmetic unit is connected to the control unit, the first data shift unit, the control unit, the second data shift unit and the output The units are respectively connected with the storage unit, the control unit and the second data shift unit are connected with each other, and the data writing control unit is connected with the first data shift unit.

In the matrix inversion device based on Cholesky decomposition, the data writing control unit is used to complete the writing control of the matrix. The matrix is a positive definite matrix of order N, and N is an integer greater than 1 and less than or equal to 32.

In the matrix inversion device based on Cholesky decomposition, the first data shift unit is used to shift the diagonal data of the matrix to the first bit of each column to obtain the first shift data.

In the matrix inversion device based on Cholesky decomposition, the control unit is used for communication and control between the storage unit, the second data shift unit and the arithmetic unit.

In the matrix inversion device based on Cholesky decomposition, the arithmetic unit is used to perform N parallel iterative operations on the matrix according to the control information of the control unit to obtain the operation of each parallel iterative operation in the N parallel iterative operations As a result, the operation result of the xth parallel iterative operation is obtained based on the component data of the xth column of the N-th order positive definite matrix and the operation result of the previous (x-1) iterative operation, and the operation result of the xth parallel iterative operation Contains the column component data of the x-th column of the lower triangular matrix based on the Cholesky decomposition and the row component data of the x-th row of the inverse matrix of the matrix. The column component data of the x-th column does not include the diagonal of the lower triangular matrix Data, x is an integer greater than 0 and less than or equal to N.

In the matrix inversion device based on Cholesky decomposition, the second data shift unit is used to perform data shift on the operation result of each parallel iterative operation to obtain the second shift of the operation result of each iterative operation. Bit data, the second shift data is used for the input of the next parallel iterative operation.

In the matrix inversion device based on Cholesky decomposition, the storage unit is used to store the first shift data and the second shift data of the operation result of each iteration operation. The storage unit is a local cache, which can cache up to 32x32x64bit matrices.

In the matrix inversion device based on Cholesky decomposition, the output unit is used to output the inverse matrix according to the second shift data of the operation result of each iteration operation stored in the storage unit.

For a positive definite matrix with N=8, the matrix inversion device can support the interleaving operation of 4 matrices at the same time. For a positive definite matrix with N=4, the matrix inversion device can support the interleaving operation of 8 matrices at the same time. For a matrix with N=16, the matrix inversion device can support the interleaving operation of two matrices at the same time. For a positive definite matrix with N=32, no matrix interleaving operation is required. When N is not an integer multiple of 8, a close approach can be used, and the interleaving method is consistent with that of a positive definite matrix when N is an integer multiple of 8.

The embodiment of the present application provides a matrix inversion device based on Cholesky decomposition, which can solve the current problems of less internal computing resources and low utilization rate of computing resources in the current vector processor, thereby reducing the Cholesky-based Decompose the inversion processing time delay and improve the network performance.

Description of the drawings

FIG. 1 is a schematic diagram of an embodiment of a matrix inversion device based on Cholesky decomposition provided by an embodiment of the application;

Figure 2 (a) is a schematic diagram of a matrix format received by a data writing control unit provided by an embodiment of the application;

FIG. 2(b) is a schematic diagram of the matrix format output by the data writing control unit provided by an embodiment of the application;

3 is a schematic diagram of the first shifted data after the matrix is shifted by the first data shifting unit according to an embodiment of the application;

Fig. 4(a) is a schematic diagram of data changes during the first parallel iterative operation of 4 sets of interleaved 8-order matrices provided by an embodiment of the application in the operation unit;

FIG. 4(b) is a schematic diagram of data changes during the second parallel iterative operation of the 4 groups of interleaved 8-order matrices provided by the embodiment of the application in the operation unit;

Fig. 4(c) is a schematic diagram of data changes during the third parallel iterative operation of the 4 groups of interleaved 8-order matrices provided by the embodiment of the application in the operation unit;

FIG. 5 is a schematic diagram of the storage format of the operation result of each parallel iterative operation of 4 groups of interleaved 8-order matrices provided by an embodiment of the application.

detailed description

In order to enable those skilled in the art to better understand the solutions of the application, the technical solutions in the embodiments of the application will be clearly and completely described below in conjunction with the drawings in the embodiments of the application. Obviously, the described embodiments are only It is a part of the embodiments of this application, not all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work should fall within the protection scope of this application.

The terms "first" and "second" in the specification and claims of the application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It should be understood that the data used in this way can be interchanged under appropriate circumstances, so that the embodiments described herein can be implemented in a sequence other than the content illustrated or described herein. In addition, the terms "including" and "having" and any variations of them are intended to cover non-exclusive inclusions. For example, a process, method, system, product, or device that includes a series of steps or units is not necessarily limited to those clearly listed. Those steps or units may include other steps or units that are not clearly listed or are inherent to these processes, methods, products, or equipment.

In the process of matrix solution equation operation, the inverse of the matrix is required, such as AX=B. To solve the value of X, you need to find A ^-1 first, and then find X according to X=A ^{-1 *B} value.

The Cholesky decomposition algorithm is a very common matrix decomposition method. Its basic principle is: For an n-th order symmetric positive definite matrix A, there is a lower triangular matrix L, such that A=L*L ^T , the diagonal of the L matrix The numbers above are all positive real numbers, and L ^T represents the conjugate transpose of the lower triangular matrix L:

The basic formula of Cholesky decomposition algorithm is:

Among them, j = 1, 2, ..., n; the initial values of _{l jj} and l _{ij are:}

Finally, according to A ^-1 =(L*L ^T )-1=(L ^T ) ^-1 *L ^-1 =(L ^-1 ) ^T *L ^-1 , the inverse matrix of A is obtained.

Vector processor-implemented method of inversion based on the Cholesky decomposition of the matrix mainly comprises the following two steps: firstly divide the operation, i.e., in accordance with A = L * L ^T to obtain a lower triangular matrix L. Due to the dependence of the data, the vector processor must perform the calculations column by column when performing the Cholesky decomposition of matrix A. First, calculate the first column, and then calculate the second column after the first column is calculated. The calculation of one column depends on the calculation results of all the previous columns. After Choleski's decomposition is completed to obtain the lower triangular matrix of matrix A, the inverse operation is performed, that is, A ^{-1 is} calculated ^{according to A -1} =(L ^-1 ) ^T *L ^-1 . Due to the dependence of the data, the vector processor also needs to perform calculations according to the rows in the process of inverting the matrix A based on the decomposition results, and then calculate the next row after the calculation of one row is completed.

Therefore, for the current vector processor in the decomposition and inversion process, the inversion operation is performed after all the decomposition results are obtained, and the CMAC utilization rate is low and the vector processor CMAC has less computing resources. The embodiment of the present application provides a This kind of matrix inversion device based on Cholesky decomposition can reduce the delay of Cholesky-based decomposition and inversion processing and improve network performance. Please refer to Figure 1.

FIG. 1 is a schematic diagram of an embodiment of a matrix inversion device 10 based on Cholesky decomposition provided by an embodiment of the application.

1, the matrix inversion device 10 based on Cholesky decomposition provided by the embodiment of the present application includes a data writing control unit 101, a first data shift unit 102, a control unit 103, an arithmetic unit 104, and a second data The shift unit 105, the storage unit 106, and the output unit 107 are characterized in that the arithmetic unit 104 includes 8 single-precision complex multiplication and addition units CMAC, the CMAC has a four-stage pipeline operation structure, the arithmetic unit 104 and The control unit 103 is connected, the first data shift unit 102, the control unit 103, the second data shift unit 105, and the output unit 107 are respectively connected to the storage unit 106, and the control The unit 103 and the second data shift unit 105 are connected to each other, and the data writing control unit 101 is connected to the first data shift unit 102.

It should be noted that the matrix inversion device 10 based on Cholesky decomposition provided by the embodiment of the present application includes 8 single-precision CMACs, and each CMAC has a four-stage pipeline operation structure, which can directly support N less than or equal to 32. When N is greater than 32, it can be decomposed into dimensions below 32 for calculation in the form of software, which is not limited in the embodiment of the application.

Since the matrix inversion device 10 provided by the embodiment of the present application includes 8 single-precision CMACs, and each CMAC has a four-stage pipeline operation structure, for a positive definite matrix with N=8, the matrix inversion device 10 can simultaneously support Interleaving operation of 4 matrices. For a positive definite matrix with N=4, the matrix inversion device 10 can support the interleaving operation of 8 matrices at the same time. For a matrix with N=16, the matrix inversion device can support the interleaving operation of two matrices at the same time. For a positive definite matrix with N=32, no matrix interleaving operation is required. The above-mentioned several situations can realize the 100% utilization of CMAC. It should be noted that when N is not an integer multiple of 8, a close approach can be used, and the interleaving method is consistent with that of a positive definite matrix when N is an integer multiple of 8.

Specifically, in the embodiment of the present application, the functions of each functional unit included in the matrix inversion device 10 are as follows:

The data writing control unit 101 is used to complete the writing control of the matrix. It should be noted that the matrix in the embodiment of the present application is a positive definite matrix of order N, and N is an integer greater than 1 and less than or equal to 32.

The first data shift unit 102 is configured to shift the diagonal data of the matrix to the first bit of each column to obtain the first shift data.

The control unit 103 is used for scheduling and controlling the entire computing task.

The operation unit 104 is configured to perform N parallel iterative operations on the matrix according to the control signal of the control unit to obtain the operation result of each parallel iterative operation in the N parallel iterative operations, wherein the operation result of the xth parallel iterative operation It is obtained based on the component data of the xth column of the N-th order positive definite matrix and the operation result of the previous (x-1) iteration operation. The operation result of the xth parallel iteration operation contains the matrix based on the lower triangular matrix of the Cholessky decomposition For the column component data of the xth column and the row component data of the xth row of the inverse matrix of the matrix, x is an integer greater than 0 and less than or equal to N.

The second data shift unit 105 is configured to perform data shift on the operation result of each parallel iterative operation to obtain the second shift data of the operation result of each iterative operation, which is used for the input of the next parallel iterative operation.

The storage unit 106 is configured to store the first shift data and the second shift data of the operation result of each iteration operation.

It should be noted that in this embodiment of the application, the storage unit 106 responds to the calculation unit 104 first, and responds to external input only when the calculation unit 104 has no demand; at the same time, the storage unit 106 can also guarantee the number of received matrices. Back pressure to the front stage when full. It should be noted that, in the embodiment of the present application, the storage unit is a local cache, which can cache a matrix of 32x32x64 bits at most. Optionally, the bandwidth of the storage unit 106 is 64*8=512 bits, which can be implemented in a 128bits*128depth*4bank manner.

The output unit 107 is configured to output the inverse matrix of the matrix according to the second shift data of the operation result of each iteration operation stored in the storage unit 106.

Optionally, the embodiment of the present application specifically introduces the functions of each unit module in the matrix inversion device 10 in the embodiment of the present application in a manner of N=8 and 4 sets of matrix interleaving.

In the embodiment of the present application, when N=8 and the operation is performed in a 4-group interleaving manner, the 4 matrices may be different 8-order positive definite matrices. The data writing control unit 101 first completes the writing control of the externally input matrix, and the matrixes received by the data writing control unit 101 are all transmitted in rows. As shown in FIG. 2(a), matrix 1 to matrix 4 are the 4 matrices received by the data writing control unit 101 in rows. Specifically, the data writing control unit 101's writing control to the externally input matrix may refer to conjugate the matrix transmitted in rows to realize the row-column transposition of the matrix, and the data writing control unit 101 shifts to the first data The matrix format output by the unit 102 is shown in Figure 2(b).

In the embodiment of this application, according to the Cholesky decomposition formula, it is necessary to first obtain the diagonal data dii of the inverse matrix of the matrix according to the diagonal data of the matrix, and then continue the next decomposition and inversion operation of each column. Therefore, the initial calculation of each column must include diagonal data. In order to solve the problem of addressing complexity caused by searching for diagonal data in the subsequent operation, the first data shift unit 102 is used in the embodiment of the present application to shift the diagonal data of each column of each matrix to the first position. In this way, the first shift data is obtained and stored in the storage unit 106 as input for subsequent calculations.

In the embodiment of the present application, after the first data shift unit 102 receives the matrix transmitted by the data writing control unit 101, it shifts the diagonal data of the matrix to the first bit of each column, thereby simplifying subsequent calculations. Addressing complexity. The shifted data is the first shifted data, and the first shifted data is stored in the storage unit 106. For example, FIG. 3 shows the first shift data corresponding to each matrix obtained after the first data shift unit 102 is shifted by the matrix 1-matrix 4.

In the embodiment of this application, the control unit 103 completes the communication between the arithmetic unit 104, the second data shift unit 105, and the storage unit 106, and performs the address calculation of the data in the storage unit 106. Each iteration of the arithmetic operation in the arithmetic unit 104 Control, data shift control of the second data shift unit 105, and control of multiple matrix interleaving calculations. The entire computing task can have less sequential logic and more combinational logic.

For example, in the embodiment of the present application, when N=8 and the operation is performed in a 4-group interleaving manner, each matrix corresponds to 8 parallel iterative operations. 4(a)-FIG. 4(b) show the data changes during the 8 parallel iterative operations in the arithmetic unit 104 of 4 sets of interleaved 8-order matrices.

As shown in Figure 4(a), during the first parallel iterative operation, the control unit 103 controls the input of the first column of component data in the first shift data of each matrix in matrix 1 to matrix 4 in the operation unit 104 , Through 8 CMAC operations, the first parallel iterative operation result corresponding to each matrix is obtained. The first parallel iterative operation result contains the first column vector data L10-L70 of the lower triangular matrix corresponding to each matrix, and the inverse The first row of vector data d00 of the matrix.

After obtaining the result of the first parallel iterative operation, the control unit 103 controls the second data shift unit 105 to perform data shift on the result of the first parallel iterative operation to obtain the second shift data corresponding to each matrix as the next The input of one iteration operation is the second shift data of the first parallel iteration result as shown in Figure 4(b).

Figure 4(b) shows that during the second parallel iterative operation, the control unit 103 controls the second column of component data in the first shift data of each matrix in matrix 1 to matrix 4 and the first parallel iterative operation. The second shift data corresponding to the operation result is input in the operation unit 104, and the second parallel iterative operation result corresponding to each matrix is obtained through 8 CMAC operations. The second parallel iterative operation result contains each matrix corresponding The second column vector data L21-L71 of the lower triangular matrix and the first row vector data d10 and d11 of the inverse matrix.

Correspondingly, after obtaining the second parallel iterative operation result, the control unit 103 controls the second data shift unit 105 to perform data shift on the second parallel iterative operation result to obtain the second shift data corresponding to each matrix , As the input of the next iteration operation, the second shift data of the second parallel iteration result as shown in Figure 4(c).

Figure 4(c) shows that during the third parallel iterative operation, the control unit 103 controls the third column of component data in the first shift data of each matrix in matrix 1 to matrix 4, and the first and second time components. The operation results of the parallel iterative operation respectively correspond to the input of the second shift data in the operation unit 104, and the third parallel iterative operation result corresponding to each matrix is obtained through 8 CMAC operations. The third parallel iterative operation result is Contains the third column vector data L32-L72 of the lower triangular matrix corresponding to each matrix, and the first row vector data d20, d21, and d22 of the inverse matrix.

Correspondingly, after obtaining the result of the third parallel iterative operation, the control unit 103 controls the second data shift unit 105 to perform data shift on the result of the third parallel iterative operation to obtain the second shift data corresponding to each matrix. , As the input of the next iteration operation, as shown in Figure 4 (c) as shown in Figure 4 (c) the second shift data of the third parallel iteration result.

By analogy, to the eighth parallel iterative operation, the control unit 103 controls the eighth column of component data in the first shift data of each matrix in matrix 1 to matrix 4 and the operation results of the previous 7 parallel iterative operations respectively correspond to The input of the second shift data in the arithmetic unit 104, through 8 CMAC operations, the result of the eighth parallel iterative operation corresponding to each matrix is obtained, and the result of the eighth parallel iterative operation contains each matrix The eighth row vector data d70-d77 of the corresponding inverse matrix.

After the eight parallel iterative operations are completed, the storage format of the operation result of each parallel iterative operation stored in the storage unit 106 is shown in FIG. 5. Finally, the output unit 107 reads the operation result of each parallel iterative operation in the storage unit 106 and outputs the operation result.

The foregoing takes the manner of N=8 and 4 sets of matrix interleaving as an example to specifically introduce the functions of each unit module in the matrix inversion device 10 in the embodiment of the present application. It should be understood that for a positive definite matrix whose N is not equal to 8, the matrix inversion device 10 in the embodiment of the present application can also use the same principle to perform matrix decomposition and inversion. For example, N=4, 8 matrix interleaving operations, N=16, 2 matrix interleaving operations, N=32, without matrix interleaving operations, and when N is not an integer multiple of 8, a close approach is used, Operations that are consistent with the interleaving method of a positive definite matrix when N is an integer multiple of 8, are all within the scope of protection of this application.

The matrix inversion device provided by the embodiments of the present application can solve the current problems of less internal computing resources and low utilization of computing resources in the current vector processor, thereby reducing the processing delay of the decomposition and inversion based on Cholesky, and improving Network performance.

It can be understood that the various numbers or letter numbers involved in the embodiments of the present application are only for easy distinction for description, and are not used to limit the scope of the embodiments of the present application. The size of the sequence number of each process mentioned above does not mean the order of execution, and the execution order of each process should be determined by its function and internal logic.

The above is a detailed introduction to the matrix inversion device based on Cholesky decomposition provided in the embodiments of the present application. Specific examples are used in this article to explain the principles and implementation of the present invention. The description of the above embodiments is only for To help understand the method and core idea of the present invention; at the same time, for those of ordinary skill in the art, according to the idea of the present invention, there will be changes in the specific implementation and the scope of application. In summary, the content of this specification It should not be understood as a limitation to the present invention.

Claims

A matrix inversion device based on Cholesky decomposition, comprising a data writing control unit, a first data shift unit, a control unit, an arithmetic unit, a second data shift unit, a storage unit and an output unit, and is characterized in , The arithmetic unit includes 8 single-precision complex multiplication and addition units CMAC, the CMAC has a four-stage pipeline operation structure, the arithmetic unit is connected to the control unit, the first data shift unit, the control unit , The second data shift unit and the output unit are respectively connected to the storage unit, the control unit and the second data shift unit are connected to each other, and the data writing control unit is connected to the first The data shift unit is connected.
The device according to claim 1, wherein the data writing control unit is configured to complete the writing control of a matrix, the matrix is a positive definite matrix of order N, and the N is greater than 1 and less than or equal to 32 Integer.
3. The device according to claim 2, wherein the first data shift unit is configured to shift the diagonal data of the matrix to the first bit of each column to obtain the first shift data .
The device according to claim 2 or 3, wherein the control unit is used for communication and control between the storage unit, the second data shift unit, and the arithmetic unit.
The device according to claim 4, wherein the arithmetic unit is configured to perform N parallel iterative operations on the matrix according to the control information of the control unit to obtain the N parallel iterative operations The operation result of each parallel iterative operation, where the operation result of the xth parallel iterative operation is obtained based on the component data of the xth column of the N-th order positive definite matrix and the operation result of the previous (x-1) iteration operation, The operation result of the xth parallel iterative operation includes the column component data of the xth column of the lower triangular matrix of the matrix based on Cholesky decomposition and the row component data of the xth row of the inverse matrix of the matrix, so The column component data of the xth column does not include the diagonal data of the lower triangular matrix, and the x is an integer greater than 0 and less than or equal to N.
5. The device according to claim 5, wherein the second data shift unit is configured to perform data shift on the operation result of each parallel iterative operation to obtain the operation of each iterative operation The resultant second shift data, which is used as the input of the next parallel iterative operation.
7. The device according to claim 6, wherein the storage unit is configured to store the first shift data and the second shift data of the operation result of each iteration operation.
7. The device according to claim 7, wherein the output unit is configured to output the inverse matrix according to the second shift data of the operation result of each iteration operation stored in the storage unit.