WO2021036313A1 - Cholesky decomposition-based matrix inversion apparatus - Google Patents

Cholesky decomposition-based matrix inversion apparatus Download PDF

Info

Publication number
WO2021036313A1
WO2021036313A1 PCT/CN2020/086987 CN2020086987W WO2021036313A1 WO 2021036313 A1 WO2021036313 A1 WO 2021036313A1 CN 2020086987 W CN2020086987 W CN 2020086987W WO 2021036313 A1 WO2021036313 A1 WO 2021036313A1
Authority
WO
WIPO (PCT)
Prior art keywords
unit
data
matrix
shift
control unit
Prior art date
Application number
PCT/CN2020/086987
Other languages
French (fr)
Chinese (zh)
Inventor
张应松
矫渊培
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021036313A1 publication Critical patent/WO2021036313A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7896Modular architectures, e.g. assembled from a number of identical packages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F5/00Methods or arrangements for data conversion without changing the order or content of the data handled
    • G06F5/01Methods or arrangements for data conversion without changing the order or content of the data handled for shifting, e.g. justifying, scaling, normalising
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F5/00Methods or arrangements for data conversion without changing the order or content of the data handled
    • G06F5/01Methods or arrangements for data conversion without changing the order or content of the data handled for shifting, e.g. justifying, scaling, normalising
    • G06F5/015Methods or arrangements for data conversion without changing the order or content of the data handled for shifting, e.g. justifying, scaling, normalising having at least two separately controlled shifting levels, e.g. using shifting matrices

Definitions

  • This application relates to the field of digital signal processing, in particular to a matrix inversion device based on Cholesky decomposition.
  • the vector processor mainly relies on its internal vector processing unit to perform the decomposition and inversion based on Cholesky.
  • the current vector processor only contains 16 half-precision complex signal processors (CMAC), which is only equivalent to 4 single-precision CMACs, that is, ideally, it can only do 4 single-precision at a time. Complex number operations make even if the utilization rate of computing resources can reach 100%, the processing power is still weak.
  • the vector processor when the vector processor is doing the inversion based on the Cholesky decomposition, it first performs the Cholesky decomposition, and then performs the inversion operation after all the decomposition results are obtained. Due to the data dependence in the calculation process, the decomposition process requires less and less CMAC as the iteration progresses, while the reverse is the opposite. As the iteration progresses, more and more CMAC are required. Whether it is decomposition or reversal, as the iteration progresses, there are scenarios where the CMAC utilization rate becomes better or worse, that is, the average CMAC utilization rate is low.
  • the current vector processor has less internal computing resources and low utilization of computing resources, which makes the decomposition and inversion processing based on Choleski longer delay, which makes it easy to become the bottleneck of the link and affect the network. performance.
  • the embodiment of the present application provides a matrix inversion device based on Cholesky decomposition, which can reduce the processing delay of Cholesky-based decomposition and inversion, and improve network performance.
  • This application provides a matrix inversion device based on Cholesky decomposition, which includes a data writing control unit, a first data shift unit, a control unit, an arithmetic unit, a second data shift unit, a storage unit, and an output unit,
  • the arithmetic unit includes 8 single-precision complex multiplication and addition units CMAC
  • each CMAC has a four-stage pipeline operation structure
  • the arithmetic unit is connected to the control unit, the first data shift unit, the control unit, the second data shift unit and the output
  • the units are respectively connected with the storage unit, the control unit and the second data shift unit are connected with each other, and the data writing control unit is connected with the first data shift unit.
  • the data writing control unit is used to complete the writing control of the matrix.
  • the matrix is a positive definite matrix of order N, and N is an integer greater than 1 and less than or equal to 32.
  • the first data shift unit is used to shift the diagonal data of the matrix to the first bit of each column to obtain the first shift data.
  • control unit is used for communication and control between the storage unit, the second data shift unit and the arithmetic unit.
  • the arithmetic unit is used to perform N parallel iterative operations on the matrix according to the control information of the control unit to obtain the operation of each parallel iterative operation in the N parallel iterative operations
  • the operation result of the xth parallel iterative operation is obtained based on the component data of the xth column of the N-th order positive definite matrix and the operation result of the previous (x-1) iterative operation
  • the operation result of the xth parallel iterative operation Contains the column component data of the x-th column of the lower triangular matrix based on the Cholesky decomposition and the row component data of the x-th row of the inverse matrix of the matrix.
  • the column component data of the x-th column does not include the diagonal of the lower triangular matrix Data, x is an integer greater than 0 and less than or equal to N.
  • the second data shift unit is used to perform data shift on the operation result of each parallel iterative operation to obtain the second shift of the operation result of each iterative operation.
  • Bit data the second shift data is used for the input of the next parallel iterative operation.
  • the storage unit is used to store the first shift data and the second shift data of the operation result of each iteration operation.
  • the storage unit is a local cache, which can cache up to 32x32x64bit matrices.
  • the output unit is used to output the inverse matrix according to the second shift data of the operation result of each iteration operation stored in the storage unit.
  • the matrix inversion device can support the interleaving operation of 4 matrices at the same time.
  • the matrix inversion device can support the interleaving operation of 8 matrices at the same time.
  • the matrix inversion device can support the interleaving operation of two matrices at the same time.
  • no matrix interleaving operation is required.
  • N is not an integer multiple of 8
  • a close approach can be used, and the interleaving method is consistent with that of a positive definite matrix when N is an integer multiple of 8.
  • the embodiment of the present application provides a matrix inversion device based on Cholesky decomposition, which can solve the current problems of less internal computing resources and low utilization rate of computing resources in the current vector processor, thereby reducing the Cholesky-based Decompose the inversion processing time delay and improve the network performance.
  • FIG. 1 is a schematic diagram of an embodiment of a matrix inversion device based on Cholesky decomposition provided by an embodiment of the application;
  • Figure 2 (a) is a schematic diagram of a matrix format received by a data writing control unit provided by an embodiment of the application;
  • FIG. 2(b) is a schematic diagram of the matrix format output by the data writing control unit provided by an embodiment of the application;
  • FIG. 3 is a schematic diagram of the first shifted data after the matrix is shifted by the first data shifting unit according to an embodiment of the application;
  • Fig. 4(a) is a schematic diagram of data changes during the first parallel iterative operation of 4 sets of interleaved 8-order matrices provided by an embodiment of the application in the operation unit;
  • FIG. 4(b) is a schematic diagram of data changes during the second parallel iterative operation of the 4 groups of interleaved 8-order matrices provided by the embodiment of the application in the operation unit;
  • Fig. 4(c) is a schematic diagram of data changes during the third parallel iterative operation of the 4 groups of interleaved 8-order matrices provided by the embodiment of the application in the operation unit;
  • FIG. 5 is a schematic diagram of the storage format of the operation result of each parallel iterative operation of 4 groups of interleaved 8-order matrices provided by an embodiment of the application.
  • the embodiment of the present application provides a matrix inversion device based on Cholesky decomposition, which can reduce the processing delay of Cholesky-based decomposition and inversion, and improve network performance.
  • the inversion operation is performed after all the decomposition results are obtained, and the CMAC utilization rate is low and the vector processor CMAC has less computing resources.
  • the embodiment of the present application provides a This kind of matrix inversion device based on Cholesky decomposition can reduce the delay of Cholesky-based decomposition and inversion processing and improve network performance. Please refer to Figure 1.
  • FIG. 1 is a schematic diagram of an embodiment of a matrix inversion device 10 based on Cholesky decomposition provided by an embodiment of the application.
  • the matrix inversion device 10 based on Cholesky decomposition provided by the embodiment of the present application includes a data writing control unit 101, a first data shift unit 102, a control unit 103, an arithmetic unit 104, and a second data
  • the shift unit 105, the storage unit 106, and the output unit 107 are characterized in that the arithmetic unit 104 includes 8 single-precision complex multiplication and addition units CMAC, the CMAC has a four-stage pipeline operation structure, the arithmetic unit 104 and The control unit 103 is connected, the first data shift unit 102, the control unit 103, the second data shift unit 105, and the output unit 107 are respectively connected to the storage unit 106, and the control The unit 103 and the second data shift unit 105 are connected to each other, and the data writing control unit 101 is connected to the first data shift unit 102.
  • the matrix inversion device 10 based on Cholesky decomposition includes 8 single-precision CMACs, and each CMAC has a four-stage pipeline operation structure, which can directly support N less than or equal to 32.
  • N is greater than 32, it can be decomposed into dimensions below 32 for calculation in the form of software, which is not limited in the embodiment of the application.
  • each functional unit included in the matrix inversion device 10 is as follows:
  • the data writing control unit 101 is used to complete the writing control of the matrix. It should be noted that the matrix in the embodiment of the present application is a positive definite matrix of order N, and N is an integer greater than 1 and less than or equal to 32.
  • the first data shift unit 102 is configured to shift the diagonal data of the matrix to the first bit of each column to obtain the first shift data.
  • the control unit 103 is used for scheduling and controlling the entire computing task.
  • the operation unit 104 is configured to perform N parallel iterative operations on the matrix according to the control signal of the control unit to obtain the operation result of each parallel iterative operation in the N parallel iterative operations, wherein the operation result of the xth parallel iterative operation It is obtained based on the component data of the xth column of the N-th order positive definite matrix and the operation result of the previous (x-1) iteration operation.
  • the operation result of the xth parallel iteration operation contains the matrix based on the lower triangular matrix of the Cholessky decomposition For the column component data of the xth column and the row component data of the xth row of the inverse matrix of the matrix, x is an integer greater than 0 and less than or equal to N.
  • the second data shift unit 105 is configured to perform data shift on the operation result of each parallel iterative operation to obtain the second shift data of the operation result of each iterative operation, which is used for the input of the next parallel iterative operation.
  • the storage unit 106 is configured to store the first shift data and the second shift data of the operation result of each iteration operation.
  • the storage unit 106 responds to the calculation unit 104 first, and responds to external input only when the calculation unit 104 has no demand; at the same time, the storage unit 106 can also guarantee the number of received matrices. Back pressure to the front stage when full.
  • the storage unit is a local cache, which can cache a matrix of 32x32x64 bits at most.
  • the output unit 107 is configured to output the inverse matrix of the matrix according to the second shift data of the operation result of each iteration operation stored in the storage unit 106.
  • the 4 matrices may be different 8-order positive definite matrices.
  • the data writing control unit 101 first completes the writing control of the externally input matrix, and the matrixes received by the data writing control unit 101 are all transmitted in rows.
  • matrix 1 to matrix 4 are the 4 matrices received by the data writing control unit 101 in rows.
  • the data writing control unit 101's writing control to the externally input matrix may refer to conjugate the matrix transmitted in rows to realize the row-column transposition of the matrix, and the data writing control unit 101 shifts to the first data
  • the matrix format output by the unit 102 is shown in Figure 2(b).
  • the first data shift unit 102 is used in the embodiment of the present application to shift the diagonal data of each column of each matrix to the first position. In this way, the first shift data is obtained and stored in the storage unit 106 as input for subsequent calculations.
  • the first data shift unit 102 after the first data shift unit 102 receives the matrix transmitted by the data writing control unit 101, it shifts the diagonal data of the matrix to the first bit of each column, thereby simplifying subsequent calculations. Addressing complexity.
  • the shifted data is the first shifted data, and the first shifted data is stored in the storage unit 106.
  • FIG. 3 shows the first shift data corresponding to each matrix obtained after the first data shift unit 102 is shifted by the matrix 1-matrix 4.
  • control unit 103 completes the communication between the arithmetic unit 104, the second data shift unit 105, and the storage unit 106, and performs the address calculation of the data in the storage unit 106.
  • Each iteration of the arithmetic operation in the arithmetic unit 104 Control, data shift control of the second data shift unit 105, and control of multiple matrix interleaving calculations.
  • the entire computing task can have less sequential logic and more combinational logic.
  • each matrix corresponds to 8 parallel iterative operations.
  • 4(a)-FIG. 4(b) show the data changes during the 8 parallel iterative operations in the arithmetic unit 104 of 4 sets of interleaved 8-order matrices.
  • the control unit 103 controls the input of the first column of component data in the first shift data of each matrix in matrix 1 to matrix 4 in the operation unit 104 , Through 8 CMAC operations, the first parallel iterative operation result corresponding to each matrix is obtained.
  • the first parallel iterative operation result contains the first column vector data L10-L70 of the lower triangular matrix corresponding to each matrix, and the inverse The first row of vector data d00 of the matrix.
  • control unit 103 controls the second data shift unit 105 to perform data shift on the result of the first parallel iterative operation to obtain the second shift data corresponding to each matrix as the next
  • the input of one iteration operation is the second shift data of the first parallel iteration result as shown in Figure 4(b).
  • Figure 4(b) shows that during the second parallel iterative operation, the control unit 103 controls the second column of component data in the first shift data of each matrix in matrix 1 to matrix 4 and the first parallel iterative operation.
  • the second shift data corresponding to the operation result is input in the operation unit 104, and the second parallel iterative operation result corresponding to each matrix is obtained through 8 CMAC operations.
  • the second parallel iterative operation result contains each matrix corresponding The second column vector data L21-L71 of the lower triangular matrix and the first row vector data d10 and d11 of the inverse matrix.
  • control unit 103 controls the second data shift unit 105 to perform data shift on the second parallel iterative operation result to obtain the second shift data corresponding to each matrix , As the input of the next iteration operation, the second shift data of the second parallel iteration result as shown in Figure 4(c).
  • Figure 4(c) shows that during the third parallel iterative operation, the control unit 103 controls the third column of component data in the first shift data of each matrix in matrix 1 to matrix 4, and the first and second time components.
  • the operation results of the parallel iterative operation respectively correspond to the input of the second shift data in the operation unit 104, and the third parallel iterative operation result corresponding to each matrix is obtained through 8 CMAC operations.
  • the third parallel iterative operation result is Contains the third column vector data L32-L72 of the lower triangular matrix corresponding to each matrix, and the first row vector data d20, d21, and d22 of the inverse matrix.
  • control unit 103 controls the second data shift unit 105 to perform data shift on the result of the third parallel iterative operation to obtain the second shift data corresponding to each matrix.
  • the control unit 103 controls the second data shift unit 105 to perform data shift on the result of the third parallel iterative operation to obtain the second shift data corresponding to each matrix.
  • the control unit 103 controls the eighth column of component data in the first shift data of each matrix in matrix 1 to matrix 4 and the operation results of the previous 7 parallel iterative operations respectively correspond to The input of the second shift data in the arithmetic unit 104, through 8 CMAC operations, the result of the eighth parallel iterative operation corresponding to each matrix is obtained, and the result of the eighth parallel iterative operation contains each matrix The eighth row vector data d70-d77 of the corresponding inverse matrix.
  • the storage format of the operation result of each parallel iterative operation stored in the storage unit 106 is shown in FIG. 5.
  • the output unit 107 reads the operation result of each parallel iterative operation in the storage unit 106 and outputs the operation result.
  • the matrix inversion device 10 in the embodiment of the present application can also use the same principle to perform matrix decomposition and inversion.
  • the matrix inversion device provided by the embodiments of the present application can solve the current problems of less internal computing resources and low utilization of computing resources in the current vector processor, thereby reducing the processing delay of the decomposition and inversion based on Cholesky, and improving Network performance.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Complex Calculations (AREA)

Abstract

Disclosed in the present application is a Cholesky decomposition-based matrix inversion apparatus, comprising a data write control unit, a first data shift unit, a control unit, a calculation unit, a second data shift unit, a memory unit and an output unit, the calculation unit comprising eight single-precision complex multiplier-accumulators (CMAC), said CMACs having a four-stage pipeline calculation structure, the calculation unit being connected to the control unit, the first data shift unit, the control unit, the second data shift unit and the output unit each being connected to the memory unit, the control unit being connected to the second data shift unit, and the data write control unit being connected to the first data shift unit. Technical solutions of the present application, by means of using an eight-CMAC calculation unit, solve the problems of internal computing resources of current vector processors being relatively few and calculation resource utilization being relatively low, decrease processing delay for Cholesky decomposition-based inversion, and increase network functionality.

Description

一种基于乔列斯基分解的矩阵求逆装置A matrix inversion device based on Cholesky decomposition
本申请要求于2019年8月28日提交中国专利局、申请号为201910804096.X、发明名称为“一种基于乔列斯基分解的矩阵求逆装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on August 28, 2019, the application number is 201910804096.X, and the invention title is "a matrix inversion device based on Cholesky decomposition", all of which The content is incorporated in this application by reference.
技术领域Technical field
本申请涉及数字信号处理领域,具体涉及一种基于乔列斯基分解的矩阵求逆装置。This application relates to the field of digital signal processing, in particular to a matrix inversion device based on Cholesky decomposition.
背景技术Background technique
基于乔列斯基分解是一种常用的正定矩阵求逆的方法,该矩阵求逆的原理是:对于一个n阶对称正定矩阵A,存在一个下三角矩阵L,使得A=L*L T,那么该正定矩阵A的逆A -1=(L*L T) -1=(L T) -1*L -1=(L -1) T*L -1。业界的通用做法是采用矢量处理器来实现基于乔列斯基的分解求逆。 Based on the Cholesky decomposition, it is a commonly used method to find the inversion of a positive definite matrix. The principle of the inversion of the matrix is: for an n-th order symmetric positive definite matrix A, there is a lower triangular matrix L such that A=L*L T , Then the inverse A -1 of the positive definite matrix A = (L*L T ) -1 = (L T ) -1 *L -1 = (L -1 ) T *L -1 . The common practice in the industry is to use vector processors to realize the decomposition and inversion based on Cholesky.
基于乔列斯基分解的矩阵求逆运算过程中存在大量的迭代和交织运算,其内部计算量非常大。矢量处理器主要依靠其内部的矢量处理单元来进行基于乔列斯基的分解求逆。然而,当前矢量处理器中只包含16个半精度的复数乘加单元(complex signal processor,CMAC),只等效于4个单精度的CMAC,即理想情况下每一次只能做4次单精度复数运算,使得即使计算资源的利用率能达到百分之百,处理能力也依然较弱。与此同时,矢量处理器在做基于乔列斯基的分解求逆时,先做乔列斯基分解,待分解结果全部求出后再进行求逆操作。由于运算过程中的数据依赖关系,分解的过程是随着迭代的进行所需要的CMAC越来越少,而求逆则相反,随着迭代额进行所需要的CMAC越来越多。无论是分解或者是求逆,随着迭代的进行,都存在CMAC利用率变好或者变差的情景,即CMAC平均利用率较低。There are a large number of iterations and interleaving operations in the matrix inversion operation based on Cholesky decomposition, and the amount of internal calculations is very large. The vector processor mainly relies on its internal vector processing unit to perform the decomposition and inversion based on Cholesky. However, the current vector processor only contains 16 half-precision complex signal processors (CMAC), which is only equivalent to 4 single-precision CMACs, that is, ideally, it can only do 4 single-precision at a time. Complex number operations make even if the utilization rate of computing resources can reach 100%, the processing power is still weak. At the same time, when the vector processor is doing the inversion based on the Cholesky decomposition, it first performs the Cholesky decomposition, and then performs the inversion operation after all the decomposition results are obtained. Due to the data dependence in the calculation process, the decomposition process requires less and less CMAC as the iteration progresses, while the reverse is the opposite. As the iteration progresses, more and more CMAC are required. Whether it is decomposition or reversal, as the iteration progresses, there are scenarios where the CMAC utilization rate becomes better or worse, that is, the average CMAC utilization rate is low.
综上所述,当前矢量处理器内部计算资源较少和计算资源的利用率较低,使得基于乔列斯基的分解求逆处理时延较长,导致其容易成为链路的瓶颈,影响网络性能。To sum up, the current vector processor has less internal computing resources and low utilization of computing resources, which makes the decomposition and inversion processing based on Choleski longer delay, which makes it easy to become the bottleneck of the link and affect the network. performance.
发明内容Summary of the invention
本申请实施例提供一种基于乔列斯基分解的矩阵求逆装置,能够降低基于乔列斯基的分解求逆处理时延,提升网络性能。The embodiment of the present application provides a matrix inversion device based on Cholesky decomposition, which can reduce the processing delay of Cholesky-based decomposition and inversion, and improve network performance.
本申请提供一种基于乔列斯基分解的矩阵求逆装置,包括数据写入控制单元,第一数据移位单元、控制单元、运算单元、第二数据移位单元、存储单元和输出单元,其中,运算单元包括8个单精度复数乘加单元CMAC,每个CMAC具备四级流水运算结构,运算单元与控制单元相连,第一数据移位单元、控制单元、第二数据移位单元和输出单元分别与存储单元相连,控制单元与第二数据移位单元相互连接,数据写入控制单元与第一数据移位单元相连。This application provides a matrix inversion device based on Cholesky decomposition, which includes a data writing control unit, a first data shift unit, a control unit, an arithmetic unit, a second data shift unit, a storage unit, and an output unit, Among them, the arithmetic unit includes 8 single-precision complex multiplication and addition units CMAC, each CMAC has a four-stage pipeline operation structure, the arithmetic unit is connected to the control unit, the first data shift unit, the control unit, the second data shift unit and the output The units are respectively connected with the storage unit, the control unit and the second data shift unit are connected with each other, and the data writing control unit is connected with the first data shift unit.
该基于乔列斯基分解的矩阵求逆装置中,数据写入控制单元,用于完成矩阵的写入控制,该矩阵为N阶正定矩阵,N为大于1且小于或等于32的整数。In the matrix inversion device based on Cholesky decomposition, the data writing control unit is used to complete the writing control of the matrix. The matrix is a positive definite matrix of order N, and N is an integer greater than 1 and less than or equal to 32.
该基于乔列斯基分解的矩阵求逆装置中,第一数据移位单元,用于将矩阵的对角线数据移位至每一列的第一位,以得到第一移位数据。In the matrix inversion device based on Cholesky decomposition, the first data shift unit is used to shift the diagonal data of the matrix to the first bit of each column to obtain the first shift data.
该基于乔列斯基分解的矩阵求逆装置中,控制单元,用于存储单元、第二数据移位单元和运算单元之间的通信与控制。In the matrix inversion device based on Cholesky decomposition, the control unit is used for communication and control between the storage unit, the second data shift unit and the arithmetic unit.
该基于乔列斯基分解的矩阵求逆装置中,运算单元,用于根据控制单元的控制信息,对矩阵进行N次并行迭代运算,以得到N次并行迭代运算中每一次并行迭代运算的运算结果,其中,第x次并行迭代运算的运算结果是根据N阶正定矩阵的第x列分量数据和前(x-1)次迭代运算的运算结果得到的,第x次并行迭代运算的运算结果包含矩阵基于乔列斯基分解的下三角矩阵的第x列的列分量数据和矩阵的逆矩阵的第x行的行分量数据,第x列的列分量数据不包含下三角矩阵的对角线数据,x为大于0小于等于N的整数。In the matrix inversion device based on Cholesky decomposition, the arithmetic unit is used to perform N parallel iterative operations on the matrix according to the control information of the control unit to obtain the operation of each parallel iterative operation in the N parallel iterative operations As a result, the operation result of the xth parallel iterative operation is obtained based on the component data of the xth column of the N-th order positive definite matrix and the operation result of the previous (x-1) iterative operation, and the operation result of the xth parallel iterative operation Contains the column component data of the x-th column of the lower triangular matrix based on the Cholesky decomposition and the row component data of the x-th row of the inverse matrix of the matrix. The column component data of the x-th column does not include the diagonal of the lower triangular matrix Data, x is an integer greater than 0 and less than or equal to N.
该基于乔列斯基分解的矩阵求逆装置中,第二数据移位单元,用于对每一次并行迭代运算的运算结果进行数据移位,以得到每一次迭代运算的运算结果的第二移位数据,第二移位数据用于下一次并行迭代运算的输入。In the matrix inversion device based on Cholesky decomposition, the second data shift unit is used to perform data shift on the operation result of each parallel iterative operation to obtain the second shift of the operation result of each iterative operation. Bit data, the second shift data is used for the input of the next parallel iterative operation.
该基于乔列斯基分解的矩阵求逆装置中,存储单元,用于存储第一移位数据和每一次迭代运算的运算结果的第二移位数据。存储单元是本地的缓存,最大可缓存32x32x64bit的矩阵。In the matrix inversion device based on Cholesky decomposition, the storage unit is used to store the first shift data and the second shift data of the operation result of each iteration operation. The storage unit is a local cache, which can cache up to 32x32x64bit matrices.
该基于乔列斯基分解的矩阵求逆装置中,输出单元,用于根据存储单元中存储的每一次迭代运算的运算结果的第二移位数据,输出逆矩阵。In the matrix inversion device based on Cholesky decomposition, the output unit is used to output the inverse matrix according to the second shift data of the operation result of each iteration operation stored in the storage unit.
对于N=8的正定矩阵,该矩阵求逆装置可以同时支持4个矩阵的交织运算。对于N=4的正定矩阵,该矩阵求逆装置可以同时支持8个矩阵的交织运算。对于N=16的矩阵,该矩阵求逆装置可以同时支持2个矩阵的交织运算。对于N=32的正定矩阵,无需矩阵的交织运算。当N不为8的整数倍时,可以采用大靠近的方式,交织方式与当N为8的整数倍的正定矩阵的交织方式一致。For a positive definite matrix with N=8, the matrix inversion device can support the interleaving operation of 4 matrices at the same time. For a positive definite matrix with N=4, the matrix inversion device can support the interleaving operation of 8 matrices at the same time. For a matrix with N=16, the matrix inversion device can support the interleaving operation of two matrices at the same time. For a positive definite matrix with N=32, no matrix interleaving operation is required. When N is not an integer multiple of 8, a close approach can be used, and the interleaving method is consistent with that of a positive definite matrix when N is an integer multiple of 8.
本申请实施例提供一种基于乔列斯基分解的矩阵求逆装置,能够解决当前矢量处理器内部计算资源较少和计算资源的利用率较低的问题,从而能够降低基于乔列斯基的分解求逆处理时延,提升网络性能。The embodiment of the present application provides a matrix inversion device based on Cholesky decomposition, which can solve the current problems of less internal computing resources and low utilization rate of computing resources in the current vector processor, thereby reducing the Cholesky-based Decompose the inversion processing time delay and improve the network performance.
附图说明Description of the drawings
图1为本申请实施例提供的基于乔列斯基分解的矩阵求逆装置的实施例示意图;FIG. 1 is a schematic diagram of an embodiment of a matrix inversion device based on Cholesky decomposition provided by an embodiment of the application;
图2(a)为本申请实施例提供的数据写入控制单元接收的矩阵格式的示意图;Figure 2 (a) is a schematic diagram of a matrix format received by a data writing control unit provided by an embodiment of the application;
图2(b)为本申请实施例提供的数据写入控制单元输出的矩阵格式的示意图;FIG. 2(b) is a schematic diagram of the matrix format output by the data writing control unit provided by an embodiment of the application;
图3为本申请实施例提供的第一数据移位单元对矩阵进行移位后的第一移位数据的示意图;3 is a schematic diagram of the first shifted data after the matrix is shifted by the first data shifting unit according to an embodiment of the application;
图4(a)为本申请实施例提供的4组交织的8阶矩阵在运算单元中进行第一次并行迭代运算过程中的数据变化示意图;Fig. 4(a) is a schematic diagram of data changes during the first parallel iterative operation of 4 sets of interleaved 8-order matrices provided by an embodiment of the application in the operation unit;
图4(b)为本申请实施例提供的4组交织的8阶矩阵在运算单元中进行第二次并行迭代运算过程中的数据变化示意图;FIG. 4(b) is a schematic diagram of data changes during the second parallel iterative operation of the 4 groups of interleaved 8-order matrices provided by the embodiment of the application in the operation unit;
图4(c)为本申请实施例提供的4组交织的8阶矩阵在运算单元中进行第三次并行迭代 运算过程中的数据变化示意图;Fig. 4(c) is a schematic diagram of data changes during the third parallel iterative operation of the 4 groups of interleaved 8-order matrices provided by the embodiment of the application in the operation unit;
图5为本申请实施例提供的4组交织的8阶矩阵的每一次并行迭代运算的运算结果的存储格式示意图。FIG. 5 is a schematic diagram of the storage format of the operation result of each parallel iterative operation of 4 groups of interleaved 8-order matrices provided by an embodiment of the application.
具体实施方式detailed description
本申请实施例提供一种基于乔列斯基分解的矩阵求逆装置,能够降低基于乔列斯基的分解求逆处理时延,提升网络性能。The embodiment of the present application provides a matrix inversion device based on Cholesky decomposition, which can reduce the processing delay of Cholesky-based decomposition and inversion, and improve network performance.
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。In order to enable those skilled in the art to better understand the solutions of the application, the technical solutions in the embodiments of the application will be clearly and completely described below in conjunction with the drawings in the embodiments of the application. Obviously, the described embodiments are only It is a part of the embodiments of this application, not all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work should fall within the protection scope of this application.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first" and "second" in the specification and claims of the application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It should be understood that the data used in this way can be interchanged under appropriate circumstances, so that the embodiments described herein can be implemented in a sequence other than the content illustrated or described herein. In addition, the terms "including" and "having" and any variations of them are intended to cover non-exclusive inclusions. For example, a process, method, system, product, or device that includes a series of steps or units is not necessarily limited to those clearly listed. Those steps or units may include other steps or units that are not clearly listed or are inherent to these processes, methods, products, or equipment.
在在矩阵解方程运算过程中,需要求出矩阵的逆,比如AX=B,要解出X的值,需要先求出A -1,之后再根据X=A -1*B求出X的值。 In the process of matrix solution equation operation, the inverse of the matrix is required, such as AX=B. To solve the value of X, you need to find A -1 first, and then find X according to X=A -1 *B value.
乔列斯基分解算法是一种很常见的矩阵分解方法,其基本原理是:对于一个n阶对称正定矩阵A,存在一个下三角矩阵L,使得A=L*L T,L矩阵对角线上的数都是正实数,L T表示下三角矩阵L的共轭转置矩阵: The Cholesky decomposition algorithm is a very common matrix decomposition method. Its basic principle is: For an n-th order symmetric positive definite matrix A, there is a lower triangular matrix L, such that A=L*L T , the diagonal of the L matrix The numbers above are all positive real numbers, and L T represents the conjugate transpose of the lower triangular matrix L:
Figure PCTCN2020086987-appb-000001
Figure PCTCN2020086987-appb-000001
乔列斯基分解算法的基本公式为:The basic formula of Cholesky decomposition algorithm is:
Figure PCTCN2020086987-appb-000002
Figure PCTCN2020086987-appb-000002
其中,j=1,2,…,n;l jj和l ij的初值为: Among them, j = 1, 2, ..., n; the initial values of l jj and l ij are:
Figure PCTCN2020086987-appb-000003
Figure PCTCN2020086987-appb-000003
最后再根据A -1=(L*L T)-1=(L T) -1*L -1=(L -1) T*L -1,求得A的逆矩阵。 Finally, according to A -1 =(L*L T )-1=(L T ) -1 *L -1 =(L -1 ) T *L -1 , the inverse matrix of A is obtained.
矢量处理器实现基于乔列斯基分解的矩阵求逆的方法主要包括如下两个步骤:首先进行分解运算,即根据A=L*L T得到下三角矩阵L。由于数据的依赖关系,矢量处理器在对矩阵A进行乔列斯基分解的过程中,必须一列一列的进行计算,首先计算第一列,待第一列计算完后再计算第二列,每一列的计算依赖前面所有列的计算结果。乔列斯基分解完成得到矩阵A的下三角矩阵之后,再进行求逆操作,即根据A -1=(L -1) T*L -1计算出A -1。由于数据的依赖关系,矢量处理器基于分解结果对矩阵A进行求逆的过程中,也需要按照行进行计算,一行计算完成后再计算下一行。 Vector processor-implemented method of inversion based on the Cholesky decomposition of the matrix mainly comprises the following two steps: firstly divide the operation, i.e., in accordance with A = L * L T to obtain a lower triangular matrix L. Due to the dependence of the data, the vector processor must perform the calculations column by column when performing the Cholesky decomposition of matrix A. First, calculate the first column, and then calculate the second column after the first column is calculated. The calculation of one column depends on the calculation results of all the previous columns. After Choleski's decomposition is completed to obtain the lower triangular matrix of matrix A, the inverse operation is performed, that is, A -1 is calculated according to A -1 =(L -1 ) T *L -1 . Due to the dependence of the data, the vector processor also needs to perform calculations according to the rows in the process of inverting the matrix A based on the decomposition results, and then calculate the next row after the calculation of one row is completed.
因此,针对当前矢量处理器做分解求逆过程中,待分解结果全部求出后再进行求逆操作,CMAC利用率较低和矢量处理器CMAC计算资源较少的情景,本申请实施例提供一种基于乔列斯基分解的矩阵求逆装置,能够降低基于乔列斯基的分解求逆处理时延,提升网络性能,请参阅图1。Therefore, for the current vector processor in the decomposition and inversion process, the inversion operation is performed after all the decomposition results are obtained, and the CMAC utilization rate is low and the vector processor CMAC has less computing resources. The embodiment of the present application provides a This kind of matrix inversion device based on Cholesky decomposition can reduce the delay of Cholesky-based decomposition and inversion processing and improve network performance. Please refer to Figure 1.
图1为本申请实施例提供的基于乔列斯基分解的矩阵求逆装置10的一个实施例示意图。FIG. 1 is a schematic diagram of an embodiment of a matrix inversion device 10 based on Cholesky decomposition provided by an embodiment of the application.
参阅图1,本申请实施例提供的基于乔列斯基分解的矩阵求逆装置10,包括数据写入控制单元101,第一数据移位单元102、控制单元103、运算单元104、第二数据移位单元105、存储单元106、和输出单元107,其特征在于,所述运算单元104包括8个单精度复数乘加单元CMAC,所述CMAC具备四级流水运算结构,所述运算单元104与所述控制单元103相连,所述第一数据移位单元102、所述控制单元103、所述第二数据移位单元105和所述输出单元107分别与所述存储单元106相连,所述控制单元103与所述第二数据移位单元105相互连接,所述数据写入控制单元101与所述第一数据移位单元102相连。1, the matrix inversion device 10 based on Cholesky decomposition provided by the embodiment of the present application includes a data writing control unit 101, a first data shift unit 102, a control unit 103, an arithmetic unit 104, and a second data The shift unit 105, the storage unit 106, and the output unit 107 are characterized in that the arithmetic unit 104 includes 8 single-precision complex multiplication and addition units CMAC, the CMAC has a four-stage pipeline operation structure, the arithmetic unit 104 and The control unit 103 is connected, the first data shift unit 102, the control unit 103, the second data shift unit 105, and the output unit 107 are respectively connected to the storage unit 106, and the control The unit 103 and the second data shift unit 105 are connected to each other, and the data writing control unit 101 is connected to the first data shift unit 102.
需要说明的是,本申请实施例提供的基于乔列斯基分解的矩阵求逆装置10,包含8个单精度CMAC,且每个CMAC具备四级流水运算结构,能够直接支持N小于或等于32的N阶正定矩阵的分解求逆的运算,当N大于32时,可以通过软件的形式将其拆解为32以下的维度进行计算,本申请实施例对此不做限定。It should be noted that the matrix inversion device 10 based on Cholesky decomposition provided by the embodiment of the present application includes 8 single-precision CMACs, and each CMAC has a four-stage pipeline operation structure, which can directly support N less than or equal to 32. When N is greater than 32, it can be decomposed into dimensions below 32 for calculation in the form of software, which is not limited in the embodiment of the application.
由于本申请实施例提供的矩阵求逆装置10中包含8个单精度CMAC,且每个CMAC具备四级流水运算结构,因此,对于N=8的正定矩阵,该矩阵求逆装置10可以同时支持4个矩阵的交织运算。对于N=4的正定矩阵,该矩阵求逆装置10可以同时支持8个矩阵的交织运算。对于N=16的矩阵,该矩阵求逆装置可以同时支持2个矩阵的交织运算。对于N=32的正定矩阵,无需矩阵的交织运算。上述这几种情况可以实现CMAC的百分百利用。需要说明的是,当N不为8的整数倍时,可以采用大靠近的方式,交织方式与当N为8的整数倍的正定矩阵的交织方式一致。Since the matrix inversion device 10 provided by the embodiment of the present application includes 8 single-precision CMACs, and each CMAC has a four-stage pipeline operation structure, for a positive definite matrix with N=8, the matrix inversion device 10 can simultaneously support Interleaving operation of 4 matrices. For a positive definite matrix with N=4, the matrix inversion device 10 can support the interleaving operation of 8 matrices at the same time. For a matrix with N=16, the matrix inversion device can support the interleaving operation of two matrices at the same time. For a positive definite matrix with N=32, no matrix interleaving operation is required. The above-mentioned several situations can realize the 100% utilization of CMAC. It should be noted that when N is not an integer multiple of 8, a close approach can be used, and the interleaving method is consistent with that of a positive definite matrix when N is an integer multiple of 8.
具体地,本申请实施例中,矩阵求逆装置10中包含的各个功能单元的作用如下:Specifically, in the embodiment of the present application, the functions of each functional unit included in the matrix inversion device 10 are as follows:
数据写入控制单元101,用于完成矩阵的写入控制,需要说明的是,本申请实施例中的矩阵为N阶正定矩阵,N为大于1且小于或等于32的整数。The data writing control unit 101 is used to complete the writing control of the matrix. It should be noted that the matrix in the embodiment of the present application is a positive definite matrix of order N, and N is an integer greater than 1 and less than or equal to 32.
第一数据移位单元102,用于将矩阵的对角线数据移位至每一列的第一位得到第一移位数据。The first data shift unit 102 is configured to shift the diagonal data of the matrix to the first bit of each column to obtain the first shift data.
控制单元103,用于整个计算任务的调度及控制。The control unit 103 is used for scheduling and controlling the entire computing task.
运算单元104,用于根据控制单元的控制信号,对矩阵进行N次并行迭代运算,以得到N次并行迭代运算中每一次并行迭代运算的运算结果,其中,第x次并行迭代运算的运算结果是根据N阶正定矩阵的第x列分量数据和前(x-1)次迭代运算的运算结果得到的,第x次并行迭代运算的运算结果包含矩阵基于乔列斯基分解的下三角矩阵的第x列的列分量数据和矩阵的逆矩阵的第x行的行分量数据,x为大于0小于等于N的整数。The operation unit 104 is configured to perform N parallel iterative operations on the matrix according to the control signal of the control unit to obtain the operation result of each parallel iterative operation in the N parallel iterative operations, wherein the operation result of the xth parallel iterative operation It is obtained based on the component data of the xth column of the N-th order positive definite matrix and the operation result of the previous (x-1) iteration operation. The operation result of the xth parallel iteration operation contains the matrix based on the lower triangular matrix of the Cholessky decomposition For the column component data of the xth column and the row component data of the xth row of the inverse matrix of the matrix, x is an integer greater than 0 and less than or equal to N.
第二数据移位单元105,用于对每一次并行迭代运算的运算结果进行数据移位,以得到每一次迭代运算的运算结果的第二移位数据,用于下一次并行迭代运算的输入。The second data shift unit 105 is configured to perform data shift on the operation result of each parallel iterative operation to obtain the second shift data of the operation result of each iterative operation, which is used for the input of the next parallel iterative operation.
存储单元106,用于存储第一移位数据和每一次迭代运算的运算结果的第二移位数据。The storage unit 106 is configured to store the first shift data and the second shift data of the operation result of each iteration operation.
需要说明的是,本申请实施例中,存储单元106优先响应运算单元104,在运算单元104没有需求的时候才会响应外部输入;同时存储单元106还可以保证接收的矩阵个数,当内部缓存满时反压前级。需要说明的是,本申请实施例中,存储单元是本地的缓存,最大可缓存32x32x64bit的矩阵。可选地,存储单元106的带宽为64*8=512bit,可以通过128bits*128depth*4bank的方式实现。It should be noted that in this embodiment of the application, the storage unit 106 responds to the calculation unit 104 first, and responds to external input only when the calculation unit 104 has no demand; at the same time, the storage unit 106 can also guarantee the number of received matrices. Back pressure to the front stage when full. It should be noted that, in the embodiment of the present application, the storage unit is a local cache, which can cache a matrix of 32x32x64 bits at most. Optionally, the bandwidth of the storage unit 106 is 64*8=512 bits, which can be implemented in a 128bits*128depth*4bank manner.
输出单元107,用于根据存储单元106中存储的所述每一次迭代运算的运算结果的第二移位数据,输出矩阵的逆矩阵。The output unit 107 is configured to output the inverse matrix of the matrix according to the second shift data of the operation result of each iteration operation stored in the storage unit 106.
可选地,本申请实施例以N=8,4组矩阵交织的方式对本申请实施例中矩阵求逆装置10中的各个单元模块的功能进行具体的介绍。Optionally, the embodiment of the present application specifically introduces the functions of each unit module in the matrix inversion device 10 in the embodiment of the present application in a manner of N=8 and 4 sets of matrix interleaving.
本申请实施例中,当N=8,采用4组交织的方式进行运算时,该4个矩阵可以为不同的8阶正定矩阵。数据写入控制单元101首先完成外部输入的矩阵的写入控制,数据写入控制单元101接收的矩阵均按行传输。如图2(a)所示,矩阵1-矩阵4为所述数据写入控制单元101按行接收的4个矩阵。具体的,数据写入控制单元101对外部输入的矩阵的写入控制可以是指对按行传输的矩阵做共轭以实现矩阵的行列转置,数据写入控制单元101向第一数据移位单元102输出的矩阵格式如图2(b)所示。In the embodiment of the present application, when N=8 and the operation is performed in a 4-group interleaving manner, the 4 matrices may be different 8-order positive definite matrices. The data writing control unit 101 first completes the writing control of the externally input matrix, and the matrixes received by the data writing control unit 101 are all transmitted in rows. As shown in FIG. 2(a), matrix 1 to matrix 4 are the 4 matrices received by the data writing control unit 101 in rows. Specifically, the data writing control unit 101's writing control to the externally input matrix may refer to conjugate the matrix transmitted in rows to realize the row-column transposition of the matrix, and the data writing control unit 101 shifts to the first data The matrix format output by the unit 102 is shown in Figure 2(b).
本申请实施例中,根据乔列斯基分解的公式,需要首先根据矩阵的对角线数据求出矩阵的逆矩阵的对角线数据dii,然后才能继续每一列接下来的分解求逆操作,因此每一列最开始的运算一定包括对角线数据。为了解决后续运算过程中寻找对角线数据导致的寻址复杂度问题,本申请实施例通过第一数据移位单元102将每个矩阵的每一列的对角线数据移位至第一位,从而得到第一移位数据,存储在存储单元106中,以作为后续计算的输入。In the embodiment of this application, according to the Cholesky decomposition formula, it is necessary to first obtain the diagonal data dii of the inverse matrix of the matrix according to the diagonal data of the matrix, and then continue the next decomposition and inversion operation of each column. Therefore, the initial calculation of each column must include diagonal data. In order to solve the problem of addressing complexity caused by searching for diagonal data in the subsequent operation, the first data shift unit 102 is used in the embodiment of the present application to shift the diagonal data of each column of each matrix to the first position. In this way, the first shift data is obtained and stored in the storage unit 106 as input for subsequent calculations.
本申请实施例中,第一数据移位单元102在接收到数据写入控制单元101传输的矩阵后,将矩阵的对角线数据均移位至每一列的第一位,从而简化后续计算的寻址复杂度。移位后的数据为第一移位数据,该第一移位数据被存储在存储单元106中。例如,图3示出了矩阵1-矩阵4在经过第一数据移位单元102移位之后所得到的每个矩阵对应的第一移位数据。In the embodiment of the present application, after the first data shift unit 102 receives the matrix transmitted by the data writing control unit 101, it shifts the diagonal data of the matrix to the first bit of each column, thereby simplifying subsequent calculations. Addressing complexity. The shifted data is the first shifted data, and the first shifted data is stored in the storage unit 106. For example, FIG. 3 shows the first shift data corresponding to each matrix obtained after the first data shift unit 102 is shifted by the matrix 1-matrix 4.
本申请实施例中,控制单元103完成运算单元104、第二数据移位单元105和存储单元106之间的通信,执行存储单元106中数据的地址计算,运算单元104中每一次迭代运算的迭代控制、第二数据移位单元105的数据移位控制和多个矩阵交织计算的控制。整个计算任务可以具备较少的时序逻辑,以及较多的组合逻辑。In the embodiment of this application, the control unit 103 completes the communication between the arithmetic unit 104, the second data shift unit 105, and the storage unit 106, and performs the address calculation of the data in the storage unit 106. Each iteration of the arithmetic operation in the arithmetic unit 104 Control, data shift control of the second data shift unit 105, and control of multiple matrix interleaving calculations. The entire computing task can have less sequential logic and more combinational logic.
例如,本申请实施例中,当N=8,采用4组交织的方式进行运算时,每个矩阵都对应8次并行迭代运算。图4(a)-图4(b)示出了4组交织的8阶矩阵在运算单元104中的8次并行迭代运算过程中数据的变化。For example, in the embodiment of the present application, when N=8 and the operation is performed in a 4-group interleaving manner, each matrix corresponds to 8 parallel iterative operations. 4(a)-FIG. 4(b) show the data changes during the 8 parallel iterative operations in the arithmetic unit 104 of 4 sets of interleaved 8-order matrices.
如图4(a)所示,第一次并行迭代运算过程中,控制单元103控制矩阵1-矩阵4中每个矩阵的第一移位数据中第一列分量数据在运算单元104中的输入,通过8个CMAC的运算得到每个矩阵对应的第一次并行迭代运算结果,第一次并行迭代运算结果中包含每个矩阵对应的下三角矩阵的第一列向量数据L10-L70,以及逆矩阵的第一行向量数据d00。As shown in Figure 4(a), during the first parallel iterative operation, the control unit 103 controls the input of the first column of component data in the first shift data of each matrix in matrix 1 to matrix 4 in the operation unit 104 , Through 8 CMAC operations, the first parallel iterative operation result corresponding to each matrix is obtained. The first parallel iterative operation result contains the first column vector data L10-L70 of the lower triangular matrix corresponding to each matrix, and the inverse The first row of vector data d00 of the matrix.
在得到第一次并行迭代运算结果之后,控制单元103控制第二数据移位单元105对第一次并行迭代运算结果进行数据移位,以得到每个矩阵对应的第二移位数据,作为下一次迭代运算的输入,如图4(b)中所示的第一次并行迭代结果的第二移位数据。After obtaining the result of the first parallel iterative operation, the control unit 103 controls the second data shift unit 105 to perform data shift on the result of the first parallel iterative operation to obtain the second shift data corresponding to each matrix as the next The input of one iteration operation is the second shift data of the first parallel iteration result as shown in Figure 4(b).
图4(b)示出了第二次并行迭代运算过程中,控制单元103控制矩阵1-矩阵4中每个矩阵的第一移位数据中第二列分量数据以及第一次并行迭代运算的运算结果对应的第二移位数据在运算单元104中的输入,通过8个CMAC的运算得到每个矩阵对应的第二次并行迭代运算结果,第二次并行迭代运算结果中包含每个矩阵对应的下三角矩阵的第二列向量数据L21-L71,以及逆矩阵的第一行向量数据d10和d11。Figure 4(b) shows that during the second parallel iterative operation, the control unit 103 controls the second column of component data in the first shift data of each matrix in matrix 1 to matrix 4 and the first parallel iterative operation. The second shift data corresponding to the operation result is input in the operation unit 104, and the second parallel iterative operation result corresponding to each matrix is obtained through 8 CMAC operations. The second parallel iterative operation result contains each matrix corresponding The second column vector data L21-L71 of the lower triangular matrix and the first row vector data d10 and d11 of the inverse matrix.
对应地,在得到第二次并行迭代运算结果之后,控制单元103控制第二数据移位单元105对第二次并行迭代运算结果进行数据移位,以得到每个矩阵对应的第二移位数据,作为下一次迭代运算的输入,如图4(c)中所示的第二次并行迭代结果的第二移位数据。Correspondingly, after obtaining the second parallel iterative operation result, the control unit 103 controls the second data shift unit 105 to perform data shift on the second parallel iterative operation result to obtain the second shift data corresponding to each matrix , As the input of the next iteration operation, the second shift data of the second parallel iteration result as shown in Figure 4(c).
图4(c)示出了第三次并行迭代运算过程中,控制单元103控制矩阵1-矩阵4中每个矩阵的第一移位数据中第三列分量数据以及第一次与第二次并行迭代运算的运算结果分别对应的第二移位数据在运算单元104中的输入,通过8个CMAC的运算得到每个矩阵对应的第三次并行迭代运算结果,第三次并行迭代运算结果中包含每个矩阵对应的下三角矩阵的第三列向量数据L32-L72,以及逆矩阵的第一行向量数据d20、d21和d22。Figure 4(c) shows that during the third parallel iterative operation, the control unit 103 controls the third column of component data in the first shift data of each matrix in matrix 1 to matrix 4, and the first and second time components. The operation results of the parallel iterative operation respectively correspond to the input of the second shift data in the operation unit 104, and the third parallel iterative operation result corresponding to each matrix is obtained through 8 CMAC operations. The third parallel iterative operation result is Contains the third column vector data L32-L72 of the lower triangular matrix corresponding to each matrix, and the first row vector data d20, d21, and d22 of the inverse matrix.
对应地,在得到第三次并行迭代运算结果之后,控制单元103控制第二数据移位单元105对第三次并行迭代运算结果进行数据移位,以得到每个矩阵对应的第二移位数据,作为下一次迭代运算的输入,如图4(c)中所示的第三次并行迭代结果的第二移位数据。Correspondingly, after obtaining the result of the third parallel iterative operation, the control unit 103 controls the second data shift unit 105 to perform data shift on the result of the third parallel iterative operation to obtain the second shift data corresponding to each matrix. , As the input of the next iteration operation, as shown in Figure 4 (c) as shown in Figure 4 (c) the second shift data of the third parallel iteration result.
以此类推,到第八次并行迭代运算时,控制单元103控制矩阵1-矩阵4中每个矩阵的第一移位数据中第八列分量数据以及前7次并行迭代运算的运算结果分别对应的第二移位数据在运算单元104中的输入,通过8个CMAC的运算得到每个矩阵对应的第八次并行迭代运算的运算结果,第八次并行迭代运算的运算结果中包含每个矩阵对应的逆矩阵的第八行向量数据d70-d77。By analogy, to the eighth parallel iterative operation, the control unit 103 controls the eighth column of component data in the first shift data of each matrix in matrix 1 to matrix 4 and the operation results of the previous 7 parallel iterative operations respectively correspond to The input of the second shift data in the arithmetic unit 104, through 8 CMAC operations, the result of the eighth parallel iterative operation corresponding to each matrix is obtained, and the result of the eighth parallel iterative operation contains each matrix The eighth row vector data d70-d77 of the corresponding inverse matrix.
在八次并行迭代运算结束后,存储单元106中所存储的每一次并行迭代运算的运算结果的存储格式如图5所示。最终,输出单元107读取存储单元106中的每一次并行迭代运算的运算结果,并将该运算结果输出。After the eight parallel iterative operations are completed, the storage format of the operation result of each parallel iterative operation stored in the storage unit 106 is shown in FIG. 5. Finally, the output unit 107 reads the operation result of each parallel iterative operation in the storage unit 106 and outputs the operation result.
上述以N=8,4组矩阵交织的方式为例对本申请实施例中矩阵求逆装置10中的各个单元模块的功能进行具体的介绍。应理解,对于N不等于8的正定矩阵,本申请实施例中矩阵求逆装置10也可以采用相同的原理进行矩阵的分解求逆。例如,N=4,8个矩阵交织的运算、 N=16,2个矩阵交织的运算、N=32,无需矩阵交织的运算,以及N不为8的整数倍时,采用大靠近的方式,交织方式与当N为8的整数倍的正定矩阵的交织方式一致的运算,均为在本申请所保护的范围之内。The foregoing takes the manner of N=8 and 4 sets of matrix interleaving as an example to specifically introduce the functions of each unit module in the matrix inversion device 10 in the embodiment of the present application. It should be understood that for a positive definite matrix whose N is not equal to 8, the matrix inversion device 10 in the embodiment of the present application can also use the same principle to perform matrix decomposition and inversion. For example, N=4, 8 matrix interleaving operations, N=16, 2 matrix interleaving operations, N=32, without matrix interleaving operations, and when N is not an integer multiple of 8, a close approach is used, Operations that are consistent with the interleaving method of a positive definite matrix when N is an integer multiple of 8, are all within the scope of protection of this application.
本申请实施例提供的矩阵求逆装置,能够解决当前矢量处理器内部计算资源较少和计算资源的利用率较低的问题,从而能够降低基于乔列斯基的分解求逆处理时延,提升网络性能。The matrix inversion device provided by the embodiments of the present application can solve the current problems of less internal computing resources and low utilization of computing resources in the current vector processor, thereby reducing the processing delay of the decomposition and inversion based on Cholesky, and improving Network performance.
可以理解的是,在本申请的实施例中涉及的各种数字或字母编号仅为描述方便进行的区分,并不用来限制本申请的实施例的范围。上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定。It can be understood that the various numbers or letter numbers involved in the embodiments of the present application are only for easy distinction for description, and are not used to limit the scope of the embodiments of the present application. The size of the sequence number of each process mentioned above does not mean the order of execution, and the execution order of each process should be determined by its function and internal logic.
以上对本申请实施例所提供的基于乔列斯基分解的矩阵求逆装置进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。The above is a detailed introduction to the matrix inversion device based on Cholesky decomposition provided in the embodiments of the present application. Specific examples are used in this article to explain the principles and implementation of the present invention. The description of the above embodiments is only for To help understand the method and core idea of the present invention; at the same time, for those of ordinary skill in the art, according to the idea of the present invention, there will be changes in the specific implementation and the scope of application. In summary, the content of this specification It should not be understood as a limitation to the present invention.

Claims (8)

  1. 一种基于乔列斯基分解的矩阵求逆装置,包括数据写入控制单元,第一数据移位单元、控制单元、运算单元、第二数据移位单元、存储单元和输出单元,其特征在于,所述运算单元包括8个单精度复数乘加单元CMAC,所述CMAC具备四级流水运算结构,所述运算单元与所述控制单元相连,所述第一数据移位单元、所述控制单元、所述第二数据移位单元和所述输出单元分别与所述存储单元相连,所述控制单元与所述第二数据移位单元相互连接,所述数据写入控制单元与所述第一数据移位单元相连。A matrix inversion device based on Cholesky decomposition, comprising a data writing control unit, a first data shift unit, a control unit, an arithmetic unit, a second data shift unit, a storage unit and an output unit, and is characterized in , The arithmetic unit includes 8 single-precision complex multiplication and addition units CMAC, the CMAC has a four-stage pipeline operation structure, the arithmetic unit is connected to the control unit, the first data shift unit, the control unit , The second data shift unit and the output unit are respectively connected to the storage unit, the control unit and the second data shift unit are connected to each other, and the data writing control unit is connected to the first The data shift unit is connected.
  2. 根据权利要求1所述的装置,其特征在于,所述数据写入控制单元,用于完成矩阵的写入控制,所述矩阵为N阶正定矩阵,所述N为大于1且小于或等于32的整数。The device according to claim 1, wherein the data writing control unit is configured to complete the writing control of a matrix, the matrix is a positive definite matrix of order N, and the N is greater than 1 and less than or equal to 32 Integer.
  3. 根据权利要求2所述的装置,其特征在于,所述第一数据移位单元,用于将所述矩阵的对角线数据移位至每一列的第一位,以得到第一移位数据。3. The device according to claim 2, wherein the first data shift unit is configured to shift the diagonal data of the matrix to the first bit of each column to obtain the first shift data .
  4. 根据权利要求2或3所述的装置,其特征在于,所述控制单元,用于所述存储单元、所述第二数据移位单元和所述运算单元之间的通信与控制。The device according to claim 2 or 3, wherein the control unit is used for communication and control between the storage unit, the second data shift unit, and the arithmetic unit.
  5. 根据权利要求4所述的装置,其特征在于,所述运算单元,用于根据所述控制单元的控制信息,对所述矩阵进行N次并行迭代运算,以得到所述N次并行迭代运算中每一次并行迭代运算的运算结果,其中,第x次并行迭代运算的运算结果是根据所述N阶正定矩阵的第x列分量数据和前(x-1)次迭代运算的运算结果得到的,所述第x次并行迭代运算的运算结果包含所述矩阵基于乔列斯基分解的下三角矩阵的第x列的列分量数据和所述矩阵的逆矩阵的第x行的行分量数据,所述第x列的列分量数据不包含所述下三角矩阵的对角线数据,所述x为大于0小于等于N的整数。The device according to claim 4, wherein the arithmetic unit is configured to perform N parallel iterative operations on the matrix according to the control information of the control unit to obtain the N parallel iterative operations The operation result of each parallel iterative operation, where the operation result of the xth parallel iterative operation is obtained based on the component data of the xth column of the N-th order positive definite matrix and the operation result of the previous (x-1) iteration operation, The operation result of the xth parallel iterative operation includes the column component data of the xth column of the lower triangular matrix of the matrix based on Cholesky decomposition and the row component data of the xth row of the inverse matrix of the matrix, so The column component data of the xth column does not include the diagonal data of the lower triangular matrix, and the x is an integer greater than 0 and less than or equal to N.
  6. 根据权利要求5所述的装置,其特征在于,所述第二数据移位单元,用于对所述每一次并行迭代运算的运算结果进行数据移位,以得到所述每一次迭代运算的运算结果的第二移位数据,所述第二移位数据用于下一次并行迭代运算的输入。5. The device according to claim 5, wherein the second data shift unit is configured to perform data shift on the operation result of each parallel iterative operation to obtain the operation of each iterative operation The resultant second shift data, which is used as the input of the next parallel iterative operation.
  7. 根据权利要求6所述的装置,其特征在于,所述存储单元,用于存储所述第一移位数据和所述每一次迭代运算的运算结果的第二移位数据。7. The device according to claim 6, wherein the storage unit is configured to store the first shift data and the second shift data of the operation result of each iteration operation.
  8. 根据权利要求7所述的装置,其特征在于,所述输出单元,用于根据所述存储单元中存储的所述每一次迭代运算的运算结果的第二移位数据,输出所述逆矩阵。7. The device according to claim 7, wherein the output unit is configured to output the inverse matrix according to the second shift data of the operation result of each iteration operation stored in the storage unit.
PCT/CN2020/086987 2019-08-28 2020-04-26 Cholesky decomposition-based matrix inversion apparatus WO2021036313A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910804096.XA CN112445752B (en) 2019-08-28 2019-08-28 Matrix inversion device based on Qiaohesky decomposition
CN201910804096.X 2019-08-28

Publications (1)

Publication Number Publication Date
WO2021036313A1 true WO2021036313A1 (en) 2021-03-04

Family

ID=74685533

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/086987 WO2021036313A1 (en) 2019-08-28 2020-04-26 Cholesky decomposition-based matrix inversion apparatus

Country Status (2)

Country Link
CN (1) CN112445752B (en)
WO (1) WO2021036313A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1783060A (en) * 2004-11-26 2006-06-07 北京天碁科技有限公司 Cholesky decomposition algorithm device
US8396914B1 (en) * 2009-09-11 2013-03-12 Altera Corporation Matrix decomposition in an integrated circuit device
US8775496B1 (en) * 2011-07-29 2014-07-08 Xilinx, Inc. Circuits and methods for calculating a cholesky decomposition of a matrix
CN105701068A (en) * 2016-02-19 2016-06-22 南京大学 Cholesky matrix inversion system based on time division multiplexing technology
CN108733627A (en) * 2018-04-30 2018-11-02 南京大学 A kind of FPGA implementation method that positive definite matrix Cholesky is decomposed
CN109635241A (en) * 2018-12-17 2019-04-16 西南电子技术研究所(中国电子科技集团公司第十研究所) Solve symmetrical or Hermit symmetric positive definite matrix inversion matrix method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101825998B (en) * 2010-01-22 2012-09-05 龙芯中科技术有限公司 Processing method for vector complex multiplication operation and corresponding device
CN103927290A (en) * 2014-04-18 2014-07-16 南京大学 Inverse operation method for lower triangle complex matrix with any order
CN105426345A (en) * 2015-12-25 2016-03-23 南京大学 Matrix inverse operation method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1783060A (en) * 2004-11-26 2006-06-07 北京天碁科技有限公司 Cholesky decomposition algorithm device
US8396914B1 (en) * 2009-09-11 2013-03-12 Altera Corporation Matrix decomposition in an integrated circuit device
US8775496B1 (en) * 2011-07-29 2014-07-08 Xilinx, Inc. Circuits and methods for calculating a cholesky decomposition of a matrix
CN105701068A (en) * 2016-02-19 2016-06-22 南京大学 Cholesky matrix inversion system based on time division multiplexing technology
CN108733627A (en) * 2018-04-30 2018-11-02 南京大学 A kind of FPGA implementation method that positive definite matrix Cholesky is decomposed
CN109635241A (en) * 2018-12-17 2019-04-16 西南电子技术研究所(中国电子科技集团公司第十研究所) Solve symmetrical or Hermit symmetric positive definite matrix inversion matrix method

Also Published As

Publication number Publication date
CN112445752B (en) 2024-01-05
CN112445752A (en) 2021-03-05

Similar Documents

Publication Publication Date Title
US10140251B2 (en) Processor and method for executing matrix multiplication operation on processor
CN108133270B (en) Convolutional neural network acceleration method and device
CN108897716B (en) Data processing device and method for reducing calculation amount through memory read-write operation
JP6256348B2 (en) Fast Fourier transform circuit, fast Fourier transform processing method, and fast Fourier transform processing program
CN112929300B (en) Data processing device, method, base station and storage medium
Bachtiar et al. Convolutional neural network and maxpooling architecture on Zynq SoC FPGA
US10031846B1 (en) Transposition of two-dimensional arrays using single-buffering
WO2021036313A1 (en) Cholesky decomposition-based matrix inversion apparatus
CN111221501B (en) Number theory conversion circuit for large number multiplication
US9268744B2 (en) Parallel bit reversal devices and methods
CN109669666B (en) Multiply-accumulate processor
CN109460535B (en) Finite field matrix inversion device and inversion method based on cloud
CN108228138B (en) Method for rapid modular multiplication of special domain in SIDH
CN111356151B (en) Data processing method and device and computer readable storage medium
WO2022252876A1 (en) A hardware architecture for memory organization for fully homomorphic encryption
EP2735963B1 (en) Galois field inversion device
Ma et al. Accelerating SVD computation on FPGAs for DSP systems
US9582473B1 (en) Instruction set to enable efficient implementation of fixed point fast fourier transform (FFT) algorithms
US10366741B2 (en) Bit processing
CN111507178B (en) Data processing optimization method and device, storage medium and computer equipment
US10102892B1 (en) RAM-based shift register with embedded addressing
CN105608054A (en) FFT/IFFT device and method based on LTE system
KR100668674B1 (en) Apparatus and method for fast fourier transform
TWI281619B (en) Data processing structure and method for fast Fourier transformation/inverse fast Fourier transformation
KR100667188B1 (en) Apparatus and method for fast fourier transform

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20858390

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20858390

Country of ref document: EP

Kind code of ref document: A1