CN105846873A

CN105846873A - Triangular systolic array structure QR decomposition device based on advanced iteration and decomposition method thereof

Info

Publication number: CN105846873A
Application number: CN201610173392.0A
Authority: CN
Inventors: 邢座程; 刘苍; 原略超; 唐川; 张洋; 王庆林; 王�锋; 汤先拓; 危乐; 吕朝; 董永旺
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2016-03-24
Filing date: 2016-03-24
Publication date: 2016-08-10
Anticipated expiration: 2036-03-24
Also published as: CN105846873B

Abstract

A triangular pulsation array structure QR decomposition device and decomposition method based on advanced iteration, used for QR decomposition of n×n matrix A, which includes a diagonal processing module, an iterative processing module and a triangular processing module; the first diagonal The processing module receives the first column vector a ₁ of the matrix A from the outside, the results q ₁ and r ₁₁ are used as the output of the QR decomposition module, and q ₁ is output to the next step of the triangular processing module, and the generated r _jj ² signal output to all iterative processing modules in the first step; the j‑1th iterative processing module will externally receive the jth column vector a _j of matrix A, the first column vector a ₁ of matrix A and the first diagonal The r _jj ² output by the processing module is used as input to obtain the j-th column vector a _j ¹ of the next iteration matrix A ¹ ; and so on, and finally the output signal r _{n-1 of the QR decomposition module is obtained after being processed by the triangular processing module, n} . The decomposition method is carried out based on the above-mentioned decomposition device. The invention has the advantages of simple principle, fast decomposition speed and high efficiency.

Description

QR decomposition device and decomposition method of triangular pulsation array structure based on advanced iteration

技术领域technical field

本发明主要涉及到无线通信系统基带信号处理领域，特指一种基于超前迭代的三角脉动阵列结构QR分解装置及分解方法。The invention mainly relates to the field of baseband signal processing of wireless communication systems, in particular to a triangular pulsation array structure QR decomposition device and decomposition method based on advanced iteration.

背景技术Background technique

正交频分复用(OFDM，orthogonal frequency division multiplexing)技术和多输入多输出技术(MIMO，multiple input multiple output)技术因其具有高频谱利用率和高传输速率得到广泛的关注，近年来关于预编码技术的一系列研究进展使得基于MIMO-OFDM技术的多用户无线通信系统可以实现同时为多个用户服务。然而基于MIMO-OFDM技术的多用户无线通信系统基带信号处理算法计算复杂度大大增加，对基带信号处理器的设计提出了前所未有的挑战。Orthogonal frequency division multiplexing (OFDM, orthogonal frequency division multiplexing) technology and multiple input multiple output technology (MIMO, multiple input multiple output) technology have received extensive attention because of their high spectrum utilization and high transmission rate. A series of research progresses in coding technology make it possible for a multi-user wireless communication system based on MIMO-OFDM technology to serve multiple users at the same time. However, the computational complexity of baseband signal processing algorithms in multi-user wireless communication systems based on MIMO-OFDM technology is greatly increased, which poses unprecedented challenges to the design of baseband signal processors.

在基于MIMO-OFDM无线通信系统的基带信号处理链路中，预编码算法和MIMO检测算法是较为复杂的两个基带信号处理算法，近年来得到研究者的广泛关注。1983年，Costa在其经典论文“Writing on dirty paper”(“脏纸编码”)中提出的脏纸编码算法被认为是性能最好的非线性预编码算法，但是其计算复杂度特别高，几乎不可能在硬件电路上实时地执行，2005年Wei Yu等人在其论文“Trellis and Convolutional Precoding forTransmitter-Based Interference Presubtraction”(“基于网格和卷积预编码的发射机干扰预消除”)中将THP(Tomlinson-Harashima Precoding)算法用于非线性预编码并取得了较好的干扰消除效果，虽然其性能较脏纸编码算法有所降低，但是其计算复杂度大大降低，使得硬件实现非线性预编码算法成为可能，在THP算法中计算复杂度最高的部分是对信道矩阵H执行QR分解的部分，高效快速的QR分解部件有助于提高THP预编码算法整体性能。最大似然估计算法是MIMO检测所有算法中检测精度最高的算法，然而其计算复杂度相当高，因此，M.Shabany等人在“A 0.13μm CMOS 655Mb/s 4×4 64-QAM k-best MIMOdetector”(“在0.13μm CMOS工艺下使用64-QAM调制方式时655Mb/s的4×4MIMO检测器设计”)中使用最大似然估计算法的近似算法球形检测(SD)算法进行MIMO检测，取得了很好的检测效果，QR分解作为SD算法的瓶颈之一，制约着其执行速度。In the baseband signal processing chain based on MIMO-OFDM wireless communication system, precoding algorithm and MIMO detection algorithm are two relatively complex baseband signal processing algorithms, which have attracted extensive attention of researchers in recent years. In 1983, the dirty paper coding algorithm proposed by Costa in his classic paper "Writing on dirty paper" ("Dirty paper coding") is considered to be the best nonlinear precoding algorithm, but its computational complexity is extremely high, almost It is impossible to execute in real time on a hardware circuit. In 2005, Wei Yu et al. in their paper "Trellis and Convolutional Precoding for Transmitter-Based Interference Presubtraction" ("Grid and Convolutional Precoding for Transmitter-Based Interference Presubtraction") will THP (Tomlinson-Harashima Precoding) algorithm is used for nonlinear precoding and has achieved good interference elimination effect. Although its performance is lower than that of dirty paper coding algorithm, its computational complexity is greatly reduced, which makes the hardware realize nonlinear precoding. The encoding algorithm becomes possible. The part with the highest computational complexity in the THP algorithm is the part that performs QR decomposition on the channel matrix H. The efficient and fast QR decomposition part helps to improve the overall performance of the THP precoding algorithm. The maximum likelihood estimation algorithm is the algorithm with the highest detection accuracy among all MIMO detection algorithms, but its computational complexity is quite high. Therefore, M.Shabany et al. MIMOdetector" ("655Mb/s 4×4 MIMO detector design when using 64-QAM modulation mode in 0.13μm CMOS process") uses the approximate algorithm of the maximum likelihood estimation algorithm Sphere Detection (SD) algorithm for MIMO detection, obtained As a result, the QR decomposition is one of the bottlenecks of the SD algorithm, which restricts its execution speed.

由于QR分解在基于MIMO-OFDM技术的多用户基带信号处理器中得到广泛的应用，且很多情况下是制约处理速度的瓶颈，因此，在很多基带信号处理器的设计中将QR分解作为一个重要的运算部件进行优化。所谓QR分解，就是将n×n的矩阵A分解为n×n的酉矩阵Q和n×n的上三角矩阵R，当前的QR分解算法主要分为三类，分别基于Householder变换、Given旋转以及MGS(modified Gram-Schmidt)算法，由于基于Householder变换的QR分解很难用硬件实现，所以使用较少，基于Given旋转的QR分解算法虽然大大降低了所使用的硬件资源，但是其所需的执行时间较长，不符合通信系统实时性的要求，基于MGS算法的QR分解因占用硬件资源较少且执行时间较短符合通信系统的实际需求。Since QR decomposition is widely used in multi-user baseband signal processors based on MIMO-OFDM technology, and in many cases is the bottleneck restricting the processing speed, therefore, QR decomposition is an important factor in the design of many baseband signal processors. The computing unit is optimized. The so-called QR decomposition is to decompose the n×n matrix A into the n×n unitary matrix Q and the n×n upper triangular matrix R. The current QR decomposition algorithms are mainly divided into three categories, which are based on Householder transformation, Given rotation and MGS (modified Gram-Schmidt) algorithm, because the QR decomposition based on Householder transformation is difficult to implement with hardware, so it is used less, although the QR decomposition algorithm based on Given rotation greatly reduces the hardware resources used, but its required execution The time is long, which does not meet the real-time requirements of the communication system. The QR decomposition based on the MGS algorithm meets the actual needs of the communication system because it occupies less hardware resources and has a shorter execution time.

有从业者R.-H.Chang等人发表文章“Iterative QR decompositionarchitecture using the modified Gram-Schmidt algorithm for MIMO systems”(“MIMO系统中基于MGS算法的迭代QR分解结构”)提出了一种基于MGS算法的三角脉动阵列结构QR分解硬件电路，完成一个n(n为大于等于2的正整数)阶方阵的QR分解，所提出的三角脉动阵列结构QR分解电路只需2n-1个时间单元。在具体应用时，使用R.-H.Chang等人提出的三角脉动阵列结构QR分解电路对一个4×4的矩阵A进行QR分解，对于一个4×4的矩阵，使用基于MGS算法的迭代结构进行QR分解需要七步即可完成，每一步需要一个时间单元，共需要七个时间单元。由此可见，虽然R.-H.Chang等人提出的三角脉动阵列结构的QR分解方法大大降低了计算时间，但是实际通信系统的基带信号处理中希望得到速度更快的QR分解结构。且目前也仅仅只有涉及4×4矩阵的QR分解的文献，并未有公布的n×n矩阵的QR分解硬件电路。Practitioners R.-H.Chang and others published the article "Iterative QR decomposition architecture using the modified Gram-Schmidt algorithm for MIMO systems" ("Iterative QR decomposition structure based on MGS algorithm in MIMO systems") proposed an MGS-based algorithm The triangular systolic array structure QR decomposition hardware circuit in the paper can complete the QR decomposition of a square matrix of order n (n is a positive integer greater than or equal to 2), and the proposed triangular systolic array structure QR decomposition circuit only needs 2n-1 time units. In specific applications, use the triangular pulsation array structure QR decomposition circuit proposed by R.-H.Chang et al. to perform QR decomposition on a 4×4 matrix A. For a 4×4 matrix, use an iterative structure based on the MGS algorithm It takes seven steps to complete the QR decomposition, and each step requires one time unit, and a total of seven time units are needed. It can be seen that although the QR decomposition method of the triangular systolic array structure proposed by R.-H.Chang et al. greatly reduces the calculation time, it is desirable to obtain a faster QR decomposition structure in the baseband signal processing of the actual communication system. And at present, there are only literatures related to QR decomposition of 4×4 matrix, and there is no published QR decomposition hardware circuit of n×n matrix.

发明内容Contents of the invention

本发明要解决的技术问题就在于：针对现有技术存在的技术问题，本发明提供一种原理简单、易实现、分解速度快、效率高的基于超前迭代的三角脉动阵列结构QR分解装置及分解方法。The technical problem to be solved by the present invention is: aiming at the technical problems existing in the prior art, the present invention provides a triangular pulsation array structure QR decomposition device and decomposition device based on advanced iteration, which is simple in principle, easy to implement, fast in decomposition speed and high in efficiency. method.

为解决上述技术问题，本发明采用以下技术方案：In order to solve the problems of the technologies described above, the present invention adopts the following technical solutions:

一种基于超前迭代的三角脉动阵列结构QR分解装置，用来对n×n的矩阵A进行QR分解，它包括对角处理模块，迭代处理模块和三角处理模块；其中，n个对角处理模块，(n-1)+(n-2)+……+1＝n×(n-1)/2个迭代处理模块，当n为偶数时，采用n/2+(n-2)+(n-4)+(n-6)+……+2＝n2/4个三角处理模块，当n为奇数时，采用(n-1)+(n-3)+(n-5)+……+2＝(n+1)(n-1)/4个三角处理模块；第一个对角处理模块从外部接收到矩阵A的第一个列向量a₁，计算结果q₁和r₁₁作为整个QR分解模块的输出，并将q₁输出到下一步的三角处理模块，在计算过程中计算产生的r_jj ²信号输出到第一步中的所有迭代处理模块；第j-1个迭代处理模块将外部接收到矩阵A的第j个列向量a_j，其中j大于等于2小于等于n-1，矩阵A的第一个列向量a₁和第一个对角处理模块输出的r_jj ²作为输入，计算得到下一次迭代矩阵A¹的第j个列向量a_j ¹,其中a₁ ¹作为第二个对角处理模块的输入，A¹的其余列向量作为第二步迭代处理模块的输入的同时作为第三步三角处理模块的输入；以此类推，最后通过三角处理模块处理后得到QR分解模块的输出信号r_n-₁,。A triangular pulsation array structure QR decomposition device based on advanced iteration, used for QR decomposition of n×n matrix A, it includes a diagonal processing module, an iterative processing module and a triangular processing module; wherein, n diagonal processing modules , (n-1)+(n-2)+...+1=n×(n-1)/2 iterative processing modules, when n is an even number, use n/2+(n-2)+( n-4)+(n-6)+...+2=n2/4 triangle processing modules, when n is an odd number, use (n-1)+(n-3)+(n-5)+... ...+2=(n+1)(n-1)/4 triangle processing modules; the first diagonal processing module receives the first column vector a ₁ of the matrix A from the outside, and calculates the results q ₁ and r ₁₁ As the output of the entire QR decomposition module, and q ₁ is output to the next step of the triangular processing module, and the r _jj ² signal generated during the calculation is output to all iterative processing modules in the first step; the j-1th iteration The processing module will externally receive the jth column vector a _j of matrix A, where j is greater than or equal to 2 and less than or equal to n-1, the first column vector a ₁ of matrix A and r _jj output by the first diagonal processing module ² As input, calculate the jth column vector a _j ¹ of the next iteration matrix A ¹ , where a ₁ ¹ is used as the input of the second diagonal processing module, and the remaining column vectors of A ¹ are used as the second step iterative processing module At the same time as the input of the triangular processing module in the third step; and so on, the output signal r _n - ₁ of the QR decomposition module is obtained after being processed by the triangular processing module.

作为本发明分解装置的进一步改进：第i个对角处理模块对从第i-1步输出的a_i ^i-1信号进行计算得到QR分解模块的输出r_ii和q_i，并计算得到了r_ii ²，其中q_i向量作为下一步三角处理模块的输入，r_ii ²作为第i步中所有迭代处理模块的输入，迭代处理模块从第i-1步接收到列向量a_i ^i-1信号和a_i1 ^i-1信号，并从对角处理模块得到r_ii ²信号作为输入，处理之后得到下一次的迭代矩阵Aⁱ的第i1个列向量，其中a_i+1 ⁱ作为第i+1个对角处理模块的输入，Aⁱ的其余列向量作为下一步迭代模块输入的同时作为第i+2步三角处理模块的输入，三角处理模块从第i-1步接收到输入信号q_i-1的同时从第i-2步接收到a_i2 ^i-2信号和a_i2+1 ^i-2信号，处理之后得到QR分解模块的输出信号r_i-1,i2信号和r_i-1,i2+1；As a further improvement of the decomposition device of the present invention: the i-th diagonal processing module calculates the a _i ^i-1 signal output from the i-1 step to obtain the output r _ii and q _i of the QR decomposition module, and calculates r _ii ² , where the q _i vector is used as the input of the triangular processing module in the next step, r _ii ² is used as the input of all iterative processing modules in the i-th step, and the iterative processing module receives the column vector a _i ^i-1 signal from the i-1th step and a _i1 ^i-1 signal, and get r _ii ² signal from the diagonal processing module as input, after processing, get the i1th column vector of the next iteration matrix A ⁱ , where a _i+1 ⁱ is the i+1th The input of a diagonal processing module, the remaining column vectors of A ⁱ are used as the input of the next iteration module and at the same time as the input of the triangular processing module in the i+2 step, the triangular processing module receives the input signal q _i- from the i-1 step ₁ and receive a _i2 ^i-2 signal and a _i2+1 ^i-2 signal from step i-2 at the same time, after processing, the output signal r _{i-1, i2} signal and r _{i-1, i2} of the QR decomposition module are obtained ₊₁ ;

第n个对角处理模块对从n-1步输出的a_n ^n-1信号进行处理得到QR分解模块输出信号r_nn和q_n,第k4个三角处理模块从n-1步接收到信号q_n-1并从n-2步接收到信号a_n ^n-2，处理后得到QR分解模块的输出信号r_n-1,n。The nth diagonal processing module processes the a _n ^n-1 signal output from the n-1 step to obtain the output signals r _nn and q _n of the QR decomposition module, and the k4th triangular processing module receives the signal q from the n-1 step _n-1 and receive the signal a _n ^{n-2 from step n-2} , and get the output signal r _n-1,n of the QR decomposition module after processing.

作为本发明分解装置的进一步改进：所述对角处理模块包括乘法器、加法器、根号运算器模块及除法器，乘法器e从外部接收到输入向量a_j的第e个元素，其中e大于等于1小于等于n，对其进行自乘处理后输出到加法器，加法器从乘法器1到乘法器n接收到信号，进行累加处理后输出到根号运算器模块的同时将其作为整个模块的输出信号r_jj ²，根号运算器模块从加法器接收到信号之后，进行开平方处理后输出到除法器1到除法器n作为除法器1到除法器n的除数，同时作为整个模块的输出信号r_jj，除法器e1从外部接收到输入向量a_j的第e1个元素作为被除数，并将从根号运算器接收到的信号作为除数，其中e1大于等于1小于等于n，运算结果作为整个模块输出向量q_j2的第e1个元素。As a further improvement of the decomposition device of the present invention: the diagonal processing module includes a multiplier, an adder, a root operator module and a divider, and the multiplier e receives the eth element of the input vector a _j from the outside, where e Greater than or equal to 1 and less than or equal to n, it is self-multiplied and output to the adder, the adder receives the signal from multiplier 1 to multiplier n, and after cumulative processing, it is output to the root operator module and it is used as the whole The output signal r _jj ² of the module, after the root operator module receives the signal from the adder, it performs the square root processing and outputs it to the divider 1 to the divider n as the divisor of the divider 1 to the divider n, and at the same time as the whole module The output signal r _jj , the divider e1 receives the e1th element of the input vector a _j from the outside as the dividend, and the signal received from the root operator as the divisor, where e1 is greater than or equal to 1 and less than or equal to n, the operation result The e1th element of the vector q _j2 is output as the entire module.

作为本发明分解装置的进一步改进：所述迭代处理模块包括第一共享硬件，第一共享硬件包含了一个多路选择器和乘法器到乘法器n，多路选择器为乘法器1到乘法器n选择不同的输入作为乘数，多路选择器从外部的a_j3 ^p向量和除法器的输出信号接收到输入进行选择后输出结果到乘法器1到乘法器n，当使能信号为‘0’时，乘法器e2从多路选择器接收到的信号作为一个乘数，其中e2大于等于1小于等于n，从外部接收的a_j ^p向量的第e2个元素作为另一个乘数，进行相乘运算后将结果输出到加法器模块，加法器模块从乘法器1到乘法器n接收到输入信号，进行累加处理之后输出到除法器模块，除法器从加法器模块接收到的信号作为被除数，从外部接收到的信号r_jj ²作为除数，进行相除运算后输出到多路选择器1的输入，当使能信号为‘1’时，乘法器e2将运算结果输出到减法器e3，其中e3大于等于1小于等于n，减法器e3从乘法器e2接收到信号作为减数，从外部接收到a_j3 ^p信号的第e3个元素作为被减数，进行相减处理后结果作为整个模块输出信号a_j3 ^p+1向量的第e3个元素。As a further improvement of the decomposition device of the present invention: the iterative processing module includes the first shared hardware, the first shared hardware includes a multiplexer and multiplier to multiplier n, and the multiplexer is multiplier 1 to multiplier n selects a different input as a multiplier, the multiplexer receives the input from the external a _j3 ^p vector and the output signal of the divider to select and output the result to multiplier 1 to multiplier n, when the enable signal is '0 ', the multiplier e2 receives the signal from the multiplexer as a multiplier, where e2 is greater than or equal to 1 and less than or equal to n, and the e2th element of the a _j ^p vector received from the outside is used as another multiplier for phase After the multiplication operation, the result is output to the adder module. The adder module receives the input signal from multiplier 1 to multiplier n, performs accumulation processing and then outputs to the divider module. The signal received by the divider from the adder module is used as the dividend. The signal r _jj ² received from the outside is used as the divisor, and then output to the input of multiplexer 1 after the division operation. When the enable signal is '1', the multiplier e2 outputs the operation result to the subtractor e3, where e3 is greater than or equal to 1 and less than or equal to n, the subtractor e3 receives the signal from the multiplier e2 as the subtrahend, and receives the e3 element of the a _j3 ^p signal from the outside as the minuend, and the result after subtraction is output as the entire module The e3th element of the signal a _j3 ^p+1 vector.

作为本发明分解装置的进一步改进：所述三角处理模块包括第二共享硬件，多路选择器1的输入分别为a_j3向量的n个元素和a_j3+1向量的n个元素，当多路选择器使能信号为‘0’时，多路选择器1选通a_j3向量的元素输出到乘法器1到乘法器n，多路选择器使能信号为‘1’时，多路选择器1选通a_j3+1向量的元素输出到乘法器1到乘法器n，乘法器e4从多路选择器接收到的数据作为一个乘数，从外部接收到q_j2向量的第e4个元素作为另一个乘数，进行相乘运算后输出到加法器，加法器从乘法器接收到信号之后进行累加运算，当多路选择器使能信号为‘0’时，累加器输出信号作为三角处理模块的输出信号r_j2,j3,当多路选择器使能信号为‘1’时，累加器输出信号作为三角处理模块的输出信号r_j2,j3+1。As a further improvement of the decomposition device of the present invention: the triangular processing module includes the second shared hardware, and the input of the multiplexer 1 is respectively n elements of a _j3 vector and n elements of a _j3+1 vector, when multiplex When the selector enable signal is '0', the multiplexer 1 selects the elements of the a _j3 vector and outputs to multiplier 1 to multiplier n, when the multiplexer enable signal is '1', the multiplexer 1 strobes the elements of a _j3+1 vector and outputs to multiplier 1 to multiplier n, multiplier e4 receives data from the multiplexer as a multiplier, and receives the e4th element of q _j2 vector from the outside as The other multiplier is multiplied and output to the adder. The adder receives the signal from the multiplier and performs accumulation operation. When the multiplexer enable signal is '0', the output signal of the accumulator is used as a triangular processing module The output signal r _j2,j3 of the multiplexer, when the enable signal of the multiplexer is '1', the output signal of the accumulator is used as the output signal r _j2,j3+1 of the triangular processing module.

一种基于上述分解装置的QR分解方法，其步骤为：A kind of QR decomposition method based on above-mentioned decomposition device, its steps are:

步骤S1：矩阵A的n个列向量a₁,……a_n作为QR分解模块的输入信号，a₁作为第一个对角处理模块的输入，对角处理模块的输出为r₁₁,和q₁，迭代处理模块计算下一次的迭代矩阵，其输入为a₁和a_j，其中1<j<n+1，j为正整数，输出为下一次的迭代矩阵a_j ¹；Step S1: The n column vectors a ₁ ,...a _n of the matrix A are used as the input signal of the QR decomposition module, a ₁ is used as the input of the first diagonal processing module, and the output of the diagonal processing module is r ₁₁ , and q ₁ , the iteration processing module calculates the next iteration matrix, its input is a ₁ and a _j , wherein 1<j<n+1, j is a positive integer, and the output is the next iteration matrix a _j ¹ ;

步骤S2～Sj步：j大于等于2小于n，将第j-1步输入的信号a_j ^j-2,……，a_n ^j-2以及第j-1步输出的信号q_j-1，a_j ^j-1,……，a_n ^j-1作为第二步的输入信号，其中a_j ^j-1作为对角处理模块的输入，用于计算r_jj和q_j，第k3个对角处理模块的输入信号为q_j-1，a_j3 ^j-2和a_j3+1 ^j-2；当n-j为奇数时，j3大于等于j小于等于n-1正整数，当n-j为偶数时，j3为大于等于j小于等于n的正整数；用于计算r_j-1，j3和r_j-1，j3+1，与第一步类似，迭代处理模块用来计算下一次的迭代矩阵，其输入为a_j ^j-1,……，a_n ^j-1，输出为a_j+1 ^j,……，a_n ^j； (2)Step S2～Sj step: j is greater than or equal to 2 and less than n, the input signal a _j ^j-2 ,..., a _n ^j-2 of the j-1th step and the output signal q _j-1 of the j-1th step, a _j ^j-1 ,..., a _n ^j-1 is used as the input signal of the second step, where a _j ^j-1 is used as the input of the diagonal processing module to calculate r _jj and q _j , the k3th diagonal The input signals of the processing module are q _j-1 , a _j3 ^j-2 and a _j3+1 ^j-2 ; when nj is an odd number, j3 is greater than or equal to j less than or equal to n-1 positive integer, when nj is an even number, j3 It is a positive integer greater than or equal to j and less than or equal to n; it is used to calculate r _{j-1, j3} and r _{j-1, j3+1} , similar to the first step, the iterative processing module is used to calculate the next iteration matrix, and its input is a _j ^j-1 ,..., a _n ^j-1 , the output is a _j+1 ^j ,..., a _n ^j ; (2)

步骤Sn：将第n-1步的输入a_n ^n-2以及第n-1步的输出q_n-1和a_n ^n-1作为输入，其中a_n ^n-1作为block1的输入，block1的输出为r_n,n和q_n，q_n-1和a_n ^n-2作为block3的输入，block3的输出为r_n-1,n。Step Sn: Take the input a _n ^n-2 of step n-1 and the output q _n-1 and a _n ^{n-1 of step n-1} as input, where a _n ^n-1 is used as the input of block1, and the output of block1 The output is r _n,n and q _n , q _n-1 and a _n ^n-2 are used as the input of block3, and the output of block3 is r _n-1,n .

与现有技术相比，本发明的优点在于：本发明的基于超前迭代的三角脉动阵列结构QR分解装置及分解方法，原理简单、易实现，可以显著加快QR分解的速度；对于一个n×n的进行QR分解，本发明所提结构仅需要n个时间单元即可完成，而使用R.-H.Chang等人提出的三角脉动阵列结构需要2n-1个时间单元，如对于前述的4×4的矩阵A，采用本发明进行QR分解，只需要4个时间单元即可完成,相比7个，少了3个时间单元。Compared with the prior art, the present invention has the advantages of: the triangular pulsation array structure QR decomposition device and decomposition method based on the advanced iteration of the present invention has a simple principle and is easy to implement, and can significantly accelerate the speed of QR decomposition; for an n×n For QR decomposition, the proposed structure of the present invention only needs n time units to complete, while using the triangular pulsation array structure proposed by R.-H.Chang et al. requires 2n-1 time units, such as for the aforementioned 4× The matrix A of 4 can be decomposed by using the present invention in only 4 time units, which is 3 time units less than 7.

附图说明Description of drawings

图1是本发明分解装置的拓扑结构示意图。Fig. 1 is a schematic diagram of the topological structure of the decomposition device of the present invention.

图2是本发明在具体应用实例中对角处理模块的结构原理示意图。Fig. 2 is a schematic diagram of the structure and principle of the diagonal processing module in a specific application example of the present invention.

图3是本发明在具体应用实例中迭代处理模块的结构原理示意图。Fig. 3 is a schematic diagram of the structural principle of the iterative processing module in a specific application example of the present invention.

图4是本发明在具体应用实例中三角处理模块的结构原理示意图。Fig. 4 is a schematic diagram of the structure and principle of the triangle processing module in a specific application example of the present invention.

具体实施方式detailed description

以下将结合说明书附图和具体实施例对本发明做进一步详细说明。The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

如图1所示，本发明基于超前迭代的三角脉动阵列结构QR分解装置，用来对n×n的矩阵A进行QR分解，它包括对角处理模块，迭代处理模块和三角处理模块；其中，n个对角处理模块，(n-1)+(n-2)+……+1＝n×(n-1)/2个迭代处理模块，当n为偶数时，需要n/2+(n-2)+(n-4)+(n-6)+……+2＝n2/4个三角处理模块，当n为奇数时，需要(n-1)+(n-3)+(n-5)+……+2＝(n+1)(n-1)/4个三角处理模块组成。As shown in Fig. 1, the triangular pulsation array structure QR decomposition device based on advanced iteration of the present invention is used for carrying out QR decomposition to the matrix A of n * n, and it comprises diagonal processing module, iterative processing module and triangular processing module; Wherein, n diagonal processing modules, (n-1)+(n-2)+...+1=n×(n-1)/2 iterative processing modules, when n is an even number, n/2+( n-2)+(n-4)+(n-6)+...+2=n2/4 triangular processing modules, when n is an odd number, (n-1)+(n-3)+( n-5)+...+2=(n+1)(n-1)/4 triangular processing modules.

第一个对角处理模块从外部接收到矩阵A的第一个列向量a₁，计算结果q₁和r₁₁作为整个QR分解模块的输出，并将q₁输出到下一步的三角处理模块，在计算过程中计算产生的r_jj ²信号输出到第一步中的所有迭代处理模块；第j-1(j大于等于2小于等于n-1)个迭代处理模块将外部接收到矩阵A的第j个列向量a_j，矩阵A的第一个列向量a₁和第一个对角处理模块输出的r_jj ²作为输入，计算得到下一次迭代矩阵A¹的第j个列向量a_j ¹,其中a₁ ¹作为第二个对角处理模块的输入，A¹的其余列向量作为第二步迭代处理模块的输入的同时作为第三步三角处理模块的输入。迭代处理模块需要r_jj ²信号时，第一个对角处理模块已经将r_jj ²信号计算完成，所以对角处理模块和迭代处理模块可以并行执行；The first diagonal processing module receives the first column vector a ₁ of the matrix A from the outside, calculates the results q ₁ and r ₁₁ as the output of the entire QR decomposition module, and outputs q ₁ to the next step of the triangular processing module, In the calculation process, the r _jj ² signal generated by the calculation is output to all iterative processing modules in the first step; the j-1th (j is greater than or equal to 2 and less than or equal to n-1) iterative processing module will externally receive the th j column vector a _j , the first column vector a ₁ of the matrix A and r _jj ² output by the first diagonal processing module are used as input, and the jth column vector a _j ¹ of the next iteration matrix A ¹ is calculated , where a ₁ ¹ is used as the input of the second diagonal processing module, and the remaining column vectors of A ¹ are used as the input of the second step iterative processing module and at the same time as the input of the third step triangular processing module. When the iterative processing module needs the r _jj ² signal, the first diagonal processing module has already calculated the r _jj ² signal, so the diagonal processing module and the iterative processing module can be executed in parallel;

第i个对角处理模块对从第i-1步输出的a_i ^i-1信号进行计算得到QR分解模块的输出r_ii和q_i，并计算得到了r_ii ²，其中q_i向量作为下一步三角处理模块的输入，r_ii ²作为第i步中所有迭代处理模块的输入，迭代处理模块从第i-1步接收到列向量a_i ^i-1信号和a_i1 ^i-1信号，并从对角处理模块得到r_ii ²信号作为输入，处理之后得到下一次的迭代矩阵Aⁱ的第i1个列向量，其中a_i+1 ⁱ作为第i+1个对角处理模块的输入，Aⁱ的其余列向量作为下一步迭代模块输入的同时作为第i+2步三角处理模块的输入，三角处理模块从第i-1步接收到输入信号q_i-1的同时从第i-2步接收到a_i2 ^i-2信号和a_i2+1 ^i-2信号，处理之后得到QR分解模块的输出信号r_i-1,i2信号和r_i-1,i2+1；The i-th diagonal processing module calculates the a _i ^i-1 signal output from the i-1 step to obtain the output r _ii and q _i of the QR decomposition module, and calculates r _ii ² , where the q _i vector is as the following The input of the one-step triangular processing module, r _ii ² is used as the input of all iterative processing modules in the i-th step, and the iterative processing module receives the column vector a _i ^i-1 signal and a _i1 ^i-1 signal from the i-1th step, and The r _ii ² signal is obtained from the diagonal processing module as input, and after processing, the i1th column vector of the next iterative matrix A ⁱ is obtained, where a _i+1 ⁱ is used as the input of the i+1th diagonal processing module, A The rest of the column vectors of ⁱ are used as the input of the next iterative module and at the same time as the input of the triangular processing module of the i+2 step, and the triangular processing module receives the input signal q _i-1 from the i-1 step and simultaneously starts Receive a _i2 ^i-2 signal and a _i2+1 ^i-2 signal, and obtain the output signal r _{i-1, i2} signal and r _{i-1, i2+1} of the QR decomposition module after processing;

如图2所示，在具体应用实例中，对角处理模块包括乘法器、加法器、根号运算器模块及除法器，乘法器e(e大于等于1小于等于n)从外部接收到输入向量a_j的第e个元素，对其进行自乘处理后输出到加法器，加法器从乘法器1到乘法器n接收到信号，进行累加处理后输出到根号运算器模块的同时将其作为整个模块的输出信号r_jj ²，根号运算器模块从加法器接收到信号之后，进行开平方处理后输出到除法器1到除法器n作为除法器1到除法器n的除数，同时作为整个模块的输出信号r_jj，除法器e1(e1大于等于1小于等于n)从外部接收到输入向量a_j的第e1个元素作为被除数，并将从根号运算器接收到的信号作为除数，运算结果作为整个模块输出向量q_j2的第e1个元素。As shown in Figure 2, in a specific application example, the diagonal processing module includes a multiplier, an adder, a root operator module and a divider, and the multiplier e (e is greater than or equal to 1 and less than or equal to n) receives an input vector from the outside The e-th element of a _j is self-multiplied and then output to the adder. The adder receives signals from multiplier 1 to multiplier n, performs cumulative processing, and outputs it to the root operator module while using it as The output signal r _jj ² of the whole module, after the root operator module receives the signal from the adder, performs square root processing and outputs it to divider 1 to divider n as the divisor of divider 1 to divider n, and at the same time as the whole The output signal r _jj of the module, the divider e1 (e1 is greater than or equal to 1 and less than or equal to n) receives the e1th element of the input vector a _j as the dividend from the outside, and uses the signal received from the root operator as the divisor, and operates The result is the e1th element of the entire module output vector _qj2 .

上述对角处理模块用于计算Q矩阵的第j2个列向量q_j2，R矩阵的对角线元素r_jj以及对角线元素的平方r_jj ²，其中r_jj ²将会被用于迭代处理模块的输入，由于对角处理模块输出r_jj ²的时刻和迭代处理模块需要用到r_jj ²的时刻相同，所以两个模块可以并行执行，从而提高了QR分解的速度。The above diagonal processing module is used to calculate the j2th column vector q _j2 of the Q matrix, the diagonal element r _jj of the R matrix and the square of the diagonal element r _jj ² , where r _jj ² will be used for iterative processing The input of the module, because the moment when the diagonal processing module outputs r _jj ² is the same as the moment when the iterative processing module needs to use r _jj ² , the two modules can be executed in parallel, thereby improving the speed of QR decomposition.

如图3所示，在具体应用实例中，迭代处理模块包括第一共享硬件，第一共享硬件包含了一个多路选择器和乘法器1到乘法器n，多路选择器为乘法器1到乘法器n选择不同的输入作为乘数，多路选择器从外部的a_j3 ^p向量和除法器的输出信号接收到输入进行选择后输出结果到乘法器1到乘法器n，当使能信号为‘0’时，乘法器e2(e2大于等于1小于等于n)从多路选择器接收到的信号作为一个乘数，从外部接收的a_j ^p向量的第e2个元素作为另一个乘数，进行相乘运算后将结果输出到加法器模块，加法器模块从乘法器1到乘法器n接收到输入信号，进行累加处理之后输出到除法器模块，除法器从加法器模块接收到的信号作为被除数，从外部接收到的信号r_jj ²作为除数，进行相除运算后输出到多路选择器1的输入，当使能信号为‘1’时，乘法器e2将运算结果输出到减法器e3(e3大于等于1小于等于n)，减法器e3从乘法器e2接收到信号作为减数，从外部接收到a_j3 ^p信号的第e3个元素作为被减数，进行相减处理后结果作为整个模块输出信号a_j3 ^p+1向量的第e3个元素。As shown in Figure 3, in a specific application example, the iterative processing module includes the first shared hardware, the first shared hardware includes a multiplexer and multiplier 1 to multiplier n, and the multiplexer is multiplier 1 to multiplier n The multiplier n selects different inputs as the multiplier, and the multiplexer receives the input from the external a _j3 ^p vector and the output signal of the divider to select and output the result to multiplier 1 to multiplier n, when the enable signal is When '0', the signal received by the multiplier e2 (e2 is greater than or equal to 1 and less than or equal to n) from the multiplexer is used as a multiplier, and the e2th element of the a _j ^p vector received from the outside is used as another multiplier, After the multiplication operation, the result is output to the adder module. The adder module receives input signals from multiplier 1 to multiplier n, performs accumulation processing, and then outputs to the divider module. The signal received by the divider from the adder module is used as Divisor, the signal r _jj ² received from the outside is used as the divisor, after the division operation, it is output to the input of multiplexer 1. When the enable signal is '1', the multiplier e2 outputs the operation result to the subtractor e3 (e3 is greater than or equal to 1 and less than or equal to n), the subtractor e3 receives the signal from the multiplier e2 as the subtrahend, and receives the e3th element of the a _j3 ^p signal from the outside as the minuend, and the result after subtraction is used as the whole The block outputs the e3th element of the signal a _j3 ^p+1 vector.

上述迭代处理模块用于计算下一次迭代矩阵的第j3列，图中所示需要用到第一共享硬件模块的两个位置相互独立，所以可以通过硬件的分时共享技术节约硬件资源。The aforementioned iterative processing module is used to calculate the j3th column of the next iterative matrix. The two positions shown in the figure that require the use of the first shared hardware module are independent of each other, so hardware resources can be saved through the time-sharing technology of hardware.

如图4所示，在具体应用实例中，三角处理模块包括第二共享硬件，多路选择器1的输入分别为a_j3向量的n个元素和a_j3+1向量的n个元素，当多路选择器使能信号为‘0’时，多路选择器1选通a_j3向量的元素输出到乘法器1到乘法器n，多路选择器使能信号为‘1’时，多路选择器1选通a_j3+1向量的元素输出到乘法器1到乘法器n，乘法器e4从多路选择器接收到的数据作为一个乘数，从外部接收到q_j2向量的第e4个元素作为另一个乘数，进行相乘运算后输出到加法器，加法器从乘法器接收到信号之后进行累加运算，当多路选择器使能信号为‘0’时，累加器输出信号作为三角处理模块的输出信号r_j2,j3,当多路选择器使能信号为‘1’时，累加器输出信号作为三角处理模块的输出信号r_j2,j3+1。As shown in Figure 4, in a specific application example, the triangular processing module includes the second shared hardware, and the input of the multiplexer 1 is respectively n elements of a _j3 vector and n elements of a _j3+1 vector, when multiple When the enable signal of the multiplexer is '0', the multiplexer 1 selects the elements of the a _j3 vector and outputs to multiplier 1 to multiplier n, and when the enable signal of the multiplexer is '1', the multiplexer selects Element 1 gating a _j3+1 vector is output to multiplier 1 to multiplier n, multiplier e4 receives the data from the multiplexer as a multiplier, and receives the e4th element of q _j2 vector from the outside As another multiplier, it is multiplied and output to the adder. The adder receives the signal from the multiplier and performs accumulation operation. When the multiplexer enable signal is '0', the output signal of the accumulator is processed as a triangle. The output signal r _j2,j3 of the module, when the enable signal of the multiplexer is '1', the output signal of the accumulator is used as the output signal r _j2,j3+1 of the triangular processing module.

上述三角处理模块用于计算矩阵R位于坐标[j2,j3]和坐标[j2,j3+1]处的元素，图4与图2、图3的对比可知，计算坐标[j2,j3]处元素值的时间小于第二基本模块和第一基本模块执行时间的50％，因此在本发明中将计算坐标[j2,j3]处元素值的硬件资源分时复用，达到节约硬件资源的目的。The above triangular processing module is used to calculate the elements of matrix R located at coordinates [j2, j3] and coordinates [j2, j3+1]. The comparison between Figure 4 and Figure 2 and Figure 3 shows that the elements at coordinates [j2, j3] are calculated The value time is less than 50% of the execution time of the second basic module and the first basic module, so in the present invention, the hardware resources for calculating element values at coordinates [j2, j3] are time-divisionally multiplexed to achieve the purpose of saving hardware resources.

本发明进一步提供一种基于上述分解装置的分解方法，对一个n×n的矩阵A使用上述分解装置的电路进行QR分解共需要经过n步，其具体步骤为：The present invention further provides a decomposition method based on the above-mentioned decomposition device. It needs n steps to perform QR decomposition on an n×n matrix A using the circuit of the above-mentioned decomposition device. The specific steps are:

步骤S1：矩阵A的n个列向量a₁,……a_n作为QR分解模块的输入信号，a₁作为第一个对角处理模块的输入，对角处理模块的输出为r₁₁,和q₁，迭代处理模块计算下一次的迭代矩阵，其输入为a₁和a_j(1<j<n+1，j为正整数)，输出为下一次的迭代矩阵a_j ¹。第一步中的各输出信号的值如式(1)所示；Step S1: The n column vectors a ₁ ,...a _n of the matrix A are used as the input signal of the QR decomposition module, a ₁ is used as the input of the first diagonal processing module, and the output of the diagonal processing module is r ₁₁ , and q _1. The iteration processing module calculates the next iteration matrix, its input is a ₁ and a _j (1<j<n+1, j is a positive integer), and the output is the next iteration matrix a _j ¹ . The value of each output signal in the first step is shown in formula (1);

$\begin{matrix} {r r}_{1111} = = \sqrt{{(({a a}_{1111}))}^{22} + + {(({a a}_{21 twenty one}))}^{22} ...... … + + {(({a a}_{n no 11}))}^{22}} \\ {q q}_{1111} = = \frac{{a a}_{1111}}{{r r}_{1111}},, ...... …,, {q q}_{n no 11} = = \frac{{a a}_{n no 11}}{{r r}_{1111}} \\ {a a}_{11}^{11} = = 00 \\ {a a}_{22}^{11} = = {a a}_{22} - - {r r}_{1212} {q q}_{11} = = {a a}_{22} - - {q q}_{11}^{T T} {a a}_{22} {q q}_{11} = = {a a}_{22} - - \frac{{a a}_{11}^{T T} {a a}_{22} {a a}_{11}}{{r r}_{1111}^{22}} \\ ...... … \\ {a a}_{n no}^{11} = = {a a}_{33} - - \frac{{a a}_{11}^{T T} {a a}_{n no} {a a}_{11}}{{r r}_{1111}^{22}} \end{matrix} - - - - - - ((11))$

从步骤S1可以发现本发明与传统的QR分解方法最大的不同在于，本发明超前一步计算出了下一次的迭代矩阵，传统的QR分解之所以在第二步计算下一次迭代矩阵是因为迭代矩阵的计算需要使用到第一步的输出结果，本发明通过对传统方法的改进使用第一步的输入计算下一次的迭代矩阵，大大提高了QR分解速度。From step S1, it can be found that the biggest difference between the present invention and the traditional QR decomposition method is that the present invention calculates the next iteration matrix one step ahead, and the reason why the traditional QR decomposition calculates the next iteration matrix in the second step is because the iteration matrix The calculation needs to use the output result of the first step, and the present invention uses the input of the first step to calculate the next iterative matrix by improving the traditional method, which greatly improves the QR decomposition speed.

步骤S2～Sj步：j大于等于2小于n，将第j-1步输入的信号a_j ^j-2,……，a_n ^j-2以及第j-1步输出的信号q_j-1，a_j ^j-1,……，a_n ^j-1作为第二步的输入信号，其中a_j ^j-1作为对角处理模块的输入，用于计算r_jj和q_j，第k3个对角处理模块的输入信号为q_j-1，a_j3 ^j-2和a_j3+1 ^j-2(当n-j为奇数时，j3大于等于j小于等于n-1正整数，当n-j为偶数时，j3为大于等于j小于等于n的正整数)，用于计算r_j-1，j3和r_j-1，j3+1，与第一步类似，迭代处理模块用来计算下一次的迭代矩阵，其输入为a_j ^j-1,……，a_n ^j-1，输出为a_j+1 ^j,……，a_n ^j，第j步中各输出如式(2)所示；Step S2～Sj step: j is greater than or equal to 2 and less than n, the input signal a _j ^j-2 ,..., a _n ^j-2 of the j-1th step and the output signal q _j-1 of the j-1th step, a _j ^j-1 ,..., a _n ^j-1 is used as the input signal of the second step, where a _j ^j-1 is used as the input of the diagonal processing module to calculate r _jj and q _j , the k3th diagonal The input signals of the processing module are q _j-1 , a _j3 ^j-2 and a _j3+1 ^j-2 (when nj is an odd number, j3 is greater than or equal to j less than or equal to n-1 positive integer, when nj is an even number, j3 is a positive integer greater than or equal to j and less than or equal to n), used to calculate r _{j-1, j3} and r _{j-1, j3+1} , similar to the first step, the iterative processing module is used to calculate the next iteration matrix, its The input is a _j ^j-1 ,..., a _n ^j-1 , the output is a _j+1 ^j ,..., a _n ^j , and each output in step j is shown in formula (2);

$\begin{matrix} {r r}_{j j,, j j} = = \sqrt{{(({a a}_{11,, j j}^{j j - - 11}))}^{22} + + {(({a a}_{22,, j j}^{j j - - 11}))}^{22} + + ...... … + + {(({a a}_{n no,, j j}^{j j - - 11}))}^{22}} \\ {q q}_{11,, j j} = = \frac{{a a}_{11,, j j}^{j j - - 11}}{{r r}_{j j,, j j}},, ..... .....,, {q q}_{n no,, j j} = = \frac{{a a}_{n no,, j j}^{j j - - 11}}{{r r}_{j j,, j j}} \\ {r r}_{j j - - 11,, j j 33} = = {q q}_{j j - - 11}^{T T} {a a}_{j j 33}^{j j - - 22} \\ = = {q q}_{11,, j j - - 11} {a a}_{11,, j j 33} + + ...... … + + {q q}_{n no,, j j - - 11} {a a}_{n no,, j j 33},, \\ j j \leq \leq j j 33 \leq \leq n no;; j j 33 &Element; &Element; N N \\ {a a}_{i i 44}^{j j} = = {a a}_{i i 44}^{j j - - 11} - - \frac{{(({a a}_{j j}^{j j - - 11}))}^{T T} {a a}_{i i 44}^{j j - - 11} {a a}_{j j}^{j j - - 11}}{{r r}_{j j j j}^{22}},, j j \leq \leq i i 44 \leq \leq n no;; i i 44 &Element; &Element; N N \end{matrix}$

步骤Sn：将第n-1步的输入a_n ^n-2以及第n-1步的输出q_n-1和a_n ^n-1作为输入，其中a_n ^n-1作为block1的输入，block1的输出为r_n,n和q_n，q_n-1和a_n ^n-2作为block3的输入，block3的输出为r_n-1,n，第n步中各输出如式(3)所示；Step Sn: Take the input a _n ^n-2 of step n-1 and the output q _n-1 and a _n ^{n-1 of step n-1} as input, where a _n ^n-1 is used as the input of block1, and the output of block1 The output is r _n,n and q _n , q _n-1 and a _n ^n-2 are used as the input of block3, the output of block3 is r _n-1,n , and each output in the nth step is shown in formula (3);

$\begin{matrix} {r r}_{n no,, n no} = = \sqrt{{(({a a}_{11,, n no}^{n no - - 11}))}^{22} + + {(({a a}_{22,, n no}^{n no - - 11}))}^{22} + + ...... … + + {(({a a}_{n no,, n no}^{n no - - 11}))}^{22}} \\ {q q}_{11,, n no} = = \frac{{a a}_{11,, n no}^{n no - - 11}}{{r r}_{n no,, n no}},, ..... .....,, {q q}_{n no,, n no} = = \frac{{a a}_{n no,, n no}^{n no - - 11}}{{r r}_{n no,, n no}} \\ {r r}_{n no - - 11,, n no} = = {q q}_{n no - - 11}^{T T} {a a}_{n no}^{n no - - 22} \\ = = {q q}_{11,, n no - - 11} {a a}_{11,, n no}^{n no - - 22} + + ...... … + + {q q}_{n no,, n no - - 11} {a a}_{n no,, n no}^{n no - - 22} \end{matrix} - - - - - - ((33))$

由上可知，对于一个n×n的进行QR分解，本发明所提结构仅需要n个时间单元即可完成，而使用R.-H.Chang等人提出的三角脉动阵列结构需要2n-1个时间单元，如对于前述的4×4的矩阵A，采用本发明进行QR分解，只需要4个时间单元即可完成,相比7个，少了3个时间单元。因此，本发明所提基于超前迭代的三角脉动阵列结构QR分解可以显著加快QR分解的速度。It can be seen from the above that for an n×n QR decomposition, the proposed structure of the present invention only needs n time units to complete, while using the triangular pulsation array structure proposed by R.-H.Chang et al. requires 2n-1 For the time unit, for the aforementioned 4×4 matrix A, the QR decomposition using the present invention only needs 4 time units, which is 3 time units less than 7. Therefore, the triangular systolic array structure QR decomposition based on advanced iteration proposed by the present invention can significantly speed up the QR decomposition.

以上仅是本发明的优选实施方式，本发明的保护范围并不仅局限于上述实施例，凡属于本发明思路下的技术方案均属于本发明的保护范围。应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理前提下的若干改进和润饰，应视为本发明的保护范围。The above are only preferred implementations of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions under the idea of the present invention belong to the protection scope of the present invention. It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principle of the present invention should be regarded as the protection scope of the present invention.

Claims

1. a triangle systolic array architecture QR decomposer based on advanced iterative, is used for that the matrix A of n × n is carried out QR and divides Solve, it is characterised in that it includes diagonal angle processing module, iterative processing module and triangulation process module；Wherein, n diagonal angle processes Module, (n-1)+(n-2)+...+1=n × (n-1)/2 iterative processing module, when n is even number, employing n/2+ (n-2)+ (n-4)+(n-6)+...+2=n²/ 4 triangulation process modules, when n is odd number, employing (n-1)+(n-3)+(n-5)+... + 2=(n+1) (n-1)/4 triangulation process module；First diagonal angle processing module is received externally first row of matrix A Vector a₁, result of calculation q₁And r₁₁As the output of whole QR decomposing module, and by q₁Output is to next step triangulation process mould Block, calculates the r of generation during calculating_jj ²Signal exports all iterative processing module in the first step；-1 iteration of jth Processing module will be externally received jth column vector a of matrix A_j, wherein j more than or equal to 2 less than or equal to n-1, the of matrix A One column vector a₁R with first diagonal angle processing module output_jj ²As input, it is calculated next iteration matrix A 1 Jth column vector a_j ¹, wherein a₁ ¹As the input of second diagonal angle processing module, A¹Remaining column vector as second step iteration As the input of the 3rd step triangulation process module while the input of processing module；By that analogy, finally by triangulation process mould Block obtains output signal r of QR decomposing module after processing_n-1,n。

Triangle systolic array architecture QR decomposer based on advanced iterative the most according to claim 1, it is characterised in that I-th diagonal angle processing module is to a from the i-th-1 step output_i ^i-1Signal carries out being calculated the output r of QR decomposing module_iiAnd q_i, And it has been calculated r_ii ², wherein q_iVector is as the input of next step triangulation process module, r_ii ²As iteration all in the i-th step The input of processing module, iterative processing module receives column vector a from the i-th-1 step_i ^i-1Signal and a_i1 ^i-1Signal, and at diagonal angle Reason module obtains r_ii ²Signal, as input, obtains Iterative Matrix A next time after processⁱThe i-th 1 column vectors, wherein a_i+1 ⁱAs the input of i+1 diagonal angle processing module, AⁱRemaining column vector as next step iteration module input while make Being the input of the i-th+2 step triangulation process module, triangulation process module receives input signal q from the i-th-1 step_i-₁While from I-2 step receives a_i2 ^i-2Signal and a_i2+1 ^i-2Signal, obtains output signal r of QR decomposing module after process_i-1,i2Signal and r_i-1,i2+1；

N-th diagonal angle processing module is to a from n-1 step output_n ^n-1Signal carries out process and obtains QR decomposing module output signal r_nn And q_n, 4 triangulation process modules of kth receive signal q from n-1 step_n-1And receive signal a from n-2 step_n ^n-2, obtain after process Output signal r of QR decomposing module_n-1,n。

Triangle systolic array architecture QR decomposer based on advanced iterative the most according to claim 1 and 2, its feature exists In, described diagonal angle processing module includes multiplier, adder, radical sign operator block and divider, and multiplier e is from external reception To input vector a_jThe e element, wherein e is more than or equal to 1 less than or equal to n, and after it is carried out involution process, output is to addition Device, adder receives signal from multiplier 1 to multiplier n, and after carrying out accumulation process, the same of radical sign operator block is arrived in output Time as output signal r of whole module_jj ², radical sign operator block, after adder receives signal, carries out out flat Side process after output to divider 1 to divider n as the divisor of divider 1 to divider n, defeated simultaneously as whole module Go out signal r_jj, divider e1 is received externally input vector a_jThe e1 element as dividend, and will be from radical sign computing The signal that device receives is as divisor, and wherein e1 is more than or equal to 1 less than or equal to n, and operation result is as whole module output vector q_j2The e1 element.

Triangle systolic array architecture QR decomposer based on advanced iterative the most according to claim 1 and 2, its feature exists In, described iterative processing module includes that first shares hardware, and first shares hardware contains a MUX and multiplier To multiplier n, MUX is that multiplier 1 selects different inputs as multiplier to multiplier n, and MUX is from outside A_j3 ^pThe output signal of vector sum divider receives and outputs results to multiplier 1 after input selects to multiplier n, when Enabling signal when be ' 0 ', the signal that multiplier e2 receives from MUX is as a multiplier, and wherein e2 is more than or equal to 1 Less than or equal to n, from a of external reception_j ^pThe e2 element of vector is as another multiplier, after carrying out multiplication operation, result is defeated Going out to adder Module, adder Module receives input signal from multiplier 1 to multiplier n, defeated after carrying out accumulation process Going out to divider module, the signal that divider receives from adder Module is as dividend, the signal r being received externally_jj ² As divisor, after carrying out division operation, the input of MUX 1 is arrived in output, and when enabling signal and being ' 1 ', multiplier e2 will transport Calculating result and export subtractor e3, wherein e3 receives signal work less than or equal to n, subtractor e3 from multiplier e2 more than or equal to 1 For subtrahend, it is received externally a_j3 ^pThe e3 element of signal is as minuend, and after carrying out subtracting each other process, result is as whole mould Block output signal a_j3 ^p+1The e3 element of vector.

Triangle systolic array architecture QR decomposer based on advanced iterative the most according to claim 4, it is characterised in that Described triangulation process module includes that second shares hardware, and the input of MUX 1 is respectively a_j3N element of vector and a_j3+1 N element of vector, when MUX enable signal is ' 0 ', MUX 1 gates a_j3The element of vector exports to be taken advantage of Musical instruments used in a Buddhist or Taoist mass 1 is to multiplier n, and when MUX enable signal is ' 1 ', MUX 1 gates a_j3+1The element of vector exports The data that multiplier 1 receives to multiplier n, multiplier e4 from MUX, as a multiplier, are received externally q_j2 The e4 element of vector is as another multiplier, and after carrying out multiplication operation, output receives to adder, adder from multiplier Carrying out accumulating operation after signal, when MUX enable signal is ' 0 ', accumulator output signal is as triangulation process Output signal r of module_j2,j3, when MUX enable signal is ' 1 ', accumulator output signal is as triangulation process module Output signal r_j2,j3+1。

6. one kind based on the QR decomposition method of any one decomposer in the claims 1～5, it is characterised in that step For:

Step S1: n column vector a of matrix A₁,……a_nAs the input signal of QR decomposing module, a₁As first diagonal angle The input of processing module, diagonal angle processing module is output as r₁₁, and q₁, iterative processing module calculates Iterative Matrix next time, Its input is a₁And a_j, wherein 1 < j < n+1, j is positive integer, is output as Iterative Matrix a next time_j ¹；

Step S2～Sj step: j are less than n, signal a jth-1 step inputted more than or equal to 2_j ^j-2..., a_n ^j-2And jth-1 step The signal q of output_j-1, a_j ^j-1..., a_n ^j-1As the input signal of second step, wherein a_j ^j-1Defeated as diagonal angle processing module Enter, be used for calculating r_jjAnd q_j, the input signal of 3 diagonal angle processing modules of kth is q_j-1, a_j3 ^j-2And a_j3+1 ^j-2；When n-j is odd number Time, j3 is more than or equal to j less than or equal to n-1 positive integer, and when n-j is even number, j3 is the positive integer being less than or equal to n more than or equal to j； For calculating r_{J-1, j3}And r_{J-1, j3+1}, similar with the first step, iterative processing module is used for the Iterative Matrix calculated next time, and it is defeated Enter for a_j ^j-1..., a_n ^j-1, it is output as a_j+1 ^j..., a_n ^j；

Step Sn: by the input a of the (n-1)th step_n ^n-2And the (n-1)th output q of step_n-1And a_n ^n-1As input, wherein a_n ^n-1As The input of block1, block1 is output as r_n,nAnd q_n, q_n-1And a_n ^n-2As the input of block3, block3 is output as r_n-1,n。