CN111858465A

CN111858465A - Large-scale matrix QR decomposition parallel computing structure

Info

Publication number: CN111858465A
Application number: CN202010609939.3A
Authority: CN
Inventors: 吴明钦; 刘红伟; 潘灵; 贾明权; 郝黎宏; 林勤; 张昊
Original assignee: Southwest Electronic Technology Institute No 10 Institute of Cetc
Current assignee: Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority date: 2020-06-29
Filing date: 2020-06-29
Publication date: 2020-10-30
Anticipated expiration: 2040-06-29
Also published as: CN111858465B

Abstract

The invention discloses a large-scale matrix QR decomposition parallel computing structure, which relates to the field of digital signal processing and aims to provide a three-level parallel computing structure with clear parallel logic, high throughput rate and low delay, and is realized by the following technical scheme: in a processor cluster system and a QR decomposition parallel computing structure constructed by adopting a multi-core processor chip, a top-level framework divides a matrix to be decomposed into a plurality of data fragments, the data fragments are distributed to nodes of each level through a communication network interconnected among nodes of the multi-core processor, the nodes of each level are sequentially computed step by step according to a binary tree complete structure, and the nodes of each level are computed in parallel; the middle layer framework carries out matrix blocking and carries out operation layer by layer along the diagonal subarrays; the bottom layer architecture utilizes a processor instruction set to perform multi-data parallel vector calculation to complete QR decomposition and multiplication operations of a single core. The multi-core processor cluster adopts a layer-by-layer decomposition structure to realize QR parallel decomposition of a large-scale matrix.

Description

Large-scale matrix QR decomposition parallel computing structure

Technical Field

The invention relates to a large-scale array antenna signal processing technology in the field of digital signal processing, in particular to a QR decomposition method for a cluster parallel structure of a high-performance parallel computation large-scale multi-core processor in the field of numerical computation by a large-scale parallel processor.

Background

In the field of digital signal processing, signal processing algorithms such as large-scale array antenna signal processing and large-scale multiple-input multiple-output (MIMO) technology often involve problems such as covariance matrix inversion, channel matrix estimation, channel equalization, etc., and QR decomposition is widely applied in these aspects. MIMO is a standard for wireless network evolution, among others. In fact, MIMO technology has become one of the most critical technologies in many wireless communication standards, such as ieee802.11n, 3GPP-LTE, etc. The efficient and key QR decomposition operation unit is designed, so that the complexity of the MIMO system can be reduced, and good calculation performance can be obtained. The covariance matrix inversion algorithm is one of the most commonly used spatial interference suppression algorithms. However, when the interference power is too large, the sampling covariance matrix is a singular matrix, and the matrix inversion theorem is invalid, so that the interference suppression weight cannot be effectively generated. Due to the fact that QR decomposition can effectively improve the condition number of the matrix and improve the numerical stability, the situation that effective interference suppression weight cannot be obtained due to too strong interference power can be effectively avoided by utilizing the QR decomposition algorithm to invert the sampling covariance matrix. The computational performance of QR decomposition has a direct impact on the signal processing performance of the interference suppression system. The QR decomposition algorithm is used as a main tool for digital signal processing, plays an important role in the field of high-performance computing, and is an important index for measuring the performance of a system. Matrix operation is one of core problems in high-performance computation, matrix decomposition is an important way for improving the parallelism of matrix operation, and QR decomposition is an important matrix decomposition form. By definition, Q is an orthogonal matrix and R is a nonsingular upper triangular matrix (i.e., the elements below the diagonal of matrix R are all 0), where the decomposition is unique when the diagonal elements of R are required to be positive.

There are many ways to actually calculate QR decomposition, such as Givens rotation, Householder transformation, and Gram-Schmidt orthogonalization, among others. Each method has its advantages and disadvantages. QR decomposition of the matrix is typically achieved using the Householder transform or Givens rotation or Gram-Schmidt orthogonalization methods. The QR decomposition algorithm of Householder transformation obtains a Householder transformation matrix through reflection operation, and elements below a diagonal are updated to be 0 through matrix multiplication, so that a large amount of matrix multiplication calculation is needed in the process, and the complexity of the algorithm is increased. Although the Givens rotation QR decomposition method is to perform update calculation on a matrix through a Givens rotation matrix, the Givens rotation matrix can be divided into rows for updating, and although the Givens rotation QR decomposition method has operations such as division, evolution and the like, the complexity is lower than that of Householder. Due to the fact that QR decomposition computation amount is large, a large amount of computation time is consumed in the decomposition process, and the QR decomposition computation becomes a bottleneck for improving the performance of many practical applications. For example, in cognitive radio, QR decomposition is the most time-consuming computing module in singular value decomposition operations (SVD). The data shows that the time spent processing QR decomposition accounts for over 70% of the total SVD operation.

The large-scale matrix QR decomposition has wide application in the fields of signal processing, image processing, computational structure mechanics and the like. Because the large-scale matrix QR decomposition algorithm has huge computation amount and very complex algorithm structure, and is not beneficial to parallel decomposition, the traditional method realizes large-scale QR decomposition on a high-performance super computing platform based on an X86 framework. The problems of large-scale QR decomposition, task allocation, data synchronization and the like based on the distributed computing platform cause longer communication time and cannot meet the real-time processing requirement of ms level. In recent years, a large number of medium-small-scale matrix QR decompositions are realized by adopting FPGA design, a hardware structure is usually realized by adopting a systolic array, the parallelism performance is good, the real-time performance is high, but the matrix scale is sharply increased and limited by the area and power consumption of an FPGA chip, the QR decomposition realized based on the FPGA can not meet the requirement of high throughput rate of large-scale matrix QR decompositions, and the development period is long.

The key to the wide application of large-scale matrix QR decomposition is to reduce the processing delay while improving the throughput rate. With the large increase of the scale of the front-end sensor, the continuous improvement of the sampling rate and the continuous increase of the scale of the channel matrix, the traditional QR decomposition method can not meet the requirements of large-scale QR decomposition on the throughput rate and the real-time processing. The existing parallel computing technology research aiming at QR decomposition has two extremes: the method has the advantages that firstly, in the field of scientific computing, the existing popular distributed parallel computing architecture is adopted to realize ultra-large-scale matrix decomposition on Hadoop and other distributed computing platforms, although the method can realize large-scale throughput processing, the method cannot meet the requirement of real-time processing, and is not suitable for embedded equipment with strict requirements on low power consumption, flexibility and high reliability; secondly, aiming at a specific hardware structure, the QR decomposition is realized by using the FPGA as a representative, a special matrix decomposition parallel processor is designed, and extremely rapid processing is pursued.

For most engineering applications, a system hardware architecture with strong expansibility needs to be flexibly and quickly built by utilizing the existing mature chip. Currently, a variety of processor chips have evolved from single-core processors to multi-core processor chips, each of which typically has Single Instruction Multiple Data (SIMD) parallel computing capability and vector computing capability. At present, the international development level is that hundreds of light cores can be integrated in a single chip, and taking a TMS320C6678 DSP chip of TI corporation as an example, the number of integrated high-performance DSP cores reaches 8 cores, and in the future, the processing capacity of a single chip is further improved through two ways of further improving the performance of a single core and increasing the number of multiple cores. Therefore, the current multi-core technology is gradually available and increasingly becomes a hotspot for CPU/DSP development. From the technical aspect, the parallelism of multiple cores of a processor is an important method for realizing software parallelism, and the multiple cores are important solution ways for improving the performance of the processor. The adoption of the multi-core structure is an important means for improving the performance of the processor, and is very suitable for parallel computing tasks. Increasing the number of cores and improving the capacity of an on-chip memory are the main means for improving the computing capability of the multi-core DSP in the current commercial multi-core DSP. The multi-core cluster system is a mainstream large-scale parallel computing system at present, connects a plurality of multi-core processor chips through a high-speed interconnection network to form a cluster, and has extremely strong parallel computing capability. However, the conventional QR decomposition algorithm cannot fully utilize the parallel processing capability of the multi-core cluster system, and is difficult to exert the performance advantage of the multi-core cluster system.

Disclosure of Invention

The invention aims to provide a three-level parallel computing structure which has clear parallel logic, strong expansibility and portability, high throughput rate, low delay and high universality and can fully utilize the parallel processing advantages of a multi-core processor cluster to realize the parallel among multi-processor nodes, the parallel among single processors and the parallel among single cores and multiple data of the single cores aiming at the application of the large-scale multi-core processor cluster which relates to the large-scale QR decomposition and simultaneously requires the real-time processing, so as to solve the problem that the traditional QR decomposition method can not effectively utilize the multi-core processor cluster resources to carry out the large-scale parallel computing.

The technical scheme adopted by the invention is as follows: a large-scale matrix QR factorized parallel computing structure, comprising: the parallel structure comprises a multi-processor node parallel structure, a single-processor multi-core parallel structure and a single-core multi-data parallel three-level parallel structure, wherein the multi-processor node parallel structure, the single-processor multi-core parallel structure and the single-core multi-data parallel structure realize the large-scale QR parallel decomposition of a matrix, a first-layer parallel structure with the characteristic of a binary tree structure belongs to a top-layer framework in a three-layer parallel structure, a second-layer parallel structure belongs to a middle-layer framework in the three-layer parallel structure, and a third: in a processor cluster system with large-scale parallel computing capability and a QR decomposition parallel computing structure which are built by adopting a multi-core processor chip, a top-level architecture divides a matrix to be decomposed into a plurality of data fragments by utilizing the characteristics of the multi-core processor cluster, the data fragments are distributed to parallel nodes of a first level through a communication network interconnected among the nodes of the multi-core processor, each node of the first level parallelly completes a corresponding QR decomposition task and then sends a triangular matrix R to a second level node through the network, the second level node completes R matrix combination and QR decomposition and then sends the R matrix to a third level node through the network, and so on, each level node sequentially executes step by step according to a complete structure of a binary tree, and the same level node executes in parallel, so that the parallel computation of a flow graph among the nodes of the multi-core processor chip is; the middle-layer architecture carries out matrix subarray blocking according to the matrix scale input by the processor nodes and the number of processor cores, each subarray is divided into square matrixes with uniform matrix scale, the whole operation process is carried out layer by layer along diagonal subarrays, single-layer subarray QR decomposition and matrix data updating operations are executed in parallel by multiple cores in a processor slice, and the bottom-layer architecture carries out multi-data parallel vector calculation by utilizing a processor instruction set with SIMD capability to complete single-core QR decomposition and multiplication operations. The multi-core processor cluster adopts a layer-by-layer decomposition method, and realizes QR parallel decomposition of a large-scale matrix from three-level parallel structures of multi-core nodes of a multi-processor chip, multi-core parallel of a single-processor chip and single-core multi-data parallel.

Compared with the prior art, the technical scheme of the invention has the following beneficial effects:

the parallel logic is clear, and the expansibility and the portability are strong. The parallel computing capability of the multi-core processor cluster is fully utilized, the three-layer parallel architecture is adopted to realize the QR decomposition parallel computing structure of the processor cluster system with the large-scale parallel computing capability, which is constructed by the multi-core processor chip, the nodes at all levels are sequentially executed step by step according to the complete structure of the binary tree, the nodes at the same level are executed in parallel, and the multi-processor cluster deployment is easy to realize. The three-layer parallel architecture has clear structure and is easy to realize, and can simultaneously meet the requirements of system throughput rate and real-time processing. The experimental results show that: compared with the software implementation of the general processor in the prior art, the parallel result of QR decomposition can achieve an acceleration ratio of more than 20 times on the core computing performance, and has higher data and pipeline parallelism for the matrix triangulation computing process.

According to the invention, a communication network is formed by interconnecting and exchanging chips among the nodes of the multi-core processor, the cluster scale is easy to expand, and the communication efficiency and task synchronization of the parallel task data among the multiple cores can be obviously improved by using the shared memory among the multiple cores in the node of the single processor. The whole cluster architecture is beneficial to task allocation and scheduling of a three-layer parallel architecture, and has high flexibility, expansibility and transportability.

The method is based on a multi-core processor cluster, and adopts a layer-by-layer decomposition structure to realize multi-node parallel, single-node multi-core parallel and single-core multi-data parallel. The multi-node parallel with the structure characteristic of the binary tree can fully utilize the cluster parallel processing advantages of the processors, meet the throughput rate requirement of QR decomposition of the large-scale matrix, and obtain the exponential scale increase performance of the matrix to be processed at the cost of linear scale increase of the clusters under the condition of the same processing time. The multi-core parallel of the processor can fully utilize multi-core resources, the QR decomposition and data matrix updating operation of a single-layer subarray can be realized in parallel, the real-time computing capacity of the QR decomposition on a single node and the QR decomposition operation speed can be greatly improved, and the processing performance is multiplied with the number of cores. The multiple data among the single cores utilize the SIMD capability and the vector computing capability of the processor instruction set in parallel, and the single-core QR decomposition performance is increased by 4 times by adopting an improved GR algorithm. The three-layer parallel architecture is very suitable for a multi-core processor cluster, not only meets the throughput rate requirement of large-scale matrix QR decomposition, but also greatly reduces the processing delay and has the real-time processing capability.

The invention comprehensively considers the communication, storage and calculation resources of a multi-core processor and the SIMD, matrix and vector calculation capabilities of an instruction level, realizes the parallel decomposition of large-scale matrix QR decomposition from three-level parallel structures of parallel between nodes of the multi-processor, parallel between single-chip and multi-core and parallel between single-core and multi-data, realizes the multi-core communication and task synchronization inside the processor by adopting a shared internal memory, adopts a network switching chip for communication between chips, can randomly expand the cluster scale, can simultaneously process the task level parallel and the data level parallel in the matrix decomposition process, fully utilizes the SIMD instruction set and the vector calculation instruction set of the processor to carry out parallel operation, and realizes the flexible scheduling of QR decomposition parallel tasks. The multi-level parallel computing structure can obtain more than 20 times of performance improvement, has low communication overhead and obvious computing acceleration, obviously improves the computing parallelism and the cluster communication flexibility, and can obtain more than 20 times of performance improvement in numerical computation related to weight updating in large-scale array signal processing. The method is very suitable for large-scale parallel computing clusters which are built by taking multi-core DSP as a representative.

The method is greatly superior to the existing QR decomposition method, has outstanding engineering application value, and is very suitable for cluster calculation of large-scale multi-core processors.

Drawings

FIG. 1 is a block diagram of a top-level parallel structure in a large-scale matrix QR decomposition three-level parallel structure of the present invention;

FIG. 2 is a block diagram of a hierarchical progressive relationship of a middle layer parallel structure in a large-scale matrix QR decomposition three-layer parallel structure;

FIG. 3 is a block diagram of a single level of multi-core parallel computing in the middle layer parallel architecture of FIG. 2;

FIG. 4 is a block diagram of a bottom level parallel structure cascade progression relationship of a large-scale QR decomposition three-tier parallel structure;

FIG. 5 is a block diagram of inter-node connections of a multi-core processor cluster and inter-processor multi-core connections.

Detailed Description

See fig. 1-3. In a preferred embodiment described below, a large-scale matrix QR factorized parallel computing structure comprises: the parallel structure comprises three levels of parallel structures of processor nodes, processor core parallel and single-core instruction level parallel, the first level of parallel structure with the characteristic of a binary tree structure belongs to a top level framework in a three-level parallel structure, the second level of parallel structure belongs to a middle level framework in the three-level parallel structure, and the third level of parallel structure belongs to a bottom level framework in the three-level parallel structure. In a processor cluster system with large-scale parallel computing capability and a QR decomposition parallel computing structure which are built by adopting a multi-core processor chip, a top-level architecture utilizes the multi-core processor cluster to divide data of a divided matrix to be decomposed into pieces, the pieces are distributed to parallel nodes of a first level through a communication network which is formed around a switching chip and is interconnected among nodes of the multi-core processor, each node of the first level outputs a triangular matrix R to a second level node after completing a corresponding QR decomposition task in parallel, the second level node sends the R matrix to a third level node after completing R matrix combination and QR decomposition, each level node is sequentially executed step by step according to a complete structure of a binary tree, and the same level nodes are executed in parallel, so that a parallel computing flow diagram among the nodes of the processor chip is completed; the middle-layer architecture carries out matrix blocking according to the matrix scale input by the processor nodes, each blocking matrix is a square matrix subarray with uniform matrix scale, the subarray or subarray combination is used as an object to carry out QR decomposition and matrix data updating operation, the whole operation process is carried out layer by layer along diagonal subarrays, single-layer subarray QR decomposition and updating operation is carried out by multiple cores in a processor slice in parallel, and the bottom-layer architecture carries out vector calculation by utilizing a processor instruction set with single instruction multiple data SIMD capability to complete single-core QR decomposition and multiplication operation; the QR decomposition of the large-scale matrix adopts a layer-by-layer decomposition method, and realizes the QR parallel decomposition of the large-scale matrix from three-level parallel structures of parallel among multi-core processor chip nodes, parallel among multi-core chips and parallel single-core multi-data.

In an optional embodiment, a parallel architecture for implementing QR parallel decomposition of a large-scale matrix adopts a three-layer architecture of a top layer, a middle layer, and a bottom layer.

In an alternative embodiment, the top-level parallelism architecture shown in fig. 1 is a typical binary tree architecture, and mainly accomplishes parallelism among the processor chip nodes. At least 8 multi-core processor chips are arranged on a large-scale parallel computing platform, an input to-be-decomposed matrix A with the size of at least 16N rows and N columns is arranged on the large-scale parallel computing platform, the sub-matrix segmentation of the input to-be-decomposed matrix A can be flexibly cut according to the maximum parallelism capacity provided by the platform, and the input to-be-decomposed matrix A is segmented into a plurality of sub-matrix blocks A with the sizes of 2N rows and N columns according to the columns_iThe 8 multi-core processor chips can be maximally divided into A1-A8 matrix blocks, a top-level architecture of a matrix A to be decomposed is cascaded according to a structure of a binary tree 8-4-2-1 architecture, and the top-level parallel architecture completes the parallel between the processor chip nodes by a typical binary tree architecture. In order to greatly improve the throughput capacity of the platform, according to at least 17 processor nodes, the maximum 8 multi-core processor nodes in each stage can be simultaneously executed in parallel, and an 8-4-2-1 architecture is utilized to construct at least 4 stages of pipelines.

The pipeline structure with the 8-4-2-1 architecture sequentially executes the following steps:

proceeding in the direction of pipeline cascade, the first stage pipeline processes 8 processor nodes simultaneously in parallel, node 1 performs QR decomposition of A1, node 2 performs QR decomposition of A2, and node i performs A _iQR decomposition of (A) each node A, due to the same matrix size_iThe execution time is the same, and the parallel performance is good.

A second stage pipeline: upper triangular matrix R with matrix size N x N for first stage output_1,iAnd combining the data of every two nodes, wherein the size of the new matrix is 2N x N after combination, and the 4 processor nodes perform QR decomposition on the new matrix simultaneously and parallelly.

A third stage of assembly line: an upper triangular matrix R with matrix size N x N for the output of the second stage pipeline_2,iAnd combining the data of every two nodes, wherein the size of the new matrix is 2N x N after combination, and 2 processor nodes simultaneously and parallelly carry out QR decomposition on the new matrix.

A fourth stage pipeline: triangular matrix R with matrix size N x N on result output by third stage_3,iAnd combining the data of every two, wherein the size of the new matrix is 2N x N after combination, carrying out QR decomposition on the new matrix by 1 processor node, and obtaining an output result which is an upper triangular R matrix of the QR decomposition.

See fig. 2. The intermediate layer parallel architecture completes QR decomposition in parallel by using multiple cores of a single-node processor, and the QR decomposition time of the single-node processor directly determines the processing delay of the top layer parallel architecture. The intermediate layer parallel framework decomposes QR of the sub-matrix blocks of row 2N and column N, and executes the QR in a multi-layer progressive manner, wherein each layer is executed in parallel. The middle layer carries out matrix blocking on the matrix with 2N rows and N columns according to the core number of the single-node processor, and a blocking matrix A _i,jIs a square matrix. The matrix transposition is defined mathematically as the fact that the matrix A is a matrix of m × n order (i.e. m rows and n columns) and the element of the ith row and the jth column is a_i,jIf i and j are equal to or greater than 1, then the matrix B with n × m order is equal to A^TSatisfies b_j，i＝a_i，j. Assuming that a chip of a multi-core processor has 8 cores, the block matrix A_i,jIs a square matrix of rows 2N/8 and columns 2N/8. The method comprises the following steps of:

the middle layer parallel framework divides the block matrix A to the first layer_1,1QR decomposition is carried out, and the block square matrix A of the same column is formed_i,1(i > 1) performing 0 elimination and dividing the block matrix A into the same row_1,j(j > 1) data update is performed. The method comprises 4 operation operations, namely general QR operation, transposition GEQRT and post-QR multiplication operation in-line updating ORMQR, high thin matrix QR operation, transposition TSQRT and post-QR high thin matrix multiplication operation in-line updating TSMQR. GEQRT executes general QR operation, and updates A with upper triangular matrix R of QR result_1,1The orthogonal matrix Q performs transposition of the output Q^T(ii) a ORMQR uses preceding GEQRT result Q^TExecution A_1,j＝Q^T.A_1,jUpdating the matrix A_1,j(ii) a TSQRT uses updated A_1,1And A_i,1(i > 1) a combination of

And for the combined elongated matrix

QR decomposition is carried out, and a new upper triangular matrix R is output_1，1Update A_1,1And output the result Q^T(ii) a TSMQR mainly utilizes a front stage TSQRT to output a result Q ^TExecute

Updating

T denotes transposition.

The computation of the first layer is completed by multi-stage cascade connection, and the operation of each stage is realized by multi-core parallel.

See fig. 3. Single-stage multi-core parallel first-stage computation in the middle-layer parallel architecture: the core 0 and the core 4 complete the general QR operation and the transposition GEQRT operation in parallel; multi-core parallel second-stage computation: 8, cores are parallel, a core 0 and a core 4 complete high-thin matrix QR operation and transposed TSQRT operation, and the other cores complete QR post-multiplication operation and inline update ORMQR operation; multi-core parallel third-level computation: 8, cores are parallel, the core 0 and the core 4 complete TSQRT operation, and the rest of multi-core parallel cores complete high-thin matrix QR backward row updating operation TSMQR operation; multi-core parallel fourth-level computation: 8 cores are parallel, the core 0 and the core 4 complete TSQRT operation, and the other multi-core parallel cores complete TSMQR operation; multi-core parallel fifth-level computation: 7, cores are parallel, core 0 completes TSQRT operation on the combined thin and long matrix of

cores

0 and 4, and the other multi-core parallel cores complete TSMQR operation; and the sixth-level multi-core parallelism is realized, and the 3-core parallelism completes the TSMQR operation.

Multi-core parallel second layer: binary block square matrix A_2，2QR decomposition is carried out, and A is updated_2，2For block square matrix A of the same column_i，2(i > 2) eliminating 0 and dividing the block into square matrixes A in the same row _2，j(j > 2) updating. The operation steps are similar to the multi-core parallel first layer, except that the number of cores participating in parallel computations differs.

Multi-core parallel third layer: binary block square matrix A_3，3QR decomposition is carried out, and A is updated_3，3For block square matrix A of the same column_i，3(i > 3) eliminating 0 and dividing the block into square matrixes A in the same row_i，j(j > 3) updating. The operation steps are similar to the multi-core parallel first layer operation, except that the number of cores participating in parallel computing is different.

Multi-core parallel fourth layer: since only 1 column of block matrix remains for this layer, only pair A is needed_4，4QR decomposition is carried out, and A is updated_4，4For block square matrix A of the same column_i，4(i > 4)) to perform 0 elimination. The operation steps include a generic QR operation and a transposed GEQRT and a high thin matrix QR operation and a transposed TSQRT.

See fig. 4. The bottom parallel architecture mainly completes parallel computation of one core of the processor. Currently, instruction sets of popular multi-core processors support SIMD operations, vector and matrix operations. In order to fully utilize the parallel performance of an instruction set and adapt to the time overhead of each core in different operations, an improved orthogonalization GS (Gram-Schmidt) method is adopted for the general QR operation, the transposition GEQRT operation, the high-thin matrix QR operation and the transposition TSQRT operation. The improved GS process steps are as follows:

The first step is as follows: the bottom parallel framework adopts special single instruction stream multi-data stream SIMD instructions and vector calculation instructions to carry out parallel processing, and the first diagonal R of the R matrix is solved₁₁Calculation of vectors

Using a special instruction to make a calculation, r₁₁＝sqrt(R₁) Using a special reciprocal instruction to calculate the reciprocal g₁＝1/r₁₁And in particular to the additional instruction set provided by the multi-core processor to accelerate computations,different processors have instruction sets specially supported by the processors;

the second step is that: the bottom layer parallel framework adopts special SIMD instructions and vector calculation instructions to carry out parallel processing, and other elements in the first row of the R matrix are processed

Vector calculation is carried out, and the first diagonal R of the R matrix is solved_1j＝C_1j·g₁(j＞1)；

The third step: the bottom layer parallel framework adopts special SIMD instructions and vector calculation instructions to perform parallel processing, and updates the matrixes A and h_1j＝C_1j/R₁₁(j > 1)' pair

And carrying out vector calculation. By this point, all the element solutions in the first row of the R matrix have been completed and the matrix a has been updated. All R matrix evaluation can be completed by repeating the three steps, the whole improved GS algorithm divides a small part of evolution and division, most of the improved GS algorithm is vector multiplication and addition, and the whole calculation process can be greatly accelerated by utilizing parallel processing methods such as SIMD instructions and vector calculation instructions.

See fig. 5. The multi-core processor cluster is characterized in that a system architecture is formed by a double data rate DDR memory cross-linked switching network formed by interconnecting a plurality of multi-core processors, and the universality and the expandability of the cluster are determined by the interconnection architecture of the system. A multi-core processor typically has multiple independent processing cores, for example, a digital signal processing DSP chip TMS320C6678 has 8 cores, each of which is relatively independent. The 8 processor chips are interconnected through a switching chip, the switching network can be a high-speed communication network such as RapidIO and the like, and a multi-core processor cluster with 64 processing cores is formed. Data are shared among the 8 processor nodes through the DDR of each node, for example, the data in the DDR is sent to the DDR of the multi-core processor chip 1 through a switching network by the multi-core processor chip 0. The 8 cores in the multi-core processor chip carry out rapid data interaction and task synchronization through the shared memory in the chip and can also carry out communication with the processing cores of other nodes through the off-chip DDR. The processor nodes are interconnected through the switching chips to form a communication network, so that the expansion of cluster scale is facilitated, the multi-core in the processor can improve the data communication efficiency and the task synchronization capability of parallel tasks among the multi-core through the shared memory in the chips, and different processors carry out inter-core communication through DDR, so that the data distribution among the nodes is facilitated. The interconnection mode of the whole cluster architecture is beneficial to task allocation and scheduling of the three-layer parallel architecture, and has high flexibility and good expansibility.

The foregoing is directed to the preferred embodiment of the present invention and it is noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. It will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the spirit and substance of the invention, and these modifications and improvements are also considered to be within the scope of the invention.

Claims

1. A large-scale matrix QR factorized parallel computing structure, comprising: the parallel structure comprises a parallel structure with three levels of multi-processor nodes, parallel structure with single processor cores and parallel structure with single core and multiple data on the three levels, wherein the parallel structure with the characteristics of a binary tree structure comprises a first level parallel structure belonging to a top level architecture in a three-level parallel structure, a second level parallel structure belonging to a middle level architecture in the three-level parallel structure, and a third level parallel structure belonging to a bottom level architecture in the three-level parallel structure, and is characterized in that: in a processor cluster system with large-scale parallel computing capability and a QR decomposition parallel computing structure which are built by adopting a multi-core processor chip, a top-level architecture divides a matrix to be decomposed into a plurality of data fragments by utilizing the characteristics of the multi-core processor cluster, the data fragments are distributed to parallel nodes of a first level through a communication network interconnected among the nodes of the multi-core processor, each node of the first level parallelly completes a corresponding QR decomposition task and then sends a triangular matrix R to a second level node through the network, the second level node completes R matrix combination and QR decomposition and then sends the R matrix to a third level node through the network, and so on, each level node sequentially executes step by step according to a complete structure of a binary tree, and the same level node executes in parallel, so that the parallel computation of a flow graph among the nodes of the multi-core processor chip is; the middle-layer architecture carries out matrix blocking according to the matrix scale input by the processor nodes and the number of processor cores, each blocking matrix is a square matrix with uniform matrix scale, the whole operation process is carried out layer by layer along diagonal sub-arrays, single-layer sub-array QR decomposition and matrix data updating operations are executed in parallel by multiple cores in a processor slice, and the bottom-layer architecture carries out multi-data parallel vector calculation by utilizing a processor instruction set with single instruction multiple data operation SIMD to complete single-core QR decomposition or multiplication operation; the multi-core processor cluster adopts a layer-by-layer decomposition method, and realizes QR parallel decomposition of a large-scale matrix from three-level parallel structures of multi-core nodes of a multi-processor chip, multi-core parallel of a single-processor chip and single-core multi-data parallel.

2. The large-scale matrix QR factorization parallel computing architecture of claim 1, wherein: at least 8 multi-core processor chips are arranged on a large-scale parallel computing platform, an input to-be-decomposed matrix A with the size of at least 16N rows and N columns is arranged on the large-scale parallel computing platform, the sub-matrix segmentation of the input to-be-decomposed matrix A can be flexibly cut according to the maximum parallelism capacity provided by the platform, and the input to-be-decomposed matrix A is segmented into a plurality of sub-matrix blocks A with the sizes of 2N rows and N columns according to the columns_iThe 8 multi-core processor chips can be maximally divided into A1-A8 matrix blocks, a top-level architecture of a matrix A to be decomposed is cascaded according to a structure of a binary tree 8-4-2-1 architecture, the top-level parallel architecture is a typical binary tree architecture, the processor chip nodes are parallel, at least 17 processor nodes are arranged, maximally 8 multi-core processor nodes in each level can be simultaneously executed in parallel, and an at least four-level pipeline is constructed by using the 8-4-2-1 architecture.

3. A large-scale matrix QR factorization parallel computing architecture as claimed in claim 3, characterized in that: in the layer-wise progressive direction, the first-level pipeline simultaneously processes 8 processor nodes in parallel, node 1 performs QR decomposition of A1, node 2 performs QR decomposition of A2, and node i performs A_iQR decomposition of (A) of each node A _iThe execution time is the same; the matrix size of the second-stage pipeline output to the first-stage pipeline is N x NR_1,iCombining the data of every two, wherein the size of the new matrix is 2N × N after combination; the matrix size of the third stage pipeline to the second stage output is N x N upper triangular matrix R_2,iCombining the data of every two, wherein the size of the new matrix is 2N × N after combination; triangular matrix R with matrix size of N x N on result output by fourth-stage pipeline to third-stage pipeline_3,iAnd merging the data of every two, wherein the size of the new matrix after merging is 2N x N.

4. The large-scale matrix QR factorization parallel computing architecture of claim 5, wherein: performing data merging and QR decomposition by single-node operation of a second-stage pipeline; and performing data merging and QR decomposition by single-node operation of the third-stage assembly line, performing QR decomposition on the new matrix by 1 processor node of the fourth-stage assembly line, and obtaining an output result which is an upper triangular R matrix of the QR decomposition.

5. The large-scale matrix QR factorization parallel computing architecture of claim 1, wherein: the intermediate layer parallel framework carries out QR decomposition on the sub-matrix blocks of the row 2N and the column N, the sub-matrix blocks are executed in a multi-layer progressive mode according to the layer progressive direction, each layer is executed in parallel, and multi-core parallel completion of QR decomposition is achieved by using a single-node processor.

6. The large-scale matrix QR factorization parallel computing architecture of claim 1, wherein: the middle layer parallel framework divides the block matrix A to the first layer_1,1QR decomposition is carried out, and the block square matrix A of the same column is formed_i,1(i > 1) performing a 0 elimination, first layer: binary block square matrix A_1,1QR decomposition is carried out, and the block square matrix A of the same column is formed_i,1(i > 1) performing 0 elimination and dividing the block matrix A into the same row_1,j(j > 1) data updating is carried out, wherein the data updating comprises 4 kinds of operation operations, namely general QR operation and GEQRT transposition, after-QR multiplication operation same-row updating ORMQR, high-thin matrix QR operation and TSQRT transposition, after-QR high-thin matrix multiplication operation same-row updating TSMQR.

7. A large-scale matrix QR decomposition parallel computing structure according to claim 7,the method is characterized in that: general QR operation and transpose GEQRT to perform general QR operation, updating A with triangular matrix R on QR result_1,1The orthogonal matrix Q performs transposition of the output Q^T(ii) a Simultaneous update of ORMQR by post-QR multiplication operation using preceding stage GEQRT result Q^TExecution A_1,j＝Q^T.A_1,jUpdating the matrix A_1,j(ii) a High-thin matrix QR operation and TSQRT transfer to utilize updated A_1,1And A_i,1(i > 1) a combination of

And for the combined elongated matrix

QR decomposition is carried out, and a new upper triangular matrix R is output _1，1Update A_1,1And output the result Q^T(ii) a High-thin matrix multiplication operation simultaneous updating TSMQR after QR mainly utilizes front stage TSQRT to output result Q^TExecute

Updating

T denotes transposition.

8. The large-scale matrix QR factorization parallel computing architecture of claim 1, wherein: single-stage multi-core parallel first-stage computation in the middle-layer parallel architecture: core 0 and core 4 complete the GEQRT operation in parallel; multi-core parallel second-stage computation: 8, cores are parallel, core 0 and core 4 complete TSQRT operation, and the other cores complete ORMQR operation; multi-core parallel third-level computation: 8 cores are parallel, the core 0 and the core 4 complete TSQRT operation, and the other multi-core parallel cores complete TSMQR operation; multi-core parallel fourth-level computation: 8 cores are parallel, the core 0 and the core 4 complete TSQRT operation, and the other multi-core parallel cores complete TSMQR operation; multi-core parallel fifth-level computation: 7, cores are parallel, core 0 completes TSQRT operation on the combined thin and long matrix of cores 0 and 4, and the other multi-core parallel cores complete TSMQR operation; and the sixth-level multi-core parallelism is realized, and the 3-core parallelism completes the TSMQR operation.

9. The large-scale matrix QR factorization parallel computing architecture of claim 8, wherein: multi-core parallel second layer: binary block square matrix A_2,2QR decomposition is carried out, and A is updated _2,2For block square matrix A of the same column_i,2(i > 2) eliminating 0 and dividing the block into square matrixes A in the same row_2,j(j > 2) updating; multi-core parallel third layer: binary block square matrix A_3,3QR decomposition is carried out, and A is updated_3,3For block square matrix A of the same column_i,3(i > 3) eliminating 0 and dividing the block into square matrixes A in the same row_i,j(j > 3) updating; multi-core parallel fourth layer: since only 1 column of block matrix remains for this layer, only pair A is needed_4,4QR decomposition is carried out, and A is updated_4,4For block square matrix A of the same column_i,4(i > 4)) to perform 0 elimination.

10. The large-scale matrix QR factorization parallel computing architecture of claim 9, wherein: the bottom parallel framework adopts a single instruction multiple data stream SIMD instruction and a vector calculation instruction to carry out parallel processing, and the first diagonal R of an R matrix is solved₁₁Calculation of vectors

Calculating diagonal element r by using evolution instruction evolution₁₁＝sqrt(R₁) Using a special reciprocal instruction to calculate the reciprocal g₁＝1/r₁₁(ii) a For other elements in the first row of the R matrix

Vector calculation is carried out, and the first diagonal R of the R matrix is solved_1j＝C_1j·g₁(j > 1); update the matrix A, coefficient h_1j＝C_1j/R₁₁(j > 1), for intermediate results

Column vector of matrix A

And (4) carrying out vector calculation, completing the solution of all elements in the first row of the R matrix, updating the matrix A, and repeating the steps to complete the evaluation of all the R matrices.