CN111858465B

CN111858465B - Large-scale matrix QR decomposition parallel computing system

Info

Publication number: CN111858465B
Application number: CN202010609939.3A
Authority: CN
Inventors: 吴明钦; 刘红伟; 潘灵; 贾明权; 郝黎宏; 林勤; 张昊
Original assignee: Southwest Electronic Technology Institute No 10 Institute of Cetc
Current assignee: Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority date: 2020-06-29
Filing date: 2020-06-29
Publication date: 2023-06-06
Anticipated expiration: 2040-06-29
Also published as: CN111858465A

Abstract

The invention discloses a large-scale matrix QR decomposition parallel computing structure, which relates to the field of digital signal processing and aims to provide a three-level parallel computing structure with clear parallel logic, high throughput and low delay, and the invention is realized by the following technical scheme: in the process of constructing a processor cluster system and a QR decomposition parallel computing structure by adopting a multi-core processor chip, a top-layer architecture divides a matrix to be decomposed into a plurality of data fragments, the data fragments are distributed to all levels of nodes through a communication network which is interconnected among the nodes of the multi-core processor, the nodes of all levels are sequentially calculated step by step according to a binary tree complete structure, and the nodes of all levels are calculated in parallel; the middle layer architecture performs matrix blocking, and performs operation layer by layer along the diagonal subarrays; the lower layer architecture utilizes a processor instruction set to perform multi-data parallel vector computation to complete single-core QR decomposition and multiplication operations. The multi-core processor cluster adopts a structure of layer-by-layer decomposition to realize QR parallel decomposition of a large-scale matrix.

Description

Large-scale matrix QR decomposition parallel computing system

Technical Field

The invention relates to a large-scale array antenna signal processing technology in the field of digital signal processing, in particular to a QR decomposition method for a cluster parallel structure of a large-scale multi-core processor for high-performance parallel computing of a large-scale parallel processor in the field of numerical computation.

Background

In the field of digital signal processing, signal processing algorithms such as large-scale array antenna signal processing and large-scale Multiple Input Multiple Output (MIMO) technology often involve covariance matrix inversion, channel matrix estimation, channel equalization and other problems, and QR decomposition has been widely used in these aspects. MIMO is a standard for wireless network evolution, among others. In fact, MIMO technology has become one of the most critical of many wireless communication standards, such as ieee802.11n, 3GPP-LTE, etc. The design of the efficient and key QR decomposition operation unit can reduce the complexity of the implementation of the MIMO system, so that good calculation performance is obtained. The covariance matrix inversion algorithm is one of the most commonly used spatial interference suppression algorithms. However, when the interference power is too large, the sampling covariance matrix is a singular matrix, and the matrix inversion primer fails, so that the interference suppression weight cannot be effectively generated. Because the QR decomposition can effectively improve the number of matrix conditions and improve the numerical stability, the situation that effective interference suppression weight cannot be obtained due to too strong interference power can be effectively avoided by inverting the sampling covariance matrix by utilizing the QR decomposition algorithm. The computational performance of QR decomposition has a direct impact on the signal processing performance of the interference suppression system. The QR decomposition algorithm is used as a main tool for digital signal processing, plays an important role in the field of high-performance calculation, and is an important index for measuring the performance of a system. Matrix operation is one of core problems in high-performance computation, matrix decomposition is an important approach for improving the parallelism of matrix operation, and QR decomposition is an important matrix decomposition form. By definition Q is an orthogonal matrix and R is a non-singular upper triangular matrix (i.e., the elements below the diagonal of matrix R are all 0), where the decomposition is unique when the diagonal elements of R are required to be positive.

There are many ways to actually calculate the QR decomposition, such as Givens rotation, householder transformation, and Gram-Schmidt orthogonalization, etc. Each of these methods has its advantages and disadvantages. The QR decomposition of the matrix is typically implemented using a Householder transform or Givens rotation or Gram-Schmidt orthogonalization methods. The QR decomposition algorithm of the Householder transformation obtains a Householder transformation matrix through reflection operation, and elements below a diagonal line are updated to be 0 through matrix multiplication, so that a large amount of matrix multiplication calculation is needed in the process, and the algorithm complexity is increased. The Givens rotated QR decomposition method performs update calculation on the matrix through the Givens rotated matrix, but can split the matrix into line updates, and has operations of division, evolution and the like, but has lower complexity than a Householder. Because of the huge amount of QR decomposition operation, a large amount of calculation time is consumed in the decomposition process, and the QR decomposition operation becomes a bottleneck for improving the performance of numerous practical applications. For example, in cognitive radio, QR decomposition is the most time-consuming calculation module in Singular Value Decomposition (SVD). The data shows that the time spent processing QR decomposition accounts for more than 70% of the overall SVD operation.

The large-scale matrix QR decomposition has wide application in the fields of signal processing, image processing, computational structural mechanics and the like. Because the large-scale matrix QR decomposition algorithm has huge operand, the algorithm structure is very complex, and parallel decomposition is not facilitated, and the traditional method is to realize large-scale QR decomposition on a high-performance super computing platform based on an X86 architecture. The problems of large-scale QR decomposition, task allocation, data synchronization and the like based on the distributed computing platform lead to longer communication time, and cannot meet the ms-level real-time processing requirement. In recent years, a plurality of medium-small-scale matrix QR decompositions are realized by adopting an FPGA design, a hardware structure is usually realized by adopting a systonic array, the parallelism performance is good, the real-time performance is high, the matrix size is rapidly increased, the matrix size is limited by the area and the power consumption limitation of an FPGA chip, the QR decomposition realized based on the FPGA cannot meet the requirement of high throughput rate of the large-scale matrix QR decomposition, and the development period is long.

The key to the wide application of large-scale matrix QR decomposition is to reduce processing delay while improving throughput. Along with the large increase of the scale of the front-end sensor, the continuous increase of the sampling rate and the continuous increase of the scale of the channel matrix, the conventional QR decomposition method can not meet the requirements of the large-scale QR decomposition on throughput rate and real-time processing. There are two extremes in the current parallel computing technology research for QR decomposition: firstly, in the field of scientific computation, an existing popular distributed parallel computing architecture is adopted to realize ultra-large-scale matrix decomposition on a distributed computing platform such as Hadoop, and the method can realize large-scale throughput processing but cannot meet the requirement of real-time processing, and is not applicable to embedded equipment with strict requirements on low power consumption, flexibility and high reliability; secondly, aiming at a specific hardware structure, taking the implementation of QR decomposition by an FPGA as a representative, designing a special matrix decomposition parallel processor, pursuing extremely rapid processing, the method has long development period, limited chip scale performance, difficult expansion, limited throughput rate and inapplicability to large-scale QR decomposition.

For most engineering applications, a system hardware architecture with strong expansibility needs to be flexibly and rapidly built by using the existing mature chip. Currently, a variety of processor chips have evolved from single-core processors to multi-core processor chips, each of which in turn typically has Single Instruction Multiple Data (SIMD) parallel computing capabilities and vector computing capabilities. The current international development level is that hundreds of light cores can be integrated in a single chip, taking a TMS320C6678 DSP chip of TI company as an example, the number of integrated high-performance DSP cores reaches 8 cores, and the processing capacity of the single chip is further improved by two ways of further improving the single-core performance and increasing the number of the multiple cores in the future. Therefore, current multi-core technology is gradually available and becomes a hot spot for CPU/DSP development. From the technical aspect, the parallelism of the multiple cores of the processor is an important method for realizing software parallelism, and the multiple cores are important solving ways for improving the performance of the processor. The adoption of the multi-core structure is an important means for improving the performance of the processor, and is very suitable for parallel computing tasks. Increasing the number of cores and improving the capacity of an on-chip memory are the main means for improving the operation capability of the multi-core DSP by the current commercial multi-core DSP. The multi-core cluster system is a mainstream massive parallel computing system at present, and a plurality of multi-core processor chips are connected through a high-speed interconnection network to form a cluster, so that the multi-core cluster system has extremely strong parallel computing capability. However, the conventional QR decomposition algorithm cannot fully utilize the parallel processing capability of the multi-core cluster system, and it is difficult to exert the performance advantage of the multi-core cluster system.

Disclosure of Invention

The invention aims at providing a three-level parallel computing structure which has clear parallel logic, high expansibility and portability, high throughput, low delay and high universality and can fully utilize the parallel processing advantages of a multi-core processor cluster to realize the parallel among multi-processor nodes, the parallel among single-processor multi-cores and the parallel among single-core multi-data aiming at the application of the large-scale multi-core processor cluster which relates to the large-scale QR decomposition and simultaneously requires real-time processing, so as to solve the problem that the traditional QR decomposition method cannot effectively utilize the cluster resources of the multi-core processor to carry out the large-scale parallel computation.

The invention adopts the technical scheme that: a large-scale matrix QR decomposition parallel computing system, comprising: the multi-processor node parallel, single processor multi-core parallel and single core multi-data parallel three-level parallel structure for realizing the large-scale matrix QR parallel decomposition is characterized in that a first-level parallel structure with the characteristic of a binary tree structure belongs to a top-level framework in a three-level parallel structure, a second-level parallel framework belongs to a middle-level framework in the three-level parallel structure, and a third-level parallel framework belongs to a bottom-level framework in the three-level parallel structure, and the multi-processor node parallel structure is characterized in that: in a processor cluster system with large-scale parallel computing capability and a QR decomposition parallel computing structure are constructed by adopting a multi-core processor chip, a top-level architecture utilizes the cluster characteristics of the multi-core processor to divide a matrix to be decomposed into a plurality of data fragments, the data fragments are distributed to each parallel node of a first level through a communication network which is interconnected among nodes of the multi-core processor, each node of the first level parallelly completes corresponding QR decomposition tasks and then sends an upper triangular matrix R to a second level node through the network, the second level node completes R matrix combination and QR decomposition and then sends the R matrix to a third level node through the network, and the like, each level node sequentially executes step by step according to the complete structure of a binary tree, and the nodes of the same level execute in parallel, so that a parallel computing flow diagram among the nodes of the multi-processor chip is completed; the middle layer architecture performs matrix sub-array blocking according to the matrix scale input by the processor nodes and the number of processor cores, each sub-array is divided into square arrays with uniform matrix scale, the whole operation process is performed layer by layer along the diagonal sub-arrays, single-layer sub-array QR decomposition and matrix data updating operations are performed in parallel by multiple cores in a processor chip, and the bottom layer architecture performs vector computation of multiple data parallelism by utilizing a processor instruction set with SIMD (single instruction multiple data) capability, so that single-core QR decomposition and multiplication operations are completed. The multi-core processor cluster adopts a layer-by-layer decomposition method, and realizes QR parallel decomposition of a large-scale matrix from a three-level parallel structure of multi-processor chip inter-node parallel, single-processor chip multi-core parallel and single-core multi-data parallel.

Compared with the prior art, the technical scheme of the invention has the following beneficial effects:

the parallel logic is clear, and the expansibility and portability are strong. The invention fully utilizes the parallel computing capability of the multi-core processor cluster, adopts a three-layer parallel architecture to realize the QR decomposition parallel computing structure of the processor cluster system with the large-scale parallel computing capability constructed by the multi-core processor chip, and each level of nodes are sequentially executed step by step according to the binary tree complete structure, and the peer nodes are executed in parallel, so that the multi-processor cluster deployment is easy to realize. The three-layer parallel architecture has clear structure and easy realization, and can simultaneously meet the requirements of system throughput rate and real-time processing. The experimental results show that: compared with the software implementation of the general processor in the prior art, the parallel result of the QR decomposition can obtain more than 20 times of speed-up ratio on the core computing performance, and has higher data and flow parallelism for the matrix triangularization computing process.

According to the invention, the communication network is formed by the interconnection exchange chips among the multi-core processor nodes, the cluster scale is easy to expand, and the shared memory among a plurality of cores in the single processor node is utilized, so that the data communication efficiency and task synchronization of the parallel tasks among the multiple cores can be obviously improved. The whole cluster architecture is favorable for task allocation and scheduling of a three-layer parallel architecture, and has high flexibility, expansibility and portability.

The invention is based on a multi-core processor cluster, and adopts a layer-by-layer decomposition structure to realize multi-node multi-core parallel, single-node multi-core parallel and single-core multi-data parallel. The multi-node parallel with the characteristic of the binary tree structure can fully utilize the parallel processing advantage of the processor clusters, meet the throughput rate requirement of the QR decomposition of the large-scale matrix, and can obtain the exponential increase performance of the scale of the matrix to be processed at the cost of the linear increase of the cluster scale under the same processing time. The multi-core resources of the processor can be fully utilized in parallel, the QR decomposition and data matrix updating operation of the single-layer subarrays can be realized in parallel, the real-time computing capacity of the QR decomposition at a single node and the QR decomposition operation speed can be greatly improved, and the processing performance is multiplied along with the number of cores. The SIMD capability and vector computing capability of a processor instruction set are utilized in parallel by the multi-data among single cores, and the performance of single-core QR decomposition is increased by 4 times by adopting an improved GR algorithm. The three-layer parallel architecture is very suitable for multi-core processor clusters, not only meets the throughput rate requirement of the QR decomposition of a large-scale matrix, but also greatly reduces the processing delay and has the capability of real-time processing.

The invention comprehensively considers the communication, storage, calculation resources and SIMD, matrix and vector calculation capability of instruction level of the multi-core processor, realizes the parallel decomposition of the large-scale matrix QR decomposition from the parallel structure of multi-processor nodes, single-chip multi-core parallel and single-core multi-data parallel three-level parallel, adopts the shared internal memory to realize the multi-core communication and task synchronization inside the processor, adopts the network exchange chip to randomly expand the cluster scale for the communication between chips, can simultaneously process the task level parallel and the data level parallel in the matrix decomposition process, fully utilizes the SIMD instruction set and the vector calculation instruction set of the processor to perform parallel operation, and realizes the flexible scheduling of the QR decomposition parallel task. The multi-level parallel computing structure can achieve performance improvement of more than 20 times, is low in communication cost and obvious in computing acceleration, remarkably improves computing parallelism and cluster communication flexibility, and can achieve performance improvement of more than 20 times in numerical computation related to weight updating in large-scale array signal processing. The method is very suitable for a massive parallel computing cluster constructed by taking a multi-core DSP as a representative.

The method is greatly superior to the existing QR decomposition method, has outstanding engineering application value, and is very suitable for large-scale multi-core processor cluster calculation.

Drawings

FIG. 1 is a block diagram of a top-level parallel structure in a three-level parallel structure of a large-scale matrix QR decomposition of the present invention;

FIG. 2 is a block diagram of a hierarchical progression relationship per level of an intermediate layer parallel structure in a massive matrix QR decomposition three-layer parallel structure;

FIG. 3 is a block diagram of single-level multi-core parallel computing in the middle tier parallel architecture of FIG. 2;

FIG. 4 is a block diagram of a hierarchical progressive relationship of the underlying parallel structure of a massive QR decomposition three-layer parallel structure;

FIG. 5 is a block diagram of a connection between cluster nodes of a multi-core processor and a connection between cores of the processor.

Detailed Description

See fig. 1-3. In the preferred embodiment described below, a large-scale matrix QR decomposition parallel computing system includes: the processor node parallel structure, the processor core parallel structure and the single-core instruction level parallel three-level parallel structure for realizing the large-scale matrix QR parallel decomposition are characterized in that a first-level parallel structure with the characteristic of a binary tree structure belongs to a top-level framework in a three-level parallel structure, a second-level parallel framework belongs to a middle-level framework in the three-level parallel structure, and a third-level parallel framework belongs to a bottom-level framework in the three-level parallel structure. In a processor cluster system with large-scale parallel computing capability and a QR decomposition parallel computing structure are constructed by adopting a multi-core processor chip, a top-level architecture utilizes a multi-core processor cluster to divide the data of a segmented matrix to be decomposed into pieces, the pieces are distributed to all parallel nodes of a first level through a communication network which is formed by surrounding a switching chip and is connected among the nodes of the multi-core processor, all the nodes of the first level parallelly finish corresponding QR decomposition tasks and then output a triangle matrix R to a second level node, the second level node finishes R matrix combination and QR decomposition and then sends the R matrix to a third level node, all the nodes are sequentially executed step by step according to a binary tree complete structure, and the nodes of the same level are executed in parallel, so that a parallel computing flow diagram among the nodes of the processor chip is completed; the middle layer architecture performs matrix blocking according to the matrix scale input by the processor node, each blocking matrix is a square matrix subarray with uniform matrix scale, the subarray or subarray combination is taken as an object to operate QR decomposition and matrix data updating operation, the whole operation process is performed layer by layer along the diagonal subarray, single-layer subarray QR decomposition and updating operation is performed in parallel by multiple cores in the processor chip, and the bottom layer architecture performs vector calculation by utilizing a processor instruction set with single instruction multiple data SIMD (single instruction multiple data) capability to complete single-core QR decomposition and multiplication operation; the QR decomposition of the large-scale matrix adopts a layer-by-layer decomposition method, and realizes the QR parallel decomposition of the large-scale matrix from the three-level parallel structure of the chip node-to-node parallelism of the multi-core processor, the chip multi-core parallelism and the single-core multi-data parallelism.

In the optional embodiment, a parallel architecture of QR parallel decomposition of a large-scale matrix is realized, and a three-layer architecture of a top layer, a middle layer and a bottom layer is adopted.

In an alternative embodiment, the top-level parallelism architecture shown in FIG. 1 is a typical binary tree architecture, primarily accomplishing inter-processor chip node parallelism. At least 8 pieces of multi-core processor chips are arranged on the massive parallel computing platform, an input matrix A to be decomposed is input, the size of the input matrix A to be decomposed is at least row 16N and column N, subarray segmentation of the input matrix A to be decomposed can be flexibly cut according to the maximum parallelism capability provided by the platform, and the subarray segmentation is divided into a plurality of subarray blocks A with the size of row 2N and column N according to the column _i The maximum of 8 multi-core processor chips can be divided into A1-A8 matrix blocks, the top-level architecture of the matrix A to be decomposed is cascaded according to the structure of a binary tree 8-4-2-1 architecture, and the top-level parallel architecture is in a typical binary tree architecture to finish the parallel among the nodes of the processor chips. To greatly increase platform throughput, at least 8 multi-core processor nodes per stage can be executed simultaneously in parallel with at least 17 processor nodes, and at least 4 stages of pipelines are constructed by using an 8-4-2-1 architecture.

The pipeline structure of the 8-4-2-1 architecture sequentially performs the following steps:

progressive in the pipeline cascade direction, the first stage pipeline processes 8 processor nodes simultaneously in parallel, node 1 executes QR decomposition of A1, node 2 executes QR decomposition of A2, and node i executes A _i Due to the same matrix size, each node A _i The execution time is the same, and the parallel performance is good.

A second stage pipeline: an upper triangular matrix R with a matrix size of N x N for the first stage output _1,i And merging the data in pairs, wherein the size of the new matrix after merging is 2N, and the new matrix is simultaneously and parallelly subjected to QR decomposition by 4 processor nodes.

Third stage pipeline: an upper triangular matrix R with a matrix size of N x N output to the second stage pipeline _2,i Combining the data two by two, wherein the size of the combined new matrix isAnd 2N, 2 processor nodes simultaneously and parallelly perform QR decomposition on the new matrix.

Fourth stage pipeline: triangular matrix R with matrix size of N on result of third stage output _3,i And merging the data in pairs, wherein the size of the new matrix after merging is 2N, QR decomposition is carried out on the new matrix by 1 processor node, and an output result is an upper triangular R matrix of the QR decomposition.

See fig. 2. The middle-layer parallel architecture utilizes single-node processor multi-core parallel to complete QR decomposition, and the processing delay of the top-layer parallel architecture is directly determined by the time of the single-processor node QR decomposition. The middle layer parallel architecture decomposes the QR of the row 2N and column N submatrix blocks, and the QR is executed in a multi-layer progressive mode, and each layer is executed in parallel. The middle layer performs matrix blocking on the matrix of the rows 2N and the columns N according to the number of the cores of the single-node processor, and blocks the matrix A _i,j Is a square matrix. The matrix transpose is mathematically defined as assuming matrix A as a matrix of m x n order matrix (i.e., m rows and n columns), where the elements of the ith row and jth column are a _i,j If i, j is greater than or equal to 1, then there is an n×m order matrix b=a ^T Satisfy b _j，i ＝a _i，j . Assuming that a piece of multi-core processor chip has 8 cores, then the block matrix A _i,j Is a square matrix of rows 2N/8 columns 2N/8. The steps are sequentially carried out as follows:

middle layer parallel architecture is to first layer block square matrix A _1,1 QR decomposition is carried out, and the same-row block matrix A is used for _i,1 (i > 1) performing 0 elimination, and dividing the same row of the block matrix A _1,j (j > 1) performing data update. The method comprises 4 operation operations, namely updating ORMQR for common QR operation and transposed GEQRT, updating the same row of the multiplication operation after QR, updating TSQRR for high-thin matrix QR operation and transposed TSQRT, and updating the same row of the multiplication operation of the high-thin matrix after QR. GEQRT executes general QR operation, and updates A with triangular matrix R on QR result _1,1 Transpose Q of transposed output Q of quadrature matrix Q ^T The method comprises the steps of carrying out a first treatment on the surface of the ORMQR utilizes the previous GEQRT result Q ^T Execution A _1,j ＝Q ^T .A _1,j Updating matrix A _1,j The method comprises the steps of carrying out a first treatment on the surface of the TSQRT utilizes updated A _1,1 And A is a _i,1 (i > 1) is combined into

And for the combined thin and long matrix

Performing QR decomposition to output a new upper triangular matrix R _1，1 Update A _1,1 And output the result Q ^T The method comprises the steps of carrying out a first treatment on the surface of the TSMQR outputs the result Q mainly by using the front stage TSQRT ^T Execution->

Update->

T represents the transpose.

The computation of the first layer is completed in multi-stage cascade, and the operation of each stage is realized in multi-core parallel.

See fig. 3. Single-level multi-core parallel first-level computing in middle-level parallel architecture: core 0 and core 4 finish the general QR operation and transpose GEQRT operation in parallel; multicore parallel second level computing: 8, core 0 and core 4 finish high-thin matrix QR operation and transpose TSQRT operation in parallel, and other cores finish the multiplication operation and update ORMQR operation in the same row after QR; multicore parallel third level computing: 8, core parallel, core 0 and core 4 complete TSQRT operation, and other multi-core parallel cores complete TSMQR operation of high-thin matrix QR backward updating operation; multicore parallel fourth-stage computation: 8, core 0 and core 4 complete TSQRT operation in parallel, and other multi-core parallel cores complete TSMQRR operation; multicore parallel fifth level computation: 7, core parallel, core 0 completes TSQRT operation on the combined lean length matrix of

cores

0 and 4, and other multi-core parallel cores complete TSMQR operation; sixth-stage multi-core parallelism, and 3-core parallelism completes TSMQR operation.

Multicore parallel second layer: partitioned square matrix A _2,2 QR decomposition is carried out, and A is updated _2,2 For the same column of the partitioned square matrix A _i,2 (i > 2) performing 0 elimination, and dividing the same row of the block matrix A _2,j (j > 2) updating. The operation steps are similar to the multi-core parallel first layer except that the number of cores participating in the parallel computation is different.

Multicore parallel third layer: partitioned square matrix A _3,3 Proceeding withQR decomposition, update A _3,3 For the same column of the partitioned square matrix A _i,3 (i > 3) performing 0 elimination, and dividing the same row of the block matrix A _i,j (j > 3) updating. The operation steps are similar to the multi-core parallel first layer operation, except that the number of cores participating in the parallel computation is different.

Multi-core parallel fourth layer: since the layer only has 1 column of block matrix left, only the matrix of A _4,4 QR decomposition is carried out, and A is updated _4,4 For the same column of the partitioned square matrix A _i,4 (i > 4)) was subjected to 0 elimination. The operation steps include a general QR operation and transpose GEQRT and a high lean matrix QR operation and transpose TSQRT.

See fig. 4. The bottom parallel architecture mainly completes the parallel computation of one core of the processor. Currently, the instruction set of popular multi-core processors supports SIMD operations, vector and matrix operations. In order to fully utilize the parallel performance of instruction sets, the time cost of each core in different operations is adapted, and the general QR operation, the transposed GEQRT operation, the high-thin matrix QR operation and the transposed TSQRT operation adopt an improved orthogonalization GS (Gram-Schmidt) method. The improved GS method comprises the following steps:

the first step: the bottom parallel architecture adopts special single instruction stream multiple data stream SIMD instruction and vector calculation instruction to conduct parallel processing, and R matrix head diagonal element R is calculated ₁₁ Vector calculation

Adopting special evolution instruction to conduct evolution calculation, r ₁₁ ＝sqrt(R ₁ ) Taking reciprocal g by special reciprocal taking instruction ₁ ＝1/r ₁₁ Here, special refers to an additional instruction set provided by the multi-core processor for accelerating computation, and different processors all have instruction sets which are specially supported by the processor;

and a second step of: the bottom parallel architecture adopts special SIMD instruction and vector calculation instruction to process in parallel, and other elements in the first row of the R matrix are processed

Vector calculation is carried out, and the head diagonal element R of the R matrix is calculated _1j ＝C _1j ·g ₁ (j＞1)；

And a third step of: the bottom parallel architecture adopts special SIMD instructions and vector calculation instructions to carry out parallel processing and updates matrixes A and h _1j ＝C _1j /R ₁₁ (j > 1), pair

Vector calculation is performed. So far, all element solutions for the first row of the R matrix have been completed and matrix A has been updated. The three steps are repeated to complete all R matrix evaluation, the whole improved GS algorithm divides the evolution and division of a small part, and the vector multiplication and addition are mostly carried out, and the whole calculation process can be greatly accelerated by utilizing parallel processing methods such as SIMD instructions, vector calculation instructions and the like.

See fig. 5. The multi-core processor cluster is formed by a plurality of double data rate DDR memory crosslinked switching networks interconnected by the multi-core processors, and the system interconnection architecture determines the universality and expandability of the cluster. A multi-core processor typically has multiple independent processing cores, such as 8 cores per core, for example, a digital signal processing DSP chip TMS320C 6678. The 8 processor chips are interconnected through a switching chip, and the switching network can be a high-speed communication network such as RapidIO and the like, so as to form a multi-core processor cluster with 64 processing cores. Data is shared among the 8 processor nodes through the DDR of each node, for example, the multi-core processor chip 0 sends the data in the DDR to the DDR of the multi-core processor chip 1 through the switching network. 8 cores in the multi-core processor chip perform rapid data interaction and task synchronization through an on-chip shared memory, and can also communicate with processing cores of other nodes through off-chip DDR. The processor nodes are interconnected through the exchange chip to form a communication network, which is favorable for the scale expansion of clusters, the multi-core in the processor can improve the data communication efficiency and the task synchronization capability of parallel tasks among the multi-core through the shared memory in the chip, and different processors can perform inter-core communication through DDR, which is favorable for the data distribution among the nodes. The interconnection mode of the whole cluster architecture is favorable for task allocation and scheduling of the three-layer parallel architecture, and has high flexibility and good expansibility.

While the foregoing is directed to the preferred embodiment of the present invention, it is noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. Various modifications and improvements may be made by those skilled in the art without departing from the spirit and substance of the invention, and are also considered to be within the scope of the invention.

Claims

1. A large-scale matrix QR decomposition parallel computing system, comprising: the multi-processor node parallel, single processor multi-core parallel and single core multi-data parallel three-level parallel structure for realizing the large-scale matrix QR parallel decomposition on the large-scale parallel computing platform is characterized in that a first-level parallel structure with the characteristic of a binary tree structure belongs to a top-level framework in a three-level parallel structure, a second-level parallel framework belongs to a middle-level framework in the three-level parallel structure, and a third-level parallel framework belongs to a bottom-level framework in the three-level parallel structure, and the method is characterized in that: in a processor cluster system with large-scale parallel computing capability and a QR decomposition parallel computing structure are constructed by adopting a multi-core processor chip, a top-level architecture utilizes the cluster characteristics of the multi-core processor to divide a matrix to be decomposed into a plurality of data fragments, the data fragments are distributed to each parallel node of a first level through a communication network which is interconnected among nodes of the multi-core processor, each node of the first level parallelly completes corresponding QR decomposition tasks and then sends an upper triangular matrix R to a second level node through the network, the second level node completes R matrix combination and QR decomposition and then sends the R matrix to a third level node through the network, and the like, each level node sequentially executes step by step according to the complete structure of a binary tree, and the nodes of the same level execute in parallel, so that a parallel computing flow diagram among the nodes of the multi-processor chip is completed; the middle layer architecture performs matrix blocking according to the matrix scale and the number of processor cores input by the processor nodes, each blocking matrix is a square matrix with uniform matrix scale, the whole operation process is performed layer by layer along diagonal subarrays, single-layer subarray QR decomposition and matrix data updating operations are performed in parallel by multiple cores in a processor chip, and the bottom layer architecture performs vector calculation of multiple data parallelism by utilizing a processor instruction set with single instruction multiple data operation SIMD, so that single-core QR decomposition or multiplication operation is completed; the multi-core processor cluster adopts a layer-by-layer decomposition method, and realizes QR parallel decomposition of a large-scale matrix from a three-level parallel structure of multi-processor chip inter-node parallel, single-processor chip multi-core parallel and single-core multi-data parallel.

2. A massive matrix QR decomposition parallel computing system according to claim 1, wherein: at least 8 pieces of multi-core processor chips are arranged on the massive parallel computing platform, an input matrix A to be decomposed is input, the size of the input matrix A to be decomposed is at least row 16N and column N, subarray segmentation of the input matrix A to be decomposed can be flexibly cut according to the maximum parallelism capability provided by the platform, and the subarray segmentation is divided into a plurality of subarray blocks A with the size of row 2N and column N according to the column _i The maximum of 8 multi-core processor chips can be divided into A1-A8 matrix blocks, the top-level architecture of the matrix A to be decomposed is cascaded according to the structure of a binary tree 8-4-2-1 architecture, the top-level parallel architecture is in a typical binary tree architecture, the parallelism among the processor chip nodes is finished, at least 17 processor nodes can be executed in parallel at the maximum of 8 multi-core processor nodes in each stage, and at least four stages of pipelines are constructed by using the 8-4-2-1 architecture.

3. A massive matrix QR decomposition parallel computing system according to claim 2, wherein: according to the layer progressive direction, the first stage pipeline simultaneously and parallelly processes 8 processor nodes, the node 1 executes the QR decomposition of A1, the node 2 executes the QR decomposition of A2, and the node i executes A _i QR decomposition of each node A _i The execution time is the same; the second stage pipeline outputs an upper triangular matrix R with a matrix size of N to the first stage pipeline _1,i Combining the data in pairs, wherein the size of the combined new matrix is 2N; the third-stage pipeline outputs an upper triangular matrix R with the matrix size of N to the second-stage output _2,i Combining the data in pairs, wherein the size of the combined new matrix is 2N; triangle matrix R with matrix size of N x N on result output by fourth stage pipeline to third stage pipeline _3,i And merging the data in pairs, wherein the size of the new matrix after merging is 2N.

4. A massive matrix QR decomposition parallel computing system according to claim 3, wherein: performing data merging and QR decomposition by single-node operation of the second-stage pipeline; and performing data merging and QR decomposition by the single-node operation of the third-stage pipeline, performing QR decomposition on the new matrix by the 1 processor nodes of the fourth-stage pipeline, and outputting a result, namely an upper triangle R matrix of the QR decomposition.

5. A massive matrix QR decomposition parallel computing system according to claim 1, wherein: and performing QR decomposition on the row 2N and column N submatrix blocks by using the middle layer parallel architecture, performing multi-layer progressive execution according to the layer progressive direction, performing parallel execution of each layer, and performing multi-core parallel completion on QR decomposition by using the single-node processor.

6. A massive matrix QR decomposition parallel computing system according to claim 1, wherein: middle layer parallel architecture is to first layer block square matrix A _1,1 QR decomposition is carried out, and the same-row block matrix A is used for _i,1 (i > 1) 0 elimination, first layer: partitioned square matrix A _1,1 QR decomposition is carried out, and the same-row block matrix A is used for _i,1 (i > 1) performing 0 elimination, and dividing the same row of the block matrix A _1,j (j > 1) data updating, including 4 operation operations, namely updating ORMQR for the same row of general QR operation and transpose GEQRT, multiplication operation after QR, high thin matrix QR operation and conversion TSQRT, and high thin matrix multiplication operation same row updating TSMQR after QR.

7. A massive matrix QR decomposition parallel computing system according to claim 6, wherein: general QR operation and transpose GEQRT to execute general QR operation, updating A with triangular matrix R on QR result _1,1 Transpose Q of transposed output Q of quadrature matrix Q ^T The method comprises the steps of carrying out a first treatment on the surface of the QR post-multiply operation peer update ORMQR utilizes a pre-stage GEQRT result Q ^T Execution A _1,j ＝Q ^T .A _1,j Updating matrix A _1,j The method comprises the steps of carrying out a first treatment on the surface of the High-lean matrix QR operation and conversion of TSQRT to utilize updated A _1,1 And A is a _i,1 (i > 1) is combined into

Is combined with +.>

Performing QR decomposition to output a new upper triangular matrix R _1，1 Update A _1,1 And output the result Q ^T The method comprises the steps of carrying out a first treatment on the surface of the The QR post-high-lean matrix multiplication operation same-line updating TSMQR mainly utilizes the front-stage TSQRT to output a result Q ^T Execution of

Update->

T represents the transpose.

8. A massive matrix QR decomposition parallel computing system according to claim 1, wherein: single-level multi-core parallel first-level computing in middle-level parallel architecture: core 0 and core 4 complete the GEQRT operation in parallel; multicore parallel second level computing: 8, core 0 and core 4 complete TSQRT operation in parallel, and the other cores complete ORMQR operation; multicore parallel third level computing: 8, core 0 and core 4 complete TSQRT operation in parallel, and other multi-core parallel cores complete TSMQRR operation; multicore parallel fourth-stage computation: 8, core 0 and core 4 complete TSQRT operation in parallel, and other multi-core parallel cores complete TSMQRR operation; multicore parallel fifth level computation: 7, core parallel, core 0 completes TSQRT operation on the combined lean length matrix of cores 0 and 4, and other multi-core parallel cores complete TSMQR operation; sixth-stage multi-core parallelism, and 3-core parallelism completes TSMQR operation.

9. A massive matrix QR decomposition parallel computing system according to claim 8, wherein: multicore parallel second layer: partitioned square matrix A _2,2 QR decomposition is carried out, and A is updated _2,2 For the same column of the partitioned square matrix A _i,2 (i > 2) performing 0 elimination, and dividing the same row of the block matrix A _2,j (j > 2) updating; multicore parallel third layer: bisecting pairBlock matrix A _3,3 QR decomposition is carried out, and A is updated _3,3 For the same column of the partitioned square matrix A _i,3 (i > 3) performing 0 elimination, and dividing the same row of the block matrix A _i,j (j > 3) updating; multi-core parallel fourth layer: since the layer only has 1 column of block matrix left, only the matrix of A _4,4 QR decomposition is carried out, and A is updated _4,4 For the same column of the partitioned square matrix A _i,4 (i > 4)) was subjected to 0 elimination.

10. A massive matrix QR decomposition parallel computing system according to claim 9, wherein: the bottom parallel architecture adopts single instruction multiple data stream SIMD instruction and vector calculation instruction to conduct parallel processing, and R matrix head diagonal element R is calculated ₁₁ Vector calculation

Calculating diagonal element r by adopting evolution instruction evolution ₁₁ ＝sqrt(R ₁ ) Taking reciprocal g by special reciprocal taking instruction ₁ ＝1/r ₁₁ The method comprises the steps of carrying out a first treatment on the surface of the For the other elements in the first row of the R matrix +.>

Vector calculation is carried out, and the head diagonal element R of the R matrix is calculated _1j ＝C _1j ·g ₁ (j > 1); updating matrix A, coefficient h _1j ＝C _1j /R ₁₁ (j > 1) intermediate results

Column vector of matrix A->

And carrying out vector calculation to complete the solution of all elements in the first row of the R matrix, updating the matrix A, and repeating the steps to complete the evaluation of all the R matrix. />