CN111858465B - Large-scale matrix QR decomposition parallel computing system - Google Patents

Large-scale matrix QR decomposition parallel computing system Download PDF

Info

Publication number
CN111858465B
CN111858465B CN202010609939.3A CN202010609939A CN111858465B CN 111858465 B CN111858465 B CN 111858465B CN 202010609939 A CN202010609939 A CN 202010609939A CN 111858465 B CN111858465 B CN 111858465B
Authority
CN
China
Prior art keywords
matrix
parallel
decomposition
core
level
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010609939.3A
Other languages
Chinese (zh)
Other versions
CN111858465A (en
Inventor
吴明钦
刘红伟
潘灵
贾明权
郝黎宏
林勤
张昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest Electronic Technology Institute No 10 Institute of Cetc
Original Assignee
Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest Electronic Technology Institute No 10 Institute of Cetc filed Critical Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority to CN202010609939.3A priority Critical patent/CN111858465B/en
Publication of CN111858465A publication Critical patent/CN111858465A/en
Application granted granted Critical
Publication of CN111858465B publication Critical patent/CN111858465B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8007Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors single instruction multiple data [SIMD] multiprocessors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/76Arrangements for rearranging, permuting or selecting data according to predetermined rules, independently of the content of the data
    • G06F7/78Arrangements for rearranging, permuting or selecting data according to predetermined rules, independently of the content of the data for changing the order of data flow, e.g. matrix transposition or LIFO buffers; Overflow or underflow handling therefor
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a large-scale matrix QR decomposition parallel computing structure, which relates to the field of digital signal processing and aims to provide a three-level parallel computing structure with clear parallel logic, high throughput and low delay, and the invention is realized by the following technical scheme: in the process of constructing a processor cluster system and a QR decomposition parallel computing structure by adopting a multi-core processor chip, a top-layer architecture divides a matrix to be decomposed into a plurality of data fragments, the data fragments are distributed to all levels of nodes through a communication network which is interconnected among the nodes of the multi-core processor, the nodes of all levels are sequentially calculated step by step according to a binary tree complete structure, and the nodes of all levels are calculated in parallel; the middle layer architecture performs matrix blocking, and performs operation layer by layer along the diagonal subarrays; the lower layer architecture utilizes a processor instruction set to perform multi-data parallel vector computation to complete single-core QR decomposition and multiplication operations. The multi-core processor cluster adopts a structure of layer-by-layer decomposition to realize QR parallel decomposition of a large-scale matrix.

Description

Large-scale matrix QR decomposition parallel computing system
Technical Field
The invention relates to a large-scale array antenna signal processing technology in the field of digital signal processing, in particular to a QR decomposition method for a cluster parallel structure of a large-scale multi-core processor for high-performance parallel computing of a large-scale parallel processor in the field of numerical computation.
Background
In the field of digital signal processing, signal processing algorithms such as large-scale array antenna signal processing and large-scale Multiple Input Multiple Output (MIMO) technology often involve covariance matrix inversion, channel matrix estimation, channel equalization and other problems, and QR decomposition has been widely used in these aspects. MIMO is a standard for wireless network evolution, among others. In fact, MIMO technology has become one of the most critical of many wireless communication standards, such as ieee802.11n, 3GPP-LTE, etc. The design of the efficient and key QR decomposition operation unit can reduce the complexity of the implementation of the MIMO system, so that good calculation performance is obtained. The covariance matrix inversion algorithm is one of the most commonly used spatial interference suppression algorithms. However, when the interference power is too large, the sampling covariance matrix is a singular matrix, and the matrix inversion primer fails, so that the interference suppression weight cannot be effectively generated. Because the QR decomposition can effectively improve the number of matrix conditions and improve the numerical stability, the situation that effective interference suppression weight cannot be obtained due to too strong interference power can be effectively avoided by inverting the sampling covariance matrix by utilizing the QR decomposition algorithm. The computational performance of QR decomposition has a direct impact on the signal processing performance of the interference suppression system. The QR decomposition algorithm is used as a main tool for digital signal processing, plays an important role in the field of high-performance calculation, and is an important index for measuring the performance of a system. Matrix operation is one of core problems in high-performance computation, matrix decomposition is an important approach for improving the parallelism of matrix operation, and QR decomposition is an important matrix decomposition form. By definition Q is an orthogonal matrix and R is a non-singular upper triangular matrix (i.e., the elements below the diagonal of matrix R are all 0), where the decomposition is unique when the diagonal elements of R are required to be positive.
There are many ways to actually calculate the QR decomposition, such as Givens rotation, householder transformation, and Gram-Schmidt orthogonalization, etc. Each of these methods has its advantages and disadvantages. The QR decomposition of the matrix is typically implemented using a Householder transform or Givens rotation or Gram-Schmidt orthogonalization methods. The QR decomposition algorithm of the Householder transformation obtains a Householder transformation matrix through reflection operation, and elements below a diagonal line are updated to be 0 through matrix multiplication, so that a large amount of matrix multiplication calculation is needed in the process, and the algorithm complexity is increased. The Givens rotated QR decomposition method performs update calculation on the matrix through the Givens rotated matrix, but can split the matrix into line updates, and has operations of division, evolution and the like, but has lower complexity than a Householder. Because of the huge amount of QR decomposition operation, a large amount of calculation time is consumed in the decomposition process, and the QR decomposition operation becomes a bottleneck for improving the performance of numerous practical applications. For example, in cognitive radio, QR decomposition is the most time-consuming calculation module in Singular Value Decomposition (SVD). The data shows that the time spent processing QR decomposition accounts for more than 70% of the overall SVD operation.
The large-scale matrix QR decomposition has wide application in the fields of signal processing, image processing, computational structural mechanics and the like. Because the large-scale matrix QR decomposition algorithm has huge operand, the algorithm structure is very complex, and parallel decomposition is not facilitated, and the traditional method is to realize large-scale QR decomposition on a high-performance super computing platform based on an X86 architecture. The problems of large-scale QR decomposition, task allocation, data synchronization and the like based on the distributed computing platform lead to longer communication time, and cannot meet the ms-level real-time processing requirement. In recent years, a plurality of medium-small-scale matrix QR decompositions are realized by adopting an FPGA design, a hardware structure is usually realized by adopting a systonic array, the parallelism performance is good, the real-time performance is high, the matrix size is rapidly increased, the matrix size is limited by the area and the power consumption limitation of an FPGA chip, the QR decomposition realized based on the FPGA cannot meet the requirement of high throughput rate of the large-scale matrix QR decomposition, and the development period is long.
The key to the wide application of large-scale matrix QR decomposition is to reduce processing delay while improving throughput. Along with the large increase of the scale of the front-end sensor, the continuous increase of the sampling rate and the continuous increase of the scale of the channel matrix, the conventional QR decomposition method can not meet the requirements of the large-scale QR decomposition on throughput rate and real-time processing. There are two extremes in the current parallel computing technology research for QR decomposition: firstly, in the field of scientific computation, an existing popular distributed parallel computing architecture is adopted to realize ultra-large-scale matrix decomposition on a distributed computing platform such as Hadoop, and the method can realize large-scale throughput processing but cannot meet the requirement of real-time processing, and is not applicable to embedded equipment with strict requirements on low power consumption, flexibility and high reliability; secondly, aiming at a specific hardware structure, taking the implementation of QR decomposition by an FPGA as a representative, designing a special matrix decomposition parallel processor, pursuing extremely rapid processing, the method has long development period, limited chip scale performance, difficult expansion, limited throughput rate and inapplicability to large-scale QR decomposition.
For most engineering applications, a system hardware architecture with strong expansibility needs to be flexibly and rapidly built by using the existing mature chip. Currently, a variety of processor chips have evolved from single-core processors to multi-core processor chips, each of which in turn typically has Single Instruction Multiple Data (SIMD) parallel computing capabilities and vector computing capabilities. The current international development level is that hundreds of light cores can be integrated in a single chip, taking a TMS320C6678 DSP chip of TI company as an example, the number of integrated high-performance DSP cores reaches 8 cores, and the processing capacity of the single chip is further improved by two ways of further improving the single-core performance and increasing the number of the multiple cores in the future. Therefore, current multi-core technology is gradually available and becomes a hot spot for CPU/DSP development. From the technical aspect, the parallelism of the multiple cores of the processor is an important method for realizing software parallelism, and the multiple cores are important solving ways for improving the performance of the processor. The adoption of the multi-core structure is an important means for improving the performance of the processor, and is very suitable for parallel computing tasks. Increasing the number of cores and improving the capacity of an on-chip memory are the main means for improving the operation capability of the multi-core DSP by the current commercial multi-core DSP. The multi-core cluster system is a mainstream massive parallel computing system at present, and a plurality of multi-core processor chips are connected through a high-speed interconnection network to form a cluster, so that the multi-core cluster system has extremely strong parallel computing capability. However, the conventional QR decomposition algorithm cannot fully utilize the parallel processing capability of the multi-core cluster system, and it is difficult to exert the performance advantage of the multi-core cluster system.
Disclosure of Invention
The invention aims at providing a three-level parallel computing structure which has clear parallel logic, high expansibility and portability, high throughput, low delay and high universality and can fully utilize the parallel processing advantages of a multi-core processor cluster to realize the parallel among multi-processor nodes, the parallel among single-processor multi-cores and the parallel among single-core multi-data aiming at the application of the large-scale multi-core processor cluster which relates to the large-scale QR decomposition and simultaneously requires real-time processing, so as to solve the problem that the traditional QR decomposition method cannot effectively utilize the cluster resources of the multi-core processor to carry out the large-scale parallel computation.
The invention adopts the technical scheme that: a large-scale matrix QR decomposition parallel computing system, comprising: the multi-processor node parallel, single processor multi-core parallel and single core multi-data parallel three-level parallel structure for realizing the large-scale matrix QR parallel decomposition is characterized in that a first-level parallel structure with the characteristic of a binary tree structure belongs to a top-level framework in a three-level parallel structure, a second-level parallel framework belongs to a middle-level framework in the three-level parallel structure, and a third-level parallel framework belongs to a bottom-level framework in the three-level parallel structure, and the multi-processor node parallel structure is characterized in that: in a processor cluster system with large-scale parallel computing capability and a QR decomposition parallel computing structure are constructed by adopting a multi-core processor chip, a top-level architecture utilizes the cluster characteristics of the multi-core processor to divide a matrix to be decomposed into a plurality of data fragments, the data fragments are distributed to each parallel node of a first level through a communication network which is interconnected among nodes of the multi-core processor, each node of the first level parallelly completes corresponding QR decomposition tasks and then sends an upper triangular matrix R to a second level node through the network, the second level node completes R matrix combination and QR decomposition and then sends the R matrix to a third level node through the network, and the like, each level node sequentially executes step by step according to the complete structure of a binary tree, and the nodes of the same level execute in parallel, so that a parallel computing flow diagram among the nodes of the multi-processor chip is completed; the middle layer architecture performs matrix sub-array blocking according to the matrix scale input by the processor nodes and the number of processor cores, each sub-array is divided into square arrays with uniform matrix scale, the whole operation process is performed layer by layer along the diagonal sub-arrays, single-layer sub-array QR decomposition and matrix data updating operations are performed in parallel by multiple cores in a processor chip, and the bottom layer architecture performs vector computation of multiple data parallelism by utilizing a processor instruction set with SIMD (single instruction multiple data) capability, so that single-core QR decomposition and multiplication operations are completed. The multi-core processor cluster adopts a layer-by-layer decomposition method, and realizes QR parallel decomposition of a large-scale matrix from a three-level parallel structure of multi-processor chip inter-node parallel, single-processor chip multi-core parallel and single-core multi-data parallel.
Compared with the prior art, the technical scheme of the invention has the following beneficial effects:
the parallel logic is clear, and the expansibility and portability are strong. The invention fully utilizes the parallel computing capability of the multi-core processor cluster, adopts a three-layer parallel architecture to realize the QR decomposition parallel computing structure of the processor cluster system with the large-scale parallel computing capability constructed by the multi-core processor chip, and each level of nodes are sequentially executed step by step according to the binary tree complete structure, and the peer nodes are executed in parallel, so that the multi-processor cluster deployment is easy to realize. The three-layer parallel architecture has clear structure and easy realization, and can simultaneously meet the requirements of system throughput rate and real-time processing. The experimental results show that: compared with the software implementation of the general processor in the prior art, the parallel result of the QR decomposition can obtain more than 20 times of speed-up ratio on the core computing performance, and has higher data and flow parallelism for the matrix triangularization computing process.
According to the invention, the communication network is formed by the interconnection exchange chips among the multi-core processor nodes, the cluster scale is easy to expand, and the shared memory among a plurality of cores in the single processor node is utilized, so that the data communication efficiency and task synchronization of the parallel tasks among the multiple cores can be obviously improved. The whole cluster architecture is favorable for task allocation and scheduling of a three-layer parallel architecture, and has high flexibility, expansibility and portability.
The invention is based on a multi-core processor cluster, and adopts a layer-by-layer decomposition structure to realize multi-node multi-core parallel, single-node multi-core parallel and single-core multi-data parallel. The multi-node parallel with the characteristic of the binary tree structure can fully utilize the parallel processing advantage of the processor clusters, meet the throughput rate requirement of the QR decomposition of the large-scale matrix, and can obtain the exponential increase performance of the scale of the matrix to be processed at the cost of the linear increase of the cluster scale under the same processing time. The multi-core resources of the processor can be fully utilized in parallel, the QR decomposition and data matrix updating operation of the single-layer subarrays can be realized in parallel, the real-time computing capacity of the QR decomposition at a single node and the QR decomposition operation speed can be greatly improved, and the processing performance is multiplied along with the number of cores. The SIMD capability and vector computing capability of a processor instruction set are utilized in parallel by the multi-data among single cores, and the performance of single-core QR decomposition is increased by 4 times by adopting an improved GR algorithm. The three-layer parallel architecture is very suitable for multi-core processor clusters, not only meets the throughput rate requirement of the QR decomposition of a large-scale matrix, but also greatly reduces the processing delay and has the capability of real-time processing.
The invention comprehensively considers the communication, storage, calculation resources and SIMD, matrix and vector calculation capability of instruction level of the multi-core processor, realizes the parallel decomposition of the large-scale matrix QR decomposition from the parallel structure of multi-processor nodes, single-chip multi-core parallel and single-core multi-data parallel three-level parallel, adopts the shared internal memory to realize the multi-core communication and task synchronization inside the processor, adopts the network exchange chip to randomly expand the cluster scale for the communication between chips, can simultaneously process the task level parallel and the data level parallel in the matrix decomposition process, fully utilizes the SIMD instruction set and the vector calculation instruction set of the processor to perform parallel operation, and realizes the flexible scheduling of the QR decomposition parallel task. The multi-level parallel computing structure can achieve performance improvement of more than 20 times, is low in communication cost and obvious in computing acceleration, remarkably improves computing parallelism and cluster communication flexibility, and can achieve performance improvement of more than 20 times in numerical computation related to weight updating in large-scale array signal processing. The method is very suitable for a massive parallel computing cluster constructed by taking a multi-core DSP as a representative.
The method is greatly superior to the existing QR decomposition method, has outstanding engineering application value, and is very suitable for large-scale multi-core processor cluster calculation.
Drawings
FIG. 1 is a block diagram of a top-level parallel structure in a three-level parallel structure of a large-scale matrix QR decomposition of the present invention;
FIG. 2 is a block diagram of a hierarchical progression relationship per level of an intermediate layer parallel structure in a massive matrix QR decomposition three-layer parallel structure;
FIG. 3 is a block diagram of single-level multi-core parallel computing in the middle tier parallel architecture of FIG. 2;
FIG. 4 is a block diagram of a hierarchical progressive relationship of the underlying parallel structure of a massive QR decomposition three-layer parallel structure;
FIG. 5 is a block diagram of a connection between cluster nodes of a multi-core processor and a connection between cores of the processor.
Detailed Description
See fig. 1-3. In the preferred embodiment described below, a large-scale matrix QR decomposition parallel computing system includes: the processor node parallel structure, the processor core parallel structure and the single-core instruction level parallel three-level parallel structure for realizing the large-scale matrix QR parallel decomposition are characterized in that a first-level parallel structure with the characteristic of a binary tree structure belongs to a top-level framework in a three-level parallel structure, a second-level parallel framework belongs to a middle-level framework in the three-level parallel structure, and a third-level parallel framework belongs to a bottom-level framework in the three-level parallel structure. In a processor cluster system with large-scale parallel computing capability and a QR decomposition parallel computing structure are constructed by adopting a multi-core processor chip, a top-level architecture utilizes a multi-core processor cluster to divide the data of a segmented matrix to be decomposed into pieces, the pieces are distributed to all parallel nodes of a first level through a communication network which is formed by surrounding a switching chip and is connected among the nodes of the multi-core processor, all the nodes of the first level parallelly finish corresponding QR decomposition tasks and then output a triangle matrix R to a second level node, the second level node finishes R matrix combination and QR decomposition and then sends the R matrix to a third level node, all the nodes are sequentially executed step by step according to a binary tree complete structure, and the nodes of the same level are executed in parallel, so that a parallel computing flow diagram among the nodes of the processor chip is completed; the middle layer architecture performs matrix blocking according to the matrix scale input by the processor node, each blocking matrix is a square matrix subarray with uniform matrix scale, the subarray or subarray combination is taken as an object to operate QR decomposition and matrix data updating operation, the whole operation process is performed layer by layer along the diagonal subarray, single-layer subarray QR decomposition and updating operation is performed in parallel by multiple cores in the processor chip, and the bottom layer architecture performs vector calculation by utilizing a processor instruction set with single instruction multiple data SIMD (single instruction multiple data) capability to complete single-core QR decomposition and multiplication operation; the QR decomposition of the large-scale matrix adopts a layer-by-layer decomposition method, and realizes the QR parallel decomposition of the large-scale matrix from the three-level parallel structure of the chip node-to-node parallelism of the multi-core processor, the chip multi-core parallelism and the single-core multi-data parallelism.
In the optional embodiment, a parallel architecture of QR parallel decomposition of a large-scale matrix is realized, and a three-layer architecture of a top layer, a middle layer and a bottom layer is adopted.
In an alternative embodiment, the top-level parallelism architecture shown in FIG. 1 is a typical binary tree architecture, primarily accomplishing inter-processor chip node parallelism. At least 8 pieces of multi-core processor chips are arranged on the massive parallel computing platform, an input matrix A to be decomposed is input, the size of the input matrix A to be decomposed is at least row 16N and column N, subarray segmentation of the input matrix A to be decomposed can be flexibly cut according to the maximum parallelism capability provided by the platform, and the subarray segmentation is divided into a plurality of subarray blocks A with the size of row 2N and column N according to the column i The maximum of 8 multi-core processor chips can be divided into A1-A8 matrix blocks, the top-level architecture of the matrix A to be decomposed is cascaded according to the structure of a binary tree 8-4-2-1 architecture, and the top-level parallel architecture is in a typical binary tree architecture to finish the parallel among the nodes of the processor chips. To greatly increase platform throughput, at least 8 multi-core processor nodes per stage can be executed simultaneously in parallel with at least 17 processor nodes, and at least 4 stages of pipelines are constructed by using an 8-4-2-1 architecture.
The pipeline structure of the 8-4-2-1 architecture sequentially performs the following steps:
progressive in the pipeline cascade direction, the first stage pipeline processes 8 processor nodes simultaneously in parallel, node 1 executes QR decomposition of A1, node 2 executes QR decomposition of A2, and node i executes A i Due to the same matrix size, each node A i The execution time is the same, and the parallel performance is good.
A second stage pipeline: an upper triangular matrix R with a matrix size of N x N for the first stage output 1,i And merging the data in pairs, wherein the size of the new matrix after merging is 2N, and the new matrix is simultaneously and parallelly subjected to QR decomposition by 4 processor nodes.
Third stage pipeline: an upper triangular matrix R with a matrix size of N x N output to the second stage pipeline 2,i Combining the data two by two, wherein the size of the combined new matrix isAnd 2N, 2 processor nodes simultaneously and parallelly perform QR decomposition on the new matrix.
Fourth stage pipeline: triangular matrix R with matrix size of N on result of third stage output 3,i And merging the data in pairs, wherein the size of the new matrix after merging is 2N, QR decomposition is carried out on the new matrix by 1 processor node, and an output result is an upper triangular R matrix of the QR decomposition.
See fig. 2. The middle-layer parallel architecture utilizes single-node processor multi-core parallel to complete QR decomposition, and the processing delay of the top-layer parallel architecture is directly determined by the time of the single-processor node QR decomposition. The middle layer parallel architecture decomposes the QR of the row 2N and column N submatrix blocks, and the QR is executed in a multi-layer progressive mode, and each layer is executed in parallel. The middle layer performs matrix blocking on the matrix of the rows 2N and the columns N according to the number of the cores of the single-node processor, and blocks the matrix A i,j Is a square matrix. The matrix transpose is mathematically defined as assuming matrix A as a matrix of m x n order matrix (i.e., m rows and n columns), where the elements of the ith row and jth column are a i,j If i, j is greater than or equal to 1, then there is an n×m order matrix b=a T Satisfy b j,i =a i,j . Assuming that a piece of multi-core processor chip has 8 cores, then the block matrix A i,j Is a square matrix of rows 2N/8 columns 2N/8. The steps are sequentially carried out as follows:
middle layer parallel architecture is to first layer block square matrix A 1,1 QR decomposition is carried out, and the same-row block matrix A is used for i,1 (i > 1) performing 0 elimination, and dividing the same row of the block matrix A 1,j (j > 1) performing data update. The method comprises 4 operation operations, namely updating ORMQR for common QR operation and transposed GEQRT, updating the same row of the multiplication operation after QR, updating TSQRR for high-thin matrix QR operation and transposed TSQRT, and updating the same row of the multiplication operation of the high-thin matrix after QR. GEQRT executes general QR operation, and updates A with triangular matrix R on QR result 1,1 Transpose Q of transposed output Q of quadrature matrix Q T The method comprises the steps of carrying out a first treatment on the surface of the ORMQR utilizes the previous GEQRT result Q T Execution A 1,j =Q T .A 1,j Updating matrix A 1,j The method comprises the steps of carrying out a first treatment on the surface of the TSQRT utilizes updated A 1,1 And A is a i,1 (i > 1) is combined into
Figure GDA0004122265760000071
And for the combined thin and long matrix
Figure GDA0004122265760000072
Performing QR decomposition to output a new upper triangular matrix R 1,1 Update A 1,1 And output the result Q T The method comprises the steps of carrying out a first treatment on the surface of the TSMQR outputs the result Q mainly by using the front stage TSQRT T Execution->
Figure GDA0004122265760000073
Update->
Figure GDA0004122265760000074
T represents the transpose.
The computation of the first layer is completed in multi-stage cascade, and the operation of each stage is realized in multi-core parallel.
See fig. 3. Single-level multi-core parallel first-level computing in middle-level parallel architecture: core 0 and core 4 finish the general QR operation and transpose GEQRT operation in parallel; multicore parallel second level computing: 8, core 0 and core 4 finish high-thin matrix QR operation and transpose TSQRT operation in parallel, and other cores finish the multiplication operation and update ORMQR operation in the same row after QR; multicore parallel third level computing: 8, core parallel, core 0 and core 4 complete TSQRT operation, and other multi-core parallel cores complete TSMQR operation of high-thin matrix QR backward updating operation; multicore parallel fourth-stage computation: 8, core 0 and core 4 complete TSQRT operation in parallel, and other multi-core parallel cores complete TSMQRR operation; multicore parallel fifth level computation: 7, core parallel, core 0 completes TSQRT operation on the combined lean length matrix of cores 0 and 4, and other multi-core parallel cores complete TSMQR operation; sixth-stage multi-core parallelism, and 3-core parallelism completes TSMQR operation.
Multicore parallel second layer: partitioned square matrix A 2,2 QR decomposition is carried out, and A is updated 2,2 For the same column of the partitioned square matrix A i,2 (i > 2) performing 0 elimination, and dividing the same row of the block matrix A 2,j (j > 2) updating. The operation steps are similar to the multi-core parallel first layer except that the number of cores participating in the parallel computation is different.
Multicore parallel third layer: partitioned square matrix A 3,3 Proceeding withQR decomposition, update A 3,3 For the same column of the partitioned square matrix A i,3 (i > 3) performing 0 elimination, and dividing the same row of the block matrix A i,j (j > 3) updating. The operation steps are similar to the multi-core parallel first layer operation, except that the number of cores participating in the parallel computation is different.
Multi-core parallel fourth layer: since the layer only has 1 column of block matrix left, only the matrix of A 4,4 QR decomposition is carried out, and A is updated 4,4 For the same column of the partitioned square matrix A i,4 (i > 4)) was subjected to 0 elimination. The operation steps include a general QR operation and transpose GEQRT and a high lean matrix QR operation and transpose TSQRT.
See fig. 4. The bottom parallel architecture mainly completes the parallel computation of one core of the processor. Currently, the instruction set of popular multi-core processors supports SIMD operations, vector and matrix operations. In order to fully utilize the parallel performance of instruction sets, the time cost of each core in different operations is adapted, and the general QR operation, the transposed GEQRT operation, the high-thin matrix QR operation and the transposed TSQRT operation adopt an improved orthogonalization GS (Gram-Schmidt) method. The improved GS method comprises the following steps:
the first step: the bottom parallel architecture adopts special single instruction stream multiple data stream SIMD instruction and vector calculation instruction to conduct parallel processing, and R matrix head diagonal element R is calculated 11 Vector calculation
Figure GDA0004122265760000081
Adopting special evolution instruction to conduct evolution calculation, r 11 =sqrt(R 1 ) Taking reciprocal g by special reciprocal taking instruction 1 =1/r 11 Here, special refers to an additional instruction set provided by the multi-core processor for accelerating computation, and different processors all have instruction sets which are specially supported by the processor;
and a second step of: the bottom parallel architecture adopts special SIMD instruction and vector calculation instruction to process in parallel, and other elements in the first row of the R matrix are processed
Figure GDA0004122265760000082
Vector calculation is carried out, and the head diagonal element R of the R matrix is calculated 1j =C 1j ·g 1 (j>1);
And a third step of: the bottom parallel architecture adopts special SIMD instructions and vector calculation instructions to carry out parallel processing and updates matrixes A and h 1j =C 1j /R 11 (j > 1), pair
Figure GDA0004122265760000083
Vector calculation is performed. So far, all element solutions for the first row of the R matrix have been completed and matrix A has been updated. The three steps are repeated to complete all R matrix evaluation, the whole improved GS algorithm divides the evolution and division of a small part, and the vector multiplication and addition are mostly carried out, and the whole calculation process can be greatly accelerated by utilizing parallel processing methods such as SIMD instructions, vector calculation instructions and the like.
See fig. 5. The multi-core processor cluster is formed by a plurality of double data rate DDR memory crosslinked switching networks interconnected by the multi-core processors, and the system interconnection architecture determines the universality and expandability of the cluster. A multi-core processor typically has multiple independent processing cores, such as 8 cores per core, for example, a digital signal processing DSP chip TMS320C 6678. The 8 processor chips are interconnected through a switching chip, and the switching network can be a high-speed communication network such as RapidIO and the like, so as to form a multi-core processor cluster with 64 processing cores. Data is shared among the 8 processor nodes through the DDR of each node, for example, the multi-core processor chip 0 sends the data in the DDR to the DDR of the multi-core processor chip 1 through the switching network. 8 cores in the multi-core processor chip perform rapid data interaction and task synchronization through an on-chip shared memory, and can also communicate with processing cores of other nodes through off-chip DDR. The processor nodes are interconnected through the exchange chip to form a communication network, which is favorable for the scale expansion of clusters, the multi-core in the processor can improve the data communication efficiency and the task synchronization capability of parallel tasks among the multi-core through the shared memory in the chip, and different processors can perform inter-core communication through DDR, which is favorable for the data distribution among the nodes. The interconnection mode of the whole cluster architecture is favorable for task allocation and scheduling of the three-layer parallel architecture, and has high flexibility and good expansibility.
While the foregoing is directed to the preferred embodiment of the present invention, it is noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. Various modifications and improvements may be made by those skilled in the art without departing from the spirit and substance of the invention, and are also considered to be within the scope of the invention.

Claims (10)

1. A large-scale matrix QR decomposition parallel computing system, comprising: the multi-processor node parallel, single processor multi-core parallel and single core multi-data parallel three-level parallel structure for realizing the large-scale matrix QR parallel decomposition on the large-scale parallel computing platform is characterized in that a first-level parallel structure with the characteristic of a binary tree structure belongs to a top-level framework in a three-level parallel structure, a second-level parallel framework belongs to a middle-level framework in the three-level parallel structure, and a third-level parallel framework belongs to a bottom-level framework in the three-level parallel structure, and the method is characterized in that: in a processor cluster system with large-scale parallel computing capability and a QR decomposition parallel computing structure are constructed by adopting a multi-core processor chip, a top-level architecture utilizes the cluster characteristics of the multi-core processor to divide a matrix to be decomposed into a plurality of data fragments, the data fragments are distributed to each parallel node of a first level through a communication network which is interconnected among nodes of the multi-core processor, each node of the first level parallelly completes corresponding QR decomposition tasks and then sends an upper triangular matrix R to a second level node through the network, the second level node completes R matrix combination and QR decomposition and then sends the R matrix to a third level node through the network, and the like, each level node sequentially executes step by step according to the complete structure of a binary tree, and the nodes of the same level execute in parallel, so that a parallel computing flow diagram among the nodes of the multi-processor chip is completed; the middle layer architecture performs matrix blocking according to the matrix scale and the number of processor cores input by the processor nodes, each blocking matrix is a square matrix with uniform matrix scale, the whole operation process is performed layer by layer along diagonal subarrays, single-layer subarray QR decomposition and matrix data updating operations are performed in parallel by multiple cores in a processor chip, and the bottom layer architecture performs vector calculation of multiple data parallelism by utilizing a processor instruction set with single instruction multiple data operation SIMD, so that single-core QR decomposition or multiplication operation is completed; the multi-core processor cluster adopts a layer-by-layer decomposition method, and realizes QR parallel decomposition of a large-scale matrix from a three-level parallel structure of multi-processor chip inter-node parallel, single-processor chip multi-core parallel and single-core multi-data parallel.
2. A massive matrix QR decomposition parallel computing system according to claim 1, wherein: at least 8 pieces of multi-core processor chips are arranged on the massive parallel computing platform, an input matrix A to be decomposed is input, the size of the input matrix A to be decomposed is at least row 16N and column N, subarray segmentation of the input matrix A to be decomposed can be flexibly cut according to the maximum parallelism capability provided by the platform, and the subarray segmentation is divided into a plurality of subarray blocks A with the size of row 2N and column N according to the column i The maximum of 8 multi-core processor chips can be divided into A1-A8 matrix blocks, the top-level architecture of the matrix A to be decomposed is cascaded according to the structure of a binary tree 8-4-2-1 architecture, the top-level parallel architecture is in a typical binary tree architecture, the parallelism among the processor chip nodes is finished, at least 17 processor nodes can be executed in parallel at the maximum of 8 multi-core processor nodes in each stage, and at least four stages of pipelines are constructed by using the 8-4-2-1 architecture.
3. A massive matrix QR decomposition parallel computing system according to claim 2, wherein: according to the layer progressive direction, the first stage pipeline simultaneously and parallelly processes 8 processor nodes, the node 1 executes the QR decomposition of A1, the node 2 executes the QR decomposition of A2, and the node i executes A i QR decomposition of each node A i The execution time is the same; the second stage pipeline outputs an upper triangular matrix R with a matrix size of N to the first stage pipeline 1,i Combining the data in pairs, wherein the size of the combined new matrix is 2N; the third-stage pipeline outputs an upper triangular matrix R with the matrix size of N to the second-stage output 2,i Combining the data in pairs, wherein the size of the combined new matrix is 2N; triangle matrix R with matrix size of N x N on result output by fourth stage pipeline to third stage pipeline 3,i And merging the data in pairs, wherein the size of the new matrix after merging is 2N.
4. A massive matrix QR decomposition parallel computing system according to claim 3, wherein: performing data merging and QR decomposition by single-node operation of the second-stage pipeline; and performing data merging and QR decomposition by the single-node operation of the third-stage pipeline, performing QR decomposition on the new matrix by the 1 processor nodes of the fourth-stage pipeline, and outputting a result, namely an upper triangle R matrix of the QR decomposition.
5. A massive matrix QR decomposition parallel computing system according to claim 1, wherein: and performing QR decomposition on the row 2N and column N submatrix blocks by using the middle layer parallel architecture, performing multi-layer progressive execution according to the layer progressive direction, performing parallel execution of each layer, and performing multi-core parallel completion on QR decomposition by using the single-node processor.
6. A massive matrix QR decomposition parallel computing system according to claim 1, wherein: middle layer parallel architecture is to first layer block square matrix A 1,1 QR decomposition is carried out, and the same-row block matrix A is used for i,1 (i > 1) 0 elimination, first layer: partitioned square matrix A 1,1 QR decomposition is carried out, and the same-row block matrix A is used for i,1 (i > 1) performing 0 elimination, and dividing the same row of the block matrix A 1,j (j > 1) data updating, including 4 operation operations, namely updating ORMQR for the same row of general QR operation and transpose GEQRT, multiplication operation after QR, high thin matrix QR operation and conversion TSQRT, and high thin matrix multiplication operation same row updating TSMQR after QR.
7. A massive matrix QR decomposition parallel computing system according to claim 6, wherein: general QR operation and transpose GEQRT to execute general QR operation, updating A with triangular matrix R on QR result 1,1 Transpose Q of transposed output Q of quadrature matrix Q T The method comprises the steps of carrying out a first treatment on the surface of the QR post-multiply operation peer update ORMQR utilizes a pre-stage GEQRT result Q T Execution A 1,j =Q T .A 1,j Updating matrix A 1,j The method comprises the steps of carrying out a first treatment on the surface of the High-lean matrix QR operation and conversion of TSQRT to utilize updated A 1,1 And A is a i,1 (i > 1) is combined into
Figure QLYQS_1
Is combined with +.>
Figure QLYQS_2
Performing QR decomposition to output a new upper triangular matrix R 1,1 Update A 1,1 And output the result Q T The method comprises the steps of carrying out a first treatment on the surface of the The QR post-high-lean matrix multiplication operation same-line updating TSMQR mainly utilizes the front-stage TSQRT to output a result Q T Execution of
Figure QLYQS_3
Update->
Figure QLYQS_4
T represents the transpose.
8. A massive matrix QR decomposition parallel computing system according to claim 1, wherein: single-level multi-core parallel first-level computing in middle-level parallel architecture: core 0 and core 4 complete the GEQRT operation in parallel; multicore parallel second level computing: 8, core 0 and core 4 complete TSQRT operation in parallel, and the other cores complete ORMQR operation; multicore parallel third level computing: 8, core 0 and core 4 complete TSQRT operation in parallel, and other multi-core parallel cores complete TSMQRR operation; multicore parallel fourth-stage computation: 8, core 0 and core 4 complete TSQRT operation in parallel, and other multi-core parallel cores complete TSMQRR operation; multicore parallel fifth level computation: 7, core parallel, core 0 completes TSQRT operation on the combined lean length matrix of cores 0 and 4, and other multi-core parallel cores complete TSMQR operation; sixth-stage multi-core parallelism, and 3-core parallelism completes TSMQR operation.
9. A massive matrix QR decomposition parallel computing system according to claim 8, wherein: multicore parallel second layer: partitioned square matrix A 2,2 QR decomposition is carried out, and A is updated 2,2 For the same column of the partitioned square matrix A i,2 (i > 2) performing 0 elimination, and dividing the same row of the block matrix A 2,j (j > 2) updating; multicore parallel third layer: bisecting pairBlock matrix A 3,3 QR decomposition is carried out, and A is updated 3,3 For the same column of the partitioned square matrix A i,3 (i > 3) performing 0 elimination, and dividing the same row of the block matrix A i,j (j > 3) updating; multi-core parallel fourth layer: since the layer only has 1 column of block matrix left, only the matrix of A 4,4 QR decomposition is carried out, and A is updated 4,4 For the same column of the partitioned square matrix A i,4 (i > 4)) was subjected to 0 elimination.
10. A massive matrix QR decomposition parallel computing system according to claim 9, wherein: the bottom parallel architecture adopts single instruction multiple data stream SIMD instruction and vector calculation instruction to conduct parallel processing, and R matrix head diagonal element R is calculated 11 Vector calculation
Figure QLYQS_5
Calculating diagonal element r by adopting evolution instruction evolution 11 =sqrt(R 1 ) Taking reciprocal g by special reciprocal taking instruction 1 =1/r 11 The method comprises the steps of carrying out a first treatment on the surface of the For the other elements in the first row of the R matrix +.>
Figure QLYQS_6
Vector calculation is carried out, and the head diagonal element R of the R matrix is calculated 1j =C 1j ·g 1 (j > 1); updating matrix A, coefficient h 1j =C 1j /R 11 (j > 1) intermediate results
Figure QLYQS_7
Column vector of matrix A->
Figure QLYQS_8
And carrying out vector calculation to complete the solution of all elements in the first row of the R matrix, updating the matrix A, and repeating the steps to complete the evaluation of all the R matrix. />
CN202010609939.3A 2020-06-29 2020-06-29 Large-scale matrix QR decomposition parallel computing system Active CN111858465B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010609939.3A CN111858465B (en) 2020-06-29 2020-06-29 Large-scale matrix QR decomposition parallel computing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010609939.3A CN111858465B (en) 2020-06-29 2020-06-29 Large-scale matrix QR decomposition parallel computing system

Publications (2)

Publication Number Publication Date
CN111858465A CN111858465A (en) 2020-10-30
CN111858465B true CN111858465B (en) 2023-06-06

Family

ID=72989894

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010609939.3A Active CN111858465B (en) 2020-06-29 2020-06-29 Large-scale matrix QR decomposition parallel computing system

Country Status (1)

Country Link
CN (1) CN111858465B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112488506A (en) * 2020-11-30 2021-03-12 哈尔滨工程大学 Extensible distributed architecture and self-organizing method of intelligent unmanned system cluster
CN112506677B (en) * 2020-12-09 2022-09-23 上海交通大学 TensorFlow distributed matrix calculation implementation method and system
CN112631986B (en) * 2020-12-28 2024-04-02 西南电子技术研究所(中国电子科技集团公司第十研究所) Large-scale DSP parallel computing device
CN115952391A (en) * 2022-12-12 2023-04-11 海光信息技术股份有限公司 Data processing method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102193831A (en) * 2010-03-12 2011-09-21 复旦大学 Method for establishing hierarchical mapping/reduction parallel programming model
CN102214086A (en) * 2011-06-20 2011-10-12 复旦大学 General-purpose parallel acceleration algorithm based on multi-core processor
CA2801382A1 (en) * 2010-06-29 2012-01-05 Exxonmobil Upstream Research Company Method and system for parallel simulation models
CN107210984A (en) * 2015-01-30 2017-09-26 华为技术有限公司 Method and apparatus for carrying out the parallel operation based on QRD in many execution unit processing systems
WO2019109771A1 (en) * 2017-12-05 2019-06-13 南京南瑞信息通信科技有限公司 Power artificial-intelligence visual-analysis system on basis of multi-core heterogeneous parallel computing

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102193831A (en) * 2010-03-12 2011-09-21 复旦大学 Method for establishing hierarchical mapping/reduction parallel programming model
CA2801382A1 (en) * 2010-06-29 2012-01-05 Exxonmobil Upstream Research Company Method and system for parallel simulation models
CN102214086A (en) * 2011-06-20 2011-10-12 复旦大学 General-purpose parallel acceleration algorithm based on multi-core processor
CN107210984A (en) * 2015-01-30 2017-09-26 华为技术有限公司 Method and apparatus for carrying out the parallel operation based on QRD in many execution unit processing systems
WO2019109771A1 (en) * 2017-12-05 2019-06-13 南京南瑞信息通信科技有限公司 Power artificial-intelligence visual-analysis system on basis of multi-core heterogeneous parallel computing

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
吴荣腾 .多核与多GPU系统下的一种矩阵三角分解并行算法.闽江学院学报.2016,第37卷(第05期),第65-71页. *
武勇 ; 王俊 ; 张培川 ; 曹运合 ; .CUDA架构下外辐射源雷达杂波抑制并行算法.西安电子科技大学学报.2014,第42卷(第01期),第104-111页. *
穆帅 ; 王晨曦 ; 邓仰东 ; .基于GPU的多层次并行QR分解算法研究.计算机仿真.2013,第30卷(第09期),第234-238页. *

Also Published As

Publication number Publication date
CN111858465A (en) 2020-10-30

Similar Documents

Publication Publication Date Title
CN111858465B (en) Large-scale matrix QR decomposition parallel computing system
Ryu et al. Bitblade: Area and energy-efficient precision-scalable neural network accelerator with bitwise summation
CN106940815B (en) Programmable convolutional neural network coprocessor IP core
CN107301456B (en) Deep neural network multi-core acceleration implementation method based on vector processor
Wang et al. FPDeep: Scalable acceleration of CNN training on deeply-pipelined FPGA clusters
CN108170639B (en) Tensor CP decomposition implementation method based on distributed environment
CN109635241B (en) Method for solving symmetric or hermitian symmetric positive definite matrix inverse matrix
Zhou et al. Accelerating large-scale single-source shortest path on FPGA
Liu et al. WinoCNN: Kernel sharing Winograd systolic array for efficient convolutional neural network acceleration on FPGAs
CN114781632A (en) Deep neural network accelerator based on dynamic reconfigurable pulse tensor operation engine
CN110851779A (en) Systolic array architecture for sparse matrix operations
Shi et al. Efficient sparse-dense matrix-matrix multiplication on GPUs using the customized sparse storage format
Asgari et al. Meissa: Multiplying matrices efficiently in a scalable systolic architecture
US20180373677A1 (en) Apparatus and Methods of Providing Efficient Data Parallelization for Multi-Dimensional FFTs
Niu et al. SPEC2: Spectral sparse CNN accelerator on FPGAs
Zhang et al. Low-latency mini-batch gnn inference on cpu-fpga heterogeneous platform
Ren et al. Exploration of alternative GPU implementations of the pair-HMMs forward algorithm
CN116710912A (en) Matrix multiplier and control method thereof
Chen et al. The parallel algorithm implementation of matrix multiplication based on ESCA
Han et al. EGCN: An efficient GCN accelerator for minimizing off-chip memory access
Wu et al. Skeletongcn: a simple yet effective accelerator for gcn training
Wang et al. An efficient architecture for floating-point eigenvalue decomposition
WO2021217502A1 (en) Computing architecture
CN113986816A (en) Reconfigurable computing chip
Takahashi et al. Performance of the block Jacobi method for the symmetric eigenvalue problem on a modern massively parallel computer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant