CN111858465A - Large-scale matrix QR decomposition parallel computing structure - Google Patents

Large-scale matrix QR decomposition parallel computing structure Download PDF

Info

Publication number
CN111858465A
CN111858465A CN202010609939.3A CN202010609939A CN111858465A CN 111858465 A CN111858465 A CN 111858465A CN 202010609939 A CN202010609939 A CN 202010609939A CN 111858465 A CN111858465 A CN 111858465A
Authority
CN
China
Prior art keywords
matrix
parallel
core
decomposition
level
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010609939.3A
Other languages
Chinese (zh)
Other versions
CN111858465B (en
Inventor
吴明钦
刘红伟
潘灵
贾明权
郝黎宏
林勤
张昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest Electronic Technology Institute No 10 Institute of Cetc
Original Assignee
Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest Electronic Technology Institute No 10 Institute of Cetc filed Critical Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority to CN202010609939.3A priority Critical patent/CN111858465B/en
Publication of CN111858465A publication Critical patent/CN111858465A/en
Application granted granted Critical
Publication of CN111858465B publication Critical patent/CN111858465B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8007Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors single instruction multiple data [SIMD] multiprocessors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/76Arrangements for rearranging, permuting or selecting data according to predetermined rules, independently of the content of the data
    • G06F7/78Arrangements for rearranging, permuting or selecting data according to predetermined rules, independently of the content of the data for changing the order of data flow, e.g. matrix transposition or LIFO buffers; Overflow or underflow handling therefor
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a large-scale matrix QR decomposition parallel computing structure, which relates to the field of digital signal processing and aims to provide a three-level parallel computing structure with clear parallel logic, high throughput rate and low delay, and is realized by the following technical scheme: in a processor cluster system and a QR decomposition parallel computing structure constructed by adopting a multi-core processor chip, a top-level framework divides a matrix to be decomposed into a plurality of data fragments, the data fragments are distributed to nodes of each level through a communication network interconnected among nodes of the multi-core processor, the nodes of each level are sequentially computed step by step according to a binary tree complete structure, and the nodes of each level are computed in parallel; the middle layer framework carries out matrix blocking and carries out operation layer by layer along the diagonal subarrays; the bottom layer architecture utilizes a processor instruction set to perform multi-data parallel vector calculation to complete QR decomposition and multiplication operations of a single core. The multi-core processor cluster adopts a layer-by-layer decomposition structure to realize QR parallel decomposition of a large-scale matrix.

Description

Large-scale matrix QR decomposition parallel computing structure
Technical Field
The invention relates to a large-scale array antenna signal processing technology in the field of digital signal processing, in particular to a QR decomposition method for a cluster parallel structure of a high-performance parallel computation large-scale multi-core processor in the field of numerical computation by a large-scale parallel processor.
Background
In the field of digital signal processing, signal processing algorithms such as large-scale array antenna signal processing and large-scale multiple-input multiple-output (MIMO) technology often involve problems such as covariance matrix inversion, channel matrix estimation, channel equalization, etc., and QR decomposition is widely applied in these aspects. MIMO is a standard for wireless network evolution, among others. In fact, MIMO technology has become one of the most critical technologies in many wireless communication standards, such as ieee802.11n, 3GPP-LTE, etc. The efficient and key QR decomposition operation unit is designed, so that the complexity of the MIMO system can be reduced, and good calculation performance can be obtained. The covariance matrix inversion algorithm is one of the most commonly used spatial interference suppression algorithms. However, when the interference power is too large, the sampling covariance matrix is a singular matrix, and the matrix inversion theorem is invalid, so that the interference suppression weight cannot be effectively generated. Due to the fact that QR decomposition can effectively improve the condition number of the matrix and improve the numerical stability, the situation that effective interference suppression weight cannot be obtained due to too strong interference power can be effectively avoided by utilizing the QR decomposition algorithm to invert the sampling covariance matrix. The computational performance of QR decomposition has a direct impact on the signal processing performance of the interference suppression system. The QR decomposition algorithm is used as a main tool for digital signal processing, plays an important role in the field of high-performance computing, and is an important index for measuring the performance of a system. Matrix operation is one of core problems in high-performance computation, matrix decomposition is an important way for improving the parallelism of matrix operation, and QR decomposition is an important matrix decomposition form. By definition, Q is an orthogonal matrix and R is a nonsingular upper triangular matrix (i.e., the elements below the diagonal of matrix R are all 0), where the decomposition is unique when the diagonal elements of R are required to be positive.
There are many ways to actually calculate QR decomposition, such as Givens rotation, Householder transformation, and Gram-Schmidt orthogonalization, among others. Each method has its advantages and disadvantages. QR decomposition of the matrix is typically achieved using the Householder transform or Givens rotation or Gram-Schmidt orthogonalization methods. The QR decomposition algorithm of Householder transformation obtains a Householder transformation matrix through reflection operation, and elements below a diagonal are updated to be 0 through matrix multiplication, so that a large amount of matrix multiplication calculation is needed in the process, and the complexity of the algorithm is increased. Although the Givens rotation QR decomposition method is to perform update calculation on a matrix through a Givens rotation matrix, the Givens rotation matrix can be divided into rows for updating, and although the Givens rotation QR decomposition method has operations such as division, evolution and the like, the complexity is lower than that of Householder. Due to the fact that QR decomposition computation amount is large, a large amount of computation time is consumed in the decomposition process, and the QR decomposition computation becomes a bottleneck for improving the performance of many practical applications. For example, in cognitive radio, QR decomposition is the most time-consuming computing module in singular value decomposition operations (SVD). The data shows that the time spent processing QR decomposition accounts for over 70% of the total SVD operation.
The large-scale matrix QR decomposition has wide application in the fields of signal processing, image processing, computational structure mechanics and the like. Because the large-scale matrix QR decomposition algorithm has huge computation amount and very complex algorithm structure, and is not beneficial to parallel decomposition, the traditional method realizes large-scale QR decomposition on a high-performance super computing platform based on an X86 framework. The problems of large-scale QR decomposition, task allocation, data synchronization and the like based on the distributed computing platform cause longer communication time and cannot meet the real-time processing requirement of ms level. In recent years, a large number of medium-small-scale matrix QR decompositions are realized by adopting FPGA design, a hardware structure is usually realized by adopting a systolic array, the parallelism performance is good, the real-time performance is high, but the matrix scale is sharply increased and limited by the area and power consumption of an FPGA chip, the QR decomposition realized based on the FPGA can not meet the requirement of high throughput rate of large-scale matrix QR decompositions, and the development period is long.
The key to the wide application of large-scale matrix QR decomposition is to reduce the processing delay while improving the throughput rate. With the large increase of the scale of the front-end sensor, the continuous improvement of the sampling rate and the continuous increase of the scale of the channel matrix, the traditional QR decomposition method can not meet the requirements of large-scale QR decomposition on the throughput rate and the real-time processing. The existing parallel computing technology research aiming at QR decomposition has two extremes: the method has the advantages that firstly, in the field of scientific computing, the existing popular distributed parallel computing architecture is adopted to realize ultra-large-scale matrix decomposition on Hadoop and other distributed computing platforms, although the method can realize large-scale throughput processing, the method cannot meet the requirement of real-time processing, and is not suitable for embedded equipment with strict requirements on low power consumption, flexibility and high reliability; secondly, aiming at a specific hardware structure, the QR decomposition is realized by using the FPGA as a representative, a special matrix decomposition parallel processor is designed, and extremely rapid processing is pursued.
For most engineering applications, a system hardware architecture with strong expansibility needs to be flexibly and quickly built by utilizing the existing mature chip. Currently, a variety of processor chips have evolved from single-core processors to multi-core processor chips, each of which typically has Single Instruction Multiple Data (SIMD) parallel computing capability and vector computing capability. At present, the international development level is that hundreds of light cores can be integrated in a single chip, and taking a TMS320C6678 DSP chip of TI corporation as an example, the number of integrated high-performance DSP cores reaches 8 cores, and in the future, the processing capacity of a single chip is further improved through two ways of further improving the performance of a single core and increasing the number of multiple cores. Therefore, the current multi-core technology is gradually available and increasingly becomes a hotspot for CPU/DSP development. From the technical aspect, the parallelism of multiple cores of a processor is an important method for realizing software parallelism, and the multiple cores are important solution ways for improving the performance of the processor. The adoption of the multi-core structure is an important means for improving the performance of the processor, and is very suitable for parallel computing tasks. Increasing the number of cores and improving the capacity of an on-chip memory are the main means for improving the computing capability of the multi-core DSP in the current commercial multi-core DSP. The multi-core cluster system is a mainstream large-scale parallel computing system at present, connects a plurality of multi-core processor chips through a high-speed interconnection network to form a cluster, and has extremely strong parallel computing capability. However, the conventional QR decomposition algorithm cannot fully utilize the parallel processing capability of the multi-core cluster system, and is difficult to exert the performance advantage of the multi-core cluster system.
Disclosure of Invention
The invention aims to provide a three-level parallel computing structure which has clear parallel logic, strong expansibility and portability, high throughput rate, low delay and high universality and can fully utilize the parallel processing advantages of a multi-core processor cluster to realize the parallel among multi-processor nodes, the parallel among single processors and the parallel among single cores and multiple data of the single cores aiming at the application of the large-scale multi-core processor cluster which relates to the large-scale QR decomposition and simultaneously requires the real-time processing, so as to solve the problem that the traditional QR decomposition method can not effectively utilize the multi-core processor cluster resources to carry out the large-scale parallel computing.
The technical scheme adopted by the invention is as follows: a large-scale matrix QR factorized parallel computing structure, comprising: the parallel structure comprises a multi-processor node parallel structure, a single-processor multi-core parallel structure and a single-core multi-data parallel three-level parallel structure, wherein the multi-processor node parallel structure, the single-processor multi-core parallel structure and the single-core multi-data parallel structure realize the large-scale QR parallel decomposition of a matrix, a first-layer parallel structure with the characteristic of a binary tree structure belongs to a top-layer framework in a three-layer parallel structure, a second-layer parallel structure belongs to a middle-layer framework in the three-layer parallel structure, and a third: in a processor cluster system with large-scale parallel computing capability and a QR decomposition parallel computing structure which are built by adopting a multi-core processor chip, a top-level architecture divides a matrix to be decomposed into a plurality of data fragments by utilizing the characteristics of the multi-core processor cluster, the data fragments are distributed to parallel nodes of a first level through a communication network interconnected among the nodes of the multi-core processor, each node of the first level parallelly completes a corresponding QR decomposition task and then sends a triangular matrix R to a second level node through the network, the second level node completes R matrix combination and QR decomposition and then sends the R matrix to a third level node through the network, and so on, each level node sequentially executes step by step according to a complete structure of a binary tree, and the same level node executes in parallel, so that the parallel computation of a flow graph among the nodes of the multi-core processor chip is; the middle-layer architecture carries out matrix subarray blocking according to the matrix scale input by the processor nodes and the number of processor cores, each subarray is divided into square matrixes with uniform matrix scale, the whole operation process is carried out layer by layer along diagonal subarrays, single-layer subarray QR decomposition and matrix data updating operations are executed in parallel by multiple cores in a processor slice, and the bottom-layer architecture carries out multi-data parallel vector calculation by utilizing a processor instruction set with SIMD capability to complete single-core QR decomposition and multiplication operations. The multi-core processor cluster adopts a layer-by-layer decomposition method, and realizes QR parallel decomposition of a large-scale matrix from three-level parallel structures of multi-core nodes of a multi-processor chip, multi-core parallel of a single-processor chip and single-core multi-data parallel.
Compared with the prior art, the technical scheme of the invention has the following beneficial effects:
the parallel logic is clear, and the expansibility and the portability are strong. The parallel computing capability of the multi-core processor cluster is fully utilized, the three-layer parallel architecture is adopted to realize the QR decomposition parallel computing structure of the processor cluster system with the large-scale parallel computing capability, which is constructed by the multi-core processor chip, the nodes at all levels are sequentially executed step by step according to the complete structure of the binary tree, the nodes at the same level are executed in parallel, and the multi-processor cluster deployment is easy to realize. The three-layer parallel architecture has clear structure and is easy to realize, and can simultaneously meet the requirements of system throughput rate and real-time processing. The experimental results show that: compared with the software implementation of the general processor in the prior art, the parallel result of QR decomposition can achieve an acceleration ratio of more than 20 times on the core computing performance, and has higher data and pipeline parallelism for the matrix triangulation computing process.
According to the invention, a communication network is formed by interconnecting and exchanging chips among the nodes of the multi-core processor, the cluster scale is easy to expand, and the communication efficiency and task synchronization of the parallel task data among the multiple cores can be obviously improved by using the shared memory among the multiple cores in the node of the single processor. The whole cluster architecture is beneficial to task allocation and scheduling of a three-layer parallel architecture, and has high flexibility, expansibility and transportability.
The method is based on a multi-core processor cluster, and adopts a layer-by-layer decomposition structure to realize multi-node parallel, single-node multi-core parallel and single-core multi-data parallel. The multi-node parallel with the structure characteristic of the binary tree can fully utilize the cluster parallel processing advantages of the processors, meet the throughput rate requirement of QR decomposition of the large-scale matrix, and obtain the exponential scale increase performance of the matrix to be processed at the cost of linear scale increase of the clusters under the condition of the same processing time. The multi-core parallel of the processor can fully utilize multi-core resources, the QR decomposition and data matrix updating operation of a single-layer subarray can be realized in parallel, the real-time computing capacity of the QR decomposition on a single node and the QR decomposition operation speed can be greatly improved, and the processing performance is multiplied with the number of cores. The multiple data among the single cores utilize the SIMD capability and the vector computing capability of the processor instruction set in parallel, and the single-core QR decomposition performance is increased by 4 times by adopting an improved GR algorithm. The three-layer parallel architecture is very suitable for a multi-core processor cluster, not only meets the throughput rate requirement of large-scale matrix QR decomposition, but also greatly reduces the processing delay and has the real-time processing capability.
The invention comprehensively considers the communication, storage and calculation resources of a multi-core processor and the SIMD, matrix and vector calculation capabilities of an instruction level, realizes the parallel decomposition of large-scale matrix QR decomposition from three-level parallel structures of parallel between nodes of the multi-processor, parallel between single-chip and multi-core and parallel between single-core and multi-data, realizes the multi-core communication and task synchronization inside the processor by adopting a shared internal memory, adopts a network switching chip for communication between chips, can randomly expand the cluster scale, can simultaneously process the task level parallel and the data level parallel in the matrix decomposition process, fully utilizes the SIMD instruction set and the vector calculation instruction set of the processor to carry out parallel operation, and realizes the flexible scheduling of QR decomposition parallel tasks. The multi-level parallel computing structure can obtain more than 20 times of performance improvement, has low communication overhead and obvious computing acceleration, obviously improves the computing parallelism and the cluster communication flexibility, and can obtain more than 20 times of performance improvement in numerical computation related to weight updating in large-scale array signal processing. The method is very suitable for large-scale parallel computing clusters which are built by taking multi-core DSP as a representative.
The method is greatly superior to the existing QR decomposition method, has outstanding engineering application value, and is very suitable for cluster calculation of large-scale multi-core processors.
Drawings
FIG. 1 is a block diagram of a top-level parallel structure in a large-scale matrix QR decomposition three-level parallel structure of the present invention;
FIG. 2 is a block diagram of a hierarchical progressive relationship of a middle layer parallel structure in a large-scale matrix QR decomposition three-layer parallel structure;
FIG. 3 is a block diagram of a single level of multi-core parallel computing in the middle layer parallel architecture of FIG. 2;
FIG. 4 is a block diagram of a bottom level parallel structure cascade progression relationship of a large-scale QR decomposition three-tier parallel structure;
FIG. 5 is a block diagram of inter-node connections of a multi-core processor cluster and inter-processor multi-core connections.
Detailed Description
See fig. 1-3. In a preferred embodiment described below, a large-scale matrix QR factorized parallel computing structure comprises: the parallel structure comprises three levels of parallel structures of processor nodes, processor core parallel and single-core instruction level parallel, the first level of parallel structure with the characteristic of a binary tree structure belongs to a top level framework in a three-level parallel structure, the second level of parallel structure belongs to a middle level framework in the three-level parallel structure, and the third level of parallel structure belongs to a bottom level framework in the three-level parallel structure. In a processor cluster system with large-scale parallel computing capability and a QR decomposition parallel computing structure which are built by adopting a multi-core processor chip, a top-level architecture utilizes the multi-core processor cluster to divide data of a divided matrix to be decomposed into pieces, the pieces are distributed to parallel nodes of a first level through a communication network which is formed around a switching chip and is interconnected among nodes of the multi-core processor, each node of the first level outputs a triangular matrix R to a second level node after completing a corresponding QR decomposition task in parallel, the second level node sends the R matrix to a third level node after completing R matrix combination and QR decomposition, each level node is sequentially executed step by step according to a complete structure of a binary tree, and the same level nodes are executed in parallel, so that a parallel computing flow diagram among the nodes of the processor chip is completed; the middle-layer architecture carries out matrix blocking according to the matrix scale input by the processor nodes, each blocking matrix is a square matrix subarray with uniform matrix scale, the subarray or subarray combination is used as an object to carry out QR decomposition and matrix data updating operation, the whole operation process is carried out layer by layer along diagonal subarrays, single-layer subarray QR decomposition and updating operation is carried out by multiple cores in a processor slice in parallel, and the bottom-layer architecture carries out vector calculation by utilizing a processor instruction set with single instruction multiple data SIMD capability to complete single-core QR decomposition and multiplication operation; the QR decomposition of the large-scale matrix adopts a layer-by-layer decomposition method, and realizes the QR parallel decomposition of the large-scale matrix from three-level parallel structures of parallel among multi-core processor chip nodes, parallel among multi-core chips and parallel single-core multi-data.
In an optional embodiment, a parallel architecture for implementing QR parallel decomposition of a large-scale matrix adopts a three-layer architecture of a top layer, a middle layer, and a bottom layer.
In an alternative embodiment, the top-level parallelism architecture shown in fig. 1 is a typical binary tree architecture, and mainly accomplishes parallelism among the processor chip nodes. At least 8 multi-core processor chips are arranged on a large-scale parallel computing platform, an input to-be-decomposed matrix A with the size of at least 16N rows and N columns is arranged on the large-scale parallel computing platform, the sub-matrix segmentation of the input to-be-decomposed matrix A can be flexibly cut according to the maximum parallelism capacity provided by the platform, and the input to-be-decomposed matrix A is segmented into a plurality of sub-matrix blocks A with the sizes of 2N rows and N columns according to the columnsiThe 8 multi-core processor chips can be maximally divided into A1-A8 matrix blocks, a top-level architecture of a matrix A to be decomposed is cascaded according to a structure of a binary tree 8-4-2-1 architecture, and the top-level parallel architecture completes the parallel between the processor chip nodes by a typical binary tree architecture. In order to greatly improve the throughput capacity of the platform, according to at least 17 processor nodes, the maximum 8 multi-core processor nodes in each stage can be simultaneously executed in parallel, and an 8-4-2-1 architecture is utilized to construct at least 4 stages of pipelines.
The pipeline structure with the 8-4-2-1 architecture sequentially executes the following steps:
proceeding in the direction of pipeline cascade, the first stage pipeline processes 8 processor nodes simultaneously in parallel, node 1 performs QR decomposition of A1, node 2 performs QR decomposition of A2, and node i performs A iQR decomposition of (A) each node A, due to the same matrix sizeiThe execution time is the same, and the parallel performance is good.
A second stage pipeline: upper triangular matrix R with matrix size N x N for first stage output1,iAnd combining the data of every two nodes, wherein the size of the new matrix is 2N x N after combination, and the 4 processor nodes perform QR decomposition on the new matrix simultaneously and parallelly.
A third stage of assembly line: an upper triangular matrix R with matrix size N x N for the output of the second stage pipeline2,iAnd combining the data of every two nodes, wherein the size of the new matrix is 2N x N after combination, and 2 processor nodes simultaneously and parallelly carry out QR decomposition on the new matrix.
A fourth stage pipeline: triangular matrix R with matrix size N x N on result output by third stage3,iAnd combining the data of every two, wherein the size of the new matrix is 2N x N after combination, carrying out QR decomposition on the new matrix by 1 processor node, and obtaining an output result which is an upper triangular R matrix of the QR decomposition.
See fig. 2. The intermediate layer parallel architecture completes QR decomposition in parallel by using multiple cores of a single-node processor, and the QR decomposition time of the single-node processor directly determines the processing delay of the top layer parallel architecture. The intermediate layer parallel framework decomposes QR of the sub-matrix blocks of row 2N and column N, and executes the QR in a multi-layer progressive manner, wherein each layer is executed in parallel. The middle layer carries out matrix blocking on the matrix with 2N rows and N columns according to the core number of the single-node processor, and a blocking matrix A i,jIs a square matrix. The matrix transposition is defined mathematically as the fact that the matrix A is a matrix of m × n order (i.e. m rows and n columns) and the element of the ith row and the jth column is ai,jIf i and j are equal to or greater than 1, then the matrix B with n × m order is equal to ATSatisfies bj,i=ai,j. Assuming that a chip of a multi-core processor has 8 cores, the block matrix Ai,jIs a square matrix of rows 2N/8 and columns 2N/8. The method comprises the following steps of:
the middle layer parallel framework divides the block matrix A to the first layer1,1QR decomposition is carried out, and the block square matrix A of the same column is formedi,1(i > 1) performing 0 elimination and dividing the block matrix A into the same row1,j(j > 1) data update is performed. The method comprises 4 operation operations, namely general QR operation, transposition GEQRT and post-QR multiplication operation in-line updating ORMQR, high thin matrix QR operation, transposition TSQRT and post-QR high thin matrix multiplication operation in-line updating TSMQR. GEQRT executes general QR operation, and updates A with upper triangular matrix R of QR result1,1The orthogonal matrix Q performs transposition of the output QT(ii) a ORMQR uses preceding GEQRT result QTExecution A1,j=QT.A1,jUpdating the matrix A1,j(ii) a TSQRT uses updated A1,1And Ai,1(i > 1) a combination of
Figure BDA0002560648160000061
And for the combined elongated matrix
Figure BDA0002560648160000062
QR decomposition is carried out, and a new upper triangular matrix R is output1,1Update A1,1And output the result QT(ii) a TSMQR mainly utilizes a front stage TSQRT to output a result Q TExecute
Figure BDA0002560648160000063
Updating
Figure BDA0002560648160000064
T denotes transposition.
The computation of the first layer is completed by multi-stage cascade connection, and the operation of each stage is realized by multi-core parallel.
See fig. 3. Single-stage multi-core parallel first-stage computation in the middle-layer parallel architecture: the core 0 and the core 4 complete the general QR operation and the transposition GEQRT operation in parallel; multi-core parallel second-stage computation: 8, cores are parallel, a core 0 and a core 4 complete high-thin matrix QR operation and transposed TSQRT operation, and the other cores complete QR post-multiplication operation and inline update ORMQR operation; multi-core parallel third-level computation: 8, cores are parallel, the core 0 and the core 4 complete TSQRT operation, and the rest of multi-core parallel cores complete high-thin matrix QR backward row updating operation TSMQR operation; multi-core parallel fourth-level computation: 8 cores are parallel, the core 0 and the core 4 complete TSQRT operation, and the other multi-core parallel cores complete TSMQR operation; multi-core parallel fifth-level computation: 7, cores are parallel, core 0 completes TSQRT operation on the combined thin and long matrix of cores 0 and 4, and the other multi-core parallel cores complete TSMQR operation; and the sixth-level multi-core parallelism is realized, and the 3-core parallelism completes the TSMQR operation.
Multi-core parallel second layer: binary block square matrix A2,2QR decomposition is carried out, and A is updated2,2For block square matrix A of the same columni,2(i > 2) eliminating 0 and dividing the block into square matrixes A in the same row 2,j(j > 2) updating. The operation steps are similar to the multi-core parallel first layer, except that the number of cores participating in parallel computations differs.
Multi-core parallel third layer: binary block square matrix A3,3QR decomposition is carried out, and A is updated3,3For block square matrix A of the same columni,3(i > 3) eliminating 0 and dividing the block into square matrixes A in the same rowi,j(j > 3) updating. The operation steps are similar to the multi-core parallel first layer operation, except that the number of cores participating in parallel computing is different.
Multi-core parallel fourth layer: since only 1 column of block matrix remains for this layer, only pair A is needed4,4QR decomposition is carried out, and A is updated4,4For block square matrix A of the same columni,4(i > 4)) to perform 0 elimination. The operation steps include a generic QR operation and a transposed GEQRT and a high thin matrix QR operation and a transposed TSQRT.
See fig. 4. The bottom parallel architecture mainly completes parallel computation of one core of the processor. Currently, instruction sets of popular multi-core processors support SIMD operations, vector and matrix operations. In order to fully utilize the parallel performance of an instruction set and adapt to the time overhead of each core in different operations, an improved orthogonalization GS (Gram-Schmidt) method is adopted for the general QR operation, the transposition GEQRT operation, the high-thin matrix QR operation and the transposition TSQRT operation. The improved GS process steps are as follows:
The first step is as follows: the bottom parallel framework adopts special single instruction stream multi-data stream SIMD instructions and vector calculation instructions to carry out parallel processing, and the first diagonal R of the R matrix is solved11Calculation of vectors
Figure BDA0002560648160000071
Using a special instruction to make a calculation, r11=sqrt(R1) Using a special reciprocal instruction to calculate the reciprocal g1=1/r11And in particular to the additional instruction set provided by the multi-core processor to accelerate computations,different processors have instruction sets specially supported by the processors;
the second step is that: the bottom layer parallel framework adopts special SIMD instructions and vector calculation instructions to carry out parallel processing, and other elements in the first row of the R matrix are processed
Figure BDA0002560648160000072
Vector calculation is carried out, and the first diagonal R of the R matrix is solved1j=C1j·g1(j>1);
The third step: the bottom layer parallel framework adopts special SIMD instructions and vector calculation instructions to perform parallel processing, and updates the matrixes A and h1j=C1j/R11(j > 1)' pair
Figure BDA0002560648160000073
And carrying out vector calculation. By this point, all the element solutions in the first row of the R matrix have been completed and the matrix a has been updated. All R matrix evaluation can be completed by repeating the three steps, the whole improved GS algorithm divides a small part of evolution and division, most of the improved GS algorithm is vector multiplication and addition, and the whole calculation process can be greatly accelerated by utilizing parallel processing methods such as SIMD instructions and vector calculation instructions.
See fig. 5. The multi-core processor cluster is characterized in that a system architecture is formed by a double data rate DDR memory cross-linked switching network formed by interconnecting a plurality of multi-core processors, and the universality and the expandability of the cluster are determined by the interconnection architecture of the system. A multi-core processor typically has multiple independent processing cores, for example, a digital signal processing DSP chip TMS320C6678 has 8 cores, each of which is relatively independent. The 8 processor chips are interconnected through a switching chip, the switching network can be a high-speed communication network such as RapidIO and the like, and a multi-core processor cluster with 64 processing cores is formed. Data are shared among the 8 processor nodes through the DDR of each node, for example, the data in the DDR is sent to the DDR of the multi-core processor chip 1 through a switching network by the multi-core processor chip 0. The 8 cores in the multi-core processor chip carry out rapid data interaction and task synchronization through the shared memory in the chip and can also carry out communication with the processing cores of other nodes through the off-chip DDR. The processor nodes are interconnected through the switching chips to form a communication network, so that the expansion of cluster scale is facilitated, the multi-core in the processor can improve the data communication efficiency and the task synchronization capability of parallel tasks among the multi-core through the shared memory in the chips, and different processors carry out inter-core communication through DDR, so that the data distribution among the nodes is facilitated. The interconnection mode of the whole cluster architecture is beneficial to task allocation and scheduling of the three-layer parallel architecture, and has high flexibility and good expansibility.
The foregoing is directed to the preferred embodiment of the present invention and it is noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. It will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the spirit and substance of the invention, and these modifications and improvements are also considered to be within the scope of the invention.

Claims (10)

1. A large-scale matrix QR factorized parallel computing structure, comprising: the parallel structure comprises a parallel structure with three levels of multi-processor nodes, parallel structure with single processor cores and parallel structure with single core and multiple data on the three levels, wherein the parallel structure with the characteristics of a binary tree structure comprises a first level parallel structure belonging to a top level architecture in a three-level parallel structure, a second level parallel structure belonging to a middle level architecture in the three-level parallel structure, and a third level parallel structure belonging to a bottom level architecture in the three-level parallel structure, and is characterized in that: in a processor cluster system with large-scale parallel computing capability and a QR decomposition parallel computing structure which are built by adopting a multi-core processor chip, a top-level architecture divides a matrix to be decomposed into a plurality of data fragments by utilizing the characteristics of the multi-core processor cluster, the data fragments are distributed to parallel nodes of a first level through a communication network interconnected among the nodes of the multi-core processor, each node of the first level parallelly completes a corresponding QR decomposition task and then sends a triangular matrix R to a second level node through the network, the second level node completes R matrix combination and QR decomposition and then sends the R matrix to a third level node through the network, and so on, each level node sequentially executes step by step according to a complete structure of a binary tree, and the same level node executes in parallel, so that the parallel computation of a flow graph among the nodes of the multi-core processor chip is; the middle-layer architecture carries out matrix blocking according to the matrix scale input by the processor nodes and the number of processor cores, each blocking matrix is a square matrix with uniform matrix scale, the whole operation process is carried out layer by layer along diagonal sub-arrays, single-layer sub-array QR decomposition and matrix data updating operations are executed in parallel by multiple cores in a processor slice, and the bottom-layer architecture carries out multi-data parallel vector calculation by utilizing a processor instruction set with single instruction multiple data operation SIMD to complete single-core QR decomposition or multiplication operation; the multi-core processor cluster adopts a layer-by-layer decomposition method, and realizes QR parallel decomposition of a large-scale matrix from three-level parallel structures of multi-core nodes of a multi-processor chip, multi-core parallel of a single-processor chip and single-core multi-data parallel.
2. The large-scale matrix QR factorization parallel computing architecture of claim 1, wherein: at least 8 multi-core processor chips are arranged on a large-scale parallel computing platform, an input to-be-decomposed matrix A with the size of at least 16N rows and N columns is arranged on the large-scale parallel computing platform, the sub-matrix segmentation of the input to-be-decomposed matrix A can be flexibly cut according to the maximum parallelism capacity provided by the platform, and the input to-be-decomposed matrix A is segmented into a plurality of sub-matrix blocks A with the sizes of 2N rows and N columns according to the columnsiThe 8 multi-core processor chips can be maximally divided into A1-A8 matrix blocks, a top-level architecture of a matrix A to be decomposed is cascaded according to a structure of a binary tree 8-4-2-1 architecture, the top-level parallel architecture is a typical binary tree architecture, the processor chip nodes are parallel, at least 17 processor nodes are arranged, maximally 8 multi-core processor nodes in each level can be simultaneously executed in parallel, and an at least four-level pipeline is constructed by using the 8-4-2-1 architecture.
3. A large-scale matrix QR factorization parallel computing architecture as claimed in claim 3, characterized in that: in the layer-wise progressive direction, the first-level pipeline simultaneously processes 8 processor nodes in parallel, node 1 performs QR decomposition of A1, node 2 performs QR decomposition of A2, and node i performs AiQR decomposition of (A) of each node A iThe execution time is the same; the matrix size of the second-stage pipeline output to the first-stage pipeline is N x NR1,iCombining the data of every two, wherein the size of the new matrix is 2N × N after combination; the matrix size of the third stage pipeline to the second stage output is N x N upper triangular matrix R2,iCombining the data of every two, wherein the size of the new matrix is 2N × N after combination; triangular matrix R with matrix size of N x N on result output by fourth-stage pipeline to third-stage pipeline3,iAnd merging the data of every two, wherein the size of the new matrix after merging is 2N x N.
4. The large-scale matrix QR factorization parallel computing architecture of claim 5, wherein: performing data merging and QR decomposition by single-node operation of a second-stage pipeline; and performing data merging and QR decomposition by single-node operation of the third-stage assembly line, performing QR decomposition on the new matrix by 1 processor node of the fourth-stage assembly line, and obtaining an output result which is an upper triangular R matrix of the QR decomposition.
5. The large-scale matrix QR factorization parallel computing architecture of claim 1, wherein: the intermediate layer parallel framework carries out QR decomposition on the sub-matrix blocks of the row 2N and the column N, the sub-matrix blocks are executed in a multi-layer progressive mode according to the layer progressive direction, each layer is executed in parallel, and multi-core parallel completion of QR decomposition is achieved by using a single-node processor.
6. The large-scale matrix QR factorization parallel computing architecture of claim 1, wherein: the middle layer parallel framework divides the block matrix A to the first layer1,1QR decomposition is carried out, and the block square matrix A of the same column is formedi,1(i > 1) performing a 0 elimination, first layer: binary block square matrix A1,1QR decomposition is carried out, and the block square matrix A of the same column is formedi,1(i > 1) performing 0 elimination and dividing the block matrix A into the same row1,j(j > 1) data updating is carried out, wherein the data updating comprises 4 kinds of operation operations, namely general QR operation and GEQRT transposition, after-QR multiplication operation same-row updating ORMQR, high-thin matrix QR operation and TSQRT transposition, after-QR high-thin matrix multiplication operation same-row updating TSMQR.
7. A large-scale matrix QR decomposition parallel computing structure according to claim 7,the method is characterized in that: general QR operation and transpose GEQRT to perform general QR operation, updating A with triangular matrix R on QR result1,1The orthogonal matrix Q performs transposition of the output QT(ii) a Simultaneous update of ORMQR by post-QR multiplication operation using preceding stage GEQRT result QTExecution A1,j=QT.A1,jUpdating the matrix A1,j(ii) a High-thin matrix QR operation and TSQRT transfer to utilize updated A1,1And Ai,1(i > 1) a combination of
Figure FDA0002560648150000021
And for the combined elongated matrix
Figure FDA0002560648150000022
QR decomposition is carried out, and a new upper triangular matrix R is output 1,1Update A1,1And output the result QT(ii) a High-thin matrix multiplication operation simultaneous updating TSMQR after QR mainly utilizes front stage TSQRT to output result QTExecute
Figure FDA0002560648150000023
Updating
Figure FDA0002560648150000024
T denotes transposition.
8. The large-scale matrix QR factorization parallel computing architecture of claim 1, wherein: single-stage multi-core parallel first-stage computation in the middle-layer parallel architecture: core 0 and core 4 complete the GEQRT operation in parallel; multi-core parallel second-stage computation: 8, cores are parallel, core 0 and core 4 complete TSQRT operation, and the other cores complete ORMQR operation; multi-core parallel third-level computation: 8 cores are parallel, the core 0 and the core 4 complete TSQRT operation, and the other multi-core parallel cores complete TSMQR operation; multi-core parallel fourth-level computation: 8 cores are parallel, the core 0 and the core 4 complete TSQRT operation, and the other multi-core parallel cores complete TSMQR operation; multi-core parallel fifth-level computation: 7, cores are parallel, core 0 completes TSQRT operation on the combined thin and long matrix of cores 0 and 4, and the other multi-core parallel cores complete TSMQR operation; and the sixth-level multi-core parallelism is realized, and the 3-core parallelism completes the TSMQR operation.
9. The large-scale matrix QR factorization parallel computing architecture of claim 8, wherein: multi-core parallel second layer: binary block square matrix A2,2QR decomposition is carried out, and A is updated 2,2For block square matrix A of the same columni,2(i > 2) eliminating 0 and dividing the block into square matrixes A in the same row2,j(j > 2) updating; multi-core parallel third layer: binary block square matrix A3,3QR decomposition is carried out, and A is updated3,3For block square matrix A of the same columni,3(i > 3) eliminating 0 and dividing the block into square matrixes A in the same rowi,j(j > 3) updating; multi-core parallel fourth layer: since only 1 column of block matrix remains for this layer, only pair A is needed4,4QR decomposition is carried out, and A is updated4,4For block square matrix A of the same columni,4(i > 4)) to perform 0 elimination.
10. The large-scale matrix QR factorization parallel computing architecture of claim 9, wherein: the bottom parallel framework adopts a single instruction multiple data stream SIMD instruction and a vector calculation instruction to carry out parallel processing, and the first diagonal R of an R matrix is solved11Calculation of vectors
Figure FDA0002560648150000031
Calculating diagonal element r by using evolution instruction evolution11=sqrt(R1) Using a special reciprocal instruction to calculate the reciprocal g1=1/r11(ii) a For other elements in the first row of the R matrix
Figure FDA0002560648150000032
Vector calculation is carried out, and the first diagonal R of the R matrix is solved1j=C1j·g1(j > 1); update the matrix A, coefficient h1j=C1j/R11(j > 1), for intermediate results
Figure FDA0002560648150000033
Column vector of matrix A
Figure FDA0002560648150000034
And (4) carrying out vector calculation, completing the solution of all elements in the first row of the R matrix, updating the matrix A, and repeating the steps to complete the evaluation of all the R matrices.
CN202010609939.3A 2020-06-29 2020-06-29 Large-scale matrix QR decomposition parallel computing system Active CN111858465B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010609939.3A CN111858465B (en) 2020-06-29 2020-06-29 Large-scale matrix QR decomposition parallel computing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010609939.3A CN111858465B (en) 2020-06-29 2020-06-29 Large-scale matrix QR decomposition parallel computing system

Publications (2)

Publication Number Publication Date
CN111858465A true CN111858465A (en) 2020-10-30
CN111858465B CN111858465B (en) 2023-06-06

Family

ID=72989894

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010609939.3A Active CN111858465B (en) 2020-06-29 2020-06-29 Large-scale matrix QR decomposition parallel computing system

Country Status (1)

Country Link
CN (1) CN111858465B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112488506A (en) * 2020-11-30 2021-03-12 哈尔滨工程大学 Extensible distributed architecture and self-organizing method of intelligent unmanned system cluster
CN112506677A (en) * 2020-12-09 2021-03-16 上海交通大学 TensorFlow distributed matrix calculation implementation method and system
CN112631986A (en) * 2020-12-28 2021-04-09 西南电子技术研究所(中国电子科技集团公司第十研究所) Large-scale DSP parallel computing device
CN113254078A (en) * 2021-06-23 2021-08-13 北京睿芯高通量科技有限公司 Data stream processing method for efficiently executing matrix addition on GPDPU simulator
CN115952391A (en) * 2022-12-12 2023-04-11 海光信息技术股份有限公司 Data processing method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102193831A (en) * 2010-03-12 2011-09-21 复旦大学 Method for establishing hierarchical mapping/reduction parallel programming model
CN102214086A (en) * 2011-06-20 2011-10-12 复旦大学 General-purpose parallel acceleration algorithm based on multi-core processor
CA2801382A1 (en) * 2010-06-29 2012-01-05 Exxonmobil Upstream Research Company Method and system for parallel simulation models
CN107210984A (en) * 2015-01-30 2017-09-26 华为技术有限公司 Method and apparatus for carrying out the parallel operation based on QRD in many execution unit processing systems
WO2019109771A1 (en) * 2017-12-05 2019-06-13 南京南瑞信息通信科技有限公司 Power artificial-intelligence visual-analysis system on basis of multi-core heterogeneous parallel computing

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102193831A (en) * 2010-03-12 2011-09-21 复旦大学 Method for establishing hierarchical mapping/reduction parallel programming model
CA2801382A1 (en) * 2010-06-29 2012-01-05 Exxonmobil Upstream Research Company Method and system for parallel simulation models
CN102214086A (en) * 2011-06-20 2011-10-12 复旦大学 General-purpose parallel acceleration algorithm based on multi-core processor
CN107210984A (en) * 2015-01-30 2017-09-26 华为技术有限公司 Method and apparatus for carrying out the parallel operation based on QRD in many execution unit processing systems
WO2019109771A1 (en) * 2017-12-05 2019-06-13 南京南瑞信息通信科技有限公司 Power artificial-intelligence visual-analysis system on basis of multi-core heterogeneous parallel computing

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
吴荣腾: "多核与多GPU系统下的一种矩阵三角分解并行算法" *
武勇;王俊;张培川;曹运合;: "CUDA架构下外辐射源雷达杂波抑制并行算法" *
穆帅;王晨曦;邓仰东;: "基于GPU的多层次并行QR分解算法研究" *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112488506A (en) * 2020-11-30 2021-03-12 哈尔滨工程大学 Extensible distributed architecture and self-organizing method of intelligent unmanned system cluster
CN112506677A (en) * 2020-12-09 2021-03-16 上海交通大学 TensorFlow distributed matrix calculation implementation method and system
CN112631986A (en) * 2020-12-28 2021-04-09 西南电子技术研究所(中国电子科技集团公司第十研究所) Large-scale DSP parallel computing device
CN112631986B (en) * 2020-12-28 2024-04-02 西南电子技术研究所(中国电子科技集团公司第十研究所) Large-scale DSP parallel computing device
CN113254078A (en) * 2021-06-23 2021-08-13 北京睿芯高通量科技有限公司 Data stream processing method for efficiently executing matrix addition on GPDPU simulator
CN113254078B (en) * 2021-06-23 2024-04-12 北京中科通量科技有限公司 Data stream processing method for efficiently executing matrix addition on GPDPU simulator
CN115952391A (en) * 2022-12-12 2023-04-11 海光信息技术股份有限公司 Data processing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN111858465B (en) 2023-06-06

Similar Documents

Publication Publication Date Title
CN111858465B (en) Large-scale matrix QR decomposition parallel computing system
Ryu et al. Bitblade: Area and energy-efficient precision-scalable neural network accelerator with bitwise summation
CN107704916B (en) Hardware accelerator and method for realizing RNN neural network based on FPGA
Wang et al. FPDeep: Scalable acceleration of CNN training on deeply-pipelined FPGA clusters
Urquhart et al. Systolic matrix and vector multiplication methods for signal processing
CN109635241B (en) Method for solving symmetric or hermitian symmetric positive definite matrix inverse matrix
CN111199275B (en) System on chip for neural network
CN110361691B (en) Implementation method of coherent source DOA estimation FPGA based on non-uniform array
CN110851779B (en) Systolic array architecture for sparse matrix operations
CN107341133A (en) The dispatching method of Reconfigurable Computation structure based on Arbitrary Dimensions LU Decomposition
CN114781632A (en) Deep neural network accelerator based on dynamic reconfigurable pulse tensor operation engine
Asgari et al. Meissa: Multiplying matrices efficiently in a scalable systolic architecture
US20180373677A1 (en) Apparatus and Methods of Providing Efficient Data Parallelization for Multi-Dimensional FFTs
Niu et al. SPEC2: Spectral sparse CNN accelerator on FPGAs
CN116710912A (en) Matrix multiplier and control method thereof
US11886347B2 (en) Large-scale data processing computer architecture
CN113055060B (en) Coarse-grained reconfigurable architecture system for large-scale MIMO signal detection
Wu et al. Accelerator design for vector quantized convolutional neural network
Jiang et al. Prarch: Pattern-based reconfigurable architecture for deep neural network acceleration
Gallivan et al. High-performance architectures for adaptive filtering based on the Gram-Schmidt algorithm
US20230244484A1 (en) Bit-parallel vector composability for neural acceleration
CN113705773B (en) Dynamically reconfigurable PE unit and PE array for graph neural network reasoning
Chen et al. Edge FPGA-based Onsite Neural Network Training
CN112596912B (en) Acceleration operation method and device for convolution calculation of binary or ternary neural network
Miao A Review on Important Issues in GCN Accelerator Design

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant