CN114329327A - Sparse matrix parallel solving method and device based on upper and lower triangular decomposition - Google Patents

Sparse matrix parallel solving method and device based on upper and lower triangular decomposition Download PDF

Info

Publication number
CN114329327A
CN114329327A CN202111532120.2A CN202111532120A CN114329327A CN 114329327 A CN114329327 A CN 114329327A CN 202111532120 A CN202111532120 A CN 202111532120A CN 114329327 A CN114329327 A CN 114329327A
Authority
CN
China
Prior art keywords
matrix
node
nodes
processors
task queue
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111532120.2A
Other languages
Chinese (zh)
Other versions
CN114329327B (en
Inventor
薛巍
刘侃
刘首文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
State Grid Hubei Electric Power Co Ltd
Original Assignee
Tsinghua University
State Grid Hubei Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University, State Grid Hubei Electric Power Co Ltd filed Critical Tsinghua University
Priority to CN202111532120.2A priority Critical patent/CN114329327B/en
Publication of CN114329327A publication Critical patent/CN114329327A/en
Application granted granted Critical
Publication of CN114329327B publication Critical patent/CN114329327B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Multi Processors (AREA)

Abstract

The invention provides a sparse matrix parallel solving method based on upper and lower triangular decomposition, wherein the method is applied to a parallel computing platform, the parallel computing platform comprises a plurality of processors, and the method comprises the following steps: receiving an input matrix and storing non-zero elements of the input matrix column by column; reordering the input matrix to obtain a reordered matrix, wherein in the reordered matrix, the filling of LU decomposition is less than a first quantity, and the absolute value of non-zero elements of diagonal positions is greater than a first threshold; constructing a matrix elimination tree based on the reordering matrix, and constructing a sub-tree and a task queue based on the matrix elimination tree; performing parallel calculation on a plurality of subtrees and a plurality of task queues based on a plurality of processors, and obtaining an output matrix based on a calculation result, wherein the output matrix is an LU decomposition matrix of an input matrix; and obtaining a solution result related to the input matrix based on the output matrix. By the method and the device, the calculation efficiency of the parallel solving process of the sparse matrix based on the upper and lower triangular decomposition is improved.

Description

Sparse matrix parallel solving method and device based on upper and lower triangular decomposition
Technical Field
The invention relates to the technical field of data processing, in particular to a sparse matrix parallel solving method and device based on upper and lower triangular decomposition.
Background
With the development of computer technology, the demand for large-scale scientific computing is gradually increasing. Many fields of scientific computing, such as structural mechanics computing, fluid mechanics computing, etc., require solving large-scale sparse linear equations.
The direct solution of the linear equation set is the method which has the highest numerical stability and can effectively utilize the floating point calculation capability of the processor. The solution does not solve the upper and lower triangular decomposition (LU Factorization, abbreviated as LU) of the coefficient matrix, and further performs the parallel solution of the sparse matrix based on the upper and lower triangular decomposition.
Under the scene that the power system solves the current of each branch circuit and the voltage of each node, the parallel solution of the sparse matrix needs to be carried out based on upper and lower triangular decomposition. At present, finding an efficient method for performing parallel solution of sparse matrices based on upper and lower triangular decomposition is an important issue to be urgently solved in the industry.
Disclosure of Invention
The invention provides a sparse matrix parallel solving method and device based on upper and lower triangular decomposition, which can quickly and effectively realize synchronous operation and data sharing among processors by reasonably utilizing communication among the processors, and improve the calculation efficiency of the parallel solving process of sparse matrix based on the upper and lower triangular decomposition.
The invention provides a sparse matrix parallel solving method based on upper and lower triangular decomposition, which is applied to a parallel computing platform, wherein the parallel computing platform comprises a plurality of processors, and the method comprises the following steps: receiving an input matrix and storing non-zero elements of the input matrix column by column; reordering the input matrix to obtain a reordered matrix, wherein, in the reordered matrix, the filling of LU decomposition is smaller than a first number, and the absolute value of the non-zero element of the diagonal position is larger than a first threshold; constructing a matrix elimination tree based on the reordering matrix, and constructing a sub-tree and a task queue based on the matrix elimination tree; performing parallel computation on the plurality of subtrees and the plurality of task queues based on the plurality of processors, and obtaining an output matrix based on the computation result, wherein the output matrix is an LU decomposition matrix related to the input matrix; and obtaining a solving result about the input matrix based on the output matrix.
According to the sparse matrix parallel solving method based on the upper and lower triangular decomposition provided by the invention, the matrix elimination tree comprises a plurality of nodes, the nodes comprise root nodes and leaf nodes, and the matrix elimination tree based on constructs subtrees and task queues comprises the following steps: dividing the matrix elimination tree into a plurality of subtrees by taking the root node as a starting point and the leaf nodes as an end point, wherein the number of the subtrees is greater than that of the processors; and determining independent nodes, and constructing the task queue based on the independent nodes, wherein the independent nodes are nodes which are not drawn into the subtree in the matrix elimination tree.
According to the parallel solution method of the sparse matrix in the upper and lower triangular decomposition provided by the invention, the parallel computation of the plurality of subtrees based on the plurality of processors comprises the following steps: assigning a plurality of said sub-trees to a plurality of said processors based on a greedy algorithm; performing parallel computations on the plurality of subtrees based on the plurality of processors.
According to the sparse matrix parallel solving method based on upper and lower triangular decomposition provided by the invention, the processor comprises a cache memory, the nodes of the subtree comprise a plurality of child nodes, and the processor calculates the subtree by adopting the following modes: determining a reading mode for reading the child node based on the capacity of the cache memory; calculating the read child nodes based on the processor to obtain node values of the nodes; obtaining a calculation result about the subtree based on the node value.
According to the sparse matrix parallel solving method based on the upper and lower triangular decomposition provided by the invention, the method for determining the reading mode for reading the child node based on the capacity of the cache memory comprises the following steps: arranging the child nodes in a descending order according to the distance between the child nodes and the nodes of the subtree to obtain a child node queue; sequentially selecting a preset number of child nodes from the child node queue by taking the starting end of the child node queue as a starting point, and reading the preset number of child nodes through the cache memory, wherein the preset number is less than or equal to the capacity of the cache memory; reading, based on the processor, other child nodes in the child node queue except for the preset number of child nodes.
According to the sparse matrix parallel solving method based on the upper and lower triangular decomposition provided by the invention, the parallel calculation of the plurality of task queues based on the plurality of processors comprises the following steps: determining a scheduling processor and a plurality of executing processors in the processors, wherein the executing processors are other processors except the scheduling processor; based on communication transmission, issuing the task queues to the execution processors through the scheduling processor; and performing parallel computation on a plurality of task queues based on a plurality of execution processors.
According to the sparse matrix parallel solving method based on the upper and lower triangular decomposition provided by the invention, the execution processor calculates the task queue by adopting the following method, and the method comprises the following steps: determining a computation state of the task queue in real time based on the scheduling processor; responding to the request of the execution processor to acquire the task queue for calculation, and judging the calculation state of the task queue acquired by the request of the execution processor based on the scheduling processor; and calculating the task queue through the execution processor based on the calculation state of the task queue.
According to the sparse matrix parallel solving method based on the upper and lower triangular decomposition provided by the invention, the calculation state of the task queue comprises the calculation state of a node for determining the calculation result of the task queue, and the calculation of the task queue is performed by the execution processor based on the calculation state of the task queue, and the method comprises the following steps: if the calculation state of the node for determining the calculation result of the task queue is not finished calculation, performing initial calculation on the task queue through the execution processor based on the calculated child nodes in the node; and if the calculation state of the node for determining the calculation result of the task queue is the calculation completion, calculating the task queue through the execution processor based on the node.
According to the sparse matrix parallel solving method based on the upper and lower triangular decomposition provided by the invention, after the initial calculation is performed on the task queue through the execution processor based on the calculated child nodes in the nodes, the method further comprises the following steps: determining the calculation state of the child node which is not calculated in the nodes in real time based on the scheduling processor; and in response to the calculation state of the sub-node which is not calculated at the current moment being calculated, updating the result of the initial calculation through the execution processor based on the sub-node which is not calculated.
According to the sparse matrix parallel solving method based on the upper and lower triangular decomposition provided by the invention, the method further comprises the following steps: and based on the plurality of processors, performing parallel computation on the same first subtree and/or the same first task queue, wherein the first subtree is a subtree connected with the root of the matrix elimination tree, and the first task queue is a task queue connected with the root of the matrix elimination tree.
The invention also provides a device for solving the sparse matrix parallel in the upper and lower triangular decomposition, which is applied to a parallel computing platform, wherein the parallel computing platform comprises a plurality of processors, and the device comprises: the receiving module is used for receiving an input matrix and storing non-zero elements of the input matrix column by column; a reordering module, configured to reorder the input matrix to obtain a reordered matrix, where in the reordered matrix, a filling of LU decomposition is smaller than a first number, and an absolute value of the non-zero element at a diagonal position is larger than a first threshold; the construction module is used for constructing a matrix elimination tree based on the reordering matrix and constructing a sub-tree and a task queue based on the matrix elimination tree; the processing module is used for carrying out parallel computation on the plurality of subtrees and the plurality of task queues based on the plurality of processors and obtaining an output matrix based on a computation result, wherein the output matrix is an LU decomposition matrix related to the input matrix; and the generating module is used for obtaining a solving result related to the input matrix based on the output matrix.
According to the sparse matrix parallel solving device based on the upper and lower triangular decomposition provided by the invention, the matrix elimination tree comprises a plurality of nodes, the nodes comprise root nodes and leaf nodes, and the construction module constructs a sub-tree and a task queue based on the matrix elimination tree by adopting the following modes: dividing the matrix elimination tree into a plurality of subtrees by taking the root node as a starting point and the leaf nodes as an end point, wherein the number of the subtrees is greater than that of the processors; and determining independent nodes, and constructing the task queue based on the independent nodes, wherein the independent nodes are nodes which are not drawn into the subtree in the matrix elimination tree.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor realizes the steps of the sparse matrix parallel solving method based on the upper and lower triangular decomposition when executing the program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the sparse matrix parallel solution method based on upper and lower triangular decomposition as described in any of the above.
The invention also provides a computer program product, which comprises a computer program, wherein the computer program is used for realizing the steps of the sparse matrix parallel solving method based on the upper and lower triangular decomposition when being executed by a processor.
The sparse matrix parallel solving method and device based on upper and lower triangular decomposition provided by the invention are used for constructing subtrees and task queues related to an input matrix, performing parallel calculation on the subtrees and the task queues through a plurality of processors, and obtaining an output matrix based on calculation results. In the invention, the parallel computation is carried out through the processors, the computation efficiency of the parallel solution of the sparse matrix can be improved, and in the process of carrying out the parallel computation on the subtrees and the task queues through the plurality of processors, the communication mechanism among the processors is utilized to transmit the dependent data of each operation, so that the access and storage overhead can be minimized, and the computation efficiency of the parallel solution process of the sparse matrix based on the upper triangular decomposition and the lower triangular decomposition is further improved.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is one of the flow diagrams of the sparse matrix parallel solving method based on the upper and lower triangular decomposition provided by the present invention;
FIG. 2 is one of the schematic diagrams of LU decomposition of sparse matrix provided by the present invention;
FIG. 3 is a schematic diagram of a process for constructing subtrees and task queues based on a matrix elimination tree according to the present invention;
FIG. 4 is one of the schematic structural diagrams of a matrix elimination tree for LU decomposition of sparse matrices provided by the present invention;
FIG. 5 is a flow chart illustrating a parallel computation of a plurality of subtrees based on a plurality of processors according to the present invention;
FIG. 6 is a flow chart of the processor-based computation of the subtree according to the present invention;
FIG. 7 is a flow chart illustrating a method for determining a read mode for reading a child node according to the capacity of the cache memory;
FIG. 8 is a flow chart illustrating a parallel computation of a plurality of task queues based on a plurality of processors according to the present invention;
FIG. 9 is a flow chart illustrating a process for performing computations on a task queue based on a processor according to the present invention;
FIG. 10 is a schematic structural diagram of a sparse matrix parallel solving device based on upper and lower triangular decomposition according to the present invention;
fig. 11 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
With the development of computer technology, the demand for large-scale scientific computing is gradually increasing. Many fields of scientific computing, such as structural mechanics computing, fluid mechanics computing, etc., require solving large-scale sparse linear equations.
The direct solution of the linear equation set is the method which has the highest numerical stability and can effectively utilize the floating point calculation capability of the processor. The solution does not solve the upper and lower triangular decomposition (LU Factorization, abbreviated as LU) of the coefficient matrix, and further performs the parallel solution of the sparse matrix based on the upper and lower triangular decomposition.
Under the scene that the power system solves the current of each branch circuit and the voltage of each node, the parallel solution of the sparse matrix needs to be carried out based on upper and lower triangular decomposition. At present, finding an efficient method for performing parallel solution of sparse matrices based on upper and lower triangular decomposition is an important issue to be urgently solved in the industry.
In the present invention, for any sparse matrix, the LU decomposition result is: the sparsity of the L matrix and the U matrix may have some filling compared with the original matrix, and the filling may cause the output matrix to present locally dense characteristics through a certain reordering. Therefore, in LU decomposition of the sparse matrix, a dense sub-block of the sparse matrix is generally regarded as a calculation task, and a solution method of dense linear algebra is used for the dense sub-blocks. When a row or a column is regarded as a task node to construct a matrix elimination tree, the dense sub-blocks correspond to super nodes formed by fusing a plurality of adjacent nodes, and therefore the dense sub-blocks are also called super nodes (supernodes). In the invention, the matrix elimination tree refers to an elimination tree after node fusion, and the 'node' on the matrix elimination tree refers to a super-node formed by dense sub-blocks.
The development of high-performance technology makes it possible for multiple processors to operate cooperatively and to utilize parallel units for computation and memory access efficiently. The performance of many programs is limited by the speed at which system memory is accessed, compared to the powerful computing power provided by multiple processors. To this end, the processor's own on-chip memory may store a small portion of the data and provide a fast access mode. How to effectively utilize the space of the on-chip memory and reduce the access times of the main memory provides challenges for program development designers. High-speed data communication modes often exist between multiple processors of high-performance computer systems. Taking a domestic new generation Shenwei many-core processor as an example, 64 slave cores on a single-core group have Remote Memory Access (Remote Memory Access), can directly Access on-chip memories of other slave cores, and has Access delay and bandwidth far faster than that of directly accessing a main Memory.
The sparse matrix parallel solving method based on the upper triangular decomposition and the lower triangular decomposition reasonably utilizes a communication mechanism between the processors to quickly and effectively realize synchronous operation and data sharing between the processors. In the process of parallel computing the sub-trees and the task queues based on a plurality of processors, the access and storage overhead is minimized by transmitting the dependent data of each operation by using a communication mechanism among the processors, and the computing efficiency of the parallel solving process of the sparse matrix based on the upper triangular decomposition and the lower triangular decomposition is further improved.
It should be noted that the sparse matrix parallel solving method provided by the invention can be applied to a scene that a power system solves the current of each branch and the voltage of each node. By parallel solving of a sparse matrix related to the power system scene based on upper and lower triangular decomposition, the current of each branch and the voltage of each node in the circuit can be obtained.
The present invention will be described with reference to the following embodiments for a process of parallel solution of sparse matrix based on upper and lower triangular decomposition.
Fig. 1 is one of the flow diagrams of the sparse matrix parallel solving method based on the upper and lower triangular decomposition provided by the present invention.
In an exemplary embodiment of the invention, the sparse matrix parallel solution based on the upper and lower triangular decomposition may be applied to a parallel computing platform, wherein the parallel computing platform may include a plurality of processors. As shown in fig. 1, the sparse matrix parallel solving method based on the upper and lower triangular decomposition may include steps 110 to 150, which will be described below.
In step 110, an input matrix is received and non-zero elements of the input matrix are stored column by column.
In one embodiment, the input matrix may be understood as a matrix that needs to be solved in parallel based on an upper and lower triangular decomposition. The input matrix may be a sparse matrix. In one embodiment, the input matrix may be the impedance of the circuit in addressing the problem with the power system. By parallel solving of a sparse matrix related to the power system scene based on upper and lower triangular decomposition, the current of each branch and the voltage of each node in the circuit can be obtained.
In one embodiment, the input matrix may include an input scale. In an example, the input matrix may be a sparse matrix a of N x N. The input matrix a stores non-zero elements in the memory of the processor in a column-by-column order. For the case of storing non-zero elements row by row, it can be regarded as the transpose matrix A of the column-by-column storage sequenceTBased on the solving method provided by the invention, A can also be solvedTAnd carrying out LU decomposition to further achieve the purpose of solving.
In one embodiment, the input matrix A is represented in a Compressed Sparse Column (Compressed Sparse Column) format. In one example, the input matrix a may include: array V containing all non-zero valuesAInteger array I containing all non-zero column indicesAPosition P of the first non-zero element of each rowA. At VAAnd IAIn the above description, the non-zero elements are arranged in a row-by-row order. Wherein, PAIs not deliveredReduced nonnegative integer, IAAre non-negative integers and have no repeated subscripts within the same row of the matrix. I isAThe subscript in (1) does not exceed N.
In step 120, the input matrix is reordered to obtain a reordered matrix, wherein the LU decomposition padding in the reordered matrix is less than a first number and the absolute value of the non-zero elements of the diagonal positions is greater than a first threshold.
In one embodiment, the input matrix a may be reordered such that the LU decomposition is filled as little as possible and the absolute values of the diagonal elements are as large as possible. In one example, reordering the input matrix may result in a reordered matrix. Wherein, in the reordering matrix, the LU decomposition is filled in by less than a first amount, and the absolute value of the non-zero elements of the diagonal positions is greater than a first threshold. It is understood that the first number may be adjusted according to actual situations, and is not specifically limited in this embodiment. By defining the population of the LU decomposition to be smaller than the first number, the population of the LU decomposition can be made as small as possible. Accordingly, the first threshold may also be adjusted according to actual situations, and is not specifically limited in this embodiment. By defining that the absolute value of the non-zero elements of the diagonal positions is larger than the first threshold, the absolute value of the diagonal elements can be made as large as possible.
In yet another embodiment, the input matrix may be first rearranged in rows such that the elements with larger absolute values in each column are arranged on the diagonal; and then, carrying out row-column symmetrical rearrangement on the input matrix, so that the filling of LU decomposition is as little as possible. In the application process, the LU decomposition results can be more easily generated into dense super-nodes according to the sequential traversal order sorting of the matrix elimination trees, and the calculation of the sub-trees has better data locality.
In yet another embodiment, reordering the input matrix to obtain a reordered matrix may be performed by open source software such as Suite spark-AMD, METIS, HSL-MC64, and the like. In one example, the row rearrangement may be performed by HSL-MC64 such that the absolute value of the diagonal elements is as large as possible; then, the row and column symmetric rearrangement is carried out through METIS, so that the filling of LU decomposition is as little as possible.
At step 130, a matrix-elimination tree is constructed based on the reordering matrix, and a sub-tree and a task queue are constructed based on the matrix-elimination tree.
In one embodiment, the matrix structure of the reordered matrix may be analyzed and a matrix elimination tree may be constructed based on the matrix structure of the reordered matrix. Further, based on the matrix elimination tree, a sparse structure of subtrees, task queues, and output matrices may be constructed. Each node in the matrix elimination tree can be regarded as a calculation task, and the task depends on the child nodes of the node, so that the calculation process of each node can be started from the leaf node to the root node. It may be noted that each node in the matrix elimination tree may be a supernode corresponding to a dense sub-block.
In step 140, a plurality of subtrees and a plurality of task queues are computed in parallel based on the plurality of processors, and an output matrix is obtained based on the computation results. Wherein the output matrix is a LU decomposition matrix with respect to the input matrix.
In one embodiment, multiple processors may be initiated to perform computations on multiple subtrees and multiple task queues in parallel, and an output matrix is derived based on the computation. Wherein the output matrix is a LU decomposition matrix with respect to the input matrix. It is understood that in the scenario of solving the power system problem, the output matrix is an LU decomposition matrix regarding the impedance of the circuit, and further, based on the LU decomposition matrix regarding the impedance of the circuit, the branch currents and node voltages in the circuit can be solved. In one example, all processors may be started to execute, one processor per thread binding. For example, P processors may be started and P threads bound. In the application process, in the process of performing parallel computation on the subtree and the task queue through a plurality of processors, a communication mechanism among the processors can be utilized to transmit the dependent data of each operation in the subtree and the task queue so as to minimize the access and storage overhead, and further improve the computation efficiency of the parallel solution process of performing the sparse matrix based on the upper triangular decomposition and the lower triangular decomposition.
As can be seen from fig. 2, the 12 × 12 input matrix (left image) is divided into two parts after LU decomposition (right image)5 super nodes are present, namely 5 tasks. Wherein the small squares in the input matrix represent respective non-zero elements, and the non-zero elements in the input matrix (left image) are different from those in the LU decomposition (right image). It is understood that the right diagram in fig. 2 is the LU decomposition matrix obtained by performing decomposition after reordering. In this embodiment, the L matrix and the U matrix are represented by 10 arrays in total, and are respectively the non-zero element value V of the starting column S of each super node, the super node number C to which each column belongs, and the LLEach supernode is at VLIn (2) a start position PVLThe non-zero row index I for each supernode in LLEach supernode is in ILIn (2) a start position PILNon-zero value V of UUEach fragment being at VUStarting position G, line index I of each segmentUStarting fragment position P of each columnU. For convenience of representation, the diagonal blocks are not separated into upper and lower triangular portions, but are stored in VLIn and VUOnly the non-diagonal block portion of U is stored, each column of which contains a number of non-zero segments, one for each diagonal block.
At step 150, based on the output matrix, a solution is obtained for the input matrix.
In one embodiment, the solution result for the input matrix may be obtained based on the output matrix, i.e., the LU decomposition matrix for the input matrix. In an example, in a scenario of solving the power system problem, the output matrix is an LU decomposition matrix regarding impedance of the circuit, and further, based on the LU decomposition matrix regarding impedance of the circuit, each branch current and each node voltage corresponding to the impedance in the circuit can be solved.
The sparse matrix parallel solving method and device based on upper and lower triangular decomposition provided by the invention are used for constructing subtrees and task queues related to an input matrix, performing parallel calculation on the subtrees and the task queues through a plurality of processors, and obtaining an output matrix based on calculation results. In the invention, the parallel computation is carried out through the processors, the computation efficiency of the parallel solution of the sparse matrix can be improved, and in the process of carrying out the parallel computation on the subtrees and the task queues through the plurality of processors, the communication mechanism among the processors is utilized to transmit the dependent data of each operation, so that the access and storage overhead can be minimized, and the computation efficiency of the parallel solution process of the sparse matrix based on the upper triangular decomposition and the lower triangular decomposition is improved.
The present invention will be described with reference to the following embodiments for constructing subtrees and task queues based on a matrix elimination tree.
FIG. 3 is a schematic diagram of a process for constructing subtrees and task queues based on a matrix elimination tree according to the present invention.
In an exemplary embodiment of the invention, the matrix elimination tree may include a plurality of nodes, where each node is considered a computational task. Further, the nodes may include child nodes including at least a root node and a leaf node. It will be appreciated that the computational tasks corresponding to the nodes may be derived from the child nodes on which they depend.
As shown in FIG. 3, constructing the subtree and the task queue based on the matrix elimination tree may include steps 310 and 320, which are described separately below.
In step 310, the matrix elimination tree is divided into a plurality of sub-trees with the root node as a starting point and the leaf nodes as an end point, wherein the number of sub-trees is greater than the number of processors.
In one embodiment, a bifurcation point may be found starting from a root node of a matrix elimination tree and going to a leaf node to divide the matrix elimination tree into a plurality of sub-trees. Wherein the number of subtrees may be greater than the number of processors. By the method, each processor can be ensured to have the subtree to perform calculation processing, the processors are reasonably utilized, and therefore the solving efficiency is improved.
It will be appreciated that each node in the matrix elimination tree may be considered a computational task that depends on its children. Thus, the calculation process starts from the leaf node and ends at the root node. In one embodiment, before the subtree is divided, the weight of the subtree may be determined according to the calculation amount of the subtree, and it is understood that the larger the calculation amount of the subtree is, the larger the corresponding weight is. In the initial state of partitioning, the entire matrix elimination tree can be considered a subtree. And starting from the root node of the matrix elimination tree, searching for a bifurcation point, generating a new subtree, and starting from the root node of the current maximum weight subtree, and continuously searching for the bifurcation point until s x P subtrees are generated (s is a constant greater than 1, and P is the number of processors).
In step 320, independent nodes are determined and a task queue is constructed based on the independent nodes. And the independent nodes are nodes which are not divided into subtrees in the matrix elimination tree.
In one embodiment, after the subtrees are divided, the parts which are not divided into any subtree can be used as task queues and added into a global task queue for dynamic scheduling, and the queue order can be in a bottom-to-top order. As can be seen from fig. 4, for the matrix elimination tree in fig. 4, the calculation tasks may be divided into 18 calculation tasks, wherein the 18 calculation tasks are divided into 4 subtrees and 4 task queues. Wherein, the calculation tasks 1 to 3 form a subtree 1; the calculation tasks 4 to 7 form a subtree 2; the calculation tasks 9 to 11 form a subtree 3; calculation tasks 13 through 16 constitute subtree 4. The calculation task 8, the calculation task 12, the calculation task 17 and the calculation task 18 respectively constitute a task queue.
The present invention will be described with reference to the following embodiments for a process in which multiple processors perform parallel computations on multiple subtrees.
FIG. 5 is a flow chart illustrating a parallel computation of a plurality of subtrees based on a plurality of processors according to the present invention.
In an exemplary embodiment of the present invention, as shown in FIG. 5, performing parallel computation on a sub-tree based on multiple processors may include steps 510 and 520, which are described separately below.
In step 510, a plurality of subtrees are assigned to a plurality of processors based on a greedy algorithm.
In step 520, a plurality of subtrees are computed in parallel based on the plurality of processors.
In one embodiment, the description continues with the example of generating s x P subtrees described above. In the application process, a greedy algorithm can be used to allocate s × P subtrees to P processors for parallel processing, that is, the processor core with the lightest current task is continuously allocated to the one with the largest weight in the rest subtrees. By the method, each processor can be ensured to have the subtree to perform calculation processing, the processors are reasonably utilized, and therefore the solving efficiency is improved.
The present invention will be described with reference to the following embodiments for a process of computing a subtree based on a processor.
FIG. 6 is a flow chart illustrating calculation of a sub-tree based on a processor according to the present invention.
In an exemplary embodiment of the invention, the processor may include a cache memory, and the nodes of the subtree may include a plurality of child nodes. Computing the sub-tree based on the processor may include steps 610 through 630, each of which is described separately below.
In step 610, a read mode to read the child node is determined based on the capacity of the cache.
In step 620, the read child nodes are calculated based on the processor, and node values about the nodes are obtained.
In step 630, based on the node value, a calculation result on the subtree is obtained.
In one embodiment, the reading mode for reading the child node may be determined according to the capacity of the cache memory, and the node value of the node may be obtained based on the calculation of the read child node by the processor. Further, based on the node values, a calculation result about the subtree can be obtained, and then a solution result about the input matrix can be obtained. It should be noted that, the reading mode for reading the child node is determined according to the capacity of the cache, so that the child node can be ensured to be read in a reasonable and effective manner, the efficiency of data reading is improved, and the solution efficiency is further improved.
To further illustrate the present invention, a process for determining a read mode for reading a child node based on the capacity of the cache will be described.
FIG. 7 is a flow chart illustrating a method for determining a read mode for reading a child node according to the capacity of the cache memory.
In an exemplary embodiment of the present invention, as shown in fig. 7, determining a read mode for reading the child node based on the capacity of the cache memory may include steps 710 to 730, which will be described separately below.
In step 710, the child nodes are sorted in descending order according to the distance from the node of the sub-tree, and a child node queue is obtained.
In an embodiment, the child nodes may be arranged in descending order according to the distance from the node of the sub-tree, so as to obtain the child node queue. It will be appreciated that the nodes of the subtree are root nodes of the child nodes. In one example, children nodes closer to the node may be derived from children nodes further from the node.
In step 720, a predetermined number of child nodes are sequentially selected from the child node queue with the start of the child node queue as the starting point, and the predetermined number of child nodes are read through the cache memory. Wherein the predetermined number is less than or equal to the capacity of the cache memory.
In step 730, based on the processor, other child nodes in the child node queue except for the preset number of child nodes are read.
In one embodiment, each processor, in computing a node in a subtree, may record the amount of data for the node that was previously computed. Further, the most recently computed node may be read through the cache memory when read, depending on the capacity of the cache memory. It can be understood that the most recently calculated node may be a node that sequentially selects a predetermined number of child nodes in the child node queue, with the beginning of the child node queue being the starting point. When reading a far node, selecting a memory access mode which does not destroy the existing data in the cache as far as possible according to a memory access mechanism of the processor to read, wherein the far node can correspond to other sub-nodes except the preset number of sub-nodes in the sub-node queue. By the method, the cache memory can be fully utilized for data reading, the access and storage expenses are reduced, and the utilization rate of computing resources is improved.
The present invention will be described with reference to the following embodiments for a process of performing parallel computation on a plurality of task queues based on a plurality of processors.
FIG. 8 is a flow chart illustrating parallel computation of a plurality of task queues based on a plurality of processors according to the present invention.
In an exemplary embodiment of the present invention, as shown in fig. 8, performing parallel computation on a plurality of task queues based on a plurality of processors may include steps 810 to 830, which will be described separately below.
In step 810, a scheduling processor and a plurality of execution processors are determined in the processor, wherein the execution processors are other processors except the scheduling processor.
In step 820, a plurality of task queues are issued by the scheduling processor to the plurality of execution processors based on the communication transmission.
In step 830, a plurality of task queues are computed in parallel based on a plurality of execution processors.
In one embodiment, one processor may act as a scheduler and the other processor as an executor in performing task queue computations. In an example, a scheduling processor and a plurality of execution processors may be determined in the processor, wherein the execution processors are other processors than the scheduling processor. Further, the scheduling processor may receive a scheduling request of the execution processor through a communication transmission mechanism between the processors, allocate the task queue in the global task queue to the execution processor, and send a response through inter-processor communication until all the task queues in the global task queue are allocated. In an example, multiple task queues may be computed in parallel based on multiple execution processors. In this embodiment, through fast communication transmission between the processors, the task queue may be issued to a plurality of execution processors, so as to implement parallel processing on the task queue. Because the communication transmission between the processors has high efficiency, the efficiency of issuing the task queue can be improved, and the solving efficiency is further improved.
The present invention will be described with reference to the following embodiments for a process of computing a task queue based on a processor.
FIG. 9 is a flow chart illustrating a process for performing computations on a task queue based on a processor according to the present invention.
In an exemplary embodiment of the present invention, as shown in fig. 9, the processor-based calculation of the task queue may include steps 910 to 930, which will be described separately below.
In step 910, a computational state of the task queue is determined in real-time based on the scheduling processor.
In one embodiment, the computational state of the subtree and the computational state of the task queue may be determined in real-time by the scheduling processor. Wherein the computation state of the subtree may include the computation states of the nodes on which the subtree depends. The computation state of the task queue may include the computation state of each subtree on which the task queue depends.
In yet another embodiment, each execution processor, in computing a subtree, may maintain the computation state of the corresponding subtree. The calculation state may indicate the node number fb _ tree currently being calculated. Nodes before this number may be understood as corresponding nodes that have been calculated and are available for other execution processors to read for use as calculation dependent data for calculating other nodes.
In yet another embodiment, the scheduling processor may maintain a series of states including the first outstanding task number fb _ queue in the task queue, the number of outstanding tasks in each node's children, nb _ child.
In step 920, in response to the execution processor requesting to acquire the task queue for calculation, a calculation state of the task queue requested to acquire by the execution processor is determined based on the scheduling processor.
In step 930, a calculation is performed on the task queue by the execution processor based on the calculation state of the task queue.
In one embodiment, in response to the execution processor requesting the scheduling processor to obtain the task queue for computation, the scheduling processor may determine a computation state of the task queue obtained by the execution processor request, and perform computation on the task queue through the execution processor according to the computation state. In one example, a scheduling processor, upon receiving a computation request to obtain a task queue for an execution processor request, may treat the last task queue executed by the execution processor as completed. At this point, the scheduling processor also needs to update the computational state information it maintains. While the scheduling processor maintains computation state information for the task queue, there may be execution processors that are also performing computations on the subtree. Since the subtree can be the dependency data of the task queue for performing the computation, the scheduling processor needs to check the computation status of the subtree and update the computation status of the subtree in time when processing some requests related to the execution processor for performing the task queue computation, so as to facilitate the scheduling processor to perform the computation on the corresponding task queue.
The present invention will be described with reference to the following embodiments, which are used to describe a process of calculating a task queue by an execution processor based on a calculation state of the task queue.
In an exemplary embodiment of the present invention, the computation state of the task queue may include a computation state of a node for determining a computation result of the task queue. It will be appreciated that the computation of the task queue is dependent on the node, wherein the computation state of the task queue may comprise the computation state for that node, wherein the computation of the node may in turn be dependent on the child node corresponding to that node, and thus the computation state of the node is in turn dependent on the computation state of the corresponding child node. Based on the computation state of the task queue, computing the task queue by the execution processor can be implemented by: if the calculation state of the node for determining the calculation result of the task queue is unfinished calculation, performing initial calculation on the task queue through an execution processor based on the calculated child nodes in the node; and if the calculation state of the node for determining the calculation result of the task queue is the calculation completion, calculating the task queue by the execution processor based on the node.
In one embodiment, the execution processor sends a task queue to the execution processor for computation when receiving a response from the scheduling processor. When computing a node associated with the task queue, the execution processor needs to update the current node with child nodes of the current node. If nb _ child is non-zero, then there are parts of the child node that have not completed computation. At this time, whether each node dependent on is already calculated is judged according to the fb _ tree value of each executing processor and the fb _ queue value of the scheduling processor, and the calculation result of the task queue is updated by the already-calculated node, namely the task queue is initially calculated. And then waiting for nb _ child to return to zero, and continuing to finish updating the calculation result of the task queue by using the node which is not updated before, namely updating the initial calculation of the task queue to obtain the final calculation result. And after all the updates are finished, performing intra-node partial solution to obtain non-zero elements of the LU decomposition matrix, thereby determining the LU decomposition matrix. Further, based on the LU decomposition matrix, a solution result about the input matrix is obtained.
It should be noted that, during the internal decomposition of each node, the column selection principal element may be performed inside the node. Since the corresponding block of nodes in the U matrix is already considered dense, the row exchange does not change its non-zero structure. While the selected pivot in the node can not ensure the value to be stable, the absolute value is smaller than the threshold value by means of pivot disturbance
Figure BDA0003411178560000181
(where ε represents the floating point error due to machine precision, | | A | | non-calculation)11-norm of the representation matrix A) is replaced by tau, so that the occurrence of vicious termination such as zero removal can be avoided within a certain error, and errors caused by the disturbance of the pivot can be eliminated through iterative fine correction.
In another embodiment, if the computation status of the node for determining the computation result of the task queue is that the computation is completed, the final computation result of the task queue may be obtained by performing the computation on the task queue by the execution processor based on the node directly. Further, the intra-node partial solution is performed to obtain non-zero elements of the LU decomposition matrix, so as to determine the LU decomposition matrix. And obtaining a solution result about the input matrix based on the LU decomposition matrix.
In an exemplary embodiment of the invention, after performing initial computation on a task queue by an execution processor based on computed child nodes in a node, the method further includes: determining the calculation state of the child node which is not calculated in the node in real time based on the scheduling processor; and in response to the calculation state of the sub-node which is not calculated at the current moment being calculated, updating the result of the initial calculation through the execution processor based on the sub-node which is not calculated.
In one embodiment, the computation status of the child node that has not completed computation in the node may be determined in real time based on the scheduling processor, and it is understood that if the scheduling processor determines that nb _ child is zeroed, it indicates that the computation status of the child node that has not completed computation at the current time is completed computation. At this time, the child node that has not completed the calculation before can continue to complete the calculation result update of the task queue, that is, the initial calculation of the task queue is updated, so as to obtain the final calculation result. In this embodiment, based on the execution processor, the calculation result of the task queue is initially updated by the node that has completed the calculation, and then the calculation result of the task queue is further updated according to the calculation state of the node on which the task queue depends, so as to obtain the final calculation result. By the method, the calculation tasks of the execution processor can be reasonably distributed, and the calculation load and the calculation time of the execution processor are optimized.
To further describe the sparse matrix parallel solving method based on the upper and lower triangular decomposition provided by the present invention, the following embodiments will be described below.
In an exemplary embodiment of the present invention, the method for solving the sparse matrix based on the upper and lower triangular decomposition in parallel may further include the following steps in addition to the aforementioned steps 110 to 150: based on the multiple processors, parallel computation is performed on the same first subtree and/or the same first task queue, wherein the first subtree is a subtree connected with the root of the matrix elimination tree, and the first task queue is a task queue connected with the root of the matrix elimination tree.
In one embodiment, the subtree or task queue near the root of the matrix elimination tree is less parallel, such as the first subtree or first task queue. The scheduling processor may allocate the same first sub-tree or first task queue to a plurality of execution processors for parallel computation according to the data amount of the child nodes included in the first sub-tree or first task queue. The plurality of execution processors may divide the data by columns, each updating a respective portion thereof, and performing parallel decomposition by processor communication at a decomposition stage within the node (corresponding to the first subtree or first task queue). It will be appreciated that the execution processors of the group may be simultaneously used at this stage to compute the results of computations for the same sub-tree or task queue. The next time the execution processor of the group issues a scheduling request, the scheduling processor will only treat the last execution processor of the group as having completed the previous task, i.e., will only update the computation state for that stage upon receiving the next request from the last execution processor of the group.
In the invention, the dependent data of each operation is transmitted based on the rapid communication transmission among the processors, so that the synchronization and memory access overhead can be minimized, and the resource utilization rate can be maximized.
According to the description, the sparse matrix parallel solving method and device based on the upper and lower triangular decomposition, provided by the invention, are used for constructing the subtrees and the task queues related to the input matrix, performing parallel calculation on the subtrees and the task queues through a plurality of processors, and obtaining the output matrix based on the calculation result. In the invention, the parallel computation is carried out through the processors, the computation efficiency of the parallel solution of the sparse matrix can be improved, and in the process of carrying out the parallel computation on the subtrees and the task queues through the plurality of processors, the communication mechanism among the processors is utilized to transmit the dependent data of each operation, so that the access and storage overhead can be minimized, and the computation efficiency of the parallel solution process of the sparse matrix based on the upper triangular decomposition and the lower triangular decomposition is improved.
Based on the same conception, the invention also provides a sparse matrix parallel solving device based on upper and lower triangular decomposition.
The following describes the sparse matrix parallel solving device based on the upper and lower triangular decomposition provided by the present invention, and the sparse matrix parallel solving device based on the upper and lower triangular decomposition described below and the sparse matrix parallel solving method based on the upper and lower triangular decomposition described above can be referred to correspondingly.
Fig. 10 is a schematic structural diagram of a sparse matrix parallel solving device based on upper and lower triangular decomposition according to the present invention.
In an exemplary embodiment of the present invention, the sparse matrix parallel solving apparatus based on the upper and lower triangular decomposition may be applied to a parallel computing platform, wherein the parallel computing platform may include a plurality of processors. The apparatus for solving the sparse matrix parallel based on the upper and lower triangular decomposition may include a receiving module 1010, a reordering module 1020, a constructing module 1030, a processing module 1040, and a generating module 1050. Each module will be described separately below.
The receiving module 1010 may be configured to receive an input matrix and store non-zero elements of the input matrix column by column.
The reordering module 1020 may be configured for reordering the input matrix resulting in a reordered matrix, wherein in the reordered matrix the LU decomposition is filled to less than a first number and the absolute value of the non-zero elements of the diagonal positions is greater than a first threshold.
The construction module 1030 may be configured for constructing a matrix elimination tree based on the reordering matrix and constructing a sub-tree and a task queue based on the matrix elimination tree.
The processing module 1040 may be configured to perform parallel computation on a plurality of subtrees and a plurality of task queues based on a plurality of processors, and obtain an output matrix based on the computation result, where the output matrix is an LU decomposition matrix with respect to an input matrix.
The generation module 1050 may be configured to derive a solution result for the input matrix based on the output matrix.
In an exemplary embodiment of the invention, the matrix elimination tree may include a plurality of nodes, the nodes may include a root node and a leaf node, and the constructing module 1030 may construct the sub-tree and the task queue based on the matrix elimination tree in the following manner: dividing the matrix elimination tree into a plurality of sub-trees by taking the root node as a starting point and the leaf nodes as an end point, wherein the number of the sub-trees is greater than that of the processors; and determining independent nodes, and constructing a task queue based on the independent nodes, wherein the independent nodes are nodes which are not divided into subtrees in the matrix elimination tree.
In an exemplary embodiment of the invention, the processing module 1040 may perform parallel computations on the subtrees based on the processors in the following manner: assigning a plurality of subtrees to a plurality of processors based on a greedy algorithm; the plurality of subtrees are computed in parallel based on the plurality of processors.
In an exemplary embodiment of the invention, the processor may include a cache memory, the nodes of the subtree may include a plurality of child nodes, and the processing module 1040 may compute the subtree by the processor in the following manner:
determining a reading mode for reading the child node based on the capacity of the cache memory; calculating the read child nodes based on the processor to obtain node values of the nodes; and obtaining a calculation result about the subtree based on the node value.
In an exemplary embodiment of the invention, the processing module 1040 may determine the read mode of the read child node based on the capacity of the cache memory in the following manner:
arranging the child nodes in a descending order according to the distance between the child nodes and the nodes of the subtree to obtain a child node queue; sequentially selecting a preset number of child nodes from the child node queue by taking the starting end of the child node queue as a starting point, and reading the preset number of child nodes through a high-speed buffer memory, wherein the preset number is less than or equal to the capacity of the high-speed buffer memory; and reading other child nodes except the preset number of child nodes in the child node queue based on the processor.
In an exemplary embodiment of the invention, the processing module 1040 may perform parallel computation on a plurality of task queues based on a plurality of processors in the following manner:
determining a scheduling processor and a plurality of executing processors in the processor, wherein the executing processors are other processors except the scheduling processor; based on communication transmission, a plurality of task queues are issued to a plurality of execution processors through a scheduling processor; and performing parallel computation on a plurality of task queues based on a plurality of execution processors.
In an exemplary embodiment of the invention, the processing module 1040 may compute the task queue by executing the processor in the following manner: determining the calculation state of the task queue in real time based on a scheduling processor; responding to the request of the execution processor to acquire the task queue for calculation, and judging the calculation state of the task queue acquired by the request of the execution processor based on the scheduling processor; the task queue is computed by the execution processor based on a computation state of the task queue.
In an exemplary embodiment of the invention, the calculation state of the task queue may include a calculation state of a node for determining a calculation result of the task queue, and the processing module 1040 may calculate the task queue by the execution processor based on the calculation state of the task queue in the following manner:
if the calculation state of the node for determining the calculation result of the task queue is unfinished calculation, performing initial calculation on the task queue through an execution processor based on the calculated child nodes in the node; and if the calculation state of the node for determining the calculation result of the task queue is the calculation completion, calculating the task queue by the execution processor based on the node.
In an exemplary embodiment of the present invention, the processing module 1040 may be further configured to determine, in real time, the computation status of the sub-node that is not computed in the node based on the scheduling processor; and in response to the calculation state of the sub-node which is not calculated at the current moment being calculated, updating the result of the initial calculation through the execution processor based on the sub-node which is not calculated.
In an exemplary embodiment of the invention, the processing module 1040 may be further configured to perform parallel computations on the same first subtree and/or the same first task queue based on the plurality of processors, wherein the first subtree is a subtree connected to the root of the matrix elimination tree and the first task queue is a task queue connected to the root of the matrix elimination tree.
Fig. 11 illustrates a physical structure diagram of an electronic device, and as shown in fig. 11, the electronic device may include: a processor (processor)1110, a communication Interface (Communications Interface)1120, a memory (memory)1130, and a communication bus 1140, wherein the processor 1110, the communication Interface 1120, and the memory 1130 communicate with each other via the communication bus 1140. The processor 1110 may call logic instructions in the memory 1130 to perform a sparse matrix parallel solution method based on an upper and lower triangular decomposition, wherein the sparse matrix parallel solution method based on the upper and lower triangular decomposition is applied to a parallel computing platform, the parallel computing platform including a plurality of processors, the method including: receiving an input matrix and storing non-zero elements of the input matrix column by column; reordering the input matrix to obtain a reordered matrix, wherein in the reordered matrix, the filling of LU decomposition is less than a first number, and the absolute value of a non-zero element of an diagonal position is greater than a first threshold; constructing a matrix elimination tree based on the reordering matrix, and constructing a sub-tree and a task queue based on the matrix elimination tree; performing parallel computation on the subtrees and the task queues based on a plurality of processors, and obtaining an output matrix based on the computation result, wherein the output matrix is an LU decomposition matrix related to an input matrix; and obtaining a solution result related to the input matrix based on the output matrix.
In addition, the logic instructions in the memory 1130 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention further provides a computer program product, where the computer program product includes a computer program, the computer program can be stored on a non-transitory computer readable storage medium, and when the computer program is executed by a processor, the computer can execute the sparse matrix parallel solution method based on upper and lower triangular decomposition provided by the above methods, where the method is applied to a parallel computing platform, and the parallel computing platform includes a plurality of processors, and the method includes: receiving an input matrix and storing non-zero elements of the input matrix column by column; reordering the input matrix to obtain a reordered matrix, wherein in the reordered matrix, the filling of LU decomposition is less than a first number, and the absolute value of a non-zero element of an diagonal position is greater than a first threshold; constructing a matrix elimination tree based on the reordering matrix, and constructing a sub-tree and a task queue based on the matrix elimination tree; performing parallel computation on the subtrees and the task queues based on a plurality of processors, and obtaining an output matrix based on the computation result, wherein the output matrix is an LU decomposition matrix related to an input matrix; and obtaining a solution result related to the input matrix based on the output matrix.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements a sparse matrix parallel solving method based on upper and lower triangular decomposition provided by the above methods, wherein the method is applied to a parallel computing platform, the parallel computing platform including a plurality of processors, the method including: receiving an input matrix and storing non-zero elements of the input matrix column by column; reordering the input matrix to obtain a reordered matrix, wherein in the reordered matrix, the filling of LU decomposition is less than a first number, and the absolute value of a non-zero element of an diagonal position is greater than a first threshold; constructing a matrix elimination tree based on the reordering matrix, and constructing a sub-tree and a task queue based on the matrix elimination tree; performing parallel computation on the subtrees and the task queues based on a plurality of processors, and obtaining an output matrix based on the computation result, wherein the output matrix is an LU decomposition matrix related to an input matrix; and obtaining a solution result related to the input matrix based on the output matrix.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (15)

1. A sparse matrix parallel solving method based on upper and lower triangular decomposition is applied to a parallel computing platform, the parallel computing platform comprises a plurality of processors, and the method comprises the following steps:
receiving an input matrix and storing non-zero elements of the input matrix column by column;
reordering the input matrix to obtain a reordered matrix, wherein, in the reordered matrix, the filling of LU decomposition is smaller than a first number, and the absolute value of the non-zero element of the diagonal position is larger than a first threshold;
constructing a matrix elimination tree based on the reordering matrix, and constructing a sub-tree and a task queue based on the matrix elimination tree;
performing parallel computation on the plurality of subtrees and the plurality of task queues based on the plurality of processors, and obtaining an output matrix based on the computation result, wherein the output matrix is an LU decomposition matrix related to the input matrix;
and obtaining a solving result about the input matrix based on the output matrix.
2. The upper and lower triangular decomposition based sparse matrix parallel solving method of claim 1, wherein the matrix elimination tree comprises a plurality of nodes, the nodes comprise a root node and a leaf node, and the matrix elimination tree based sparse matrix parallel solving method comprises the following steps:
dividing the matrix elimination tree into a plurality of subtrees by taking the root node as a starting point and the leaf nodes as an end point, wherein the number of the subtrees is greater than that of the processors;
and determining independent nodes, and constructing the task queue based on the independent nodes, wherein the independent nodes are nodes which are not drawn into the subtree in the matrix elimination tree.
3. The method of claim 2, wherein the parallel computation of the subtrees based on the plurality of processors comprises:
assigning a plurality of said sub-trees to a plurality of said processors based on a greedy algorithm;
performing parallel computations on the plurality of subtrees based on the plurality of processors.
4. The method of claim 3, wherein the processor comprises a cache memory, the nodes of the subtree comprise a plurality of child nodes, and the processor computes the subtree by:
determining a reading mode for reading the child node based on the capacity of the cache memory;
calculating the read child nodes based on the processor to obtain node values of the nodes;
obtaining a calculation result about the subtree based on the node value.
5. The method according to claim 4, wherein the determining a reading mode for reading the child node based on the capacity of the cache memory comprises:
arranging the child nodes in a descending order according to the distance between the child nodes and the nodes of the subtree to obtain a child node queue;
sequentially selecting a preset number of child nodes from the child node queue by taking the starting end of the child node queue as a starting point, and reading the preset number of child nodes through the cache memory, wherein the preset number is less than or equal to the capacity of the cache memory;
reading, based on the processor, other child nodes in the child node queue except for the preset number of child nodes.
6. The upper and lower triangular decomposition based sparse matrix parallel solving method according to claim 2, wherein the parallel computing of the plurality of task queues based on the plurality of processors comprises:
determining a scheduling processor and a plurality of executing processors in the processors, wherein the executing processors are other processors except the scheduling processor;
based on communication transmission, issuing the task queues to the execution processors through the scheduling processor;
and performing parallel computation on a plurality of task queues based on a plurality of execution processors.
7. The upper and lower triangular decomposition based sparse matrix parallel solving method of claim 6, wherein the execution processor calculates the task queue by adopting the following modes:
determining a computation state of the task queue in real time based on the scheduling processor;
responding to the request of the execution processor to acquire the task queue for calculation, and judging the calculation state of the task queue acquired by the request of the execution processor based on the scheduling processor;
and calculating the task queue through the execution processor based on the calculation state of the task queue.
8. The upper and lower triangular decomposition based sparse matrix parallel solving method according to claim 7, wherein the computation state of the task queue comprises a computation state of a node for determining a computation result of the task queue, and the computation of the task queue by the execution processor based on the computation state of the task queue comprises:
if the calculation state of the node for determining the calculation result of the task queue is not finished calculation, performing initial calculation on the task queue through the execution processor based on the calculated child nodes in the node;
and if the calculation state of the node for determining the calculation result of the task queue is the calculation completion, calculating the task queue through the execution processor based on the node.
9. The method for solving the sparse matrix parallel based on the upper and lower triangular decomposition recited in claim 8, wherein after performing the initial computation on the task queue by the execution processor based on the computed child nodes in the nodes, the method further comprises:
determining the calculation state of the child node which is not calculated in the nodes in real time based on the scheduling processor;
and in response to the calculation state of the sub-node which is not calculated at the current moment being calculated, updating the result of the initial calculation through the execution processor based on the sub-node which is not calculated.
10. The upper and lower triangular decomposition based sparse matrix parallel solving method according to claim 1, further comprising:
and based on the plurality of processors, performing parallel computation on the same first subtree and/or the same first task queue, wherein the first subtree is a subtree connected with the root of the matrix elimination tree, and the first task queue is a task queue connected with the root of the matrix elimination tree.
11. A sparse matrix parallel solving device based on upper and lower triangular decomposition is characterized in that the device is applied to a parallel computing platform, the parallel computing platform comprises a plurality of processors, and the device comprises:
the receiving module is used for receiving an input matrix and storing non-zero elements of the input matrix column by column;
a reordering module, configured to reorder the input matrix to obtain a reordered matrix, where in the reordered matrix, a filling of LU decomposition is smaller than a first number, and an absolute value of the non-zero element at a diagonal position is larger than a first threshold;
the construction module is used for constructing a matrix elimination tree based on the reordering matrix and constructing a sub-tree and a task queue based on the matrix elimination tree;
the processing module is used for carrying out parallel computation on the plurality of subtrees and the plurality of task queues based on the plurality of processors and obtaining an output matrix based on a computation result, wherein the output matrix is an LU decomposition matrix related to the input matrix;
and the generating module is used for obtaining a solving result related to the input matrix based on the output matrix.
12. The apparatus of claim 11, wherein the matrix elimination tree comprises a plurality of nodes, the nodes comprise a root node and a leaf node, and the construction module constructs a sub-tree and a task queue based on the matrix elimination tree by:
dividing the matrix elimination tree into a plurality of subtrees by taking the root node as a starting point and the leaf nodes as an end point, wherein the number of the subtrees is greater than that of the processors;
and determining independent nodes, and constructing the task queue based on the independent nodes, wherein the independent nodes are nodes which are not drawn into the subtree in the matrix elimination tree.
13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the sparse matrix parallel solution method based on upper and lower trigonometric decompositions of any one of claims 1 to 10 when executing the program.
14. A non-transitory computer readable storage medium, having stored thereon a computer program, when being executed by a processor, for implementing the steps of the sparse matrix parallel solution method based on upper and lower triangular decomposition according to any one of claims 1 to 10.
15. A computer program product comprising a computer program, wherein the computer program when executed by a processor implements the steps of the sparse matrix parallel solution method based on upper and lower triangular decomposition according to any one of claims 1 to 10.
CN202111532120.2A 2021-12-14 2021-12-14 Sparse matrix parallel solving method and device based on upper and lower triangular decomposition Active CN114329327B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111532120.2A CN114329327B (en) 2021-12-14 2021-12-14 Sparse matrix parallel solving method and device based on upper and lower triangular decomposition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111532120.2A CN114329327B (en) 2021-12-14 2021-12-14 Sparse matrix parallel solving method and device based on upper and lower triangular decomposition

Publications (2)

Publication Number Publication Date
CN114329327A true CN114329327A (en) 2022-04-12
CN114329327B CN114329327B (en) 2024-09-17

Family

ID=81053072

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111532120.2A Active CN114329327B (en) 2021-12-14 2021-12-14 Sparse matrix parallel solving method and device based on upper and lower triangular decomposition

Country Status (1)

Country Link
CN (1) CN114329327B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115080913A (en) * 2022-05-11 2022-09-20 中国核动力研究设计院 Method, system, equipment and storage medium for solving burn-up sparse matrix
CN115396065A (en) * 2022-10-26 2022-11-25 南京邮电大学 Low-delay decoding method for sparse random linear network coding
CN115437763A (en) * 2022-08-15 2022-12-06 中山大学 Task reconstruction method and device after rocket fault, terminal equipment and storage medium
CN117311948A (en) * 2023-11-27 2023-12-29 湖南迈曦软件有限责任公司 Automatic multiple substructure data processing method for heterogeneous parallelism of CPU and GPU
CN118152716A (en) * 2024-05-09 2024-06-07 巨霖科技(上海)有限公司 Matrix solving method, computer equipment, storage medium and program product
CN118378008A (en) * 2024-06-27 2024-07-23 南京邮电大学 Matrix decomposition parallelization optimization method and system for high-performance computing

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103399841A (en) * 2013-07-31 2013-11-20 清华大学 Sparse matrix LU decomposition method based on GPU
US20170147301A1 (en) * 2015-11-19 2017-05-25 Hongbo Rong Technologies for automatic reordering of sparse matrices
CN109062866A (en) * 2018-07-13 2018-12-21 清华大学 Trigonometric equation group method for solving and system in electric system based on greediness layering
CN110162736A (en) * 2018-01-10 2019-08-23 成都信息工程大学 Large Scale Sparse symmetrical linear equation group method for parallel processing based on elimination-tree

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103399841A (en) * 2013-07-31 2013-11-20 清华大学 Sparse matrix LU decomposition method based on GPU
US20170147301A1 (en) * 2015-11-19 2017-05-25 Hongbo Rong Technologies for automatic reordering of sparse matrices
CN110162736A (en) * 2018-01-10 2019-08-23 成都信息工程大学 Large Scale Sparse symmetrical linear equation group method for parallel processing based on elimination-tree
CN109062866A (en) * 2018-07-13 2018-12-21 清华大学 Trigonometric equation group method for solving and system in electric system based on greediness layering

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李程,田新民,王鼎兴,郑纬民: "稀疏三角矩阵线性系统的基于树结构并行求解", 软件学报, vol. 6, no. 8, 5 August 1995 (1995-08-05), pages 479 - 484 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115080913A (en) * 2022-05-11 2022-09-20 中国核动力研究设计院 Method, system, equipment and storage medium for solving burn-up sparse matrix
CN115437763A (en) * 2022-08-15 2022-12-06 中山大学 Task reconstruction method and device after rocket fault, terminal equipment and storage medium
CN115437763B (en) * 2022-08-15 2023-04-11 中山大学 Task reconstruction method and device after rocket fault, terminal equipment and storage medium
CN115396065A (en) * 2022-10-26 2022-11-25 南京邮电大学 Low-delay decoding method for sparse random linear network coding
CN117311948A (en) * 2023-11-27 2023-12-29 湖南迈曦软件有限责任公司 Automatic multiple substructure data processing method for heterogeneous parallelism of CPU and GPU
CN117311948B (en) * 2023-11-27 2024-03-19 湖南迈曦软件有限责任公司 Automatic multiple substructure data processing method for heterogeneous parallelism of CPU and GPU
CN118152716A (en) * 2024-05-09 2024-06-07 巨霖科技(上海)有限公司 Matrix solving method, computer equipment, storage medium and program product
CN118152716B (en) * 2024-05-09 2024-07-26 巨霖科技(上海)有限公司 Matrix solving method, computer equipment, storage medium and program product
CN118378008A (en) * 2024-06-27 2024-07-23 南京邮电大学 Matrix decomposition parallelization optimization method and system for high-performance computing

Also Published As

Publication number Publication date
CN114329327B (en) 2024-09-17

Similar Documents

Publication Publication Date Title
CN114329327B (en) Sparse matrix parallel solving method and device based on upper and lower triangular decomposition
US8813091B2 (en) Distribution data structures for locality-guided work stealing
CN114035936B (en) Multi-dimensional parallel processing method, system, equipment and readable storage medium based on artificial intelligence
US20200090051A1 (en) Optimization problem operation method and apparatus
CN112463189B (en) Distributed deep learning multi-step delay updating method based on communication operation sparsification
CN113569511A (en) Quantum circuit simulation method and device
US20210149985A1 (en) Method and apparatus for processing large-scale distributed matrix product
US20220391571A1 (en) Fast quantum circuit simulations with parallel task-based tensor network contraction
US20230090284A1 (en) Memory processing optimisation
CN116069393A (en) Data processing method and related device
Peng et al. Lock-free parallelization for variance-reduced stochastic gradient descent on streaming data
US11429299B2 (en) System and method for managing conversion of low-locality data into high-locality data
US20240160894A1 (en) Mixture-of-experts layer with switchable parallel modes
US10210136B2 (en) Parallel computer and FFT operation method
CN106933882A (en) A kind of big data incremental calculation method and device
CN115328865A (en) Batch import method of CSV files and related equipment
CN109213592B (en) Graph calculation method based on automatic selection of duplicate factor model
CN111967590B (en) Heterogeneous multi-XPU machine learning system oriented to recommendation system matrix decomposition method
US20240169463A1 (en) Mixture-of-experts layer with dynamic gating
US20240160906A1 (en) Collective communication phases at mixture-of-experts layer
CN113505825B (en) Graph calculating device
CN116166202B (en) Method, device, equipment and medium for placing copies in big data environment
CN116167447B (en) Quantum circuit processing method and device and electronic equipment
US11960449B2 (en) Computer-readable recording medium storing information processing program, information processing method, and information processing apparatus
CN116861151A (en) Vector processor-oriented sparse matrix vector multiplication method, device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant