CN117573375B

CN117573375B - Dynamic load balance parallel computing method oriented to self-adaptive decoupling equation

Info

Publication number: CN117573375B
Application number: CN202410054454.0A
Authority: CN
Inventors: 张斌; 肖辰祥; 胡桐; 李林颖; 刘淏旸
Original assignee: Sichuan Research Institute Of Shanghai Jiaotong University
Current assignee: Sichuan Research Institute Of Shanghai Jiaotong University
Priority date: 2024-01-15
Filing date: 2024-01-15
Publication date: 2024-04-02
Anticipated expiration: 2044-01-15
Also published as: CN117573375A

Abstract

The invention discloses a dynamic load balance parallel computing method for a self-adaptive decoupling equation, and provides a dynamic load balance algorithm which is based on MPI+OpenMP mixed parallel, is more flexible and efficient and is oriented to the problem of load unbalance of the self-adaptive decoupling NS equation on the basis of a dynamic data transfer scheme. The method solves the technical problems that the existing load balancing method does not fully utilize the advantages of the shared memory of the thread level parallelism, the load balancing algorithm is complex, and the data transfer cost is large, achieves dynamic load balancing more flexibly and efficiently, reduces program blocking time, improves computing efficiency and accelerates computing.

Description

Dynamic load balance parallel computing method oriented to self-adaptive decoupling equation

Technical Field

The invention relates to the field of load balancing, in particular to a dynamic load balancing parallel computing method oriented to a self-adaptive decoupling equation.

Background

With the rapid development of computer technology, numerical simulation (CFD) has become an important tool for the study of chemical reaction flow problems. However, the stiffness problem in the chemical reaction flow problem, i.e. the chemical reaction characteristic time is not consistent with the flow characteristic time scale, presents a serious challenge to conventional numerical calculation methods. On the one hand, when the stiffness problem is severe, the general numerical method may not converge, and even a false numerical solution may be generated. On the other hand, the rigidity problem causes a sharp increase in the amount of numerical calculation, even making the conventional numerical method unusable. The existing acceleration calculation method comprises the following steps: reducing model size, adaptive trellis encryption, high performance parallel computing (HPC), etc. HPC has become a powerful tool for accelerating CFD simulation, typically using shared memory based Open Multi-Processing (OpenMP), pthread, and distributed memory based Message Passing Interface (MPI). However, HPC introduces load imbalance caused by rigidity while accelerating calculation, and the calculation is concentrated in a region where chemical reaction is intense, so that the calculation load between processes is seriously unbalanced, and the parallel efficiency is greatly reduced, so that load balancing becomes an important subject of research.

The existing load balancing scheme is mainly divided into three parts:

1. dynamic region decomposition: monitoring load distribution conditions in calculation iteration, re-dividing grids when a load balance threshold is reached, processing calculation tasks of grids of each process, and communicating boundary data between different processes.

2. Dynamic data transfer: and (3) maintaining optimal grid division for flow calculation in calculation iteration, and transferring calculation tasks among different processes when the optimal grid division reaches a load balance threshold. The method utilizes the characteristic of local chemical reaction characteristics, namely each grid only needs to iterate chemical reaction according to flow field data, so that the transfer of calculation tasks is not data-dependent, and the method is very flexible. Disadvantages: most schemes only consider using MPI to achieve process level parallelism and the data transfer algorithm is complex.

3. CPU/GPU heterogeneous computing: in the calculation iteration, grid nodes with high rigidity are distributed to a CPU for implicit solution, grid nodes with low rigidity are distributed to a GPU for display solution, the advantages of the grid nodes and the GPU are fully exerted, similar calculated amount is achieved, and therefore load is balanced. Disadvantages: the realization is complex and the requirement on hardware is high.

The problem that this patent was aimed at is the unbalanced load phenomenon when self-adaptation decoupling NS (Naver-Stokes) equation was solved. The NS equation describes the flow of fluid and is the core of CFD. In CFD chemical reaction flow calculation, decoupling is a common method, namely, the chemical reaction is separated from the flow, the time precision is high, the memory consumption is very low, and the parallel calculation and the software engineering are facilitated. The adaptive approach is to add a rigid-based predictive step prior to the chemical reaction process, and in one flow time step, ODE (ordinary differential equations) of the chemical reaction requires multiple steps of computation, so the optimal number of sub-iteration steps per grid at the time is obtained by the predictive step. The method not only can ensure that reasonable numerical solutions are obtained, but also greatly reduces the calculated amount of the algorithm. However, in parallel computing, the time of chemical reaction is dominant, and the computation amount between different processes is different, so that load imbalance is caused. After the completion of the calculation, the process with small load needs to wait for the process with large load to calculate, resulting in a large amount of blocking time (mpi_barrier). For this problem, due to its decoupling characteristics, a load balancing strategy for dynamic data transfer is more suitable.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a dynamic load balance parallel computing method oriented to a self-adaptive decoupling equation.

In order to achieve the aim of the invention, the invention adopts the following technical scheme:

a dynamic load balance parallel computing method facing self-adaptive decoupling equation includes the following steps:

s1, acquiring the load of each process and judging the load unbalance degree, wherein the load is the sum of the chemical reaction iteration steps of all grids of each process;

s2, generating a chemical reaction iteration step number transfer list to obtain a sending process, a receiving process and a transfer iteration step number, and generating a grid transfer list to transfer the transfer iteration step number to specific grid data;

s3, the sending process sends the packed data to a corresponding receiving process and unpacks the packed data, and the computing operation is carried out while the data is sent through a non-blocking channel;

s4, carrying out chemical reaction formula solving of the transfer grid after receiving the transferred data and completing data unpacking.

Further, the sum of the steps of the chemical reaction iteration in S1 is expressed as:

in the method, in the process of the invention,for the load of the ith process, +.>For the mesh size of the ith process, +.>The number of iterative steps of the chemical reaction for the j-th grid of the i-th process,/, is->Is the number of processes.

Further, the specific calculation method of the load unbalance degree in S1 is as follows:

in the method, in the process of the invention,for load imbalance, +.>For the maximum value of the load in each process,an average value of loads of each process;

and when the load unbalance is larger than a balance threshold value, starting a dynamic load balance algorithm.

Further, the specific way of generating the chemical reaction iteration step number transfer list in S2 is as follows:

a1, dividing the load into a part larger than or equal to the average value and a part smaller than the average value by utilizing a priority queue;

a2, based on a greedy algorithm, continuously taking out the element with the largest difference value from the part which is larger than or equal to the average value and the element with the smallest difference value from the part which is smaller than the average value, and transferring until the transfer times reach a threshold value, so as to obtain a chemical reaction iteration step number transfer list;

a3, corresponding the transmission process-receiving process-transfer iteration steps in the obtained chemical reaction iteration step transfer list.

Further, the specific way of generating the grid transfer list in S2 is as follows:

b1, inputting a transfer list of iteration steps of the chemical reaction according to the obtained steps, and process numbers which are required to be sent to other processes by each process;

b2, traversing grids, judging whether a sending target process is empty, if yes, exiting a loop, if not, judging whether the current grid chemical reaction iteration step number can be added into a transfer grid from a first target process, if yes, adding the current grid and related data into corresponding data of the transfer grid, if not, judging whether a next target process can be added into the target process meeting the required iteration step number preferentially;

and B3, exiting the loop until all target processes meet the required iteration step number, and obtaining a grid transfer list.

Further, in the step S3, the sending process sends the packed transfer flow field data and transfer chemical reaction data to the corresponding receiving process by using non-blocking communication, and unpacks the transfer flow field data and the transfer chemical reaction data.

Further, the step S4 includes a chemical reaction solution of the receiving process and a chemical reaction solution of the sending process.

Further, the chemical reaction solution of the receiving process comprises the following steps:

s41, receiving transferred data and completing data unpacking, and immediately starting solving and calculating of chemical reactions of the transfer grid;

s42, after the solving is completed, the calculation result is packed and data updating is carried out;

s43, sending the data updating result back to the original process by utilizing the MPI_Isend of non-blocking communication, starting the solving and calculating of the grid chemical reaction of the process, and overlapping the MPI communication time and the chemical reaction calculating time by utilizing the MPI_Isend non-blocking transmission;

the sending process chemical reaction solution only carries out the solution calculation of the grid chemical reaction of the process

The invention has the following beneficial effects:

the mixed parallel scheme of MPI and OpenMP is adopted, the advantages of MPI process level parallel distributed memory and OpenMP thread level parallel shared memory are fully utilized, data transmission is reduced, and parallel efficiency is improved;

generating a data transfer list by utilizing a priority queue, optimizing the time complexity to be O (nlogn), and improving the efficiency of a dynamic complex balance algorithm;

the communication time is synchronized with the calculation time by using non-blocking communication, so that the communication overhead of frequently carrying out data transfer is reduced;

for the decoupling NS equation solution of the self-adaptive chemical reaction step, the approximately uniform load distribution condition is realized, and the blocking time is reduced to 4.77 percent from 78.65 percent of the standard self-adaptive decoupling algorithm; the parallel efficiency is further improved from 18.42% to 47.75% of the standard algorithm.

Drawings

FIG. 1 is a flow chart of a dynamic load balancing algorithm of the present invention.

Fig. 2 is an mpi+openmp hybrid parallel model diagram according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and all the inventions which make use of the inventive concept are protected by the spirit and scope of the present invention as defined and defined in the appended claims to those skilled in the art.

A dynamic load balance parallel computing method facing an adaptive decoupling equation is shown in fig. 1, and comprises the following steps:

firstly, an MPI+OpenMP mixed parallel scheme is designed, and the advantages of a process-level parallel distributed memory and a thread-level parallel shared memory are fully utilized.

MPI distributes computing tasks to different processes through grid decomposition and task division of a main process, and transmits necessary information (such as communication of boundary information) between the different processes. The general process is as follows: a) Initializing an MPI parallel environment by using an MPI_initial function, and starting MPI parallel computation; b) Dividing a computing task into n parts, and distributing each part to an MPI process; c) Each process performs independent calculation and enjoys independent memory space; d) When necessary information communication is needed between processes, message communication functions such as MPI_Send and MPI_Recv are called; e) All process calculation tasks are finished, and the MPI_Finalize function is called to terminate. OpenMP typically achieves computational acceleration by performing multi-threaded computation on hot spot loops. The general procedure is for the fork-join mode: a) The computing task is performed on the main line Cheng Chuanhang; b) The calculation is carried out to a hot spot loop, a multithreading calculation (fork) is started, and the shared memory does not need to be communicated; c) The hotspot loop ends and the regression main thread (join) continues the serial computation.

The mpi+openmp hybrid parallel scheme is shown in fig. 2, and the details are described below:

a) The calculation domain is decomposed into n sub-domains, so that the load balance (uniform distribution) of flow solving is ensured;

b) Each sub-domain is allocated to an MPI process;

c) Each subdomain realizes non-blocking communication by using MPI_Isend and MPI_Irech functions, carries out boundary information communication among processes, overlaps communication with calculation by using the non-blocking communication, and eliminates communication time;

d) When encountering a hot spot cycle (chemical reaction), opening OpenMP multithreading calculation, closing the OpenMP multithreading after the hot spot cycle is ended, and returning to process calculation

The load is defined in the present algorithm as the sum of the number of chemical reaction iteration steps for all grids of each process, as shown in equation 1. The core idea is to transfer chemical reaction calculation between different processes (the process with the largest load is preferentially transferred to the process with the smallest load) based on a greedy algorithm, so that uniform calculation load between different processes is realized, program blocking time is reduced, and calculation efficiency is improved.

Wherein,for the load of the ith process, +.>For the mesh size of the ith process, +.>The number of iterative steps of the chemical reaction for the j-th grid of the i-th process,/, is->Is the number of processes.

And collecting loads of all processes and judging the load unbalance degree. Obtaining loads of all processes by using MPI_Allgather functionCalculating the load imbalance +.>When it is greater than the load balancing thresholdWhen (i.e.)>Starting a dynamic load balancing algorithm;

wherein,for the maximum value of the load in each process, +.>Is the average value of the load of each process.

and generating a chemical reaction iteration step number transfer list, wherein the load is defined by the chemical reaction iteration step number, and the chemical reaction iteration step number which needs to be transferred for each process, namely the sending process-receiving process-transfer iteration step number, is needed to be generated. The specific details are as follows:

performing transfer operation circularly, utilizing the characteristic of the priority queue, and based on greedy algorithm by continuously performing slave operationThe element with the largest difference is taken out, and +.>The element with the minimum difference value is taken out for transferring, thereby realizing high efficiencyElement balance transfer of (2), the time complexity of each transfer operation is +.>The total transfer operation time complexity is。

Finally, the three columns shown in the first column of the table 1 are generatedThe tuple list (10 processes are taken as an example) corresponds to the number of transmit process-receive process-transfer iteration steps. The fraction of time was tested to be only 0.03% of the total computation time.

TABLE 1 iterative steps for transfer chemistry and transfer grid array

A grid transfer list is generated as shown in algorithm 2. After the sending process-receiving process-transferring iteration step number is obtained, the data transfer must be performed based on the grid, that is, the iteration step number to be transferred is transferred to specific grid data, so that the sending process-receiving process-transferring grid number and index (after the transfer calculation, the original process needs to be returned) need to be generated, and meanwhile, the data transfer is packed. The specific details are as follows:

b2, traversing grids, judging whether a sending target process is empty, if yes, exiting from a loop, if not, judging whether the chemical reaction iteration steps of the current grid can be added into a transfer grid from a first target process, and if yes, adding the current grid and related data into corresponding data of the transfer grid;

and B3, starting the next target process when the number of chemical reaction iteration steps transferred by one target process meets the requirement, and obtaining a grid transfer list after all processes are transferred.

Final number of transfer gridsAs shown in the fourth column of table 1.

the sending process will package the data by MPI non-blocking communication MPI_Isend and MPI_RecvAnd sending the packet to a corresponding receiving process, and unpacking the packet. Through non-blocking communication, calculation operation (including partial parameter calculation, memory release and opening) is performed while data is transmitted, so that efficient communication operation is realized, and the MPI_Recv is used here to ensure that all data is received, and the next operation is performed. The part of the communication time only accounts for 0.52% of the total calculation time through testing;

For the receiving process, after receiving the transferred data and completing data unpacking, immediately starting the solving and calculating of the chemical reaction of the transfer grid, and packing the data after the solving is completedThe MPI_Isend is used to send back to the original process, then the solution calculation of the grid chemical reaction of the process is started, and the two latter communication and calculation are overlapped through the MPI_Isend non-blocking sending. For the sending process, only the solving calculation of the grid chemical reaction of the process is carried out. In the chemical reaction calculation, MPI+OpenMP mixed parallelism described in the previous section is used to further accelerate calculation and improve calculation efficiency, and the sending process receives returned data by using MPI_Recv, unpacks and updates the returned data to local data to complete the calculationThe calculation of one flow step proceeds to the next iteration.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The principles and embodiments of the present invention have been described in detail with reference to specific examples, which are provided to facilitate understanding of the method and core ideas of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Those of ordinary skill in the art will recognize that the embodiments described herein are for the purpose of aiding the reader in understanding the principles of the present invention and should be understood that the scope of the invention is not limited to such specific statements and embodiments. Those of ordinary skill in the art can make various other specific modifications and combinations from the teachings of the present disclosure without departing from the spirit thereof, and such modifications and combinations remain within the scope of the present disclosure.

Claims

1. The dynamic load balance parallel computing method for the self-adaptive decoupling equation is characterized by comprising the following steps of:

s1, acquiring the load of each process and judging the load unbalance degree, wherein the load is the sum of the chemical reaction iteration steps of all grids of each process, and the load unbalance degree is specifically calculated in the following way:

when the load unbalance is larger than a balance threshold value, starting a dynamic load balance algorithm;

s2, generating a chemical reaction iteration step number transfer list to obtain a sending process, a receiving process and a transfer iteration step number, generating a grid transfer list to transfer the transfer iteration step number to specific grid data, and generating the chemical reaction iteration step number transfer list in the following specific modes:

a3, corresponding the transmission process-receiving process-transfer iteration steps in the obtained chemical reaction iteration step transfer list;

the specific mode for generating the grid transfer list is as follows:

b3, starting the next target process when the number of chemical reaction iteration steps transferred by one target process meets the requirement, and obtaining a grid transfer list after all processes are transferred;

s3, the sending process sends the packed data to a corresponding receiving process and unpacks the packed data, and the computing operation is carried out while the data is sent through non-blocking communication;

s4, the receiving process receives the transferred data and performs chemical reaction formula solving of the transfer grid after data unpacking is completed.

2. The method for dynamically balancing and parallel computing for self-adaptive decoupling equations according to claim 1, wherein the sum of the number of chemical reaction iteration steps in S1 is expressed as:

3. The adaptive decoupling equation-oriented dynamic load balancing parallel computing method of claim 1, wherein S4 includes a chemical reaction solution of a receiving process and a chemical reaction solution of a transmitting process.

4. A dynamic load balancing parallel computing method for an adaptive decoupling equation according to claim 3, wherein the chemical reaction solution of the receiving process comprises the steps of:

s43, sending the data updating result back to the original process by using non-blocking communication, starting the solving and calculating of the grid chemical reaction of the process, and overlapping the MPI communication time and the chemical reaction calculating time by using non-blocking communication;

and the sending process chemical reaction solution only carries out the solution calculation of the grid chemical reaction of the process.