CN109857543A

CN109857543A - A kind of streamline simulation accelerated method calculated based on the more GPU of multinode

Info

Publication number: CN109857543A
Application number: CN201811574392.7A
Authority: CN
Inventors: 季晓慧
Original assignee: China University of Geosciences Beijing
Current assignee: China University of Geosciences Beijing
Priority date: 2018-12-21
Filing date: 2018-12-21
Publication date: 2019-06-07

Abstract

The invention discloses a kind of accelerated methods that streamline simulation is realized based on the more GPU parallel computation particles trace algorithms of multinode, belong to streamline numerical simulation field.This method operates on several computers or server with multiple GPU.This method comprises: zoning is discrete for several grids；The essential information of host process initialization model, aquifer parameter, coefficient matrix and grid division model in velocity vector formula, and be broadcasted in other processes；Each process calls one piece of GPU, and stream is created on the GPU and is accelerated, GPU memory headroom storing data is opened up；Each GPU starts several threads, the next position and a upper position of per thread according to current location using particles trace algorithm calculating particle forwardly and rearwardly, iterate the process, obtains a complete streamline, then passes GPU data in the CPU memory for distributing to current process back.The calculated result for finally being summarized each process using MPI collective communication function, after handling invalid data, is output in destination file.The present invention makes full use of the GPU resource on cluster system resource and node, realizes the depth parallel computation of extensive streamline.It is significant with acceleration effect, the advantages of quickly generating streamline.

Description

A kind of streamline simulation accelerated method calculated based on the more GPU of multinode

Technical field

The present invention relates to streamline numerical simulation fields, particularly relate to one kind and are chased after based on the more GPU parallel computation particles of multinode Track algorithm is to realize the accelerated method of streamline simulation

Background technique

Streamline simulation technology is not only that study of groundwater provides image, intuitive ground water movement state information, to study week Hydrological characteristics are enclosed, and provide oil-layer characteristic and production performance information for the test of reservoir engineering inter-well tracer test, it is convenient preferably to open Christmas.Pollock proposes a kind of streamline simulation method of semi analytic in the research of Groundwater Flow, i.e. particles trace is calculated Method, the algorithm are widely used because of its flexibility and universality.In practical application engineering, when survey region is very big, stream Line gauge mould is millions of, and common machines runing time is very long, is not able to satisfy demand of the engineer application to data fructufy when property.

Traditional CPU accelerates frame to be limited to the calculating core number of CPU parallel, and the computing capability of CPU is lower than the meter of GPU Calculation ability.And the GPU quantity on single computer is limited by machine pocket numbers, thus acceleration effect obtained by Limitation.

Summary of the invention

Being designed to provide for the embodiment of the present invention is a kind of based on the more GPU parallel computation particles trace algorithms realizations of multinode The accelerated method of streamline simulation.To solve the problems, such as that it is slow that traditional CPU simulation calculates extensive flow line speed.The present invention is implemented Example has small occupied area, low cost and acceleration effect clear advantage.

In order to solve the above technical problems, present invention offer technical solution is as follows:

The present invention provides a kind of acceleration side that streamline simulation is realized based on the more GPU parallel computation particles trace algorithms of multinode Method, which is characterized in that the method operates on multiple computers or server simultaneously, and each computer or server have Several GPU, the method, comprising:

Step 1: host process (No. 0 process) is responsible for turning to several grids, the base of initialization model for regional model is discrete Each coefficient in this information, aquifer parameter and velocity vector formula, completes grid dividing, and the load between realization process is equal Weighing apparatus；

Step 2: the essential information of broadcast model, aquifer parameter, each coefficient and submodel in velocity vector formula Size is to other processes；

Step 3: each process being distributed on multiple servers calls the GPU card number with unique designation, on the GPU Creation stream opens up GPU memory, and parameter needed for GPU is calculated copies in GPU global memory from CPU.

Step 4: each GPU call multiple GPU thread parallels calculate several grid g (ix, iy, iz) physical coordinates p (x, Y, z), start to carry out particles trace process；

Step 5: according to the calculation formula of velocity vector, calculating the flow velocity Vp (Vx, Vy, Vz) of particle；

Step 6: if encountering boundary condition, such as stationary point, then tracing process terminates forward or backward, and executes step 7, no Then, according to flow relocity calculation particle traveling time, and next position of particle is obtained.At this point, completing once to chase after forward or backward Track.Using new position as current particle coordinate, step 5 is continued to execute.

Step 7: being sequentially connected the coordinate points that single GPU thread is calculated, a complete streamline can be constituted.It is single All threads calculating of a GPU finishes, by GPU calculated result from the CPU memory that GPU is transferred to the process.

After step 8:GPU transmission success, using between MPI collective communication mechanism realization process data transmit, by it is each into The calculated result of journey is aggregated into host process.It is output in destination file after finally handling null result data.

Further, in the step 1, given zoning projects to length Lx in x-y plane and width Ly, aqueous Layer important parameter ε_x、ε_y, α, β and calculating speed vector formula in coefficient matrix b_mn, net is turned to by regional model is discrete Then lattice model completes model partition, the load balancing between realization process.

Further, the essential information of broadcast model, aquifer parameter, each in velocity vector formula in the step 2 A coefficient and sub-grid size are to other processes；

Further, in the step 3, each process being distributed on multiple servers, which is called, has unique designation GPU card number, creation flows, opens up GPU memory on the GPU, and parameter needed for GPU is calculated copies GPU global memory to from CPU In.

Further, in the step 4, each GPU call multiple GPU threads calculate simultaneously several grid g (ix, iy, Iz physical coordinates p (x, y, z)) then starts to carry out particles trace process；

Further, the step 5 includes:

The calculation formula of velocity vector:

Formula (1), (2), in (3), ε_x、ε_y, α, β be water-bearing layer important parameter, b_mnIt is coefficient matrix, Lx, Ly are to calculate Length and width after region projection to x-y plane, x_p、y_pAnd z_pCalculation formula it is as follows:

In formula (4), x, y, z is the physical coordinates of particle.In formula (5), Z_mnIndicate the characteristic function of hydraulic Head Distribution, Calculation formula is as follows:

W_mnThe characteristic function for indicating vertical flow velocity, as shown in formula (6).

In formula (5) and formula (6), λ_mnFrom formula (7) calculating.

Further, the step 6 includes:

Boundary condition are as follows: z > 0 or

If being unsatisfactory for boundary condition, a step is tracked according to following equation (8) forward or backward and obtains the new position of particle:

In formula (8), the value of DIR is 1 or -1,1 expression one step of Forward Trace, and -1 indicates one step of backward tracing, Δ t table Show a time step.

Further, in the step 7, using the copy function of CUDA, GPU calculated result is transferred back into CPU from GPU On.

Further, in the step 8, the data between the collective communication MPI_Gather function realization process of MPI are used Transmission, the calculated result of each process is aggregated into host process.Knot is output to after finally handling null result data In fruit file.

Further, in the step 1, it is based on MPI concurrent technique, multiple processes are set, are distributed on multiple servers, Realize the parallel of process level.

Further, in the step 3, based on the concurrent technique of CUDA platform, call cudaSetDevice (), The functions such as cudaStreamCreate (), cudaMalloc (), cudaMemcpyAsync () realize specified responsible calculating GPU card number, creation stream open up GPU memory and complete the operation such as data transmission between CPU and GPU.

Further, in the step 3, most-often used parameter is loaded into the most fast register of reading speed；It will count And reading data infrequently maximum according to amount are loaded into the global memory that capacity is maximum but reading speed is slow；By changeless ginseng Number is loaded into most fast and read-only during the running constant memory of reading speeds；By reading position logically relatively close to and read Frequent data are loaded into texture memory.

Further, in the step 4, it is based on CUDA parallel architecture, is invoked at the kernel function executed on GPU, it is real Thread-Level Parallelism in existing process.Each process is all provided with the quantity for determining thread thread in thread block block and block, i.e. setting is parallel The total number of threads of the GPU of acceleration realizes Thread-Level Parallelism；

The invention has the following advantages:

The present invention realizes that the large-scale application program of streamline simulation carries out parallelization processing for particles trace algorithm, realizes Parallel computation of the program based on the more GPU of multinode, and obtain great acceleration effect.

Detailed description of the invention

Fig. 1 is the program flow diagram that the embodiment of the present invention 1 provides.

Fig. 2 is the streamline simulation that provides of the embodiment of the present invention 1 and regional network is formatted schematic diagram；

Fig. 3 is the CUDA thread mode and GPU memory module schematic diagram that the embodiment of the present invention 1 provides；

Fig. 4 is the experimental result picture of the embodiment of the present invention 1；

Specific embodiment

To keep the technical problem to be solved in the present invention, technical solution and advantage clearer, below in conjunction with attached drawing and tool Body embodiment is described in detail.

The present invention provides a kind of acceleration side that streamline simulation is realized based on the more GPU parallel computation particles trace algorithms of multinode Method operates in multiple computers or clothes based on the more GPU of the multinode streamline simulation accelerated method calculated as shown in figures 1-4 simultaneously It is engaged on device, and each computer or server have several GPU, the accelerated method, comprising:

Step 2: the essential information of broadcast model, aquifer parameter, each coefficient and sub-grid in velocity vector formula Size is to other processes；

Step 4: each GPU call multiple GPU threads calculate simultaneously several grid g (ix, iy, iz) physical coordinates p (x, Y, z), then start to carry out particles trace process；

The beneficial effects of the present invention are:

Preferably, in the step 3, each process being distributed on multiple servers calls the GPU with unique designation Card number, creation flows, opens up GPU memory on the GPU, and parameter needed for GPU is calculated copies in GPU global memory from CPU.

Preferably, the step 5 includes:

The calculation formula of velocity vector:

In formula (5) and formula (6), λ_mnFrom formula (7) calculating.

Further, the step 6 includes:

Boundary condition are as follows: z > 0 or

In the step 1, it is based on MPI concurrent technique, multiple processes are set, are distributed on multiple servers, realizes process Grade it is parallel.

In the step 3, based on the concurrent technique of CUDA platform, call cudaSetDevice (), The functions such as cudaStreamCreate (), cudaMalloc (), cudaMemcpyAsync () realize specified responsible calculating GPU card number, creation stream open up GPU memory and complete the operation such as data transmission between CPU and GPU.

In the step 3, most-often used parameter is loaded into the most fast register of reading speed；By data volume it is maximum and Data infrequently are read to be loaded into the global memory that capacity is maximum but reading speed is slow；Changeless parameter is loaded into and is read During fastest and operation in read-only constant memory；By reading position logically relatively close to and read frequent data It is loaded into texture memory.

In the step 4, it is based on CUDA parallel architecture, is invoked at the kernel function executed on GPU, realizes process interior lines Journey grade is parallel.Each process is all provided with the quantity for determining thread thread in thread block block and block, that is, sets the GPU accelerated parallel Total number of threads realizes Thread-Level Parallelism；

The programming of CPU-GPU isomery is realized based on MPI+CUDA hybrid technology, using multiple GPU on multiple servers, from And realize more GPU and realize parallel acceleration, further speed up calculating speed.

In the present invention, the characteristics of making full use of GPU various different storage organizations, optimizes data reading mode, will be most-often used Parameter be loaded into the most fast register of reading speed；Data that are data volume is maximum and reading infrequently be loaded into capacity it is maximum but In the slow global memory of reading speed；Changeless parameter is loaded into most fast and read-only during the running constant of reading speed In depositing；By reading position logically relatively close to and read frequent data and be loaded into texture memory.

Embodiment 1:

Below by one embodiment, the present invention is described further, during groundwater of basin streamline simulation, The length that basin region projects to x-y plane is 20000 meters, and width is ten thousand metres, and basin depth capacity is 8000 meters, aqueous Layer parameter is respectively 0,0,1,1, coefficient matrix b={ { 40,0,10,0 }, { 20,0,0,0 }, { 10,0,0,0 }, { 0,0,0,0 } }, Ideal step-length is 6 meters.It is as shown in Figure 1 to the accelerator of the specific streamline simulation in this region.

Step 1: the three-dimensional underground water discrete region of Lx × Ly × Lz is turned to Nx × Ny × Nz by host process (No. 0 process) Grid respectively indicates line number, columns and the number of plies.Wherein, Lx=20000, Ly=10000, Lz=8000, discrete rear Nx=101, Ny=201, Nz=81.Aquifer parameter is respectively α=0, β=0, ε_x=1, ε_y=1.Coefficient matrix b value are as follows: b=40, 0,10,0},{20,0,0,0},{10,0,0,0},{0,0,0,0}}；, point mode is changed by block in the direction y, and grid model is carried out Even to be divided into n parts, n is the quantity of total process.

Step 2: using the essential information of MPI_Bcast communication functions broadcast model, aquifer parameter, velocity vector formula In each coefficient and sub-grid size to other processes；

Step 5: the flow velocity of current particle position p (x, y, z) is calculated, according to following calculation formula solving speed not Component Vx, Vy, Vz on equidirectional:

In formula (5) and formula (6), λ_mnFrom formula (7) calculating.

Step 6: if encountering perimeter strip: z > 0 orThen tracking terminates forward or backward, And step 6 is executed, otherwise, time (step-length/rate) is calculated according to current meter, and the next of particle is calculated according to the time Position:

Step 7: using CUDA copy function by GPU calculated result from the CPU memory that GPU is transferred to the process.

Step 8: being transmitted using the data between MPI collective communication mechanism realization process, the calculated result of each process is converged Always in host process.It is output in destination file after finally handling null result data.

Realize that the accelerated method of streamline simulation operates in multiple bands based on the more GPU parallel computation particles trace algorithms of multinode There is the server of muti-piece GPU card.Specifically, the invention, which calculates core, realizes it by the kernel function of MPI technology combination CUDA More GPU's is parallel on multiple servers (computer).

When realizing the Thread-Level Parallelism of function by CUDA, need to set thread thread in thread block block and block Quantity then sets the total number of threads of the GPU used parallel, realizes Thread-Level Parallelism.This example 1 is by the Thread Count in thread block 32 are set as, thread block is set as two dimensionIn the kernel function executed on specified GPU, pass through ThreadIdx.x+blockIdx.x × blockDim.x+gridDim.x × blockDim.x × blockIdx.y obtains absolute Thread serial number index, the thread are responsible for the tracing process of grid g (ix, iy, iz) interior fluid particle.The execution model of CUDA with The storage model of GPU is as shown in Figure 3.

Data are read between thread block or thread in certain memory, memory includes the local memory of thread oneself and posts It deposits, global memory, constant memory and texture memory between the shared drive and block in thread block in grid.Due to same line Thread in journey block can be read out the data in the shared drive of the thread block, and read and write the speed of shared drive very Fastly, so some common constants are stored in the shared drive of each block by we in such a way that _ _ shared__ is defined In, in this way when repeatedly calling these constant vectors, many time can be saved.It is deposited by using the GPU of different levels Storage structure optimizes data reading performance using redundancy.

Further, by MPI+CUDA hybrid parallel technology, so that the synchronous execution of more GPU in more calculate nodes, The quantity of GPU is expanded, to accelerate to execute streamline simulation calculating process, there is better acceleration effect.

The size of the characteristics of due to MPI_Gather function, the data volume that each process is sent must be consistent, and real In testing, data volume size may be different, therefore, using the method for expanding invalid data, so that the data volume of all processes is all same Maximum value is consistent.After main thread has been collected, the invalid data of expansion is handled.It is finally saved or is printed.

The present embodiment 1 realizes that parallel the speed-up ratio of acquisition is as shown in figure 4, using more on the server of the more GPU of multinode A process calls identical GPU quantity parallel computation, and acceleration effect is significant, and last acceleration effect keeps stable state.

To sum up, the invention has the following advantages:

Although above having used general explanation and specific embodiment, the present invention is described in detail, at this On the basis of invention, it can be made some modifications or improvements, this will be apparent to those skilled in the art.Therefore, These modifications or improvements without departing from theon the basis of the spirit of the present invention are fallen within the scope of the claimed invention.

Claims

1. a kind of accelerated method for being realized streamline simulation based on the more GPU parallel computation particles trace algorithms of multinode, feature are existed In, the method operates on multiple computers or server simultaneously, and each computer or server have several GPU, The method, comprising:

Step 1: host process (No. 0 process) is responsible for turning to several grids, the basic letter of initialization model for regional model is discrete Each coefficient in breath, aquifer parameter and velocity vector formula completes grid dividing, the load balancing between realization process；

Step 2: essential information, aquifer parameter, each coefficient in velocity vector formula and the submodel size of broadcast model To other processes；

Step 3: each process being distributed on multiple servers calls the GPU card number with unique designation, creates on the GPU It flows, open up GPU memory, parameter needed for GPU is calculated copies in GPU global memory from CPU；

Step 6: if encountering boundary condition, such as stationary point, then tracing process terminates forward or backward, and executes step 7, otherwise, According to flow relocity calculation particle traveling time, and next position of particle is obtained, at this point, complete once to track forward or backward, Using new position as current particle coordinate, step 5 is continued to execute；

Step 7: being sequentially connected the coordinate points that single GPU thread is calculated, a complete streamline can be constituted, individually All threads calculating of GPU finishes, by GPU calculated result from the CPU memory that GPU is transferred to the process；

After step 8:GPU transmission success, transmitted using the data between MPI collective communication mechanism realization process, by each process Calculated result is aggregated into host process, is output in destination file after finally handling null result data.

2. according to claim 1 realize adding for streamline simulation based on the more GPU parallel computation particles trace algorithms of multinode Fast method, which is characterized in that in the step 1, given zoning projects to length Lx and width Ly in x-y plane, contains Water layer important parameter ε_x、ε_y, α, β and calculating speed vector formula in coefficient matrix b_mn, and by regional model discretization For grid model, then host process realizes that grid model divides, and is divided into several submodels, and proof load is balanced.

3. according to claim 1 realize adding for streamline simulation based on the more GPU parallel computation particles trace algorithms of multinode Fast method, which is characterized in that in the step 2, broadcast essential information, the aquifer parameter, velocity vector formula of grid model In each coefficient and submodel size to other processes.

4. according to claim 1 realize adding for streamline simulation based on the more GPU parallel computation particles trace algorithms of multinode Fast method, which is characterized in that in the step 3, each process being distributed on multiple servers, which is called, has unique designation GPU card number, creation flows, opens up GPU memory on the GPU, and parameter needed for GPU is calculated copies GPU global memory to from CPU In.

5. according to claim 1 realize adding for streamline simulation based on the more GPU parallel computation particles trace algorithms of multinode Fast method, which is characterized in that in the step 4, several grid g of each multiple GPU thread parallels calculating of GPU calling (ix, iy, Iz physical coordinates p (x, y, z)) starts to carry out particles trace process.

6. according to claim 1 realize adding for streamline simulation based on the more GPU parallel computation particles trace algorithms of multinode Fast method, which is characterized in that the step 5 includes:

The calculation formula of velocity vector:

Formula (1), (2), in (3), ε_x、ε_y, α, β be water-bearing layer important parameter, b_mnIt is coefficient matrix, Lx, Ly are that zoning is thrown Length and width after shadow to x-y plane, x_p、y_pAnd z_pCalculation formula it is as follows:

In formula (4), x, y, z is the physical coordinates of particle, in formula (5), Z_mnIt indicates the characteristic function of hydraulic Head Distribution, calculates Formula is as follows:

W_mnThe characteristic function for indicating vertical flow velocity, as shown in formula (6),

In formula (5) and formula (6), λ_mnFrom formula (7) calculating.

7. according to claim 1 realize adding for streamline simulation based on the more GPU parallel computation particles trace algorithms of multinode Fast method, which is characterized in that the step 6 includes:

Boundary condition are as follows: z > 0 or

If being unsatisfactory for boundary condition, a step is tracked according to following equation (8) forward or backward:

In formula (8), the value of DIR is 1 or -1,1 expression one step of Forward Trace, and -1 indicates one step of backward tracing, and Δ t indicates one A time step.

8. according to claim 1 realize adding for streamline simulation based on the more GPU parallel computation particles trace algorithms of multinode Fast method, which is characterized in that the step 7 includes: the copy function using CUDA, and calculated result is transferred back to CPU from GPU On.

9. according to claim 3 realize adding for streamline simulation based on the more GPU parallel computation particles trace algorithms of multinode Fast method, which is characterized in that by the parallel architecture based on message transmission, i.e. MPI concurrent technique, starting is distributed in more calculating Multiple processes on machine, each process are responsible for a part of streamlined impeller process, realize that process level is parallel.

10. according to claim 3 realize adding for streamline simulation based on the more GPU parallel computation particles trace algorithms of multinode Fast method, which is characterized in that most-often used parameter is loaded into the most fast register of reading speed；By data volume maximum and read Data infrequently are taken to be loaded into the global memory that capacity is maximum but reading speed is slow；Changeless parameter is loaded into and reads speed During spending most fast and operation in read-only constant memory；By reading position logically relatively close to and read frequent data and carry Enter in texture memory.

11. according to claim 4 realize adding for streamline simulation based on the more GPU parallel computation particles trace algorithms of multinode Fast method, which is characterized in that one piece of GPU of each Process flowchart realizes the parallel computation of task, using CUDA parallel architecture, setting The quantity of thread thread in thread block block and block, that is, set the total number of threads of the GPU accelerated parallel, inside realization process Thread-Level Parallelism.