CN103064819A

CN103064819A - Method for utilizing microwave integrated circuit (MIC) to rapidly achieve lattice Boltzmann parallel acceleration

Info

Publication number: CN103064819A
Application number: CN2012104120747A
Authority: CN
Inventors: 张广勇; 张清
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2012-10-25
Filing date: 2012-10-25
Publication date: 2013-04-24

Abstract

The invention provides a method for utilizing a microwave integrated circuit (MIC) to rapidly achieve lattice Boltzmann parallel acceleration. The method comprises the steps of enabling a central processing unit (CPU) to set parameters such as a computational domain, a reference length, an inflow velocity, density and Reynolds numbers according to physical problems and design thread counts of an inner core according to nuclear numbers of an MIC card; enabling an MIC end to calculate equilibrium state distribution functions of all lattice points in various directions through the macroscopic parameters (the density, the speed, the Reynolds numbers and a coefficient of viscosity and the like) so as to enable the equilibrium state distribution functions to serve as a computational initial field, performing parallel solving of a discrete equation and a processing edge, and enabling results obtained by final iteration to be passed back to a CPU end. The characteristic of rapid calculation of an MICMIC end is utilized to participate in calculation of migration and collision in lattice-Boltzmann, and the iterative process of the lattice-Boltzmann is accelerated by coordinated operation of the CPUCPU end and the MICMIC end.

Description

A kind of MIC of utilization realizes the parallel method of accelerating of LATTICE BOLTZMANN fast

Technical field

The present invention relates to computing machine high-performance computing sector, Fluid Mechanics Computation field, the MIC that is specifically related to a kind of Intel of utilization realizes the method that grid-Boltzmann accelerates fast.

Background technology

Lattice Boltzmann method (Lattice Boltzmann Method, LBM) developed into a kind of effective method for numerical simulation in the past 20 years, it is to see method between the microcosmic Molecular Dynamics method with based on a kind of Jie between the macro approach of continuous medium hypothesis.The method is different from traditional fluid simulation method, and it is based on Molecule Motion Theory, asks square to obtain macroscopical average characteristics by the pair distribution function that transports then of following the tracks of particle distribution function.The moving theoretical characteristics of LATTICE BOLTZMANN method is so that it is more effective in many Complex Flows in simulation, as porous medium flow, suspension flow, polyphasic flow, polycomponent stream etc.The LATTICE BOLTZMANN method has born parallel characteristics, and the advantage such as boundary treatment is simple, program is easy to implement.

Basic process when adopting the LBM method to find the solution physical problem as shown in Figure 1.For a specific physical problem, at first carry out additional step:

(1) based on various simplification and assumptions, carry out physical modeling, determine zoning, starting condition and boundary condition etc., and according to the difference of physical problem, select corresponding lattice Boltzmann model;

(2) carry out grid and divide, it is NX*NY that hypothetical trellis is divided size;

(3) according to different lattice Boltzmann model, select governing equation, and it is dispersed.As adopting the standard lattice Boltzmann method to simulate the isothermal incompressible flow is moving, the governing equation after then dispersing is the LBGK equation.

This 3 step of front just carried out before numerical simulation.Enter subsequently the numerical simulation stage:

(4) according to physical problem, the macroscopical parameter on given all lattice points (density, speed, viscosity coefficient etc.), and calculate thus the equilibrium distribution function of all directions on all lattice points, with this as first that calculates;

(5) find the solution governing equation after discrete, for example, adopt the migration collision rule to find the solution the LBGK equation;

(6) according to boundary condition, implement the boundary treatment form at corresponding lattice point at the boundary;

(7) based on macroscopical definition of quantity rule of different LATTICE BOLTZMANN models, calculate the macroscopical parameter on each lattice point;

(8) judge to calculate whether restrain;

(9) if calculate convergence, then export result of calculation; Otherwise returned for the 4th step, continue to find the solution, until convergence.

The single relaxation time approximation BGK of the grid of widespread use-Boltzmann model is based on following EVOLUTION EQUATION:

Here,

Be particle distribution function, represent time t,

The place exists with microcosmic speed The probability of the particle of motion.Slack time

Representative reaches the speed of partial balancing, and is relevant with the moving coefficient of viscosity of fluid.The balanced distribution function

The low mach that is the Maxwell-Boltzmann equation is approximate, depends on density and the flowing velocity of fluid.Relation between them is determined by following formula:

Wherein, in the D2Q9 model:

Fluid density and speed then can be calculated according to following formula by particle distribution function:

Discrete velocity

, and the number N of particle distribution function depends on selected grid-Boltzmann model, in the D2Q9 model

9 components are arranged, and the number of corresponding particle distribution function is 9 also, sees Fig. 2.

MIC (Many Integrated Core) is the many-core processor that Intel Company releases, compare with general multinuclear Xeon, the many nuclear of MIC framework has less kernel and hardware thread, many-core processor computational resource density is higher, the chip-on communication expense significantly reduces, more transistor and energy can be competent at more complicated Parallel application.Intel MIC product based on the many-core processor of heavy nucleus, comprises the core more than 50 based on X86-based, and the vectorial bit wide of 512bit, and two smart performances surpass 1TFlops.

OpenMP is the guiding note of a cover for the design of the multithread programs on the shared drive parallel system, the same OpenMP programming model of supporting on the MIC platform, reduced like this construction cycle of MIC concurrent program, traditional parallel programming language had good support, therefore, can utilize the OpenMP parallel programming model on the MIC platform, to realize fast the performance application concurrent software, obtain fast the raising of performance.

The LBM method needs a large amount of calculating, the example that is calculated as with square chamber, the hypothetical trellis size is 1024*1024, iteration 10000 times, each net point will be done once migration and collision calculation in an iteration, be that the QuadCore Intel Xeon X5450 of 3.00GHz calculates the time that needs a few hours in dominant frequency, and we calculate larger grid iteration more times needs the time of a couple of days when counting, this has had a strong impact on the performance of LBM method.At present, people often carry out LBM with large-scale X86 server cluster and process, its principle is computational load to be divided then to be assigned to each lattice point first, then calculated separately by each lattice point and behind each iteration step swap data, at last the result is gathered output.This way is lower and very big a large amount of time, electric power and the maintenance cost of having consumed of Internet Transmission expense owing to cpu spike Floating-point Computation ability.And, along with the cycle request of people's convection cell simulation is shorter and shorter, accuracy requirement is more and more higher, the scale of PC server cluster is done larger and larger, all is faced with huge challenge at aspects such as system constructing cost, data center machine room space, power consumption heat radiation and electrical power limit, manageability, programming simplification, extendability, management maintenance expenses.

As seen, for satisfying the demand of fluid simulation, need a kind of method that improves the calculated performance of LBM, and the method can reduce machine room construction cost and management, operation, maintenance cost, and MIC can well address these problems.

Summary of the invention

The object of the invention is to realize fast the LATTICE BOLTZMANN accelerated method, improve its handling property and efficient, make CPU and MIC work in coordination with calculating, thereby satisfy the demand of fluid simulation, and reduce the machine room construction cost and management, operation, maintenance cost provide a kind of MIC of utilization to realize fast the parallel method of accelerating of LATTICE BOLTZMANN.

The objective of the invention is to realize in the following manner, to need that basic parameter initialization calculating is placed on the CPU end carries out, and consuming time and the extraordinary balanced distribution function calculation of concurrency, macroscopic quantity statistics, discrete equation is found the solution partly utilizes the OpenMP technology to carry out paralell design with boundary treatment, make it hold executed in parallel at MIC, CPU and MIC work in coordination with calculating, the final acceleration lattice Boltzmann method that realizes

Particular content comprises CPU end and MIC end, wherein:

The CPU end will carry out grid according to physical problem and divide, macroscopical parameter of density, speed, reference length, Reynolds number and viscosity coefficient on all lattice points of given grid, set simultaneously the thread execution configuration of kernel, start the parallel computation of MIC end, and the iterative computation result of reception MIC end obtains final fluid state;

MIC end uses corresponding multi-threaded parallel ground according to the equilibrium distribution function of all directions on described all lattice points according to described thread execution configuration, successively by migration and collision, the distribution function of layer when boundary treatment obtains the next one;

Concrete steps are as follows:

1) CPU end will carry out grid and divide, and set the initial value of the macroscopical parameter on all lattice points of grid, according to the check figure of described mic card the Thread Count of the described iterative computation of executed in parallel will be set, and specifically comprise:

According to the requirement of described physical problem grid being carried out in the Flow Field Calculation territory divides, described sizing grid is NX*NY, NX is the x direction, NY is the y direction, nodes on the described grid is N=NX*NY, according to the check figure that adopts mic card the Thread Count of the described migration collision calculation of executed in parallel is set, the check figure of described mic card is M, the Thread Count T=4*M of described migration collision calculation;

2) MIC end uses a kernel of asking distribution function according to initial macroscopical parameter, calculates the equilibrium distribution function of all directions on all lattice points, and then the repeatedly iteration by migration and collision, boundary treatment obtains the convergence state in described flow field;

3) MIC holds initial macroscopical parameter and the thread execution configuration in given flow field, and is delivered in the internal memory of described MIC end, and reads the net result in the MIC internal memory after the calculating of MIC end is complete;

The distribution function of layer obtained the distribution function in a period of time under this flow field layer by migration collision and boundary treatment when 4) MIC end used corresponding multi-threaded parallel ground current according to described flow field according to described thread execution configuration, specifically comprised:

5) MIC end use T thread parallel ground to N lattice point of described fluid grid according to initial distribution function F _i ⁽⁰⁾Or the distribution function F that calculates of previous step _i ^(k)Carry out described migration collision and boundary treatment algorithm, calculate obtain described grid lattice point lower a period of time layer distribution function F _i ^(k+1), described i gets altogether b+1 value of 0-b, represents respectively the distribution function of b+1 direction on the lattice point, and described k equals 1 or greater than 1 integer;

6) CPU end control iterations, and hold MIC final result to pass back to CPU and hold, wherein: CPU end control iteration ITR time, namely the kernel iteration is called ITR time, and described ITR is the iterations that carries out in the fluid simulation;

7) the grid lattice point is carried out after ITR time in iteration, and the MIC end is according to the distribution function F of last grid lattice point _i ^(ITR)Calculate the flow field macroscopic view parameter of kernel parallel speed, density and stream function;

8) the CPU end as a result of writes back velocity field, density field and the stream function that the MIC end calculates in the memory modules.

The invention has the beneficial effects as follows: by technical scheme of the present invention as seen, the present invention partly is performance bottleneck in the LBM algorithm by test migration and collision and boundary treatment, and the data of this part have independence fully, be suitable for the upper multithreading that adopts of MIC fully and carry out parallel computation, and initiation parameter not consuming time and result's output still are placed on the execution of CPU end, CPU and MIC work in coordination with calculating.Improve 54 times by the test overall performance, present one is calculated the calculated performance that the MIC computing node is equivalent to 54 original above CPU nuclears, so not only satisfied the demand of fluid simulation, and greatly reduce power consumption, machine room construction cost and management, operation, maintenance cost have been reduced, and this method realizes that simply, the cost of development that needs is low.

Description of drawings

The basic flow sheet of accompanying drawing 1 LBM method analog approach;

Accompanying drawing 2 D2Q9 illustratons of model;

Accompanying drawing 3 utilizes MIC to accelerate the process flow diagram of LBM embodiment of the method;

Accompanying drawing 4 transition process synoptic diagram.

Embodiment

Explain below with reference to Figure of description method of the present invention being done.

In order to make the purpose, technical solutions and advantages of the present invention more clear, below in conjunction with drawings and Examples, the present invention is described in detail below.The method is divided into following steps:

1) performance bottleneck of location lattice Boltzmann method;

When utilizing LBM to carry out fluid simulation, calculating section the most consuming time is the process of finding the solution discrete equation and boundary treatment, and this process has occupied most times of whole simulation, and other parts are consuming time hardly, therefore, the iterative process of finding the solution discrete equation and boundary treatment is the performance bottleneck among the LBM;

2) concurrency analysis;

According to finding the solution the analysis of the serial algorithm of discrete equation and boundary treatment in the LBM algorithm, the migration of each net point, collision, macroscopic quantity statistics, equilibrium distribution function calculates and the calculating of boundary treatment is data parallel;

3) find the solution the paralell design of discrete equation and boundary treatment;

A) find the solution the process that discrete equation can adopt the migration collision, macroscopic quantity statistics, equilibrium distribution function calculate and collision process in between the calculating of each grid without any dependence, therefore, can allow each thread among the MIC be responsible for the calculating of the delegation net point of a grid in dividing, the calculating of every row net point utilizes the vectorization technology on the MIC further to accelerate; The migration of distribution function only relates to other lattice points around this lattice point, also can realize by the read operation of single thread to correlation distribution function in the global storage;

B) in the LBM algorithm, to do special processing (non-equilibrium extrapolation, bounce-back) to the border, there is not the dependence of data for the calculating between borderline each lattice point yet, therefore, can utilize the OpenMP multithreading to be responsible for the calculating of lattice point at the boundary;

C) threading model of OpenMP design: the Thread Count that kernel is set according to the MIC core number;

The MIC kernel code of d) finding the solution discrete equation and boundary treatment is write.

Embodiment

The object of the invention is to accelerate lattice Boltzmann method, improve its handling property, make CPU and MIC work in coordination with calculating, thereby satisfy the demand of fluid simulation, and reduce machine room construction cost and management, operation, maintenance cost.In the method, to need that initialization calculating is placed on the CPU end carries out, and find the solution discrete equation and boundary treatment partly utilizes the OpenMP technology to carry out paralell design consuming time and concurrency is extraordinary, make it hold executed in parallel at MIC, CPU and MIC work in coordination with calculating, the final realization accelerated lattice Boltzmann method, and as shown in Figure 3, concrete steps and implementation process are as follows:

(1) according to physical problem, hold macroscopical parameter (density, speed, viscosity coefficient etc.) on the given computational fields at CPU, pass to the MIC end;

(2) data structure and the storage mode of definition MIC end, the macroscopical parameters such as speed, density that are used for the balanced distribution function of each lattice point all directions of storage and each lattice point, the macroscopical Parameters Calculation that is transmitted by CPU end goes out the equilibrium distribution function of all directions on all lattice points, with this as first that calculates;

(3) design migration collision kernel, the design lines number of passes is T=4*M, M is the check figure of mic card, and allow migration and the collision process of each the thread computes delegation net point in the kernel, and utilize #pragma ivdep to realize the vectorization of interior loop in the kernel, as shown in Figure 4, the kernel false code is as follows;

1:#pragma omp parallel for private (i, j, k ...) num_threads (T) //T is Thread Count

2: for (i=1;i<NY-1;i++)

3:#pragma ivdep // vectorization

4: for(j=1;j<NX-1;j++)

5: {

6:k=i*NX+j; //k represents the label of grid

7:fr=fr0[k]; The upper for the moment distribution function of layer of // 0 representative

8: fe = fe0[k-1];

9: fn = fn0[k-NX];

10: fw = fw0[k+1];

11: fs = fs0[k+NX];

12: fne = fne0[k-NX-1];

13: fnw = fnw0[k-NX+1];

14: fsw = fsw0[k+NX+1];

15: fse = fse0[k+NX-1];

16 :/* collision process */

17: ask macroscopic quantity according to the distribution function fr-fse after the migration

18: ask the balanced distribution function f 1 of all directions, f2, f3, f4, f5, f6, f7, f8 according to macroscopic quantity;

19: according to f1, f2, f3, f4, f5, f6, f7, the distribution function fr after f8 and the migration, fe, fn, fw, fs, fne, fnw, fsw, fsw, fse ask the distribution function fr1[k after the collision], fe1[k], fn1[k], fw1[k] and, fs1[k], fne1[k] and, fnw1[k], fsw1[k] and, fse1[k];

20: }

(4) at MIC end the border is processed, boundary treatment can adopt the methods such as bounce method, non-equilibrium extrapolation method, the calculating of T thread process boundary node of same design to the processing on border the time;

(5) judge whether that iteration finishes, finish then output, otherwise continue iteration;

(6) MIC end is tried to achieve macroscopical parameter such as speed, density and stream function and the result is passed to the CPU end according to distribution function is parallel; The CPU end carries out result's output;

(7) performance test

A) test environment and test data

Test environment comprises hardware environment, software environment, operating software, and wherein operating software comprises the serial LBM algorithm that operates on the CPU and the parallel LBM algorithm that operates on the MIC; Test data has been chosen lid-driven cavity flow, and input comprises sizing grid and some other input parameter, and concrete every test environment and test data are as shown in table 1;

Table 1

B) results of property

In order to guarantee test performance result's stability, we have carried out 10 tests to above-mentioned operation, data type is float, CPU version LBM algorithm is 10324 seconds in the averaging time of single CPU operation 10 times, and MIC version LBM algorithm is 191 seconds in the averaging time of the above-mentioned same operation of single MIC operation 10 times, and the performance of MIC version operation is 10324/191=54 times of CPU version.

Except the described technical characterictic of instructions, be the known technology of those skilled in the art.

Claims

1. one kind is utilized MIC to realize fast the parallel method of accelerating of LATTICE BOLTZMANN, it is characterized in that comprising CPU end and MIC end, wherein:

Concrete steps are as follows: