CN103279391A

CN103279391A - Load balancing optimization method based on CPU (central processing unit) and MIC (many integrated core) framework processor cooperative computing

Info

Publication number: CN103279391A
Application number: CN2013102343891A
Authority: CN
Inventors: 吴庆
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2013-06-09
Filing date: 2013-06-09
Publication date: 2013-09-04

Abstract

The invention provides a load balancing optimization method based on CPU (central processing unit) and MIC (many integrated core) framework processor cooperative computing, and relates to an optimization method for load balancing among computing nodes, between CPU and MIC computing equipment in the computing nodes and among computing cores inside the CPU and MIC computing equipment. The load balancing optimization method specifically includes task partitioning optimization, progress/thread scheduling optimization, thread affinity optimization and the like. The load balancing optimization method is applicable to software optimization of CPU and MIC framework processor cooperative computing, software developers are guided to perform load balancing optimization and modification on software of existing CPU and MIC framework processor cooperative computing modes scientifically, effectively and systematically, maximization of computing resource unitization by the software is realized, and computing efficiency and overall performance of the software are remarkably improved.

Description

A kind of based on CPU and the collaborative load calculated balance optimizing method of MIC architecture processor

Technical field

The present invention relates to computing machine high-performance computing sector, science calculating field, be specifically related to a kind of based on CPU and the collaborative load calculated balance optimizing method of MIC architecture processor.

Background technology

The limit that traditional CPU approaches current semiconductor technology gradually at aspects such as power consumption, dominant frequency, people have to look for another way, and develop innovatively multinuclear even many nuclear architecture processors, and under this background, the MIC framework many-core processor of Intel arises at the historic moment.The collaborative software pattern that calculates of CPU and MIC architecture processor will become the main flow software architecture pattern of high-performance calculation gradually.

Yet, in concrete the application, we find, simply application software is transplanted to the heterogeneous platform that CPU and MIC architecture processor are formed, performance boost is unsatisfactory, Here it is, and the isomery architecture processor is worked in coordination with in the computation schema, the ultimate challenge that we run into, so the optimization again of software seems very necessary.

Load balance optimization is the important content that software performance is optimized.So-called load refers to the workload distribution situation between a plurality of tasks, and load balancing refers to the workload allocations equilibrium between each task.Load balancing refers in parallel computation the task mean allocation on each computational resource, is made it to give full play to computing power in the executed in parallel system, does not have free time or wait, does not also exist load excessive.The load balancing effect that good parallel method can be bringd into play, unbalanced decline and the bad extendability that will cause counting yield of load.Therefore, realize that load balancing is the importance in the parallel computation, especially for the MIC architecture processor, its check figure is numerous, and good load balancing is more obvious to its Effect on Performance.

For the performance performance of software in the heterogeneous platform system that improves CPU and MIC architecture processor composition, need optimize transformation pointedly to software.This method is paid close attention to CPU and the collaborative load calculated balance optimizing of MIC architecture processor, and maximization promotes the computational resource utilization factor, promotes software usefulness.

Summary of the invention

The objective of the invention is the performance performance for software in the heterogeneous platform system that improves CPU and MIC architecture processor composition, need optimize transformation pointedly to software, provide a kind of CPU and MIC architecture processor to work in coordination with the load calculated balance optimizing, maximization promotes the computational resource utilization factor, promotes software usefulness.

The objective of the invention is to realize in the following manner, comprise that the level of load balance optimization of the collaborative computation schema software of CPU and MIC architecture processor and CPU and MIC architecture processor work in coordination with the load balance optimization method of computation schema software, wherein:

(1) level of the load balance optimization of the collaborative computation schema software of CPU and MIC architecture processor, respectively at three load balancing levels, tell about the method for load balance optimization below:

A) load balancing between the computing node; Load balancing between the computing node refers to the calculation task distribution basis equalization between each computing node, avoids certain node calculated amount excessive or too small; Each computing node here, it all is the mixed architecture computing node that traditional C PU and MIC architecture processor constitute, the mode parallel computation that walks abreast with MPI between node, internodal load balancing is consistent with load balancing on traditional pure CPU cluster, adopts the method for the balanced and dynamic load leveling of static load;

B) load balancing between the interior CPU of computing node and the MIC computing equipment, refer to load balancing between the interior CPU of computing node and the MIC computing equipment, it is calculation task basis equalization between CPU and the MIC equipment, because the computing power of CPU and MIC processor does not wait, therefore the distribution calculation amount can not be identical between CPU and the MIC, the best mode of load balancing between CPU and the MIC is to adopt the method for dynamic load leveling, comprising:

C) load balancing between the kernel is respectively calculated in CPU and MIC computing equipment inside, refers to that CPU and MIC computing equipment inside respectively calculates load balancing between the kernel, and each that refers to CPU and MIC equipment calculates the calculation task distributing equilibrium between kernel;

Realize that load balancing has two kinds of schemes: static load equilibrium and dynamic load leveling, wherein:

A) static load equilibrium

Static load is balanced need manually to be divided into a plurality of parts that walk abreast with the perform region, and guarantee that the various piece workload be divided into can balancedly be distributed on each process, thread and even the processor and move, that is to say that workload balancedly distributes between a plurality of tasks, make the acceleration of concurrent program the highest;

B) dynamic load leveling

Dynamic load leveling be in program operation process the dynamic assignment calculation task to reach the purpose of load balancing, existing in the software practice much can't be by the situation of static load balanced way solution, mainly show as calculation task and have randomness, real-time, can't thing there be the anticipation of carrying out, the overall system performance of dynamic load leveling is better than static load equilibrium, but code is more complicated on realizing;

Method based on the load balance optimization of the collaborative computation schema software of CPU and MIC architecture processor has:

[1] task division: for the application program of task division, load balancing between CPU and MIC adopts the optimization method of dynamic load leveling, suppose to have N task, an intranodal has two mic cards, i.e. three equipment, 1 CPU and 2 MIC, the method of dynamic load leveling is that each equipment obtains a task earlier and calculates, obtain next task after calculating immediately, do not need to wait for other equipment, finish up to N task computation, this mode only need be set a host process, is responsible for to each calculation procedure allocating task;

Task division is optimized; Refer to task division optimization, the distribution of computation tasks mode that is based on data or tasks in parallel pattern is optimized, thereby makes the calculation task of each parallel processing element realize balanced;

[2] data are divided: since need disposable on equipment the storage allocation space, therefore, application program for the data division, can't adopt dynamic load leveling, and need adopt the static data division methods, static data dividing mode makes the load balancing of isomery equipment room become difficult, sometimes even can't realize, for some iterated application programs, adopt learning-oriented data partition method, comprise the calculating that allows CPU and MIC do an iteration identical calculations amount respectively, the computing power ratio that calculates CPU and MIC working time by separately then, and then data are divided;

[3] process/thread optimizing scheduling; Refer to the process/thread optimizing scheduling, process or thread are the performance elements of the software level of parallel processing, are the carriers of parallel computation, and the optimized scheduling of process or thread is directly connected to the overall performance of software;

C) load balancing between the kernel is respectively calculated in CPU and MIC computing equipment inside

Load balancing between the kernel is respectively calculated in CPU and MIC computing equipment inside, can be described as the load balancing in the equipment again, adopts three kinds of load balancing strategies among the OpenMP:

(a) schedule (static [, chunk]): static scheduling, thread obtain chunk iterations at every turn, and carry out in the mode of poll, if do not indicate chunk, then carry out in the mode of mean allocation, and this is the scheduling mode of acquiescence;

(b) schedule (dynamic [, chunk]): dynamic dispatching dynamically is assigned to iteration each thread, when using the chunk parameter, the iterations of at every turn distributing to thread is chunk time of appointment, when not using the chunk parameter iteration is not assigned to each thread one by one;

(c) schedule (guided [, chunk]): the guiding scheduling is a kind of guidable heuristic self-adapting dispatching method, each thread is assigned to bigger iteration piece during beginning, the iteration piece that is assigned to afterwards can successively decrease gradually, the size of iteration piece can drop to the chunk size of appointment by index, if do not specify the chunk parameter, iteration block size minimum drops to 1 so;

D) thread affinity optimization

Load balancing, its essence is that the task load that is assigned on the process/thread wants relative equilibrium, and realization " able people should do more work ", make each thread be in busy condition as far as possible, withdraw from simultaneously at last, but thread is not to exist without foundation, thread finally need be assigned on the calculating core of processor and carry out, thread is assigned to and respectively handles the process that core is carried out, it is exactly thread scheduling, with each computational threads balance dispatching to different processor cores in the heart, make that the working load of each processing core is suitable, realize respectively calculating internuclear load balancing;

The compatibility of process and thread refers to and process or thread can be forced to be limited in the characteristic of moving on the available CPU subclass, briefly, the CPU compatibility is exactly that process will be in as far as possible operation for a long time and do not moved to the tendentiousness of other processors of certain given CPU, and it is exposed to the Systems Programmer to the scheduling strategy of process/thread on multicomputer system to a certain extent;

Soft compatibility mean process can't be between processor frequent migration, hard compatibility means that then process need move at your processor of appointment, the setting of process/thread compatibility allows the exploitation personnel programme and realizes the hard compatibility of CPU, this means that the application program explicitly process of specifying moves at which or which processor;

In the Linux kernel, all process/thread have a relevant data structure, be called task_struct, this structure is extremely important, wherein with the compatibility degree of correlation the highest be the cpus_allowed bitmask, this bitmask is made up of the n position, and is corresponding one by one with n logic processor in the system; If be provided with given position for given process, this process is just moved at relevant CPU so, therefore, if a process can be moved on what processor core in office, and can between processor core, move as required, bitmask is 1 entirely just so, in fact, and the default setting of process among Here it is the Linux;

The setting of process/thread compatibility is exactly in fact that bitmask relevant with processor core in the process/thread data structure is set;

(2) design of the load balance optimization method of thread affinity

(a) thread only is familiar with logic nuclear, therefore uses 1 physics nuclear of Hyper-Threading also can be regarded as 2 calculating cores;

(b) needs solve the contradiction of cache utilization ratio and physics core load balancing;

(c) this optimization method is applicable to and can not utilizes whole physics cores or task locality situation clearly;

The difference that compatibility arranges, thread can be assigned on the different logic core, if cross-thread has data relevant, when logic core is in same physics core, cross-thread is by utilizing the cache of same physics core, improve travelling speed, yet, this nuclear born too much calculation task and other nuclear phase to " idle ", cause the computational resource utilization factor not reach maximization again, the overall performance of program is not optimum, and the thread affinity on the MIC processor has scatter, compact and the exclusive 3 kinds of patterns of balanced MIC:

The scatter pattern

The scatter pattern with the thread priority allocation on the lighter physics core of load, this mode can reach load balancing preferably, but because adjacent thread is not in the middle of the same physics core, if therefore adjacent cross-thread has data sharing, then can not utilize cache to accelerate, and in MIC, though can read the cache(L2 of other core), but can cause the population size of available cache to reduce, and reading under other core the efficient ratio of cache, to read local cahce poor;

The compact pattern

The compact pattern is dispensed to thread on the logic core in order, this mode can make adjacent thread be in the middle of the same physics core as far as possible, if adjacent cross-thread has data sharing, this mode can improve the cache utilization ratio, but, if adjacent threads load is suitable and total number of threads relatively more after a little while, can cause the task load height to concentrate, cause serious load unbalanced, still, for some more special task allocation scheme, the task distribution has clear regularity, comprise that the odd number threads load is lighter, the even number threads load is heavier, and the affine pattern of this thread can reach load balancing preferably on the contrary;

The balanced pattern

The balanced pattern is that MIC goes up distinctive pattern, similar with the scatter pattern, also be thread to be assigned to the lighter physics core of load as far as possible, but differently with it be, the balanced pattern is being taken into account balanced while, also can adjacent thread be distributed in the middle of the same physics core, this mode has been accomplished certain balance in load balancing and cache utilization as far as possible;

More than three kinds of modes that mode all is static division, need select suitable thread allocation scheme according to the actual loading situation of program, just can reach reasonable effect.

The invention has the beneficial effects as follows: this method is widely used in CPU and the collaborative application scenario of calculating of MIC architecture processor on the CPU+MIC mixed architecture platform, the guiding software developer carries out load balance optimization to existing software effectively, realize that software is to the system resource utilization optimization, significantly improve the counting yield of hardware resource utilization and software, thereby promote the software overall performance greatly.

Description of drawings

Fig. 1 is the collaborative evaluation work pattern diagram of CPU and MIC architecture processor;

Fig. 2 is the micro-architecture synoptic diagram that MIC framework many-core processor has numerous calculating core Core;

Fig. 3 is the collaborative load calculated balanced hierarchy synoptic diagram of CPU/;

Fig. 4 is the collaborative hierarchical structure synoptic diagram that walks abreast that calculates of multinode CPU+MIC;

Fig. 5 is the collaborative experiment with computing result schematic diagram of multinode CPU+MIC;

Fig. 6 is the relative serial speed-up ratio synoptic diagram of the collaborative calculating of multinode CPU+MIC;

Fig. 7 is the relative single node OpenMP multithreading speed-up ratio synoptic diagram of the collaborative calculating of multinode CPU+MIC.

Embodiment

Explain below with reference to Figure of description method of the present invention being done.

Main level based on the load balance optimization of the collaborative computation schema software of CPU and MIC architecture processor has:

1) load balancing between the computing node;

2) load balancing between the interior CPU of computing node and the MIC computing equipment;

3) load balancing between the kernel is respectively calculated in CPU and MIC computing equipment inside.

Main method based on the load balance optimization of the collaborative computation schema software of CPU and MIC architecture processor has:

1) task division optimization

Parallel computation just refers to that a plurality of calculation tasks are assigned to different process/thread parallel processings.Be stranded this, the size of each process/thread calculation task, i.e. " working load " is directly connected to execution time of each process/thread;

2) process/thread optimizing scheduling

Process/thread finally is assigned to processor and calculates the core execution, and this assigning process is called the process/thread scheduling.If it is unbalanced respectively to calculate the performed process/thread of core, will cause " calculated load " skewness, it is overweight some calculating core " calculated load " to occur, and other calculate cores too " idle ".

Embodiment

The object of the present invention is to provide a kind of CPU and MIC architecture processor to work in coordination with the load calculated balance optimizing method.

In order to make the purpose, technical solutions and advantages of the present invention more clear, below in conjunction with drawings and Examples, the present invention is done following detailed description.

At first, briefly introduce the collaborative evaluation work pattern of CPU and MIC architecture processor, as shown in Figure 1:

MIC framework many-core processor has numerous calculating core Core, its micro-architecture synoptic diagram as shown in Figure 2:

The MIC architecture processor generally uses as coprocessor, with traditional C PU collaborative work.Mic card is communicated by letter with CPU by the PCI-E bus.Have the micro OS based on linux kernel on the MIC coprocessor expansion card, be called uOS, the process/thread scheduling of MIC is finished by uOS.MIC supports multiple parallel storehouses such as multilingual such as C/C++/Fortran and MPI/OpenMP/pthread, and the Software tool chain of Intel also provides comprehensive support to the MIC framework.

Illustrate that below all, C language parallel based on compiler, the OpenMP of Intel set forth.

A kind of CPU and MIC architecture processor are worked in coordination with load calculated balance optimizing method implementation step, the method detailed rules and regulations are as follows:

1) load balancing general introduction

Generally, realize that load balancing has two kinds of schemes: static load equilibrium and dynamic load leveling.

A) static load equilibrium

Static load is balanced need manually to be divided into a plurality of parts that walk abreast with the perform region, and guarantee that the various piece (workload) be divided into can balancedly be distributed on each process, thread and even the processor and move, that is to say that workload balancedly distributes between a plurality of tasks, make the acceleration of concurrent program the highest;

B) dynamic load leveling

Dynamic load leveling be in program operation process the dynamic assignment calculation task to reach the purpose of load balancing.Exist much in the software practice and can't mainly show as calculation task and have randomness, real-time by the situation of static load balanced way solution, can't thing not have the anticipation of carrying out.In general, the overall system performance of dynamic load leveling is better than static load equilibrium, but code is more complicated on realizing.

2) load balance optimization method of the collaborative computation schema software of CPU and MIC architecture processor

A) load balancing between the computing node;

B) load balancing between the interior CPU of computing node and the MIC computing equipment;

C) load balancing between the kernel is respectively calculated in CPU and MIC computing equipment inside.

Fig. 3 has showed that CPU/ works in coordination with the load calculated balanced hierarchy.

Fig. 1 CPU/MIC works in coordination with the computational load equilibrium.

Respectively at above three load balancing levels, tell about the method for load balance optimization below:

A) load balancing between the computing node

Each computing node here, it all is the mixed architecture computing node that traditional C PU and MIC architecture processor constitute, the mode parallel computation that walks abreast with MPI between node, internodal load balancing is consistent with load balancing on traditional pure CPU cluster, can adopt the method for the balanced and dynamic load leveling of static load.

Because the computing power of CPU and MIC processor does not wait, so the distribution calculation amount can not be identical between CPU and the MIC, and the best mode of the load balancing between CPU and the MIC is the method for employing dynamic load leveling.Below the equipment room load balancing should adopt under our situation of respectively task division and data being divided optimization method;

(1) task division: for the application program of task division, load balancing between CPU and MIC can adopt the optimization method of dynamic load leveling, N task for example arranged, an intranodal has two mic cards, i.e. three equipment (CPU and 2 MIC), the method for dynamic load leveling are that each equipment obtains a task earlier and calculates, and obtain next task immediately after calculating, do not need to wait for other equipment, finish up to N task computation.This mode only need be set a host process, is responsible for to each calculation procedure allocating task;

(2) data are divided: since we need disposable on equipment the storage allocation space, therefore, application program for the data division, we can't adopt dynamic load leveling, and need adopt the static data division methods, static data dividing mode makes the load balancing of the isomery equipment room difficulty that becomes, sometimes even can't realize! For some iterated application programs, we can adopt learning-oriented data partition method, as allow CPU and MIC do the calculating of an iteration identical calculations amount respectively, the computing power ratio that calculates CPU and MIC working time by separately then, and then data are divided.

Load balancing between the kernel is respectively calculated in CPU and MIC computing equipment inside, can be described as the load balancing in the equipment again, can adopt three kinds of load balancing strategies among the OpenMP:

(1) schedule (static [, chunk]): static scheduling, thread obtain chunk iterations at every turn, and carry out in the mode of poll.If do not indicate chunk, then carry out in the mode of mean allocation, this is the scheduling mode of acquiescence;

(2) schedule (dynamic [, chunk]): dynamic dispatching dynamically is assigned to iteration each thread, when using the chunk parameter, the iterations of at every turn distributing to thread is chunk time of appointment, when not using the chunk parameter iteration is not assigned to each thread one by one;

(3) schedule (guided [, chunk]): the guiding scheduling is a kind of guidable heuristic self-adapting dispatching method.Each thread is assigned to bigger iteration piece during beginning, and the iteration piece that is assigned to afterwards can successively decrease gradually.The size of iteration piece can drop to the chunk size of appointment by index, if do not specify the chunk parameter, iteration block size minimum drops to 1 so.

The scheduling strategy usable range of OpenMP is as shown in the table.

Dispatching algorithm	Usable range
		Static	The pinned task amount, and each iteration task amount is identical
Dynamic	Task amount is unfixing, and each iteration task amount is all different
		Guided	This is the special circumstances of dynamic dispatching algorithm, and the guiding dispatching algorithm can reduce the expense of scheduling

D) thread affinity optimization

Load balancing its essence is that the task load that is assigned on the process/thread wants relative equilibrium, and realizes " able people should do more work ", makes each thread be in busy condition as far as possible, withdraws from simultaneously at last.But thread is not to exist without foundation, and thread finally need be assigned on the calculating core of processor and carry out, and thread is assigned to respectively handles the process that core is carried out, and is exactly thread scheduling.With each computational threads balance dispatching to different processor cores in the heart, make that the working load of each processing core is suitable, realize respectively calculating internuclear load balancing.

The compatibility of process and thread (affinity) refers to and process or thread can be forced to be limited in the characteristic of moving on the available CPU subclass, briefly, CPU compatibility (affinity) is exactly that process will be in as far as possible operation for a long time and do not moved to the tendentiousness of other processors of certain given CPU, and it is exposed to the Systems Programmer to the scheduling strategy of process/thread on multicomputer system to a certain extent.

Soft compatibility mean process can't be between processor frequent migration, hard compatibility means that then process need move at your processor of appointment.The setting of process/thread compatibility the exploitation personnel can be programmed realizes hard CPU compatibility (affinity), this means that application program can explicitly specifies process at which (or which) processor to move.

In the Linux kernel, all process/thread have a relevant data structure, are called task_struct, and this structure is extremely important, wherein with compatibility (affinity) degree of correlation the highest be the cpus_allowed bitmask.This bitmask is made up of the n position, and is corresponding one by one with n logic processor in the system.If be provided with given position for given process, this process just can be moved at relevant CPU so.Therefore, if process can be moved on what processor core in office, and can move between processor core as required, bitmask is 1 entirely just so.In fact, the default setting of process among Here it is the Linux.

The setting of process/thread compatibility is exactly in fact that bitmask relevant with processor core in the process/thread data structure is set, and detailed method to set up please be checked relevant handbook.

The load balance optimization method that thread affinity is set relates to several problems:

(1) thread only is familiar with logic nuclear, therefore uses 1 physics nuclear of Hyper-Threading also can be regarded as 2 calculating cores;

(2) needs solve the contradiction of cache utilization ratio and physics core load balancing;

(3) this optimization method generally is applicable to and can not utilizes whole physics cores or task locality situation clearly.

The difference that compatibility arranges, thread can be assigned on the different logic core.If cross-thread has data relevant, when logic core is in same physics core, cross-thread can utilize the cache of same physics core, improve travelling speed, yet, this nuclear born too much calculation task and other nuclear phase to " idle ", cause the computational resource utilization factor not reach maximization again, the overall performance of program is not optimum.Thread affinity on the MIC processor have scatter, compact and balanced(MIC exclusive) 3 kinds of modes:

(1) scatter pattern

The scatter pattern with the thread priority allocation on the lighter physics core of load, this mode can reach load balancing preferably, but because adjacent thread is not in the middle of the same physics core, if therefore adjacent cross-thread has data sharing, then can not utilize cache to accelerate.And in MIC, though can read the cache(L2 of other core), can cause the population size of available cache to reduce, and reading under other core the efficient ratio of cache, to read local cahce poor;

(2) compact pattern

The compact pattern is dispensed to thread on the logic core in order, and this mode can make the adjacent lines journey be in the middle of the same physics core as far as possible, if adjacent cross-thread has data sharing, this mode can improve the cache utilization ratio.But, if adjacent threads load is suitable and total number of threads relatively more after a little while, can cause the task load height to concentrate, cause serious load unbalanced.But for some more special task allocation scheme, the task distribution has clear regularity, and for example the odd number threads load is lighter, and the even number threads load is heavier, and the affine pattern of this thread can reach load balancing preferably on the contrary;

(3) balanced pattern

The balanced pattern is that MIC goes up distinctive pattern, similar with the scatter pattern, also be thread to be assigned to the lighter physics core of load as far as possible, but differently with it be, the balanced pattern is being taken into account balanced while, also can adjacent thread be distributed in the middle of the same physics core as far as possible.This mode has been accomplished certain balance in load balancing and cache utilization.

These three kinds of modes that mode all is static division, we need select suitable thread allocation scheme according to the actual loading situation of program, just can reach reasonable effect.

Performance test and analysis

Said method is applied to certain performance application case---based on the large eddy simulation of LATTICE BOLTZMANN algorithm.

1) module brief introduction

Lattice Boltzmann method (Lattice Boltzmann Method, LBM) developed into a kind of effective method for numerical simulation in the past 20 years, it is to see method between microcosmic molecular dynamics method with based on a kind of Jie between the macro approach of continuous medium hypothesis.This method is different with traditional fluid simulation method, and it is moving theoretical based on molecule, finds the solution to obtain macroscopical average characteristics by the pair distribution function that transports then of following the tracks of particle distribution function.The moving theoretical characteristics of LATTICE BOLTZMANN method makes it more effective when the many complexity of simulation flow, as porous medium flow, suspension flow, polyphasic flow, polycomponent stream etc.The LATTICE BOLTZMANN method has born parallel characteristics, and advantage such as boundary treatment is simple, program is easy to implement.

LBM is a kind of modeling and computing method that are different from traditional numerical method in the Fluid Mechanics Computation field, is finding the solution a kind of special discrete scheme of Boltzmann equation.Solution procedure is the time push type, and solution procedure has good regionality, so be particularly suitable for parallel finding the solution.

(Large eddy simulation LES), is important numerical simulation study method in the fluid mechanics that just grows up nearly decades to large eddy simulation.It is different from direct Numerical (DNS) and average (RANS) method of Reynolds.Its basic thought is by accurately finding the solution the motion of certain all scale of turbulence more than the yardstick, thereby can capture the helpless many unstable state of RANS method, the large scale effect and the coherent structure that occur in the nonequilibrium process, simultaneously overcome direct Numerical again owing to the problem that need find the solution the huge computing cost that all scales of turbulence bring, thereby be considered to the turbulent flow numerical simulation developing direction of potentialization.

Lattice Boltzmann method can be used for finding the solution of large eddy simulation, and both are combined, and forms grid Boltzmann large eddy simulation (LBM-LES), has obtained in the fluid mechanics field using widely.

2) the collaborative computation model of multinode CPU+MIC architecture processor

Present case, original program is based on the single node single-threaded serial of CPU processor, this module is transplanted on the mixed architecture multinode cluster based on the CPU+MIC architecture processor, is realized the parallel processing of the collaborative computation schema of CPU+MIC, its parallel hierarchical structure such as Fig. 4:

The collaborative hierarchical structure that walks abreast of calculating of Fig. 1 multinode CPU+MIC.

In software development process, use the load balance optimization method based on the collaborative computation schema of CPU+MIC architecture processor that this patent provides, promoted CPU and MIC computational resource utilization factor greatly, significantly promoted the performance of this module.

3) test environment

4) The performance test results analysis

Based on the collaborative experimental result of calculating the LES parallel algorithm of multinode CPU+MIC as shown in Figure 5, on the collaborative computing platform of the CPU+MIC of 2 nodes, every node configuration two-way CPU and 2 mic cards, the performance of LES parallel algorithm reaches 1333.43MLUPS (when sizing grid is 8192*8192).

Fig. 5 multinode CPU+MIC works in coordination with the experiment with computing result

The speed-up ratio of the relative CPU serial of multinode parallel algorithm as shown in Figure 5, the performance of 2 nodes is that the peak acceleration of serial reaches 156.87 times.

The relative serial speed-up ratio of the collaborative calculating of Fig. 6 multinode CPU+MIC

The speed-up ratio of the relative single node OpenMP multithreading of multinode parallel algorithm as shown in Figure 3, as can be seen from this figure, single mic card is 1.77 times of two-way CPU, single node two-way CPU+2MIC is 3.4 times (node at two-way CPU increases the performance boost that 2 mic cards have brought 2.4 times) of two-way CPU, and 2 nodes (two-way CPU+2MIC/ node) are 6.71 times of single node two-way CPU.

The relative single node OpenMP multithreading speed-up ratio of the collaborative calculating of Fig. 1 multinode CPU+MIC.

Except the described technical characterictic of instructions, be the known technology of those skilled in the art.

Claims

1. one kind based on the collaborative load calculated balance optimizing method of CPU and MIC architecture processor, it is characterized in that comprising that the level of load balance optimization of the collaborative computation schema software of CPU and MIC architecture processor and CPU and MIC architecture processor work in coordination with the load balance optimization method of computation schema software, wherein:

A) static load equilibrium

B) dynamic load leveling

D) thread affinity optimization

(2) design of the load balance optimization method of thread affinity

The scatter pattern

The compact pattern

The balanced pattern