CN103279391A - Load balancing optimization method based on CPU (central processing unit) and MIC (many integrated core) framework processor cooperative computing - Google Patents

Load balancing optimization method based on CPU (central processing unit) and MIC (many integrated core) framework processor cooperative computing Download PDF

Info

Publication number
CN103279391A
CN103279391A CN2013102343891A CN201310234389A CN103279391A CN 103279391 A CN103279391 A CN 103279391A CN 2013102343891 A CN2013102343891 A CN 2013102343891A CN 201310234389 A CN201310234389 A CN 201310234389A CN 103279391 A CN103279391 A CN 103279391A
Authority
CN
China
Prior art keywords
thread
cpu
mic
load
load balancing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013102343891A
Other languages
Chinese (zh)
Inventor
吴庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Electronic Information Industry Co Ltd
Original Assignee
Inspur Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Electronic Information Industry Co Ltd filed Critical Inspur Electronic Information Industry Co Ltd
Priority to CN2013102343891A priority Critical patent/CN103279391A/en
Publication of CN103279391A publication Critical patent/CN103279391A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Multi Processors (AREA)

Abstract

The invention provides a load balancing optimization method based on CPU (central processing unit) and MIC (many integrated core) framework processor cooperative computing, and relates to an optimization method for load balancing among computing nodes, between CPU and MIC computing equipment in the computing nodes and among computing cores inside the CPU and MIC computing equipment. The load balancing optimization method specifically includes task partitioning optimization, progress/thread scheduling optimization, thread affinity optimization and the like. The load balancing optimization method is applicable to software optimization of CPU and MIC framework processor cooperative computing, software developers are guided to perform load balancing optimization and modification on software of existing CPU and MIC framework processor cooperative computing modes scientifically, effectively and systematically, maximization of computing resource unitization by the software is realized, and computing efficiency and overall performance of the software are remarkably improved.

Description

A kind of based on CPU and the collaborative load calculated balance optimizing method of MIC architecture processor
Technical field
The present invention relates to computing machine high-performance computing sector, science calculating field, be specifically related to a kind of based on CPU and the collaborative load calculated balance optimizing method of MIC architecture processor.
Background technology
The limit that traditional CPU approaches current semiconductor technology gradually at aspects such as power consumption, dominant frequency, people have to look for another way, and develop innovatively multinuclear even many nuclear architecture processors, and under this background, the MIC framework many-core processor of Intel arises at the historic moment.The collaborative software pattern that calculates of CPU and MIC architecture processor will become the main flow software architecture pattern of high-performance calculation gradually.
Yet, in concrete the application, we find, simply application software is transplanted to the heterogeneous platform that CPU and MIC architecture processor are formed, performance boost is unsatisfactory, Here it is, and the isomery architecture processor is worked in coordination with in the computation schema, the ultimate challenge that we run into, so the optimization again of software seems very necessary.
Load balance optimization is the important content that software performance is optimized.So-called load refers to the workload distribution situation between a plurality of tasks, and load balancing refers to the workload allocations equilibrium between each task.Load balancing refers in parallel computation the task mean allocation on each computational resource, is made it to give full play to computing power in the executed in parallel system, does not have free time or wait, does not also exist load excessive.The load balancing effect that good parallel method can be bringd into play, unbalanced decline and the bad extendability that will cause counting yield of load.Therefore, realize that load balancing is the importance in the parallel computation, especially for the MIC architecture processor, its check figure is numerous, and good load balancing is more obvious to its Effect on Performance.
For the performance performance of software in the heterogeneous platform system that improves CPU and MIC architecture processor composition, need optimize transformation pointedly to software.This method is paid close attention to CPU and the collaborative load calculated balance optimizing of MIC architecture processor, and maximization promotes the computational resource utilization factor, promotes software usefulness.
Summary of the invention
The objective of the invention is the performance performance for software in the heterogeneous platform system that improves CPU and MIC architecture processor composition, need optimize transformation pointedly to software, provide a kind of CPU and MIC architecture processor to work in coordination with the load calculated balance optimizing, maximization promotes the computational resource utilization factor, promotes software usefulness.
The objective of the invention is to realize in the following manner, comprise that the level of load balance optimization of the collaborative computation schema software of CPU and MIC architecture processor and CPU and MIC architecture processor work in coordination with the load balance optimization method of computation schema software, wherein:
(1) level of the load balance optimization of the collaborative computation schema software of CPU and MIC architecture processor, respectively at three load balancing levels, tell about the method for load balance optimization below:
A) load balancing between the computing node; Load balancing between the computing node refers to the calculation task distribution basis equalization between each computing node, avoids certain node calculated amount excessive or too small; Each computing node here, it all is the mixed architecture computing node that traditional C PU and MIC architecture processor constitute, the mode parallel computation that walks abreast with MPI between node, internodal load balancing is consistent with load balancing on traditional pure CPU cluster, adopts the method for the balanced and dynamic load leveling of static load;
B) load balancing between the interior CPU of computing node and the MIC computing equipment, refer to load balancing between the interior CPU of computing node and the MIC computing equipment, it is calculation task basis equalization between CPU and the MIC equipment, because the computing power of CPU and MIC processor does not wait, therefore the distribution calculation amount can not be identical between CPU and the MIC, the best mode of load balancing between CPU and the MIC is to adopt the method for dynamic load leveling, comprising:
C) load balancing between the kernel is respectively calculated in CPU and MIC computing equipment inside, refers to that CPU and MIC computing equipment inside respectively calculates load balancing between the kernel, and each that refers to CPU and MIC equipment calculates the calculation task distributing equilibrium between kernel;
Realize that load balancing has two kinds of schemes: static load equilibrium and dynamic load leveling, wherein:
A) static load equilibrium
Static load is balanced need manually to be divided into a plurality of parts that walk abreast with the perform region, and guarantee that the various piece workload be divided into can balancedly be distributed on each process, thread and even the processor and move, that is to say that workload balancedly distributes between a plurality of tasks, make the acceleration of concurrent program the highest;
B) dynamic load leveling
Dynamic load leveling be in program operation process the dynamic assignment calculation task to reach the purpose of load balancing, existing in the software practice much can't be by the situation of static load balanced way solution, mainly show as calculation task and have randomness, real-time, can't thing there be the anticipation of carrying out, the overall system performance of dynamic load leveling is better than static load equilibrium, but code is more complicated on realizing;
Method based on the load balance optimization of the collaborative computation schema software of CPU and MIC architecture processor has:
[1] task division: for the application program of task division, load balancing between CPU and MIC adopts the optimization method of dynamic load leveling, suppose to have N task, an intranodal has two mic cards, i.e. three equipment, 1 CPU and 2 MIC, the method of dynamic load leveling is that each equipment obtains a task earlier and calculates, obtain next task after calculating immediately, do not need to wait for other equipment, finish up to N task computation, this mode only need be set a host process, is responsible for to each calculation procedure allocating task;
Task division is optimized; Refer to task division optimization, the distribution of computation tasks mode that is based on data or tasks in parallel pattern is optimized, thereby makes the calculation task of each parallel processing element realize balanced;
[2] data are divided: since need disposable on equipment the storage allocation space, therefore, application program for the data division, can't adopt dynamic load leveling, and need adopt the static data division methods, static data dividing mode makes the load balancing of isomery equipment room become difficult, sometimes even can't realize, for some iterated application programs, adopt learning-oriented data partition method, comprise the calculating that allows CPU and MIC do an iteration identical calculations amount respectively, the computing power ratio that calculates CPU and MIC working time by separately then, and then data are divided;
[3] process/thread optimizing scheduling; Refer to the process/thread optimizing scheduling, process or thread are the performance elements of the software level of parallel processing, are the carriers of parallel computation, and the optimized scheduling of process or thread is directly connected to the overall performance of software;
C) load balancing between the kernel is respectively calculated in CPU and MIC computing equipment inside
Load balancing between the kernel is respectively calculated in CPU and MIC computing equipment inside, can be described as the load balancing in the equipment again, adopts three kinds of load balancing strategies among the OpenMP:
(a) schedule (static [, chunk]): static scheduling, thread obtain chunk iterations at every turn, and carry out in the mode of poll, if do not indicate chunk, then carry out in the mode of mean allocation, and this is the scheduling mode of acquiescence;
(b) schedule (dynamic [, chunk]): dynamic dispatching dynamically is assigned to iteration each thread, when using the chunk parameter, the iterations of at every turn distributing to thread is chunk time of appointment, when not using the chunk parameter iteration is not assigned to each thread one by one;
(c) schedule (guided [, chunk]): the guiding scheduling is a kind of guidable heuristic self-adapting dispatching method, each thread is assigned to bigger iteration piece during beginning, the iteration piece that is assigned to afterwards can successively decrease gradually, the size of iteration piece can drop to the chunk size of appointment by index, if do not specify the chunk parameter, iteration block size minimum drops to 1 so;
D) thread affinity optimization
Load balancing, its essence is that the task load that is assigned on the process/thread wants relative equilibrium, and realization " able people should do more work ", make each thread be in busy condition as far as possible, withdraw from simultaneously at last, but thread is not to exist without foundation, thread finally need be assigned on the calculating core of processor and carry out, thread is assigned to and respectively handles the process that core is carried out, it is exactly thread scheduling, with each computational threads balance dispatching to different processor cores in the heart, make that the working load of each processing core is suitable, realize respectively calculating internuclear load balancing;
The compatibility of process and thread refers to and process or thread can be forced to be limited in the characteristic of moving on the available CPU subclass, briefly, the CPU compatibility is exactly that process will be in as far as possible operation for a long time and do not moved to the tendentiousness of other processors of certain given CPU, and it is exposed to the Systems Programmer to the scheduling strategy of process/thread on multicomputer system to a certain extent;
Soft compatibility mean process can't be between processor frequent migration, hard compatibility means that then process need move at your processor of appointment, the setting of process/thread compatibility allows the exploitation personnel programme and realizes the hard compatibility of CPU, this means that the application program explicitly process of specifying moves at which or which processor;
In the Linux kernel, all process/thread have a relevant data structure, be called task_struct, this structure is extremely important, wherein with the compatibility degree of correlation the highest be the cpus_allowed bitmask, this bitmask is made up of the n position, and is corresponding one by one with n logic processor in the system; If be provided with given position for given process, this process is just moved at relevant CPU so, therefore, if a process can be moved on what processor core in office, and can between processor core, move as required, bitmask is 1 entirely just so, in fact, and the default setting of process among Here it is the Linux;
The setting of process/thread compatibility is exactly in fact that bitmask relevant with processor core in the process/thread data structure is set;
(2) design of the load balance optimization method of thread affinity
(a) thread only is familiar with logic nuclear, therefore uses 1 physics nuclear of Hyper-Threading also can be regarded as 2 calculating cores;
(b) needs solve the contradiction of cache utilization ratio and physics core load balancing;
(c) this optimization method is applicable to and can not utilizes whole physics cores or task locality situation clearly;
The difference that compatibility arranges, thread can be assigned on the different logic core, if cross-thread has data relevant, when logic core is in same physics core, cross-thread is by utilizing the cache of same physics core, improve travelling speed, yet, this nuclear born too much calculation task and other nuclear phase to " idle ", cause the computational resource utilization factor not reach maximization again, the overall performance of program is not optimum, and the thread affinity on the MIC processor has scatter, compact and the exclusive 3 kinds of patterns of balanced MIC:
The scatter pattern
The scatter pattern with the thread priority allocation on the lighter physics core of load, this mode can reach load balancing preferably, but because adjacent thread is not in the middle of the same physics core, if therefore adjacent cross-thread has data sharing, then can not utilize cache to accelerate, and in MIC, though can read the cache(L2 of other core), but can cause the population size of available cache to reduce, and reading under other core the efficient ratio of cache, to read local cahce poor;
The compact pattern
The compact pattern is dispensed to thread on the logic core in order, this mode can make adjacent thread be in the middle of the same physics core as far as possible, if adjacent cross-thread has data sharing, this mode can improve the cache utilization ratio, but, if adjacent threads load is suitable and total number of threads relatively more after a little while, can cause the task load height to concentrate, cause serious load unbalanced, still, for some more special task allocation scheme, the task distribution has clear regularity, comprise that the odd number threads load is lighter, the even number threads load is heavier, and the affine pattern of this thread can reach load balancing preferably on the contrary;
The balanced pattern
The balanced pattern is that MIC goes up distinctive pattern, similar with the scatter pattern, also be thread to be assigned to the lighter physics core of load as far as possible, but differently with it be, the balanced pattern is being taken into account balanced while, also can adjacent thread be distributed in the middle of the same physics core, this mode has been accomplished certain balance in load balancing and cache utilization as far as possible;
More than three kinds of modes that mode all is static division, need select suitable thread allocation scheme according to the actual loading situation of program, just can reach reasonable effect.
The invention has the beneficial effects as follows: this method is widely used in CPU and the collaborative application scenario of calculating of MIC architecture processor on the CPU+MIC mixed architecture platform, the guiding software developer carries out load balance optimization to existing software effectively, realize that software is to the system resource utilization optimization, significantly improve the counting yield of hardware resource utilization and software, thereby promote the software overall performance greatly.
Description of drawings
Fig. 1 is the collaborative evaluation work pattern diagram of CPU and MIC architecture processor;
Fig. 2 is the micro-architecture synoptic diagram that MIC framework many-core processor has numerous calculating core Core;
Fig. 3 is the collaborative load calculated balanced hierarchy synoptic diagram of CPU/;
Fig. 4 is the collaborative hierarchical structure synoptic diagram that walks abreast that calculates of multinode CPU+MIC;
Fig. 5 is the collaborative experiment with computing result schematic diagram of multinode CPU+MIC;
Fig. 6 is the relative serial speed-up ratio synoptic diagram of the collaborative calculating of multinode CPU+MIC;
Fig. 7 is the relative single node OpenMP multithreading speed-up ratio synoptic diagram of the collaborative calculating of multinode CPU+MIC.
Embodiment
Explain below with reference to Figure of description method of the present invention being done.
Main level based on the load balance optimization of the collaborative computation schema software of CPU and MIC architecture processor has:
1) load balancing between the computing node;
2) load balancing between the interior CPU of computing node and the MIC computing equipment;
3) load balancing between the kernel is respectively calculated in CPU and MIC computing equipment inside.
Main method based on the load balance optimization of the collaborative computation schema software of CPU and MIC architecture processor has:
1) task division optimization
Parallel computation just refers to that a plurality of calculation tasks are assigned to different process/thread parallel processings.Be stranded this, the size of each process/thread calculation task, i.e. " working load " is directly connected to execution time of each process/thread;
2) process/thread optimizing scheduling
Process/thread finally is assigned to processor and calculates the core execution, and this assigning process is called the process/thread scheduling.If it is unbalanced respectively to calculate the performed process/thread of core, will cause " calculated load " skewness, it is overweight some calculating core " calculated load " to occur, and other calculate cores too " idle ".
Embodiment
The object of the present invention is to provide a kind of CPU and MIC architecture processor to work in coordination with the load calculated balance optimizing method.
In order to make the purpose, technical solutions and advantages of the present invention more clear, below in conjunction with drawings and Examples, the present invention is done following detailed description.
At first, briefly introduce the collaborative evaluation work pattern of CPU and MIC architecture processor, as shown in Figure 1:
MIC framework many-core processor has numerous calculating core Core, its micro-architecture synoptic diagram as shown in Figure 2:
The MIC architecture processor generally uses as coprocessor, with traditional C PU collaborative work.Mic card is communicated by letter with CPU by the PCI-E bus.Have the micro OS based on linux kernel on the MIC coprocessor expansion card, be called uOS, the process/thread scheduling of MIC is finished by uOS.MIC supports multiple parallel storehouses such as multilingual such as C/C++/Fortran and MPI/OpenMP/pthread, and the Software tool chain of Intel also provides comprehensive support to the MIC framework.
Illustrate that below all, C language parallel based on compiler, the OpenMP of Intel set forth.
A kind of CPU and MIC architecture processor are worked in coordination with load calculated balance optimizing method implementation step, the method detailed rules and regulations are as follows:
1) load balancing general introduction
Generally, realize that load balancing has two kinds of schemes: static load equilibrium and dynamic load leveling.
A) static load equilibrium
Static load is balanced need manually to be divided into a plurality of parts that walk abreast with the perform region, and guarantee that the various piece (workload) be divided into can balancedly be distributed on each process, thread and even the processor and move, that is to say that workload balancedly distributes between a plurality of tasks, make the acceleration of concurrent program the highest;
B) dynamic load leveling
Dynamic load leveling be in program operation process the dynamic assignment calculation task to reach the purpose of load balancing.Exist much in the software practice and can't mainly show as calculation task and have randomness, real-time by the situation of static load balanced way solution, can't thing not have the anticipation of carrying out.In general, the overall system performance of dynamic load leveling is better than static load equilibrium, but code is more complicated on realizing.
2) load balance optimization method of the collaborative computation schema software of CPU and MIC architecture processor
Main level based on the load balance optimization of the collaborative computation schema software of CPU and MIC architecture processor has:
A) load balancing between the computing node;
B) load balancing between the interior CPU of computing node and the MIC computing equipment;
C) load balancing between the kernel is respectively calculated in CPU and MIC computing equipment inside.
Fig. 3 has showed that CPU/ works in coordination with the load calculated balanced hierarchy.
Fig. 1 CPU/MIC works in coordination with the computational load equilibrium.
Respectively at above three load balancing levels, tell about the method for load balance optimization below:
A) load balancing between the computing node
Each computing node here, it all is the mixed architecture computing node that traditional C PU and MIC architecture processor constitute, the mode parallel computation that walks abreast with MPI between node, internodal load balancing is consistent with load balancing on traditional pure CPU cluster, can adopt the method for the balanced and dynamic load leveling of static load.
B) load balancing between the interior CPU of computing node and the MIC computing equipment;
Because the computing power of CPU and MIC processor does not wait, so the distribution calculation amount can not be identical between CPU and the MIC, and the best mode of the load balancing between CPU and the MIC is the method for employing dynamic load leveling.Below the equipment room load balancing should adopt under our situation of respectively task division and data being divided optimization method;
(1) task division: for the application program of task division, load balancing between CPU and MIC can adopt the optimization method of dynamic load leveling, N task for example arranged, an intranodal has two mic cards, i.e. three equipment (CPU and 2 MIC), the method for dynamic load leveling are that each equipment obtains a task earlier and calculates, and obtain next task immediately after calculating, do not need to wait for other equipment, finish up to N task computation.This mode only need be set a host process, is responsible for to each calculation procedure allocating task;
(2) data are divided: since we need disposable on equipment the storage allocation space, therefore, application program for the data division, we can't adopt dynamic load leveling, and need adopt the static data division methods, static data dividing mode makes the load balancing of the isomery equipment room difficulty that becomes, sometimes even can't realize! For some iterated application programs, we can adopt learning-oriented data partition method, as allow CPU and MIC do the calculating of an iteration identical calculations amount respectively, the computing power ratio that calculates CPU and MIC working time by separately then, and then data are divided.
C) load balancing between the kernel is respectively calculated in CPU and MIC computing equipment inside
Load balancing between the kernel is respectively calculated in CPU and MIC computing equipment inside, can be described as the load balancing in the equipment again, can adopt three kinds of load balancing strategies among the OpenMP:
(1) schedule (static [, chunk]): static scheduling, thread obtain chunk iterations at every turn, and carry out in the mode of poll.If do not indicate chunk, then carry out in the mode of mean allocation, this is the scheduling mode of acquiescence;
(2) schedule (dynamic [, chunk]): dynamic dispatching dynamically is assigned to iteration each thread, when using the chunk parameter, the iterations of at every turn distributing to thread is chunk time of appointment, when not using the chunk parameter iteration is not assigned to each thread one by one;
(3) schedule (guided [, chunk]): the guiding scheduling is a kind of guidable heuristic self-adapting dispatching method.Each thread is assigned to bigger iteration piece during beginning, and the iteration piece that is assigned to afterwards can successively decrease gradually.The size of iteration piece can drop to the chunk size of appointment by index, if do not specify the chunk parameter, iteration block size minimum drops to 1 so.
The scheduling strategy usable range of OpenMP is as shown in the table.
Dispatching algorithm Usable range
Static The pinned task amount, and each iteration task amount is identical
Dynamic Task amount is unfixing, and each iteration task amount is all different
Guided This is the special circumstances of dynamic dispatching algorithm, and the guiding dispatching algorithm can reduce the expense of scheduling
D) thread affinity optimization
Load balancing its essence is that the task load that is assigned on the process/thread wants relative equilibrium, and realizes " able people should do more work ", makes each thread be in busy condition as far as possible, withdraws from simultaneously at last.But thread is not to exist without foundation, and thread finally need be assigned on the calculating core of processor and carry out, and thread is assigned to respectively handles the process that core is carried out, and is exactly thread scheduling.With each computational threads balance dispatching to different processor cores in the heart, make that the working load of each processing core is suitable, realize respectively calculating internuclear load balancing.
The compatibility of process and thread (affinity) refers to and process or thread can be forced to be limited in the characteristic of moving on the available CPU subclass, briefly, CPU compatibility (affinity) is exactly that process will be in as far as possible operation for a long time and do not moved to the tendentiousness of other processors of certain given CPU, and it is exposed to the Systems Programmer to the scheduling strategy of process/thread on multicomputer system to a certain extent.
Soft compatibility mean process can't be between processor frequent migration, hard compatibility means that then process need move at your processor of appointment.The setting of process/thread compatibility the exploitation personnel can be programmed realizes hard CPU compatibility (affinity), this means that application program can explicitly specifies process at which (or which) processor to move.
In the Linux kernel, all process/thread have a relevant data structure, are called task_struct, and this structure is extremely important, wherein with compatibility (affinity) degree of correlation the highest be the cpus_allowed bitmask.This bitmask is made up of the n position, and is corresponding one by one with n logic processor in the system.If be provided with given position for given process, this process just can be moved at relevant CPU so.Therefore, if process can be moved on what processor core in office, and can move between processor core as required, bitmask is 1 entirely just so.In fact, the default setting of process among Here it is the Linux.
The setting of process/thread compatibility is exactly in fact that bitmask relevant with processor core in the process/thread data structure is set, and detailed method to set up please be checked relevant handbook.
The load balance optimization method that thread affinity is set relates to several problems:
(1) thread only is familiar with logic nuclear, therefore uses 1 physics nuclear of Hyper-Threading also can be regarded as 2 calculating cores;
(2) needs solve the contradiction of cache utilization ratio and physics core load balancing;
(3) this optimization method generally is applicable to and can not utilizes whole physics cores or task locality situation clearly.
The difference that compatibility arranges, thread can be assigned on the different logic core.If cross-thread has data relevant, when logic core is in same physics core, cross-thread can utilize the cache of same physics core, improve travelling speed, yet, this nuclear born too much calculation task and other nuclear phase to " idle ", cause the computational resource utilization factor not reach maximization again, the overall performance of program is not optimum.Thread affinity on the MIC processor have scatter, compact and balanced(MIC exclusive) 3 kinds of modes:
(1) scatter pattern
The scatter pattern with the thread priority allocation on the lighter physics core of load, this mode can reach load balancing preferably, but because adjacent thread is not in the middle of the same physics core, if therefore adjacent cross-thread has data sharing, then can not utilize cache to accelerate.And in MIC, though can read the cache(L2 of other core), can cause the population size of available cache to reduce, and reading under other core the efficient ratio of cache, to read local cahce poor;
(2) compact pattern
The compact pattern is dispensed to thread on the logic core in order, and this mode can make the adjacent lines journey be in the middle of the same physics core as far as possible, if adjacent cross-thread has data sharing, this mode can improve the cache utilization ratio.But, if adjacent threads load is suitable and total number of threads relatively more after a little while, can cause the task load height to concentrate, cause serious load unbalanced.But for some more special task allocation scheme, the task distribution has clear regularity, and for example the odd number threads load is lighter, and the even number threads load is heavier, and the affine pattern of this thread can reach load balancing preferably on the contrary;
(3) balanced pattern
The balanced pattern is that MIC goes up distinctive pattern, similar with the scatter pattern, also be thread to be assigned to the lighter physics core of load as far as possible, but differently with it be, the balanced pattern is being taken into account balanced while, also can adjacent thread be distributed in the middle of the same physics core as far as possible.This mode has been accomplished certain balance in load balancing and cache utilization.
These three kinds of modes that mode all is static division, we need select suitable thread allocation scheme according to the actual loading situation of program, just can reach reasonable effect.
Performance test and analysis
Said method is applied to certain performance application case---based on the large eddy simulation of LATTICE BOLTZMANN algorithm.
1) module brief introduction
Lattice Boltzmann method (Lattice Boltzmann Method, LBM) developed into a kind of effective method for numerical simulation in the past 20 years, it is to see method between microcosmic molecular dynamics method with based on a kind of Jie between the macro approach of continuous medium hypothesis.This method is different with traditional fluid simulation method, and it is moving theoretical based on molecule, finds the solution to obtain macroscopical average characteristics by the pair distribution function that transports then of following the tracks of particle distribution function.The moving theoretical characteristics of LATTICE BOLTZMANN method makes it more effective when the many complexity of simulation flow, as porous medium flow, suspension flow, polyphasic flow, polycomponent stream etc.The LATTICE BOLTZMANN method has born parallel characteristics, and advantage such as boundary treatment is simple, program is easy to implement.
LBM is a kind of modeling and computing method that are different from traditional numerical method in the Fluid Mechanics Computation field, is finding the solution a kind of special discrete scheme of Boltzmann equation.Solution procedure is the time push type, and solution procedure has good regionality, so be particularly suitable for parallel finding the solution.
(Large eddy simulation LES), is important numerical simulation study method in the fluid mechanics that just grows up nearly decades to large eddy simulation.It is different from direct Numerical (DNS) and average (RANS) method of Reynolds.Its basic thought is by accurately finding the solution the motion of certain all scale of turbulence more than the yardstick, thereby can capture the helpless many unstable state of RANS method, the large scale effect and the coherent structure that occur in the nonequilibrium process, simultaneously overcome direct Numerical again owing to the problem that need find the solution the huge computing cost that all scales of turbulence bring, thereby be considered to the turbulent flow numerical simulation developing direction of potentialization.
Lattice Boltzmann method can be used for finding the solution of large eddy simulation, and both are combined, and forms grid Boltzmann large eddy simulation (LBM-LES), has obtained in the fluid mechanics field using widely.
2) the collaborative computation model of multinode CPU+MIC architecture processor
Present case, original program is based on the single node single-threaded serial of CPU processor, this module is transplanted on the mixed architecture multinode cluster based on the CPU+MIC architecture processor, is realized the parallel processing of the collaborative computation schema of CPU+MIC, its parallel hierarchical structure such as Fig. 4:
The collaborative hierarchical structure that walks abreast of calculating of Fig. 1 multinode CPU+MIC.
In software development process, use the load balance optimization method based on the collaborative computation schema of CPU+MIC architecture processor that this patent provides, promoted CPU and MIC computational resource utilization factor greatly, significantly promoted the performance of this module.
3) test environment
Figure 335963DEST_PATH_IMAGE003
4) The performance test results analysis
Based on the collaborative experimental result of calculating the LES parallel algorithm of multinode CPU+MIC as shown in Figure 5, on the collaborative computing platform of the CPU+MIC of 2 nodes, every node configuration two-way CPU and 2 mic cards, the performance of LES parallel algorithm reaches 1333.43MLUPS (when sizing grid is 8192*8192).
Fig. 5 multinode CPU+MIC works in coordination with the experiment with computing result
The speed-up ratio of the relative CPU serial of multinode parallel algorithm as shown in Figure 5, the performance of 2 nodes is that the peak acceleration of serial reaches 156.87 times.
The relative serial speed-up ratio of the collaborative calculating of Fig. 6 multinode CPU+MIC
The speed-up ratio of the relative single node OpenMP multithreading of multinode parallel algorithm as shown in Figure 3, as can be seen from this figure, single mic card is 1.77 times of two-way CPU, single node two-way CPU+2MIC is 3.4 times (node at two-way CPU increases the performance boost that 2 mic cards have brought 2.4 times) of two-way CPU, and 2 nodes (two-way CPU+2MIC/ node) are 6.71 times of single node two-way CPU.
The relative single node OpenMP multithreading speed-up ratio of the collaborative calculating of Fig. 1 multinode CPU+MIC.
Except the described technical characterictic of instructions, be the known technology of those skilled in the art.

Claims (1)

1. one kind based on the collaborative load calculated balance optimizing method of CPU and MIC architecture processor, it is characterized in that comprising that the level of load balance optimization of the collaborative computation schema software of CPU and MIC architecture processor and CPU and MIC architecture processor work in coordination with the load balance optimization method of computation schema software, wherein:
(1) level of the load balance optimization of the collaborative computation schema software of CPU and MIC architecture processor, respectively at three load balancing levels, tell about the method for load balance optimization below:
A) load balancing between the computing node; Load balancing between the computing node refers to the calculation task distribution basis equalization between each computing node, avoids certain node calculated amount excessive or too small; Each computing node here, it all is the mixed architecture computing node that traditional C PU and MIC architecture processor constitute, the mode parallel computation that walks abreast with MPI between node, internodal load balancing is consistent with load balancing on traditional pure CPU cluster, adopts the method for the balanced and dynamic load leveling of static load;
B) load balancing between the interior CPU of computing node and the MIC computing equipment, refer to load balancing between the interior CPU of computing node and the MIC computing equipment, it is calculation task basis equalization between CPU and the MIC equipment, because the computing power of CPU and MIC processor does not wait, therefore the distribution calculation amount can not be identical between CPU and the MIC, the best mode of load balancing between CPU and the MIC is to adopt the method for dynamic load leveling, comprising:
C) load balancing between the kernel is respectively calculated in CPU and MIC computing equipment inside, refers to that CPU and MIC computing equipment inside respectively calculates load balancing between the kernel, and each that refers to CPU and MIC equipment calculates the calculation task distributing equilibrium between kernel;
Realize that load balancing has two kinds of schemes: static load equilibrium and dynamic load leveling, wherein:
A) static load equilibrium
Static load is balanced need manually to be divided into a plurality of parts that walk abreast with the perform region, and guarantee that the various piece workload be divided into can balancedly be distributed on each process, thread and even the processor and move, that is to say that workload balancedly distributes between a plurality of tasks, make the acceleration of concurrent program the highest;
B) dynamic load leveling
Dynamic load leveling be in program operation process the dynamic assignment calculation task to reach the purpose of load balancing, existing in the software practice much can't be by the situation of static load balanced way solution, mainly show as calculation task and have randomness, real-time, can't thing there be the anticipation of carrying out, the overall system performance of dynamic load leveling is better than static load equilibrium, but code is more complicated on realizing;
Method based on the load balance optimization of the collaborative computation schema software of CPU and MIC architecture processor has:
[1] task division: for the application program of task division, load balancing between CPU and MIC adopts the optimization method of dynamic load leveling, suppose to have N task, an intranodal has two mic cards, i.e. three equipment, 1 CPU and 2 MIC, the method of dynamic load leveling is that each equipment obtains a task earlier and calculates, obtain next task after calculating immediately, do not need to wait for other equipment, finish up to N task computation, this mode only need be set a host process, is responsible for to each calculation procedure allocating task;
Task division is optimized; Refer to task division optimization, the distribution of computation tasks mode that is based on data or tasks in parallel pattern is optimized, thereby makes the calculation task of each parallel processing element realize balanced;
[2] data are divided: since need disposable on equipment the storage allocation space, therefore, application program for the data division, can't adopt dynamic load leveling, and need adopt the static data division methods, static data dividing mode makes the load balancing of isomery equipment room become difficult, sometimes even can't realize, for some iterated application programs, adopt learning-oriented data partition method, comprise the calculating that allows CPU and MIC do an iteration identical calculations amount respectively, the computing power ratio that calculates CPU and MIC working time by separately then, and then data are divided;
[3] process/thread optimizing scheduling; Refer to the process/thread optimizing scheduling, process or thread are the performance elements of the software level of parallel processing, are the carriers of parallel computation, and the optimized scheduling of process or thread is directly connected to the overall performance of software;
C) load balancing between the kernel is respectively calculated in CPU and MIC computing equipment inside
Load balancing between the kernel is respectively calculated in CPU and MIC computing equipment inside, can be described as the load balancing in the equipment again, adopts three kinds of load balancing strategies among the OpenMP:
(a) schedule (static [, chunk]): static scheduling, thread obtain chunk iterations at every turn, and carry out in the mode of poll, if do not indicate chunk, then carry out in the mode of mean allocation, and this is the scheduling mode of acquiescence;
(b) schedule (dynamic [, chunk]): dynamic dispatching dynamically is assigned to iteration each thread, when using the chunk parameter, the iterations of at every turn distributing to thread is chunk time of appointment, when not using the chunk parameter iteration is not assigned to each thread one by one;
(c) schedule (guided [, chunk]): the guiding scheduling is a kind of guidable heuristic self-adapting dispatching method, each thread is assigned to bigger iteration piece during beginning, the iteration piece that is assigned to afterwards can successively decrease gradually, the size of iteration piece can drop to the chunk size of appointment by index, if do not specify the chunk parameter, iteration block size minimum drops to 1 so;
D) thread affinity optimization
Load balancing, its essence is that the task load that is assigned on the process/thread wants relative equilibrium, and realization " able people should do more work ", make each thread be in busy condition as far as possible, withdraw from simultaneously at last, but thread is not to exist without foundation, thread finally need be assigned on the calculating core of processor and carry out, thread is assigned to and respectively handles the process that core is carried out, it is exactly thread scheduling, with each computational threads balance dispatching to different processor cores in the heart, make that the working load of each processing core is suitable, realize respectively calculating internuclear load balancing;
The compatibility of process and thread refers to and process or thread can be forced to be limited in the characteristic of moving on the available CPU subclass, briefly, the CPU compatibility is exactly that process will be in as far as possible operation for a long time and do not moved to the tendentiousness of other processors of certain given CPU, and it is exposed to the Systems Programmer to the scheduling strategy of process/thread on multicomputer system to a certain extent;
Soft compatibility mean process can't be between processor frequent migration, hard compatibility means that then process need move at your processor of appointment, the setting of process/thread compatibility allows the exploitation personnel programme and realizes the hard compatibility of CPU, this means that the application program explicitly process of specifying moves at which or which processor;
In the Linux kernel, all process/thread have a relevant data structure, be called task_struct, this structure is extremely important, wherein with the compatibility degree of correlation the highest be the cpus_allowed bitmask, this bitmask is made up of the n position, and is corresponding one by one with n logic processor in the system; If be provided with given position for given process, this process is just moved at relevant CPU so, therefore, if a process can be moved on what processor core in office, and can between processor core, move as required, bitmask is 1 entirely just so, in fact, and the default setting of process among Here it is the Linux;
The setting of process/thread compatibility is exactly in fact that bitmask relevant with processor core in the process/thread data structure is set;
(2) design of the load balance optimization method of thread affinity
(a) thread only is familiar with logic nuclear, therefore uses 1 physics nuclear of Hyper-Threading also can be regarded as 2 calculating cores;
(b) needs solve the contradiction of cache utilization ratio and physics core load balancing;
(c) this optimization method is applicable to and can not utilizes whole physics cores or task locality situation clearly;
The difference that compatibility arranges, thread can be assigned on the different logic core, if cross-thread has data relevant, when logic core is in same physics core, cross-thread is by utilizing the cache of same physics core, improve travelling speed, yet, this nuclear born too much calculation task and other nuclear phase to " idle ", cause the computational resource utilization factor not reach maximization again, the overall performance of program is not optimum, and the thread affinity on the MIC processor has scatter, compact and the exclusive 3 kinds of patterns of balanced MIC:
The scatter pattern
The scatter pattern with the thread priority allocation on the lighter physics core of load, this mode can reach load balancing preferably, but because adjacent thread is not in the middle of the same physics core, if therefore adjacent cross-thread has data sharing, then can not utilize cache to accelerate, and in MIC, though can read the cache(L2 of other core), but can cause the population size of available cache to reduce, and reading under other core the efficient ratio of cache, to read local cahce poor;
The compact pattern
The compact pattern is dispensed to thread on the logic core in order, this mode can make adjacent thread be in the middle of the same physics core as far as possible, if adjacent cross-thread has data sharing, this mode can improve the cache utilization ratio, but, if adjacent threads load is suitable and total number of threads relatively more after a little while, can cause the task load height to concentrate, cause serious load unbalanced, still, for some more special task allocation scheme, the task distribution has clear regularity, comprise that the odd number threads load is lighter, the even number threads load is heavier, and the affine pattern of this thread can reach load balancing preferably on the contrary;
The balanced pattern
The balanced pattern is that MIC goes up distinctive pattern, similar with the scatter pattern, also be thread to be assigned to the lighter physics core of load as far as possible, but differently with it be, the balanced pattern is being taken into account balanced while, also can adjacent thread be distributed in the middle of the same physics core, this mode has been accomplished certain balance in load balancing and cache utilization as far as possible;
More than three kinds of modes that mode all is static division, need select suitable thread allocation scheme according to the actual loading situation of program, just can reach reasonable effect.
CN2013102343891A 2013-06-09 2013-06-09 Load balancing optimization method based on CPU (central processing unit) and MIC (many integrated core) framework processor cooperative computing Pending CN103279391A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013102343891A CN103279391A (en) 2013-06-09 2013-06-09 Load balancing optimization method based on CPU (central processing unit) and MIC (many integrated core) framework processor cooperative computing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013102343891A CN103279391A (en) 2013-06-09 2013-06-09 Load balancing optimization method based on CPU (central processing unit) and MIC (many integrated core) framework processor cooperative computing

Publications (1)

Publication Number Publication Date
CN103279391A true CN103279391A (en) 2013-09-04

Family

ID=49061924

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013102343891A Pending CN103279391A (en) 2013-06-09 2013-06-09 Load balancing optimization method based on CPU (central processing unit) and MIC (many integrated core) framework processor cooperative computing

Country Status (1)

Country Link
CN (1) CN103279391A (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104123190A (en) * 2014-07-23 2014-10-29 浪潮(北京)电子信息产业有限公司 Load balance method and device of heterogeneous cluster system
CN104156271A (en) * 2014-08-01 2014-11-19 浪潮(北京)电子信息产业有限公司 Method and system for balancing cooperative computing cluster load
CN104266657A (en) * 2014-09-12 2015-01-07 海华电子企业(中国)有限公司 Shortest path planning parallelization method based on cooperative computing of CPU and MIC
CN104461728A (en) * 2013-09-18 2015-03-25 Sap欧洲公司 Migration event dispatching management
CN104679593A (en) * 2015-03-13 2015-06-03 浪潮集团有限公司 Task scheduling optimization method based on SMP system
CN105260241A (en) * 2015-10-23 2016-01-20 南京理工大学 Mutual cooperation method for processes in cluster system
CN105468455A (en) * 2015-11-23 2016-04-06 天脉聚源(北京)传媒科技有限公司 Dynamic task distribution method and apparatus for multiple devices
CN105893151A (en) * 2016-04-01 2016-08-24 浪潮电子信息产业股份有限公司 High-dimensional data stream processing method based on CPU + MIC heterogeneous platform
CN105955825A (en) * 2016-05-09 2016-09-21 深圳大学 Method for optimizing astronomy software gridding
CN106383961A (en) * 2016-09-29 2017-02-08 中国南方电网有限责任公司电网技术研究中心 Optimization processing method for large eddy simulation algorithm under CPU+MIC heterogeneous platform
CN106650315A (en) * 2016-11-30 2017-05-10 郑州云海信息技术有限公司 SIFT parallel algorithm based on CPU+MIC heterogeneous platform
WO2017206591A1 (en) * 2016-06-01 2017-12-07 华为技术有限公司 Data processing system and data processing method
CN104794194B (en) * 2015-04-17 2018-10-26 同济大学 A kind of distributed heterogeneous concurrent computational system towards large scale multimedia retrieval
CN108958924A (en) * 2017-03-27 2018-12-07 爱思开海力士有限公司 Storage system and its operating method with delay distribution optimization
CN109240866A (en) * 2018-09-10 2019-01-18 郑州云海信息技术有限公司 A kind of Performance tuning method based on server performance test
WO2019072179A1 (en) * 2017-10-11 2019-04-18 Oppo广东移动通信有限公司 Application running control method and apparatus
CN111130936A (en) * 2019-12-24 2020-05-08 杭州迪普科技股份有限公司 Method and device for testing load balancing algorithm
CN113296972A (en) * 2020-07-20 2021-08-24 阿里巴巴集团控股有限公司 Information registration method, computing device and storage medium
CN113850032A (en) * 2021-12-02 2021-12-28 中国空气动力研究与发展中心计算空气动力研究所 Load balancing method in numerical simulation calculation
CN115718665A (en) * 2023-01-10 2023-02-28 北京卡普拉科技有限公司 Asynchronous I/O thread processor resource scheduling control method, device, medium and equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100095303A1 (en) * 2008-10-09 2010-04-15 International Business Machines Corporation Balancing A Data Processing Load Among A Plurality Of Compute Nodes In A Parallel Computer
US20100257538A1 (en) * 2009-04-03 2010-10-07 Microsoft Corporation Parallel programming and execution systems and techniques
CN101976201A (en) * 2010-10-22 2011-02-16 北京航空航天大学 CPU affinity-based virtual CPU dynamic binding method
US8180973B1 (en) * 2009-12-23 2012-05-15 Emc Corporation Servicing interrupts and scheduling code thread execution in a multi-CPU network file server
CN102855218A (en) * 2012-05-14 2013-01-02 中兴通讯股份有限公司 Data processing system, method and device
CN102929723A (en) * 2012-11-06 2013-02-13 无锡江南计算技术研究所 Method for dividing parallel program segment based on heterogeneous multi-core processor

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100095303A1 (en) * 2008-10-09 2010-04-15 International Business Machines Corporation Balancing A Data Processing Load Among A Plurality Of Compute Nodes In A Parallel Computer
US20100257538A1 (en) * 2009-04-03 2010-10-07 Microsoft Corporation Parallel programming and execution systems and techniques
US8180973B1 (en) * 2009-12-23 2012-05-15 Emc Corporation Servicing interrupts and scheduling code thread execution in a multi-CPU network file server
CN101976201A (en) * 2010-10-22 2011-02-16 北京航空航天大学 CPU affinity-based virtual CPU dynamic binding method
CN102855218A (en) * 2012-05-14 2013-01-02 中兴通讯股份有限公司 Data processing system, method and device
CN102929723A (en) * 2012-11-06 2013-02-13 无锡江南计算技术研究所 Method for dividing parallel program segment based on heterogeneous multi-core processor

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ELI DOW: "管理处理器的亲和性(affinity)", 《HTTP://WWW.IBM.COM/DEVELOPERWORKS/CN/LINUX/L-AFFINITY.HTML》 *
王恩东 等: "《MIC高性能计算编程指南》", 30 November 2012, 中国水利水电出版社 *

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104461728A (en) * 2013-09-18 2015-03-25 Sap欧洲公司 Migration event dispatching management
CN104461728B (en) * 2013-09-18 2019-06-14 Sap欧洲公司 Computer system, medium and the method for migration event management and running
CN104123190B (en) * 2014-07-23 2017-09-19 浪潮(北京)电子信息产业有限公司 The load-balancing method and device of Heterogeneous Cluster Environment
CN104123190A (en) * 2014-07-23 2014-10-29 浪潮(北京)电子信息产业有限公司 Load balance method and device of heterogeneous cluster system
CN104156271A (en) * 2014-08-01 2014-11-19 浪潮(北京)电子信息产业有限公司 Method and system for balancing cooperative computing cluster load
CN104156271B (en) * 2014-08-01 2017-12-08 浪潮(北京)电子信息产业有限公司 A kind of method and system of cooperated computing cluster load balance
CN104266657A (en) * 2014-09-12 2015-01-07 海华电子企业(中国)有限公司 Shortest path planning parallelization method based on cooperative computing of CPU and MIC
CN104266657B (en) * 2014-09-12 2017-08-04 海华电子企业(中国)有限公司 Shortest path planning parallel method based on CPU and MIC cooperated computings
CN104679593B (en) * 2015-03-13 2017-12-01 浪潮集团有限公司 Task scheduling optimization method based on SMP system
CN104679593A (en) * 2015-03-13 2015-06-03 浪潮集团有限公司 Task scheduling optimization method based on SMP system
CN104794194B (en) * 2015-04-17 2018-10-26 同济大学 A kind of distributed heterogeneous concurrent computational system towards large scale multimedia retrieval
CN105260241A (en) * 2015-10-23 2016-01-20 南京理工大学 Mutual cooperation method for processes in cluster system
CN105260241B (en) * 2015-10-23 2019-04-16 南京理工大学 The co-operating method of process in group system
CN105468455A (en) * 2015-11-23 2016-04-06 天脉聚源(北京)传媒科技有限公司 Dynamic task distribution method and apparatus for multiple devices
CN105468455B (en) * 2015-11-23 2018-12-21 天脉聚源(北京)传媒科技有限公司 A kind of method and device of the dynamic task allocation for more equipment
CN105893151B (en) * 2016-04-01 2019-03-08 浪潮电子信息产业股份有限公司 High-dimensional data stream processing method based on CPU + MIC heterogeneous platform
CN105893151A (en) * 2016-04-01 2016-08-24 浪潮电子信息产业股份有限公司 High-dimensional data stream processing method based on CPU + MIC heterogeneous platform
CN105955825A (en) * 2016-05-09 2016-09-21 深圳大学 Method for optimizing astronomy software gridding
CN105955825B (en) * 2016-05-09 2020-07-10 深圳大学 Method for optimizing astronomy software gridding
WO2017206591A1 (en) * 2016-06-01 2017-12-07 华为技术有限公司 Data processing system and data processing method
CN106383961B (en) * 2016-09-29 2019-07-19 中国南方电网有限责任公司电网技术研究中心 Large-Eddy Simulation optimized treatment method under CPU+MIC heterogeneous platform
CN106383961A (en) * 2016-09-29 2017-02-08 中国南方电网有限责任公司电网技术研究中心 Optimization processing method for large eddy simulation algorithm under CPU+MIC heterogeneous platform
CN106650315A (en) * 2016-11-30 2017-05-10 郑州云海信息技术有限公司 SIFT parallel algorithm based on CPU+MIC heterogeneous platform
CN106650315B (en) * 2016-11-30 2020-01-03 苏州浪潮智能科技有限公司 SIFT parallel processing method based on CPU + MIC heterogeneous platform
CN108958924A (en) * 2017-03-27 2018-12-07 爱思开海力士有限公司 Storage system and its operating method with delay distribution optimization
CN108958924B (en) * 2017-03-27 2022-02-11 爱思开海力士有限公司 Memory system with delay profile optimization and method of operating the same
WO2019072179A1 (en) * 2017-10-11 2019-04-18 Oppo广东移动通信有限公司 Application running control method and apparatus
CN109240866A (en) * 2018-09-10 2019-01-18 郑州云海信息技术有限公司 A kind of Performance tuning method based on server performance test
CN111130936A (en) * 2019-12-24 2020-05-08 杭州迪普科技股份有限公司 Method and device for testing load balancing algorithm
CN113296972A (en) * 2020-07-20 2021-08-24 阿里巴巴集团控股有限公司 Information registration method, computing device and storage medium
CN113850032A (en) * 2021-12-02 2021-12-28 中国空气动力研究与发展中心计算空气动力研究所 Load balancing method in numerical simulation calculation
CN113850032B (en) * 2021-12-02 2022-02-08 中国空气动力研究与发展中心计算空气动力研究所 Load balancing method in numerical simulation calculation
CN115718665A (en) * 2023-01-10 2023-02-28 北京卡普拉科技有限公司 Asynchronous I/O thread processor resource scheduling control method, device, medium and equipment

Similar Documents

Publication Publication Date Title
CN103279391A (en) Load balancing optimization method based on CPU (central processing unit) and MIC (many integrated core) framework processor cooperative computing
Pérez et al. CellSs: Making it easier to program the Cell Broadband Engine processor
Schaller et al. SWIFT: Using task-based parallelism, fully asynchronous communication, and graph partition-based domain decomposition for strong scaling on more than 100,000 cores
CN102902512A (en) Multi-thread parallel processing method based on multi-thread programming and message queue
CN103049245A (en) Software performance optimization method based on central processing unit (CPU) multi-core platform
WO2012067688A1 (en) Codeletset representation, manipulation, and execution-methods, system and apparatus
Ravi et al. A dynamic scheduling framework for emerging heterogeneous systems
Holk et al. Declarative parallel programming for GPUs
Ashraf et al. Empirical Analysis of HPC Using Different Programming Models.
Li et al. Optimizing massively parallel winograd convolution on arm processor
Christou et al. Earth system modelling on system-level heterogeneous architectures: EMAC (version 2.42) on the Dynamical Exascale Entry Platform (DEEP)
Moustafa et al. 3D cartesian transport sweep for massively parallel architectures with PARSEC
Zheng et al. Performance model for OpenMP parallelized loops
Chen et al. Integrated research of parallel computing: Status and future
Schmaus et al. System Software for Resource Arbitration on Future Many-Architectures
Yang et al. Performance‐based parallel loop self‐scheduling using hybrid OpenMP and MPI programming on multicore SMP clusters
Odajima et al. GPU/CPU work sharing with parallel language XcalableMP-dev for parallelized accelerated computing
Zhang et al. Design of a multithreaded Barnes-Hut algorithm for multicore clusters
Chandrashekar et al. Performance model of HPC application On CPU-GPU platform
Chandrashekhar et al. Performance study of OpenMP and hybrid programming models on CPU–GPU cluster
Stock et al. A GPU-accelerated boundary element method and vortex particle method
Bard et al. A simple GPU-accelerated two-dimensional MUSCL-Hancock solver for ideal magnetohydrodynamics
Chandrashekhar et al. Prediction Model for Scheduling an Irregular Graph Algorithms on CPU–GPU Hybrid Cluster Framework
Hippold et al. Task pool teams for implementing irregular algorithms on clusters of SMPs
Bosilca et al. Scalable dense linear algebra on heterogeneous hardware

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20130904