CN102929723A

CN102929723A - Method for dividing parallel program segment based on heterogeneous multi-core processor

Info

Publication number: CN102929723A
Application number: CN2012104413269A
Authority: CN
Inventors: 陈德训; 房田文; 吴宏
Original assignee: Wuxi Jiangnan Computing Technology Institute
Current assignee: Wuxi Jiangnan Computing Technology Institute
Priority date: 2012-11-06
Filing date: 2012-11-06
Publication date: 2013-02-13
Anticipated expiration: 2032-11-06
Also published as: CN102929723B

Abstract

The invention provides a method for dividing a parallel program segment based on a heterogeneous multi-core processor. The method includes that data correlation analysis is performed on data of an application subject to determine whether program basic segments without data correlations exist; if the program basic segments without the data correlations exist, calculation amounts of program basic segments without the data correlations are calculated; and a first stage multi-core division is performed on the program basic segments without the data correlations. According to the method, the adaptability problem of general scientific calculation and engineering applications in a polymorphic heterogeneous computer system can be solved, and multi-core stage parallel efficiencies and load balancing effects are improved simultaneously.

Description

Concurrent program section division methods based on the isomery many-core processor

Technical field

The present invention relates to computer realm, relate in particular to a kind of concurrent program section division methods based on the isomery many-core processor.

Background technology

In recent years, in order to improve the computing power of system, multinuclear, many-core processor become the main body establishment parts of high-performance computer gradually.On the other hand, the developing direction of microprocessor Design towards solving particular problem, heterogeneous processor has carried out the isomery design of processor core for the singularity of specific area problem, dissimilar operation is distinguished in the soon exemplary operation load, process with different processor cores respectively, to obtain high-performance on the whole.This structure and isomorphism multi-core processor system consist of the polymorphic heterogeneous computing system of great scale.Polymorphic heterogeneous system computing power is strong, Energy Efficiency Ratio is high, it is one of important development direction that solves major application, the architecture of the parallel scale of the super large of simultaneity factor and polymorphic complexity is used for traditional high-performance calculation and is brought huge challenge, and the Parallel Implementation technology that matches lacks.Therefore research becomes the technical barrier that those skilled in the art need to be resolved hurrily based on many granularities concurrent program root segment division methods of isomery many-core processor.

Aspect the concurrent program implementation method of supporting heterogeneous computer system, current Parallel Implementation towards heterogeneous computer system is mostly based on the two-stage parallel model, be MPI(" message passing interface ", MessagePassing Interface) parallel+parallel two-step way of many nuclears, the coarse grain parallelism of MPI level implementation process level, the parallel speed-up computation of mainly finishing the core iterative part of many nuclear namely only loop fine-grained many nuclear according to core parallel.MPI parallel+realization and optimisation technique of the parallel two-stage hybrid parallel programming model of many nuclears in, the parallel main MPI of the employing communication of MPI level and overlapping being optimized of calculating, many nuclear levels are parallel mainly to adopt the methods such as a data layout optimization, data transfer optimization, calculating and memory access be overlapping to carry out Performance tuning.The acceleration effect of concrete problem and problem calculation features and optimisation technique realize closely related.

According to reading up the literature at present, the crowd of application examines Parallel Implementation and only carries out for the integrated solution process that calculated amount wherein expends huge part core calculations or certain simple problem, the numerical simulation of many complicated practical problemss found the solution does not have comprehensively deep solution, therefore the parallel scale of the MPI level of the main Parallel application of heterogeneous computer system is all in 100,000 magnitudes at present, and the more extensive and more complicated parallel computation of very difficult support is used.The whole acceleration effect of practical application problem is general.

In addition, only loop the fine grained parallel of many nuclear levels according to core, its parallel efficiency is subject to the scale of actual motion problem.For example, if the grid scale of actual subject operation dimension size be M, be N from check figure on the isomery many-core processor, if loop the fine grained parallels of many nuclears grade according to core, when M＜N, then (N-M) is individual is not utilized from computing power of examining.And as M when N and M are not the integral multiple of N, will be very undesirable from the load balance effect of nuclear level fine grained parallel.Therefore, the cardiocirculatory many nuclear level fine grained parallel technology of existing nuclear are difficult to give full play to the computing power from nuclear.

Be in the Chinese patent application of CN1783011A at publication number, disclosed more related contents.

Summary of the invention

Technical matters to be solved by this invention is to solve general science to calculate with Engineering and be applied in adaptability problem on the polymorphic heterogeneous computer system, improves simultaneously parallel efficiency and the load balance effects of many nuclear levels.

In order to address the above problem, the invention provides a kind of concurrent program section division methods based on the isomery many-core processor, comprising:

The data of using problem are carried out data dependence analysis, to determine whether to exist countless program root segments according to correlativity;

If there is described countless program root segment according to correlativity, then calculate the calculated amount of each countless program root segment according to correlativity; According to described calculated amount, described countless program root segments according to correlativity are carried out the many nuclears of the first order divide.

Optionally, also comprise after the many nuclears of the first order divide described carrying out:

Analyze each countless program root segment according to correlativity, described program root segment is decomposed into a plurality of computation cycles;

Data in each computation cycles are carried out data recurrence correlation analysis, to determine whether to exist countless computation cycles according to the recurrence correlativity;

If there is described countless computation cycles according to the recurrence correlativity, then described countless computation cycles according to the recurrence correlativity carried out the many nuclears in the second level and divide.

Optionally, described calculated amount comprises: Floating-point Computation amount and fixed point calculation amount.

Optionally, the described many nuclears of the first order that carry out are divided and are comprised: the task division and the load balance that carry out the first order with the first granularity.

Optionally, the described many nuclears of the first order that carry out are divided and are comprised: the task division and the load balance that carry out the first order with the first granularity;

The described many nuclears in the second level that carry out are divided and are comprised: the task division and the load balance that carry out the second level with the second granularity.

Optionally, described the second granularity is less than described the first granularity.

Optionally, before described data to the application problem are carried out data dependence analysis, also comprise:

Described application problem is carried out ANALYSIS OF CALCULATING;

Based on the result of described ANALYSIS OF CALCULATING, described application problem is carried out MPI level parallel task with the 3rd granularity divide.

Optionally, described the 3rd granularity is greater than described the first granularity.

Compared with prior art, technical scheme of the present invention has the following advantages:

The present invention is by multi-level fine-grained division concurrent program section, so that respectively more balanced from task division and the load of core processor, thereby can bring into play more fully from the computing power of nuclear, obtain preferably acceleration effect, calculate with Engineering and be applied in adaptability problem on the polymorphic heterogeneous computer system to solve general science.

Description of drawings

Fig. 1 is the schematic flow sheet of concurrent program section division methods the first embodiment based on the isomery many-core processor of the present invention;

Fig. 2 is the schematic flow sheet of concurrent program section division methods the second embodiment based on the isomery many-core processor of the present invention;

Fig. 3 is the schematic flow sheet of concurrent program section division methods the 3rd embodiment based on the isomery many-core processor of the present invention.

Embodiment

A lot of details have been set forth in the following description so that fully understand the present invention.But the present invention can implement much to be different from alternate manner described here, and those skilled in the art can be in the situation that do similar popularization without prejudice to intension of the present invention, so the present invention is not subjected to the restriction of following public implementation.

Secondly, the present invention utilizes schematic diagram to be described in detail, and when the embodiment of the invention was described in detail in detail, for ease of explanation, described schematic diagram was example, and it should not limit the scope of protection of the invention at this.

In order to solve the technical matters in the background technology, the invention provides a kind of concurrent program section division methods based on the isomery many-core processor.Fig. 1 is the schematic flow sheet of concurrent program section division methods the first embodiment based on the isomery many-core processor of the present invention.As shown in Figure 1, this embodiment may further comprise the steps:

Execution in step S101 carries out data dependence analysis to the data of using problem, to determine whether to exist countless program root segments according to correlativity.Particularly, be Z=F (Y) if program root segment 1 is Y=F (X), program root segment 2, think that then there is data dependence in two program root segments, can only serial computing, can't executed in parallel.

Execution in step S102 judges whether to exist countless program root segments according to correlativity.

If do not exist, then do not have can parallel computation the program root segment, can't carry out the division of concurrent program section, finish.

If exist, that is: the program root segment that existence can parallel computation, execution in step S103 then calculates the calculated amount of each countless program root segment according to correlativity.Particularly, described calculated amount comprises Floating-point Computation amount and fixed point calculation amount.Execution in step S104 according to calculated amount, carries out the many nuclears of the first order to described countless program root segments according to correlativity and divides.The described many nuclears of the first order that carry out are divided and comprised: carry out task division and the load balance of the first order with the first granularity, that is: same root segment is finished from the nuclear group at one, determines from the size of the nuclear group calculated amount by this root segment.Divide by the many nuclears of the first order, can realize the fine grained parallel of program root segment, that is: what the follow procedure root segment walked abreast walks abreast from the nuclear group.

Below in conjunction with embodiment technical scheme of the present invention is described further.

In the present embodiment, determine to have 2 countless program root segments according to correlativity in certain application problem by step S101, with module1 and module2 note.In the present embodiment, to participate in describing as 100 from the total quantity of nuclear of parallel computation.By execution in step S103, calculating is learnt: the calculated amount of program root segment module1 is 2, and the calculated amount of program root segment module2 is 3.

Then execution in step S104 according to calculated amount, carries out the many nuclear divisions of the first order to program root segment module1 and program root segment module2.100 from nuclear 40 are given program root segment module1, form first from the nuclear group.100 from nuclear 60 are given program root segment module2, form second from the nuclear group.

Need to prove, the present embodiment is intended to illustrate the technical program, thus select divide comparatively simple from nuclear volume and program segment.It will be understood by those skilled in the art that in the practical large-scale parallel computation, can be millions of from nuclear volume, program is also more complicated, and the present invention does not do concrete restriction to this.

Fig. 2 is the schematic flow sheet of concurrent program section division methods the second embodiment based on the isomery many-core processor of the present invention.Different from the first embodiment is that in this embodiment, on the basis of dividing that walks abreast according to the program root segment, also further each root segment of segmentation loops the many nuclear divisions in the second level to countless in the root segment according to the relevant core of recurrence.

As shown in Figure 2, this embodiment may further comprise the steps:

Execution in step S201 carries out data dependence analysis to the data of using problem, to determine whether to exist countless program root segments according to correlativity.

Execution in step S202 judges whether to exist countless program root segments according to correlativity.

If do not exist, then finish.

If exist, execution in step S203 then calculates the calculated amount of each countless program root segment according to correlativity.

Execution in step S204 according to calculated amount, carries out the many nuclears of the first order to described countless program root segments according to correlativity and divides.

Continue execution in step S205, analyze countless program root segments according to correlativity, described program root segment is decomposed into a plurality of computation cycles.

Execution in step S206 carries out data recurrence correlation analysis to the data in each computation cycles, to determine whether to exist countless computation cycles according to the recurrence correlativity.Particularly, if the interior data variable X of circulation _{I, j, k}=F (X _{I-1, j, k}, X _{I, j-1, k}, X _{I, j, k-1}), think that then there is the recurrence correlativity in variable, otherwise think that variable is without the recurrence correlativity.

Execution in step S207, whether the determining program root segment exists countless core circulations according to the recurrence correlativity.If exist, then execution in step S208 carries out the many nuclears in the second level to countless computation cycles according to the recurrence correlativity and divides, afterwards execution in step S209.Particularly, the described many nuclears in the second level that carry out are divided and are comprised: the task division and the load balance that carry out the second level with the second granularity.Described the second granularity is less than described the first granularity.The many nuclears of the described first order be divided into by root segment walk abreast parallel from the nuclear group, the many nuclears in the described second level are divided into the parallel from examining of circulation level in the root segment, are that the many nuclears of the first order are divided on the basis more fine-grained parallel.

If do not exist, direct execution in step S209 then judges whether that each countless program root segment according to correlativity all handles.If then finish.Otherwise, begin circulation from step S205, continue to analyze next countless program root segment according to correlativity.

The present embodiment is still continued to use program root segment module1 and the module2 of last embodiment.Such as: execution in step S205, analyze first countless program root segment module1 according to correlativity, program root segment module1 is decomposed into 2 computation cycles, with loop1 and loop2 note.

Execution in step S206 carries out data recurrence correlation analysis to the data of computation cycles loop1 and loop2, determines that computation cycles loop1, computation cycles loop2 are the computation cycles without the recurrence correlativity.

Owing to having or not computation cycles loop1, the loop2 of recurrence correlativity among the program root segment module1, therefore, execution in step S208 carries out the many nuclears in the second level to computation cycles loop1, loop2 and divides.Can carry out the many nuclears in the second level according to the calculated amount of computation cycles divides.Such as: in last embodiment, for program root segment module1 has distributed 40 from nuclear., see further segmentation herein: carry out computation cycles loop1 from nuclear for 30, carry out computation cycles loop2 from nuclear for 10 in addition.

Then, owing to also having program root segment module2 to be untreated, so execution in step S205 again analyzes countless program root segment module2 according to correlativity, program root segment module1 is decomposed into 3 computation cycles, with loop1 ', loop2 ' and loop3 ' note.

Execution in step S206, the data of computation cycles loop1 ', loop2 ' and loop3 ' are carried out data recurrence correlation analysis, determine that computation cycles loop1 ', loop2 ' and loop3 ' are the computation cycles of recurrence correlativity, so must carry out serial computing, can't executed in parallel, thereby no longer carry out the many nuclears in the second level and divide.

Fig. 3 is the schematic flow sheet of concurrent program section division methods the 3rd embodiment based on the isomery many-core processor of the present invention.Different from the second embodiment is that this embodiment is being carried out parallel root segment on the basis of 2 kinds of different grain sizes divisions, has comprised that also the coarseness of MPI process level is divided.

As shown in Figure 3, this embodiment may further comprise the steps:

Execution in step S301 carries out ANALYSIS OF CALCULATING to using problem.

Execution in step S302 based on the result of ANALYSIS OF CALCULATING, carries out the division of MPI level parallel task to using problem with the 3rd granularity.Particularly, described MPI level and behavior coarseness are divided, and refer to that mainly the MPI process level of Region Decomposition one-level is parallel.Described the 3rd granularity is greater than the first granularity that walks abreast according to root segment and divide, also certainly greater than the second granularity that walks abreast according to computation cycles and divide.

Execution in step S303 carries out data dependence analysis to the data of using problem, to determine whether to exist countless program root segments according to correlativity.

Execution in step S304 judges whether to exist countless program root segments according to correlativity.If do not exist, then finish.

If exist, execution in step S305 then calculates the calculated amount of each countless program root segment according to correlativity.

Execution in step S306 according to calculated amount, carries out the many nuclears of the first order to described countless program root segments according to correlativity and divides.

Execution in step S307 analyzes countless program root segments according to correlativity, and described program root segment is decomposed into a plurality of computation cycles.

Execution in step S308 carries out data recurrence correlation analysis to the data in each computation cycles, to determine whether to exist countless computation cycles according to the recurrence correlativity.

Execution in step S309, whether the determining program root segment exists countless core circulations according to the recurrence correlativity.If exist, then execution in step S310 carries out the many nuclears in the second level to countless computation cycles according to the recurrence correlativity and divides, afterwards execution in step S311.

If do not exist, direct execution in step S311 then judges whether that each countless program root segment according to correlativity all handles.If then finish.Otherwise, begin circulation from step S307, continue to analyze next countless program root segment according to correlativity.

Need to prove,, bring into play more fully from the computing power of nuclear from the nuclear acceleration effect although it will be understood by those skilled in the art that multi-level fine grained parallel energy Effective Raise of the present invention.But multi-level fine grained parallel of the present invention may bring increase and a small amount of double counting of storage space simultaneously.Therefore, specifically adopt which kind of level and granularity to carry out the division of concurrent program section, need operating personnel to make balance in conjunction with reality between degree of parallelism and memory space, the present invention does not do concrete restriction to this.

Need to prove, through the above description of the embodiments, those skilled in the art can be well understood to and of the present inventionly partly or entirely can realize by software and in conjunction with essential general hardware platform.Based on such understanding, the part that technical scheme of the present invention contributes to prior art in essence in other words can embody with the form of software product, this computer software product can comprise the one or more machine readable medias that store machine-executable instruction on it, and these instructions are can be so that these one or more machines come executable operations according to embodiments of the invention when carrying out such as the one or more machines such as computing machine, computer network or other electronic equipments.Machine readable media can comprise, but be not limited to floppy disk, CD, CD-ROM(compact-disc-ROM (read-only memory)), magneto-optic disk, ROM(ROM (read-only memory)), the RAM(random access memory), the EPROM(Erasable Programmable Read Only Memory EPROM), the EEPROM(Electrically Erasable Read Only Memory), magnetic or optical card, flash memory or be suitable for store the medium/machine readable media of the other types of machine-executable instruction.

The present invention can be used in numerous general or special purpose computingasystem environment or the configuration.Such as: personal computer, server computer, handheld device or portable set, plate equipment, multicomputer system, based on microprocessor system, set top box, programmable consumer-elcetronics devices, network PC, small-size computer, mainframe computer, comprise the distributed computing environment of above any system or equipment etc.

The present invention can describe in the general context of the computer executable instructions of being carried out by computing machine, for example program module.Usually, program module comprises the routine carrying out particular task or realize particular abstract data type, program, object, assembly, data structure etc.Also can in distributed computing environment, put into practice the application, in these distributed computing environment, be executed the task by the teleprocessing equipment that is connected by communication network.In distributed computing environment, program module can be arranged in the local and remote computer-readable storage medium that comprises memory device.

The present invention divides on the basis in the MPI of prior art level coarseness, carry out partition by fine granularities according to the program root segment, and can further in the core circulation of program root segment, carry out more partition by fine granularities, realized that multi-level fine-grained program segment is divided in the large scale scale heterogeneous parallel computation, improved from the load balance effect of nuclear, thereby bring into play more fully from the computing power of nuclear, Effective Raise use the whole acceleration effect of problem, the efficient operation of calculating with Engineering using integral problem for general science provides effective power-assisted.

In addition, the present invention is except providing the better concurrency heterogeneous applications itself, also sum up the general rule of the many nuclear of general science compute classes problem Parallel Implementation method and General Method for Optimization and the means of solution and the many nuclear of part raising parallel efficiencies, for basic composing software system and the parallel compilation software systems of large scale scale heterogeneous computer system are offered reference and criterion.

Although the present invention with preferred embodiment openly as above; but it is not to limit the present invention; any those skilled in the art without departing from the spirit and scope of the present invention; can utilize method and the technology contents of above-mentioned announcement that technical solution of the present invention is made possible change and modification; therefore; every content that does not break away from technical solution of the present invention; to any simple modification, equivalent variations and modification that above embodiment does, all belong to the protection domain of technical solution of the present invention according to technical spirit of the present invention.

Claims

1. the concurrent program section division methods based on the isomery many-core processor is characterized in that, comprising:

2. the concurrent program section division methods based on the isomery many-core processor as claimed in claim 1 is characterized in that, also comprises after the many nuclears of the first order divide described carrying out:

3. the concurrent program section division methods based on the isomery many-core processor as claimed in claim 1 is characterized in that:

Described calculated amount comprises: Floating-point Computation amount and fixed point calculation amount.

4. the concurrent program section division methods based on the isomery many-core processor as claimed in claim 1 is characterized in that:

The described many nuclears of the first order that carry out are divided and are comprised: the task division and the load balance that carry out the first order with the first granularity.

5. the concurrent program section division methods based on the isomery many-core processor as claimed in claim 2 is characterized in that:

The described many nuclears of the first order that carry out are divided and are comprised: the task division and the load balance that carry out the first order with the first granularity;

6. the concurrent program section division methods based on the isomery many-core processor as claimed in claim 5 is characterized in that:

Described the second granularity is less than described the first granularity.

7. such as described any the concurrent program section division methods based on the isomery many-core processor of claim 1 to 6, it is characterized in that, before described data to the application problem are carried out data dependence analysis, also comprise:

Described application problem is carried out ANALYSIS OF CALCULATING;

8. the concurrent program section division methods based on the isomery many-core processor as claimed in claim 7 is characterized in that:

Described the 3rd granularity is greater than described the first granularity.