CN102929723B

CN102929723B - Method for dividing parallel program segment based on heterogeneous multi-core processor

Info

Publication number: CN102929723B
Application number: CN201210441326.9A
Authority: CN
Inventors: 陈德训; 房田文; 吴宏
Original assignee: Wuxi Jiangnan Computing Technology Institute
Current assignee: Wuxi Jiangnan Computing Technology Institute
Priority date: 2012-11-06
Filing date: 2012-11-06
Publication date: 2015-07-08
Anticipated expiration: 2032-11-06
Also published as: CN102929723A

Abstract

The invention provides a method for dividing a parallel program segment based on a heterogeneous multi-core processor. The method includes that data correlation analysis is performed on data of an application subject to determine whether program basic segments without data correlations exist; if the program basic segments without the data correlations exist, calculation amounts of program basic segments without the data correlations are calculated; and a first stage multi-core division is performed on the program basic segments without the data correlations. According to the method, the adaptability problem of general scientific calculation and engineering applications in a polymorphic heterogeneous computer system can be solved, and multi-core stage parallel efficiencies and load balancing effects are improved simultaneously.

Description

Based on the parallel program segment division methods of isomery many-core processor

Technical field

The present invention relates to computer realm, particularly relate to a kind of parallel program segment division methods based on isomery many-core processor.

Background technology

In recent years, in order to improve the computing power of system, the main body that multinuclear, many-core processor become high-performance computer gradually sets up parts.On the other hand, towards solving the developing direction that particular problem is microprocessor Design, heterogeneous processor has carried out the isomery design of processor core for the singularity of specific area problem, distinguish by operation dissimilar in teriseable workloads, process with different processor cores respectively, to obtain high-performance on the whole.This structure and isomorphism multi-core processor system form the polymorphic heterogeneous computing system of great scale.Polymorphic heterogeneous system computing power is strong, Energy Efficiency Ratio is high, it is one of important development direction solving major application, the super large parallel scale of simultaneity factor and the architecture of polymorphic complexity, bring huge challenge to traditional high-performance calculation application, the Parallel Implementation technology matched lacks.Therefore study the many granularities concurrent program root segment division methods based on isomery many-core processor, become the technical barrier that those skilled in the art are urgently to be resolved hurrily.

In the concurrent program implementation method supporting heterogeneous computer system, the current Parallel Implementation towards heterogeneous computer system is mostly based on two-stage parallel model, i.e. MPI(" message passing interface ", MessagePassing Interface) parallel+many core walks abreast two-step way, coarse grain parallelism, many core of MPI level implementation process level walk abreast and mainly complete the speed-up computation of core iterative part, namely only carry out fine-grained many core according to core loop and walk abreast.MPI walk abreast+realization of two-stage hybrid parallel programming model that walks abreast of many core and optimisation technique in, MPI level walks abreast and mainly adopts the overlap of MPI communication and calculating to be optimized, and the parallel methods such as data layout optimization, data transfer optimization, calculating and memory access overlap that mainly adopt of many core levels carry out Performance tuning.The acceleration effect of concrete problem and problem calculation features and optimisation technique realize closely related.

According to reading up the literature at present, the integrated solution process that many core Parallel Implementation of application only expend huge part core calculations or certain simple problem for wherein calculated amount is carried out, comprehensively not deep solution is solved to the numerical simulation of many complicated practical problemss, therefore the MPI level parallel scale of the main Parallel application of heterogeneous computer system is all in 100,000 magnitudes at present, and very difficult support is applied with more complicated parallel computation more on a large scale.The overall acceleration effect of practical application problem is general.

In addition, only carry out the fine grained parallel of many core levels according to core loop, its parallel efficiency is limited to the scale of actual motion problem.Such as, if actual subject run grid scale dimension size be M, on isomery many-core processor from check figure be N, if carry out the fine grained parallel of many core levels according to core loop, when M<N, then (N-M) individual computing power from core is not utilized.And as M>N and M is not the integral multiple of N time, will be very undesirable from the load balance effect of core level fine grained parallel.Therefore, many core level fine grained parallel technology of existing core loop are difficult to give full play to the computing power from core.

Be in the Chinese patent application of CN1783011A at publication number, disclose more related contents.

Summary of the invention

Technical matters to be solved by this invention is the adaptability problem that the general scientific algorithm of solution and Engineering are applied on polymorphic heterogeneous computer system, improves parallel efficiency and the load balance effect of many core levels simultaneously.

In order to solve the problem, the invention provides a kind of parallel program segment division methods based on isomery many-core processor, comprising:

Data dependence analysis is carried out, to determine whether there is the program root segment without data dependence to the data of application problem;

If there is the described program root segment without data dependence, then calculate the calculated amount of each program root segment without data dependence; According to described calculated amount, the many core of the first order is carried out to the described program root segment without data dependence and divides.

Optionally, also comprise after the many core of the first order divides described carrying out:

Analyze each program root segment without data dependence, described program root segment is decomposed into multiple computation cycles;

Data recurrence correlation analysis is carried out to the data in each computation cycles, to determine whether there is the countless computation cycles according to recurrence correlativity;

If there is the described countless computation cycles according to recurrence correlativity, then the many core in the second level is carried out to the described countless computation cycles according to recurrence correlativity and divide.

Optionally, described calculated amount comprises: Floating-point Computation amount and fixed point calculation amount.

Optionally, carry out the many core divisions of the first order described in comprise: the task division and the load balance that carry out the first order with the first granularity.

Optionally, carry out the many core divisions of the first order described in comprise: the task division and the load balance that carry out the first order with the first granularity;

The described many core in the second level that carries out divides and comprises: the task division and the load balance that carry out the second level with the second granularity.

Optionally, described second granularity is less than described first granularity.

Optionally, before the described data to application problem carry out data dependence analysis, also comprise:

ANALYSIS OF CALCULATING is carried out to described application problem;

Based on the result of described ANALYSIS OF CALCULATING, with the 3rd granularity, MPI level parallel task is carried out to described application problem and divides.

Optionally, described 3rd granularity is greater than described first granularity.

Compared with prior art, technical scheme of the present invention has the following advantages:

The present invention is by multi-level fine-grained division parallel program segment, make respectively from task division and the load more equilibrium of core processor, thus the computing power that can play more fully from core, obtain good acceleration effect, be applied in adaptability problem on polymorphic heterogeneous computer system to solve general scientific algorithm and Engineering.

Accompanying drawing explanation

Fig. 1 is the schematic flow sheet of parallel program segment division methods first embodiment based on isomery many-core processor of the present invention;

Fig. 2 is the schematic flow sheet of parallel program segment division methods second embodiment based on isomery many-core processor of the present invention;

Fig. 3 is the schematic flow sheet of parallel program segment division methods the 3rd embodiment based on isomery many-core processor of the present invention.

Embodiment

Set forth a lot of detail in the following description so that fully understand the present invention.But the present invention can be much different from alternate manner described here to implement, those skilled in the art can when without prejudice to doing similar popularization when intension of the present invention, therefore the present invention is by the restriction of following public concrete enforcement.

Secondly, the present invention utilizes schematic diagram to be described in detail, and when describing the embodiment of the present invention in detail, for ease of illustrating, described schematic diagram is example, and it should not limit the scope of protection of the invention at this.

In order to solve the technical matters in background technology, the invention provides a kind of parallel program segment division methods based on isomery many-core processor.Fig. 1 is the schematic flow sheet of parallel program segment division methods first embodiment based on isomery many-core processor of the present invention.As shown in Figure 1, this embodiment comprises the following steps:

Perform step S101, data dependence analysis is carried out, to determine whether there is the program root segment without data dependence to the data of application problem.Particularly, if program root segment 1 be Y=F (X), program root segment 2 is Z=F (Y), then think that two program root segments exist data dependence, can only serial computing, cannot executed in parallel.

Perform step S102, judge whether to there is the program root segment without data dependence.

If do not exist, then not having can the program root segment of parallel computation, cannot carry out the division of parallel program segment, terminate.

If exist, that is: existing can the program root segment of parallel computation, then perform step S103, calculate the calculated amount of each program root segment without data dependence.Particularly, described calculated amount comprises Floating-point Computation amount and fixed point calculation amount.Perform step S104, according to calculated amount, the many core of the first order is carried out to the described program root segment without data dependence and divides.The described many core of the first order that carries out divides and comprises: the task division and the load balance that carry out the first order with the first granularity, that is: same root segment completes from core group at one, is determined from the size of core group by the calculated amount of this root segment.Divided by the many core of the first order, can realize the fine grained parallel of program root segment, that is: what follow procedure root segment carried out walking abreast walks abreast from core group.

Below in conjunction with embodiment, technical scheme of the present invention is described further.

In the present embodiment, determine, in certain application problem, have 2 without the program root segment of data dependence, remember with module1 and module2 by step S101.In the present embodiment, be 100 be described to participate in the total quantity from core of parallel computation.By performing step S103, calculating is learnt: the calculated amount of program root segment module1 is 2, and the calculated amount of program root segment module2 is 3.

Then perform step S104, according to calculated amount, the many core of the first order is carried out to program root segment module1 and program root segment module2 and divides.Give program root segment module1 by 100 from 40 core, form first from core group.Give program root segment module2 by 100 from 60 core, form second from core group.

It should be noted that, the present embodiment is intended to the technical program is described, thus select divide comparatively simple from nuclear volume and program segment.It will be understood by those skilled in the art that in actual Large-scale parallel computing, can be millions of from nuclear volume, program is also more complicated, and the present invention does not do concrete restriction to this.

Fig. 2 is the schematic flow sheet of parallel program segment division methods second embodiment based on isomery many-core processor of the present invention.With the first embodiment unlike, in this embodiment, carrying out according to program root segment on the basis of parallel patition, also further each root segment of segmentation, the many core in the second level carried out to the core loop relevant according to recurrence countless in root segment and divides.

As shown in Figure 2, this embodiment comprises the following steps:

Perform step S201, data dependence analysis is carried out, to determine whether there is the program root segment without data dependence to the data of application problem.

Perform step S202, judge whether to there is the program root segment without data dependence.

If do not exist, then terminate.

If exist, then perform step S203, calculate the calculated amount of each program root segment without data dependence.

Perform step S204, according to calculated amount, the many core of the first order is carried out to the described program root segment without data dependence and divides.

Continue to perform step S205, analyze the program root segment without data dependence, described program root segment is decomposed into multiple computation cycles.

Perform step S206, data recurrence correlation analysis is carried out to the data in each computation cycles, to determine whether there is the countless computation cycles according to recurrence correlativity.Particularly, if the interior data variable X of circulation _{i, j, k}=F (X _{i-1, j, k}, X _{i, j-1, k}, X _{i, j, k-1}), then think that variable exists recurrence correlativity, otherwise think that variable is without recurrence correlativity.

Perform step S207, whether determining program root segment exists the countless core loop according to recurrence correlativity.If exist, then perform step S208, the many core in the second level is carried out to the countless computation cycles according to recurrence correlativity and divides, perform step S209 afterwards.Particularly, carry out the many core divisions in the second level described in comprise: the task division and the load balance that carry out the second level with the second granularity.Described second granularity is less than described first granularity.The many core of the described first order is divided into walking abreast from core group of being undertaken walking abreast by root segment, and the many core in the described second level is divided into walking abreast from core of root segment Inner eycle level, for the many core of the first order divides on basis more fine-grained parallel.

If do not exist, then directly perform step S209, judge whether that each program root segment without data dependence all processes.If so, then terminate.Otherwise, circulate from step S205 place, continue to analyze next program root segment without data dependence.

The present embodiment still continues to use program root segment module1 and the module2 of previous embodiment.Such as: perform step S205, first analyze the program root segment module1 without data dependence, program root segment module1 is decomposed into 2 computation cycles, remembers with loop1 and loop2.

Perform step S206, data recurrence correlation analysis is carried out to the data of computation cycles loop1 and loop2, determines computation cycles loop1, computation cycles loop2 is computation cycles without recurrence correlativity.

Due in program root segment module1 with or without the computation cycles loop1 of recurrence correlativity, loop2, therefore, perform step S208, the many core in the second level carried out to computation cycles loop1, loop2 and divides.The many core in the second level can be carried out according to the calculated amount of computation cycles to divide.Such as: in previous embodiment, for program root segment module1 is assigned with 40 from core., see further segmentation herein: 30 perform computation cycles loop1 from core, another 10 perform computation cycles loop2 from core.

Then, owing to also having program root segment module2 untreated, so again perform step S205, analyze the program root segment module2 without data dependence, program root segment module1 is decomposed into 3 computation cycles, remembers with loop1 ', loop2 ' and loop3 '.

Perform step S206, data recurrence correlation analysis is carried out to the data of computation cycles loop1 ', loop2 ' and loop3 ', determine that computation cycles loop1 ', loop2 ' and loop3 ' are the computation cycles of recurrence correlativity, so must serial computing be carried out, cannot executed in parallel, thus no longer carry out the many core in the second level and divide.

Fig. 3 is the schematic flow sheet of parallel program segment division methods the 3rd embodiment based on isomery many-core processor of the present invention.Carrying out parallel root segment on the basis that 2 kinds of different grain sizes divide unlike, this embodiment with the second embodiment, the coarseness that further comprises MPI process level divides.

As shown in Figure 3, this embodiment comprises the following steps:

Perform step S301, ANALYSIS OF CALCULATING is carried out to application problem.

Perform step S302, based on the result of ANALYSIS OF CALCULATING, with the 3rd granularity, MPI level parallel task is carried out to application problem and divides.Particularly, described MPI level behavior coarseness divide, and mainly refer to that the MPI process level of Region Decomposition one-level is parallel.Described 3rd granularity is greater than first granularity of carrying out parallel patition according to root segment, is also certainly greater than second granularity of carrying out parallel patition according to computation cycles.

Perform step S303, data dependence analysis is carried out, to determine whether there is the program root segment without data dependence to the data of application problem.

Perform step S304, judge whether to there is the program root segment without data dependence.If do not exist, then terminate.

If exist, then perform step S305, calculate the calculated amount of each program root segment without data dependence.

Perform step S306, according to calculated amount, the many core of the first order is carried out to the described program root segment without data dependence and divides.

Perform step S307, analyze the program root segment without data dependence, described program root segment is decomposed into multiple computation cycles.

Perform step S308, data recurrence correlation analysis is carried out to the data in each computation cycles, to determine whether there is the countless computation cycles according to recurrence correlativity.

Perform step S309, whether determining program root segment exists the countless core loop according to recurrence correlativity.If exist, then perform step S310, the many core in the second level is carried out to the countless computation cycles according to recurrence correlativity and divides, perform step S311 afterwards.

If do not exist, then directly perform step S311, judge whether that each program root segment without data dependence all processes.If so, then terminate.Otherwise, circulate from step S307, continue to analyze next program root segment without data dependence.

It should be noted that, although it will be understood by those skilled in the art that multi-level fine grained parallel of the present invention can effectively improve from core acceleration effect, play the computing power from core more fully.But multi-level fine grained parallel of the present invention may bring the increase of storage space and a small amount of double counting simultaneously.Therefore, the concrete division adopting which kind of level and granularity to carry out parallel program segment, need operating personnel to combine reality and make balance between degree of parallelism and memory space, the present invention does not do concrete restriction to this.

It should be noted that, through the above description of the embodiments, those skilled in the art can be well understood to and of the present inventionly partly or entirely can to realize in conjunction with required general hardware platform by software.Based on such understanding, technical scheme of the present invention can embody with the form of software product the part that prior art contributes in essence in other words, this computer software product can comprise the one or more machine readable medias it storing machine-executable instruction, and these instructions can make this one or more machine carry out executable operations according to embodiments of the invention when being performed by one or more machine such as such as computing machine, computer network or other electronic equipments etc.Machine readable media can comprise, but be not limited to, floppy disk, CD, CD-ROM(compact-disc-ROM (read-only memory)), magneto-optic disk, ROM(ROM (read-only memory)), RAM(random access memory), EPROM(Erasable Programmable Read Only Memory EPROM), EEPROM(Electrically Erasable Read Only Memory), magnetic or optical card, flash memory or be suitable for the medium/machine readable media of other types of storing machine executable instruction.

The present invention can be used in numerous general or special purpose computing system environment or configuration.Such as: personal computer, server computer, handheld device or portable set, laptop device, multicomputer system, system based on microprocessor, set top box, programmable consumer-elcetronics devices, network PC, small-size computer, mainframe computer, comprise the distributed computing environment etc. of above any system or equipment.

The present invention can describe in the general context of computer executable instructions, such as program module.Usually, program module comprises the routine, program, object, assembly, data structure etc. that perform particular task or realize particular abstract data type.Also can put into practice the application in a distributed computing environment, in these distributed computing environment, be executed the task by the remote processing devices be connected by communication network.In a distributed computing environment, program module can be arranged in the local and remote computer-readable storage medium comprising memory device.

The present invention divides on basis in the MPI level coarseness of prior art, partition by fine granularities is carried out according to program root segment, and more partition by fine granularities can be carried out further in the core loop of program root segment, achieve multi-level fine-grained program segment in large scale scale heterogeneous parallel computation to divide, improve the load balance effect from core, thus the computing power played more fully from core, effectively improve the overall acceleration effect of application problem, for the Effec-tive Function of general scientific algorithm and Engineering using integral problem provides effective power-assisted.

In addition, the present invention is except can providing better concurrency to heterogeneous applications itself, also sum up General Method for Optimization and means that the general rule of general scientific algorithm class problem many core Parallel Implementation method and solution and part improve many core parallel efficiencies, for the basic composing software system of large scale scale heterogeneous computer system and parallel compilation software systems are offered reference and criterion.

Although the present invention with preferred embodiment openly as above; but it is not for limiting the present invention; any those skilled in the art without departing from the spirit and scope of the present invention; the Method and Technology content of above-mentioned announcement can be utilized to make possible variation and amendment to technical solution of the present invention; therefore; every content not departing from technical solution of the present invention; the any simple modification done above embodiment according to technical spirit of the present invention, equivalent variations and modification, all belong to the protection domain of technical solution of the present invention.

Claims

1., based on a parallel program segment division methods for isomery many-core processor, it is characterized in that, comprising:

If there is the described program root segment without data dependence, then calculate the calculated amount of each program root segment without data dependence; According to described calculated amount, the many core of the first order is carried out to the described program root segment without data dependence and divides;

After the many core of the first order divides described carrying out, analyze each program root segment without data dependence, described program root segment is decomposed into multiple computation cycles;

2., as claimed in claim 1 based on the parallel program segment division methods of isomery many-core processor, it is characterized in that:

Described calculated amount comprises: Floating-point Computation amount and fixed point calculation amount.

3., as claimed in claim 1 based on the parallel program segment division methods of isomery many-core processor, it is characterized in that:

The described many core of the first order that carries out divides and comprises: the task division and the load balance that carry out the first order with the first granularity;

4., as claimed in claim 3 based on the parallel program segment division methods of isomery many-core processor, it is characterized in that:

Described second granularity is less than described first granularity.

5. as claimed in claim 1 or 2 based on the parallel program segment division methods of isomery many-core processor, it is characterized in that, before the described data to application problem carry out data dependence analysis, also comprise:

ANALYSIS OF CALCULATING is carried out to described application problem;

6. the parallel program segment division methods based on isomery many-core processor as described in claim 3 or 4, is characterized in that, before the described data to application problem carry out data dependence analysis, also comprises:

ANALYSIS OF CALCULATING is carried out to described application problem;

7., as claimed in claim 6 based on the parallel program segment division methods of isomery many-core processor, it is characterized in that:

Described 3rd granularity is greater than described first granularity.