CN110222007A

CN110222007A - A kind of Accelerating running method based on Shen prestige many-core processor

Info

Publication number: CN110222007A
Application number: CN201910536855.9A
Authority: CN
Inventors: 潘景山; 刘弢; 王利; 郭强; 庄园; 曾云辉
Original assignee: Shandong Computer Science Center
Current assignee: Shandong Computer Science Center
Priority date: 2019-06-20
Filing date: 2019-06-20
Publication date: 2019-09-10
Anticipated expiration: 2039-06-20
Also published as: CN110222007B

Abstract

The present invention relates to a kind of Accelerating running methods based on Shen prestige many-core processor, comprising: A, determining program section A, program segment B, the program context dependence between program segment C；If program segment A, program segment B, all there is program context dependence between program segment C three, sequence is executed；Otherwise, adjustment programme section A, program segment B, execute sequence between program segment C, executed；B, step A are executed until having executed all programs to continuous rear three sections of program segments.Program context dependence between determining program section and program subsegment of the present invention, point situation is flexibly handled, introduce communication lock synchronization mechanism, save the main core waiting time, realize the parallel processing of main core and core group, in program process, reduces and number is needed to spawn and join core group thread, improve the execution efficiency of program.

Description

A kind of Accelerating running method based on Shen prestige many-core processor

Technical field

The present invention relates to computer high-performance, parallel computation, system structure technical fields, and in particular to one kind is based on Shen prestige The Accelerating running method of many-core processor.

Background technique

Shen prestige many-core processor is the work of the representative in domestic high-performance processor, it is the high-performance of China's independent research Computing chip, currently, " light in the martial prowess Taihu Lake " supercomputer in computing capability world rankings forefront has used more than 40,000 Shen prestige many-core processor.

Every Shen prestige many-core processor chip (Shen Wei 26010) includes 4 core groups, is connected between core group by network-on-chip It connects.Each core group is mainly made of Memory Controller Hub, administrative unit, 1 main core and 64 from core.64 between core use 8 × 8 mesh topology is attached.Each of each core group is deposited from core with the office of 64KB, as shown in Figure 1.

Since Shen prestige many-core is more from nucleus number mesh, and size each is deposited from the office of core and is extremely limited again, store wall problem In Shen, prestige many-core processor more highlights using upper.Wall problem more highlighting using upper in Shen prestige many-core processor is stored, According to current actual use situation, have following three problem: the first, core group computing resource utilization rate is insufficient.Relative to core The powerful computing capability of group, limited data transfer bandwidth between main core and core group, office deposits too small.As fruit stone group does not obtain abundance Data, will lead to its long-time idle waiting, cause the waste of core group computing resource.Particularly with what is run in supercomputing Large-scale distributed program, the number of nodes used is very more, usually calculates as unit of ten thousand.Therefore, the process on each node The data volume distributed is limited, and data needed for single function body single to be optimized carries out principal and subordinate's calculating are limited, causes core group Computing resource utilization rate is low.The second, time for depositing relative to core group access office of the data transmission period between main memory and core group is long. By taking Shen prestige many-core processor 26010 as an example, main core and be 1.5GHz from core working frequency, each clock cycle (bats) is received for 0.67 Second.The delay of hosting operations of core group access is 278 clock cycle (186.26 nanosecond), and accesses primary visit office and deposit Delay is only 4 clock cycle (2.68 nanosecond).The expense of Shen prestige many-core processor core group access main memory is that core group access office deposits The decades of times of expense, core group access main memory belong to inefficient accessing operation.Third, terminates often the starting of core group.If started Core group is calculated, and needs main core to derive from (spawn) core group thread, single operation needs 26500 clock cycle, and (17755 receive Second).After core group calculates, main core needs to carry out core group thread reduction (join), collects core group data, and single operation needs 7300 clock cycle (4891 nanosecond).For the large-scale distributed program run on supercomputer, can carry out parallel excellent The function body of change is very more, and each optimised function body needs are repeatedly called, the number of spawn and join with hundred million times or 1000000000 calculating.If repeatedly starting core group, needs frequent spawn and join core group thread, causes program overall operation efficiency Lowly.

In addition, the Parallel Program Optimization for being currently based on Shen prestige many-core processor is excellent mainly for program segment progress to be optimized Change, and most program segments are existed in the form of circulation.According to the optimization form of previous machinery, optimizing each circulation will Main core is carried out to these variables first to transmit to from the data of core.Optimize the beginning of each circulation, it will be by these data from master Core is transferred to from core.But multiple program segments to be optimized may have identical data, for example, circulation bound, constant, The input data etc. that will not change in public and program loop, the data such as identified below with underscore.

Summary of the invention

In view of the deficiencies of the prior art, the present invention provides a kind of Accelerating running methods based on Shen prestige many-core processor；

Term is explained:

Program context dependence: in the present invention, program context refers in the code segment that sequence executes, if next A code segment takes less than the data of code segment output, then claims the two code segments without program context dependence；Such as The next code segment of fruit needs the data exported using a upper code segment, then the two code segments is claimed to have program context dependence Relationship.

The technical solution of the present invention is as follows:

A kind of Accelerating running method based on Shen prestige many-core processor, runs on computer, executes program, and program includes Several program segments, technical solution of the present invention can be summarized and specifically carry out based on Shen prestige many-core processor by taking three program segments as an example Carry out a variety of situations of multiple programming.Three sections of program segments of wherein arbitrary continuation are set as program segment A, program segment B, program segment C, Wherein program segment A and program segment C can carry out parallel optimization (can be placed on from core and execute), and program segment B is not available for parallel excellent Change and (can be only placed at main core to execute), comprises the following steps that

I, determining program section A, program segment B, the program context dependence between program segment C, if program segment A, journey All there is program context dependence between sequence section B, program segment C three, then sequence executes；Setting communication lock AB, communication lock BC realizes that main core and core group share cogradient variable, the operation or wait state of main core or core group is determined by cogradient variable, including Steps are as follows:

(1) cogradient variable initializes, and communication lock AB, communication lock BC are main core and core group shared variable；Use volatile Crucial character modification；

(2) program segment A and program segment C are loaded into core group, core group execution phase A, while use AB pairs of communication lock Main core locks, and main core is waited at this time；

(3) program segment A is after core group is finished, and using 1 or several core group threads progress core group thread-data is same Step, the specific synchronous core group number of threads of data that carries out are subject to specifically used core group number of threads.DMA transfer core group data To main core, main core execution phase B is notified, the program segment C in core group is locked；

(4) main core execution phase B, after the completion of execution, communication lock BC unlock notifies core group execution phase C, program segment C returns to core group operation data to main core after core group is finished；

The advantage designed herein is, 1) it can be synchronized by the mechanism of communication lock between main core and core group.2) subtract The number of core group spawn, join is lacked.3) if there is the repeated data with program segment A in program segment C, for example, all employing Certain arrays, such mode reduce the DMA transfer number of data between main core and core group, these arrays can be in core below It is multiplexed in group program.

Otherwise, adjustment programme section A, program segment B, execute sequence between program segment C, executed；

II, step Is are executed until having executed program to continuous rear three sections of program segments.

It is preferred according to the present invention, before execution phase A, program segment B, program segment C, proceed as follows:

It a, whether include two or more a program subsegments in determining program section A, program segment B or program segment C, if do not wrapped It includes, then directly executes the program segment, otherwise, enter step b；

B, judge two or more a program subsegments whether include circulation bound, constant, in public and program loop not The input data that can change sequentially executes two or more a program subsegments if not including；Otherwise, c is entered step；

C, circulation bound, constant, the input data that will not change in public and program loop will be extracted, and by extraction Circulation bound, constant, the input data that will not change in public and program loop disposably from main core be transferred into it is each from Core executes two or more a program subsegments.Recycle bound, constant, the input number that will not change in public and program loop According to the data identified in such as background technique with underscore.If multiple program segments all include such data, can be in program It executes and uniformly extracts and be transferred into each from core when starting.

64KB each is saved as from the office of core, these data is stored and is usually no more than 3KB, will not influence from core and normally count It calculates.

Preferred according to the present invention, execution phase A is comprised the following steps that

(5) whether determining program section A includes several program subsegments, if program segment A includes several program subsegments, into Enter step (6)；Otherwise, (7) are entered step；

(6) two sections of program subsegments of wherein arbitrary continuation are set as program subsegment A1, program subsegment A2, to all programs Subsegment executes following steps until executed program segment A: program context between determining program subsegment A1, program subsegment A2 according to The relationship of relying, if there are program context dependence between program subsegment A1, program subsegment A2, successively sequential execution of programmed Subsegment A1, program subsegment A2；Otherwise, core group computing resource is distributed according to the data volume of program subsegment A1, program subsegment A2, parallel Execute program subsegment A1, program subsegment A2.

(7) if program segment A does not include several program subsegments, direct execution phase A.

Preferred according to the present invention, execution phase C is comprised the following steps that

(8) whether determining program section C includes several program subsegments, if program segment C includes several program subsegments, into Enter step (9)；Otherwise, (10) are entered step；

(9) two sections of program subsegments of wherein arbitrary continuation are set as program subsegment C1, program subsegment C2, to all programs Subsegment executes following steps until executed program segment C: program context between determining program subsegment C1, program subsegment C2 according to The relationship of relying, if there are program context dependence between program subsegment C1, program subsegment C2, successively sequential execution of programmed Subsegment C1, program subsegment C2；Otherwise, core group computing resource is distributed according to the data volume of program subsegment C1, program subsegment C2, parallel Execute program subsegment C1, program subsegment C2；

(10) if program segment C does not include several program subsegments, direct execution phase C.

About program segment and program subsegment, for example, a for circulation is used as a program segment, circulation is internal if there is not Relevant lines of code calculates, so that it may be divided into multiple program subsegment parallel processings；If this for circulation is internal with regard to a line generation Code, is handled then it can not be divided into program subsegment, directly executes the program segment.For program segment inside, It, can be with the parallel execution of setting program subsegment according to the dependence between program subsegment.For a program segment, inside exists Multiple program subsegments, these program subsegments are serially to execute in core group, if no context dependence between program subsegment, And the calculating data volume of single program subsegment is less, then these program subsegments can execute parallel.

It is preferred according to the present invention, if program segment A, program segment B, there is no above and below program between program segment C three Literary dependence, then core group execution phase A and program segment C, at the same time, main core execute can not parallel optimization program segment B. Such situation has adjusted and executes sequence between three program segments, first carries out program segment A and program segment C, reduce a spawn and The time of join.While core group execution phase A and program segment C, main core execute can not parallel optimization program segment B.At this time Main core program and core group program can be run simultaneously, reduce the number of spawn, join core group.Because of nothing between three program segments Context dependency can not have to " communication lock " and synchronize.

It is further preferred that spawn core group thread loading procedure section A and program segment C be to core group, by core group execution phase A and program segment C, at the same time, main core execute can not parallel optimization program segment B；To program segment A, program segment C and program segment B After being performed both by, join core group thread returns to the result of program segment A, program segment C to main core.

It is preferred according to the present invention, if program segment A and program segment B there are program context dependence, program segment A and Program context dependence is not present between program segment C, and program context is not present between program segment B and program segment C Dependence then first carries out program segment A and program C, returns the result rear execution phase B.

It is further preferred that spawn core group thread loading procedure section A and program segment C be to core group, by core group execution phase A and program segment C, after program segment A and program segment C are finished, join core group thread returns to the knot of program segment A and program segment C Fruit to main core, main core execute can not parallel optimization program segment B.

It is preferred according to the present invention, if program context dependence, and journey is not present in program segment A and program segment B There are program context dependences between sequence section B and program segment C, then first carry out program segment B, return the result rear execution phase A and C.Parallel optimization is carried out by the way of adjustment programme section execution sequence, reduces the number of spawn, join core group.Specifically hold Row process is as shown in Figure 5.

It is further preferred that main core execution phase B, after program segment B is finished, spawn core group thread loads journey Sequence section A and program segment C is to core group, by core group execution phase A and program segment C, after program segment A and program segment C are finished, Join core group thread returns to the result of program segment A and program segment C to main core.

The invention has the benefit that

1, the program context dependence between determining program section of the present invention, a point situation are flexibly handled, and save main core It waiting time, realizes the parallel processing of main core and core group, in program process, reduces to spawn and join core group line Journey needs number, improves the execution efficiency of program.

2, the present invention will circulation bound, constant, the input data that will not change in public and program loop disposably from Main core is incoming from core, avoids the repetition transmission of data.

3, invention introduces " communication lock " synchronization mechanism, 1, multiple or 64 cores of whole in core group can be used Three kinds of group thread etc. and the mode of main core communication synchronize communication, and the mode of program segment execution sequence is adjusted flexibly, carries out simultaneously Row optimization, is further reduced the number of spawn, join core group, for repeatedly calling the application program of loop body, can save more Time.

4, the method that uses of the present invention reduces the DMA transfer number of data between main core and core group, it is multiple can be parallel excellent The segment data of change can be disposably passed to from core, be greatly reduced based on most consuming in the prestige many-core processor Parallel Program Optimization of Shen When principal and subordinate's Nuclear Data transmission time, the incoming slave Nuclear Data of part can also obtain in the slave core program executed later Multiplexing.For program segment after parallel optimization compared to the optimal way for not using this method, improved efficiency is obvious.

5, the present invention goes back the program context dependence between determining program section and program subsegment, and a point situation is flexibly located Reason, saves the main core waiting time, realizes the parallel processing of main core and core group.One core group can be more in same time-triggered protocol A program segment or program subsegment, improve the operational efficiency of program.

Detailed description of the invention

Fig. 1 is Shen prestige many-core processor hardware architecture diagram；

Sequence is held when Fig. 2 is program segment A, program segment B, there is program context dependence between program segment C three Row flow diagram；

Program execution flow when program context dependence is not present in Fig. 3 between program subsegment A1 and program subsegment A2 Schematic diagram；

Fig. 4 is program segment A, program segment B, journey when being not present program context dependence between program segment C three Sequence executes flow diagram；

Fig. 5 be program segment A there are program context dependences with program segment B, and between program segment A, program segment C and Program execution flow schematic diagram when program context dependence is all not present between program segment B, program segment C；

Program context dependence is not present for program segment A and program segment B in Fig. 6, and between program segment B and program segment C There are program execution flow schematic diagrames when program context dependence.

Specific embodiment

The present invention is further qualified with embodiment with reference to the accompanying drawings of the specification, but not limited to this.

Embodiment 1

A kind of Accelerating running method based on Shen prestige many-core processor, runs on computer, executes program, and program includes Several program segments, technical solution of the present invention can be summarized and specifically carry out based on Shen prestige many-core processor by taking three program segments as an example Carry out a variety of situations of multiple programming.Three sections of program segments of wherein arbitrary continuation are set as program segment A, program segment B, program segment C, Wherein program segment A and program segment B can carry out parallel optimization (can be placed on from core and execute), and program segment B is not available for parallel excellent Change and (can be only placed at main core to execute), comprises the following steps that

I, determining program section A, program segment B, the program context dependence between program segment C, if program segment A, journey All there is program context dependence between sequence section B, program segment C three, then sequence executes；Setting communication lock AB, communication lock BC realizes that main core and core group share cogradient variable, the operation or wait state of main core or core group is determined by cogradient variable, is such as schemed Shown in 2, comprise the following steps that

Embodiment 2

According to a kind of Accelerating running method based on Shen prestige many-core processor described in embodiment 1, difference is: executing Before program segment A, program segment B, program segment C, proceed as follows:

By taking ocean model program Regional Ocean Modeling System (ROMS) as an example, hotspot program In step2d.f90 there are 55 program segments to be optimized to need to carry out the number of 55 program segments to be optimized according to method before According to transmission, according to the method provided by the invention, the transmission data of multiple program segments to be optimized are carried out to merge transmission, it is only necessary to The data transmission of 10 program segments to be optimized.Main core improves 80% to from the data transmission efficiency of core.

Embodiment 3

According to a kind of Accelerating running method based on Shen prestige many-core processor described in embodiment 1, difference is:

Execution phase A, comprises the following steps that

(6) two sections of program subsegments of wherein arbitrary continuation are set as program subsegment A1, program subsegment A2, to all programs Subsegment executes following steps until executed program segment A: program context between determining program subsegment A1, program subsegment A2 according to The relationship of relying, if there are program context dependence between program subsegment A1, program subsegment A2, successively sequential execution of programmed Subsegment A1, program subsegment A2；Otherwise, core group computing resource is distributed according to the data volume of program subsegment A1, program subsegment A2, parallel Execute program subsegment A1, program subsegment A2.Specific implementation procedure is as shown in Figure 3.

Execution phase C, comprises the following steps that

Embodiment 4

If program context dependence, core group program segment A, program segment B, are not present between program segment C three Execution phase A and program segment C, at the same time, main core execute can not parallel optimization program segment B.Such situation has adjusted three Sequence is executed between a program segment, first carries out program segment A and program segment C, reduces the time of a spawn and join.Core group While execution phase A and program segment C, main core execute can not parallel optimization program segment B.Main core program and core group journey at this time Sequence can be run simultaneously, reduce the number of spawn, join core group.It, can because of no context dependence between three program segments It is synchronized with not having to " communication lock ".Spawn core group thread loading procedure section A and program segment C executes program to core group, by core group Section A and program segment C, at the same time, main core execute can not parallel optimization program segment B；To program segment A, program segment C and program segment After B is performed both by, join core group thread returns to the result of program segment A, program segment C to main core.It is specific as shown in Figure 4.

If program segment A is with program segment B, there are program context dependences, are not present between program segment A and program segment C Program context dependence, and program context dependence is not present between program segment B and program segment C, then it first carries out Program segment A and program C returns the result rear execution phase B.Specific implementation procedure is as shown in Figure 5.The load of spawn core group thread Program segment A and program segment C is finished by core group execution phase A and program segment C to program segment A and program segment C to core group Afterwards, join core group thread returns to the result of program segment A and program segment C to main core, main core execute can not parallel optimization program segment B。

If program context dependence is not present in program segment A and program segment B, and between program segment B and program segment C There are program context dependences, then first carry out program segment B, return the result rear execution phase A and C.Using adjustment programme The mode of Duan Zhihang sequence carries out parallel optimization, reduces the number of spawn, join core group.Specific implementation procedure is as shown in Figure 6. Main core execution phase B, after program segment B is finished, spawn core group thread loading procedure section A and program segment C to core group, By core group execution phase A and program segment C, after program segment A and program segment C are finished, join core group thread returns to program The result of section A and program segment C is to main core.

In the present embodiment, by design content combination ocean numerical models program Parallel Ocean of the present invention Program (POP) has carried out experiment test, and test environment is " light in martial prowess Taihu Lake " supercomputer, ocean numerical models Program Parallel Ocean Program (POP) simulates the temperature variations in 5 mode day of global ocean using the program, Measurement scope is 10000 processes, optimizes the program segment in advu, hmix_del4 in POP program.Wherein hmix_del4 In certain loop body individual process be 900000 times to its call number, core the group spawn and join of single are at least time-consuming 22646 nanoseconds；After the method for the present embodiment, the spawn and join of this loop body be can be omitted, i.e., comprising this loop body The spawn and join of program segment can be omitted, final to save 20.34 seconds, and the runing time of program module is where the program segment 1020 seconds, i.e., the program segment individually optimized just saves 2% module runtime.It is similar such in real-life program Program segment enormous amount to be optimized, and this general class method needs to carry out prolonged Numerical-Mode using supercomputer Quasi-, the savable time that stacks up is considerable.

The case where describing according to the present invention, each case select the program segment that three sequences execute.Program segment is difference journey Sequence section A, program segment B, program segment C, wherein program segment A and program segment B can carry out parallel optimization (can be placed on from core and execute), Program segment B is not available for parallel optimization (can be only placed at main core to execute).Each program segment is carried out according to the method for the present invention Optimization, using after the method for the present invention program be not optimised, original method optimize efficiency comparative it is as shown in table 1.

Table 1

By using parallel optimization method of the invention, opposite original method, improved efficiency is obvious, and minimum is 16.7%, Up to 67.6%.

Claims

1. a kind of Accelerating running method based on Shen prestige many-core processor, runs on computer, program is executed, if program includes Dry program segment sets three sections of program segments of wherein arbitrary continuation as program segment A, program segment B, program segment C, which is characterized in that packet Include that steps are as follows:

I, determining program section A, program segment B, the program context dependence between program segment C, if program segment A, program segment B, all there is program context dependence between program segment C three, then sequence executes；BC is locked in setting communication lock AB, communication, real Now main core and core group share cogradient variable, and the operation or wait state of main core or core group, including step are determined by cogradient variable It is as follows:

(1) cogradient variable initializes, and communication lock AB, communication lock BC are main core and core group shared variable；

(2) program segment A and program segment C are loaded into core group, core group execution phase A, while using communication lock AB to main core It locks, main core is waited at this time；

(3) program segment A is after core group is finished, and using 1 or several core group threads carry out core group thread-data and synchronize, DMA transfer core group data notify main core execution phase B to main core, lock to the program segment C in core group；

(4) main core execution phase B, after the completion of execution, communication lock BC unlock notifies core group execution phase C, program segment C to exist After core group is finished, core group operation data is returned to main core；

2. a kind of Accelerating running method based on Shen prestige many-core processor according to claim 1, which is characterized in that execute Before program segment A, program segment B, program segment C, proceed as follows:

It a, whether include two or more a program subsegments in determining program section A, program segment B or program segment C, if not including, The program segment is then directly executed, otherwise, enters step b；

B, judge whether two or more a program subsegments include recycling bound, constant, will not changing in public and program loop The input data of change sequentially executes two or more a program subsegments if not including；Otherwise, c is entered step；

C, circulation bound, constant, the input data that will not change in public and program loop will be extracted, and by the circulation of extraction Bound, constant, the input data that will not change in public and program loop are disposably transferred into from main core each from core, hold The two or more a program subsegments of row.

3. a kind of Accelerating running method based on Shen prestige many-core processor according to claim 1, which is characterized in that execute Program segment A, comprises the following steps that

(5) whether determining program section A includes several program subsegments, if program segment A includes several program subsegments, into step Suddenly (6)；Otherwise, (7) are entered step；

(6) two sections of program subsegments of wherein arbitrary continuation are set as program subsegment A1, program subsegment A2, to all program subsegments Following steps are executed until having executed program segment A: the program context between determining program subsegment A1, program subsegment A2, which relies on, closes System, if there are program context dependence between program subsegment A1, program subsegment A2, successively sequential execution of programmed subsegment A1, program subsegment A2；Otherwise, core group computing resource is distributed according to the data volume of program subsegment A1, program subsegment A2, it is parallel to execute Program subsegment A1, program subsegment A2；

4. a kind of Accelerating running method based on Shen prestige many-core processor according to claim 1, which is characterized in that execute Program segment C, comprises the following steps that

(8) whether determining program section C includes several program subsegments, if program segment C includes several program subsegments, into step Suddenly (9)；Otherwise, (10) are entered step；

(9) two sections of program subsegments of wherein arbitrary continuation are set as program subsegment C1, program subsegment C2, to all program subsegments Following steps are executed until having executed program segment C: the program context between determining program subsegment C1, program subsegment C2, which relies on, closes System, if there are program context dependence between program subsegment C1, program subsegment C2, successively sequential execution of programmed subsegment C1, program subsegment C2；Otherwise, core group computing resource is distributed according to the data volume of program subsegment C1, program subsegment C2, it is parallel to execute Program subsegment C1, program subsegment C2；

5. a kind of Accelerating running method based on Shen prestige many-core processor according to claim 1, which is characterized in that if Program segment A, program segment B, program context dependence is not present between program segment C three, then core group execution phase A and Program segment C, at the same time, main core execute can not parallel optimization program segment B.

6. a kind of Accelerating running method based on Shen prestige many-core processor according to claim 5, which is characterized in that Spawn core group thread loading procedure section A and program segment C is to core group, by core group execution phase A and program segment C, at the same time, Main core execute can not parallel optimization program segment B；After being performed both by program segment A, program segment C and program segment B, join core group Thread returns to the result of program segment A, program segment C to main core.

7. a kind of Accelerating running method based on Shen prestige many-core processor according to claim 1, which is characterized in that if There are program context dependences with program segment B by program segment A, between program segment A and program segment C there is no program context according to The relationship of relying, and program context dependence is not present between program segment B and program segment C, then first carry out program segment A and program C returns the result rear execution phase B.

8. a kind of Accelerating running method based on Shen prestige many-core processor according to claim 7, which is characterized in that Spawn core group thread loading procedure section A and program segment C is to core group, by core group execution phase A and program segment C, to program segment A After being finished with program segment C, join core group thread returns to the result of program segment A and program segment C to main core, and main core execution can not The program segment B of parallel optimization.

9. a kind of -8 any Accelerating running method based on Shen prestige many-core processor, feature exist according to claim 1 In if program context dependence is not present in program segment A and program segment B, and existing between program segment B and program segment C Program context dependence then first carries out program segment B, returns the result rear execution phase A and C.

10. a kind of Accelerating running method based on Shen prestige many-core processor according to claim 9, which is characterized in that main Core execution phase B, after program segment B is finished, spawn core group thread loading procedure section A and program segment C to core group, by Core group execution phase A and program segment C, after program segment A and program segment C are finished, join core group thread returns to program segment A With the result of program segment C to main core.