CN110222007A - A kind of Accelerating running method based on Shen prestige many-core processor - Google Patents

A kind of Accelerating running method based on Shen prestige many-core processor Download PDF

Info

Publication number
CN110222007A
CN110222007A CN201910536855.9A CN201910536855A CN110222007A CN 110222007 A CN110222007 A CN 110222007A CN 201910536855 A CN201910536855 A CN 201910536855A CN 110222007 A CN110222007 A CN 110222007A
Authority
CN
China
Prior art keywords
program
program segment
segment
core
subsegment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910536855.9A
Other languages
Chinese (zh)
Other versions
CN110222007B (en
Inventor
潘景山
刘弢
王利
郭强
庄园
曾云辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Computer Science Center
Original Assignee
Shandong Computer Science Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Computer Science Center filed Critical Shandong Computer Science Center
Priority to CN201910536855.9A priority Critical patent/CN110222007B/en
Publication of CN110222007A publication Critical patent/CN110222007A/en
Application granted granted Critical
Publication of CN110222007B publication Critical patent/CN110222007B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The present invention relates to a kind of Accelerating running methods based on Shen prestige many-core processor, comprising: A, determining program section A, program segment B, the program context dependence between program segment C;If program segment A, program segment B, all there is program context dependence between program segment C three, sequence is executed;Otherwise, adjustment programme section A, program segment B, execute sequence between program segment C, executed;B, step A are executed until having executed all programs to continuous rear three sections of program segments.Program context dependence between determining program section and program subsegment of the present invention, point situation is flexibly handled, introduce communication lock synchronization mechanism, save the main core waiting time, realize the parallel processing of main core and core group, in program process, reduces and number is needed to spawn and join core group thread, improve the execution efficiency of program.

Description

A kind of Accelerating running method based on Shen prestige many-core processor
Technical field
The present invention relates to computer high-performance, parallel computation, system structure technical fields, and in particular to one kind is based on Shen prestige The Accelerating running method of many-core processor.
Background technique
Shen prestige many-core processor is the work of the representative in domestic high-performance processor, it is the high-performance of China's independent research Computing chip, currently, " light in the martial prowess Taihu Lake " supercomputer in computing capability world rankings forefront has used more than 40,000 Shen prestige many-core processor.
Every Shen prestige many-core processor chip (Shen Wei 26010) includes 4 core groups, is connected between core group by network-on-chip It connects.Each core group is mainly made of Memory Controller Hub, administrative unit, 1 main core and 64 from core.64 between core use 8 × 8 mesh topology is attached.Each of each core group is deposited from core with the office of 64KB, as shown in Figure 1.
Since Shen prestige many-core is more from nucleus number mesh, and size each is deposited from the office of core and is extremely limited again, store wall problem In Shen, prestige many-core processor more highlights using upper.Wall problem more highlighting using upper in Shen prestige many-core processor is stored, According to current actual use situation, have following three problem: the first, core group computing resource utilization rate is insufficient.Relative to core The powerful computing capability of group, limited data transfer bandwidth between main core and core group, office deposits too small.As fruit stone group does not obtain abundance Data, will lead to its long-time idle waiting, cause the waste of core group computing resource.Particularly with what is run in supercomputing Large-scale distributed program, the number of nodes used is very more, usually calculates as unit of ten thousand.Therefore, the process on each node The data volume distributed is limited, and data needed for single function body single to be optimized carries out principal and subordinate's calculating are limited, causes core group Computing resource utilization rate is low.The second, time for depositing relative to core group access office of the data transmission period between main memory and core group is long. By taking Shen prestige many-core processor 26010 as an example, main core and be 1.5GHz from core working frequency, each clock cycle (bats) is received for 0.67 Second.The delay of hosting operations of core group access is 278 clock cycle (186.26 nanosecond), and accesses primary visit office and deposit Delay is only 4 clock cycle (2.68 nanosecond).The expense of Shen prestige many-core processor core group access main memory is that core group access office deposits The decades of times of expense, core group access main memory belong to inefficient accessing operation.Third, terminates often the starting of core group.If started Core group is calculated, and needs main core to derive from (spawn) core group thread, single operation needs 26500 clock cycle, and (17755 receive Second).After core group calculates, main core needs to carry out core group thread reduction (join), collects core group data, and single operation needs 7300 clock cycle (4891 nanosecond).For the large-scale distributed program run on supercomputer, can carry out parallel excellent The function body of change is very more, and each optimised function body needs are repeatedly called, the number of spawn and join with hundred million times or 1000000000 calculating.If repeatedly starting core group, needs frequent spawn and join core group thread, causes program overall operation efficiency Lowly.
In addition, the Parallel Program Optimization for being currently based on Shen prestige many-core processor is excellent mainly for program segment progress to be optimized Change, and most program segments are existed in the form of circulation.According to the optimization form of previous machinery, optimizing each circulation will Main core is carried out to these variables first to transmit to from the data of core.Optimize the beginning of each circulation, it will be by these data from master Core is transferred to from core.But multiple program segments to be optimized may have identical data, for example, circulation bound, constant, The input data etc. that will not change in public and program loop, the data such as identified below with underscore.
Summary of the invention
In view of the deficiencies of the prior art, the present invention provides a kind of Accelerating running methods based on Shen prestige many-core processor;
Term is explained:
Program context dependence: in the present invention, program context refers in the code segment that sequence executes, if next A code segment takes less than the data of code segment output, then claims the two code segments without program context dependence;Such as The next code segment of fruit needs the data exported using a upper code segment, then the two code segments is claimed to have program context dependence Relationship.
The technical solution of the present invention is as follows:
A kind of Accelerating running method based on Shen prestige many-core processor, runs on computer, executes program, and program includes Several program segments, technical solution of the present invention can be summarized and specifically carry out based on Shen prestige many-core processor by taking three program segments as an example Carry out a variety of situations of multiple programming.Three sections of program segments of wherein arbitrary continuation are set as program segment A, program segment B, program segment C, Wherein program segment A and program segment C can carry out parallel optimization (can be placed on from core and execute), and program segment B is not available for parallel excellent Change and (can be only placed at main core to execute), comprises the following steps that
I, determining program section A, program segment B, the program context dependence between program segment C, if program segment A, journey All there is program context dependence between sequence section B, program segment C three, then sequence executes;Setting communication lock AB, communication lock BC realizes that main core and core group share cogradient variable, the operation or wait state of main core or core group is determined by cogradient variable, including Steps are as follows:
(1) cogradient variable initializes, and communication lock AB, communication lock BC are main core and core group shared variable;Use volatile Crucial character modification;
(2) program segment A and program segment C are loaded into core group, core group execution phase A, while use AB pairs of communication lock Main core locks, and main core is waited at this time;
(3) program segment A is after core group is finished, and using 1 or several core group threads progress core group thread-data is same Step, the specific synchronous core group number of threads of data that carries out are subject to specifically used core group number of threads.DMA transfer core group data To main core, main core execution phase B is notified, the program segment C in core group is locked;
(4) main core execution phase B, after the completion of execution, communication lock BC unlock notifies core group execution phase C, program segment C returns to core group operation data to main core after core group is finished;
The advantage designed herein is, 1) it can be synchronized by the mechanism of communication lock between main core and core group.2) subtract The number of core group spawn, join is lacked.3) if there is the repeated data with program segment A in program segment C, for example, all employing Certain arrays, such mode reduce the DMA transfer number of data between main core and core group, these arrays can be in core below It is multiplexed in group program.
Otherwise, adjustment programme section A, program segment B, execute sequence between program segment C, executed;
II, step Is are executed until having executed program to continuous rear three sections of program segments.
It is preferred according to the present invention, before execution phase A, program segment B, program segment C, proceed as follows:
It a, whether include two or more a program subsegments in determining program section A, program segment B or program segment C, if do not wrapped It includes, then directly executes the program segment, otherwise, enter step b;
B, judge two or more a program subsegments whether include circulation bound, constant, in public and program loop not The input data that can change sequentially executes two or more a program subsegments if not including;Otherwise, c is entered step;
C, circulation bound, constant, the input data that will not change in public and program loop will be extracted, and by extraction Circulation bound, constant, the input data that will not change in public and program loop disposably from main core be transferred into it is each from Core executes two or more a program subsegments.Recycle bound, constant, the input number that will not change in public and program loop According to the data identified in such as background technique with underscore.If multiple program segments all include such data, can be in program It executes and uniformly extracts and be transferred into each from core when starting.
64KB each is saved as from the office of core, these data is stored and is usually no more than 3KB, will not influence from core and normally count It calculates.
Preferred according to the present invention, execution phase A is comprised the following steps that
(5) whether determining program section A includes several program subsegments, if program segment A includes several program subsegments, into Enter step (6);Otherwise, (7) are entered step;
(6) two sections of program subsegments of wherein arbitrary continuation are set as program subsegment A1, program subsegment A2, to all programs Subsegment executes following steps until executed program segment A: program context between determining program subsegment A1, program subsegment A2 according to The relationship of relying, if there are program context dependence between program subsegment A1, program subsegment A2, successively sequential execution of programmed Subsegment A1, program subsegment A2;Otherwise, core group computing resource is distributed according to the data volume of program subsegment A1, program subsegment A2, parallel Execute program subsegment A1, program subsegment A2.
(7) if program segment A does not include several program subsegments, direct execution phase A.
Preferred according to the present invention, execution phase C is comprised the following steps that
(8) whether determining program section C includes several program subsegments, if program segment C includes several program subsegments, into Enter step (9);Otherwise, (10) are entered step;
(9) two sections of program subsegments of wherein arbitrary continuation are set as program subsegment C1, program subsegment C2, to all programs Subsegment executes following steps until executed program segment C: program context between determining program subsegment C1, program subsegment C2 according to The relationship of relying, if there are program context dependence between program subsegment C1, program subsegment C2, successively sequential execution of programmed Subsegment C1, program subsegment C2;Otherwise, core group computing resource is distributed according to the data volume of program subsegment C1, program subsegment C2, parallel Execute program subsegment C1, program subsegment C2;
(10) if program segment C does not include several program subsegments, direct execution phase C.
About program segment and program subsegment, for example, a for circulation is used as a program segment, circulation is internal if there is not Relevant lines of code calculates, so that it may be divided into multiple program subsegment parallel processings;If this for circulation is internal with regard to a line generation Code, is handled then it can not be divided into program subsegment, directly executes the program segment.For program segment inside, It, can be with the parallel execution of setting program subsegment according to the dependence between program subsegment.For a program segment, inside exists Multiple program subsegments, these program subsegments are serially to execute in core group, if no context dependence between program subsegment, And the calculating data volume of single program subsegment is less, then these program subsegments can execute parallel.
It is preferred according to the present invention, if program segment A, program segment B, there is no above and below program between program segment C three Literary dependence, then core group execution phase A and program segment C, at the same time, main core execute can not parallel optimization program segment B. Such situation has adjusted and executes sequence between three program segments, first carries out program segment A and program segment C, reduce a spawn and The time of join.While core group execution phase A and program segment C, main core execute can not parallel optimization program segment B.At this time Main core program and core group program can be run simultaneously, reduce the number of spawn, join core group.Because of nothing between three program segments Context dependency can not have to " communication lock " and synchronize.
It is further preferred that spawn core group thread loading procedure section A and program segment C be to core group, by core group execution phase A and program segment C, at the same time, main core execute can not parallel optimization program segment B;To program segment A, program segment C and program segment B After being performed both by, join core group thread returns to the result of program segment A, program segment C to main core.
It is preferred according to the present invention, if program segment A and program segment B there are program context dependence, program segment A and Program context dependence is not present between program segment C, and program context is not present between program segment B and program segment C Dependence then first carries out program segment A and program C, returns the result rear execution phase B.
It is further preferred that spawn core group thread loading procedure section A and program segment C be to core group, by core group execution phase A and program segment C, after program segment A and program segment C are finished, join core group thread returns to the knot of program segment A and program segment C Fruit to main core, main core execute can not parallel optimization program segment B.
It is preferred according to the present invention, if program context dependence, and journey is not present in program segment A and program segment B There are program context dependences between sequence section B and program segment C, then first carry out program segment B, return the result rear execution phase A and C.Parallel optimization is carried out by the way of adjustment programme section execution sequence, reduces the number of spawn, join core group.Specifically hold Row process is as shown in Figure 5.
It is further preferred that main core execution phase B, after program segment B is finished, spawn core group thread loads journey Sequence section A and program segment C is to core group, by core group execution phase A and program segment C, after program segment A and program segment C are finished, Join core group thread returns to the result of program segment A and program segment C to main core.
The invention has the benefit that
1, the program context dependence between determining program section of the present invention, a point situation are flexibly handled, and save main core It waiting time, realizes the parallel processing of main core and core group, in program process, reduces to spawn and join core group line Journey needs number, improves the execution efficiency of program.
2, the present invention will circulation bound, constant, the input data that will not change in public and program loop disposably from Main core is incoming from core, avoids the repetition transmission of data.
3, invention introduces " communication lock " synchronization mechanism, 1, multiple or 64 cores of whole in core group can be used Three kinds of group thread etc. and the mode of main core communication synchronize communication, and the mode of program segment execution sequence is adjusted flexibly, carries out simultaneously Row optimization, is further reduced the number of spawn, join core group, for repeatedly calling the application program of loop body, can save more Time.
4, the method that uses of the present invention reduces the DMA transfer number of data between main core and core group, it is multiple can be parallel excellent The segment data of change can be disposably passed to from core, be greatly reduced based on most consuming in the prestige many-core processor Parallel Program Optimization of Shen When principal and subordinate's Nuclear Data transmission time, the incoming slave Nuclear Data of part can also obtain in the slave core program executed later Multiplexing.For program segment after parallel optimization compared to the optimal way for not using this method, improved efficiency is obvious.
5, the present invention goes back the program context dependence between determining program section and program subsegment, and a point situation is flexibly located Reason, saves the main core waiting time, realizes the parallel processing of main core and core group.One core group can be more in same time-triggered protocol A program segment or program subsegment, improve the operational efficiency of program.
Detailed description of the invention
Fig. 1 is Shen prestige many-core processor hardware architecture diagram;
Sequence is held when Fig. 2 is program segment A, program segment B, there is program context dependence between program segment C three Row flow diagram;
Program execution flow when program context dependence is not present in Fig. 3 between program subsegment A1 and program subsegment A2 Schematic diagram;
Fig. 4 is program segment A, program segment B, journey when being not present program context dependence between program segment C three Sequence executes flow diagram;
Fig. 5 be program segment A there are program context dependences with program segment B, and between program segment A, program segment C and Program execution flow schematic diagram when program context dependence is all not present between program segment B, program segment C;
Program context dependence is not present for program segment A and program segment B in Fig. 6, and between program segment B and program segment C There are program execution flow schematic diagrames when program context dependence.
Specific embodiment
The present invention is further qualified with embodiment with reference to the accompanying drawings of the specification, but not limited to this.
Embodiment 1
A kind of Accelerating running method based on Shen prestige many-core processor, runs on computer, executes program, and program includes Several program segments, technical solution of the present invention can be summarized and specifically carry out based on Shen prestige many-core processor by taking three program segments as an example Carry out a variety of situations of multiple programming.Three sections of program segments of wherein arbitrary continuation are set as program segment A, program segment B, program segment C, Wherein program segment A and program segment B can carry out parallel optimization (can be placed on from core and execute), and program segment B is not available for parallel excellent Change and (can be only placed at main core to execute), comprises the following steps that
I, determining program section A, program segment B, the program context dependence between program segment C, if program segment A, journey All there is program context dependence between sequence section B, program segment C three, then sequence executes;Setting communication lock AB, communication lock BC realizes that main core and core group share cogradient variable, the operation or wait state of main core or core group is determined by cogradient variable, is such as schemed Shown in 2, comprise the following steps that
(1) cogradient variable initializes, and communication lock AB, communication lock BC are main core and core group shared variable;Use volatile Crucial character modification;
(2) program segment A and program segment C are loaded into core group, core group execution phase A, while use AB pairs of communication lock Main core locks, and main core is waited at this time;
(3) program segment A is after core group is finished, and using 1 or several core group threads progress core group thread-data is same Step, the specific synchronous core group number of threads of data that carries out are subject to specifically used core group number of threads.DMA transfer core group data To main core, main core execution phase B is notified, the program segment C in core group is locked;
(4) main core execution phase B, after the completion of execution, communication lock BC unlock notifies core group execution phase C, program segment C returns to core group operation data to main core after core group is finished;
The advantage designed herein is, 1) it can be synchronized by the mechanism of communication lock between main core and core group.2) subtract The number of core group spawn, join is lacked.3) if there is the repeated data with program segment A in program segment C, for example, all employing Certain arrays, such mode reduce the DMA transfer number of data between main core and core group, these arrays can be in core below It is multiplexed in group program.
Otherwise, adjustment programme section A, program segment B, execute sequence between program segment C, executed;
II, step Is are executed until having executed program to continuous rear three sections of program segments.
Embodiment 2
According to a kind of Accelerating running method based on Shen prestige many-core processor described in embodiment 1, difference is: executing Before program segment A, program segment B, program segment C, proceed as follows:
It a, whether include two or more a program subsegments in determining program section A, program segment B or program segment C, if do not wrapped It includes, then directly executes the program segment, otherwise, enter step b;
B, judge two or more a program subsegments whether include circulation bound, constant, in public and program loop not The input data that can change sequentially executes two or more a program subsegments if not including;Otherwise, c is entered step;
C, circulation bound, constant, the input data that will not change in public and program loop will be extracted, and by extraction Circulation bound, constant, the input data that will not change in public and program loop disposably from main core be transferred into it is each from Core executes two or more a program subsegments.Recycle bound, constant, the input number that will not change in public and program loop According to the data identified in such as background technique with underscore.If multiple program segments all include such data, can be in program It executes and uniformly extracts and be transferred into each from core when starting.
By taking ocean model program Regional Ocean Modeling System (ROMS) as an example, hotspot program In step2d.f90 there are 55 program segments to be optimized to need to carry out the number of 55 program segments to be optimized according to method before According to transmission, according to the method provided by the invention, the transmission data of multiple program segments to be optimized are carried out to merge transmission, it is only necessary to The data transmission of 10 program segments to be optimized.Main core improves 80% to from the data transmission efficiency of core.
Embodiment 3
According to a kind of Accelerating running method based on Shen prestige many-core processor described in embodiment 1, difference is:
Execution phase A, comprises the following steps that
(5) whether determining program section A includes several program subsegments, if program segment A includes several program subsegments, into Enter step (6);Otherwise, (7) are entered step;
(6) two sections of program subsegments of wherein arbitrary continuation are set as program subsegment A1, program subsegment A2, to all programs Subsegment executes following steps until executed program segment A: program context between determining program subsegment A1, program subsegment A2 according to The relationship of relying, if there are program context dependence between program subsegment A1, program subsegment A2, successively sequential execution of programmed Subsegment A1, program subsegment A2;Otherwise, core group computing resource is distributed according to the data volume of program subsegment A1, program subsegment A2, parallel Execute program subsegment A1, program subsegment A2.Specific implementation procedure is as shown in Figure 3.
(7) if program segment A does not include several program subsegments, direct execution phase A.
Execution phase C, comprises the following steps that
(8) whether determining program section C includes several program subsegments, if program segment C includes several program subsegments, into Enter step (9);Otherwise, (10) are entered step;
(9) two sections of program subsegments of wherein arbitrary continuation are set as program subsegment C1, program subsegment C2, to all programs Subsegment executes following steps until executed program segment C: program context between determining program subsegment C1, program subsegment C2 according to The relationship of relying, if there are program context dependence between program subsegment C1, program subsegment C2, successively sequential execution of programmed Subsegment C1, program subsegment C2;Otherwise, core group computing resource is distributed according to the data volume of program subsegment C1, program subsegment C2, parallel Execute program subsegment C1, program subsegment C2;
(10) if program segment C does not include several program subsegments, direct execution phase C.
About program segment and program subsegment, for example, a for circulation is used as a program segment, circulation is internal if there is not Relevant lines of code calculates, so that it may be divided into multiple program subsegment parallel processings;If this for circulation is internal with regard to a line generation Code, is handled then it can not be divided into program subsegment, directly executes the program segment.For program segment inside, It, can be with the parallel execution of setting program subsegment according to the dependence between program subsegment.For a program segment, inside exists Multiple program subsegments, these program subsegments are serially to execute in core group, if no context dependence between program subsegment, And the calculating data volume of single program subsegment is less, then these program subsegments can execute parallel.
Embodiment 4
According to a kind of Accelerating running method based on Shen prestige many-core processor described in embodiment 1, difference is:
If program context dependence, core group program segment A, program segment B, are not present between program segment C three Execution phase A and program segment C, at the same time, main core execute can not parallel optimization program segment B.Such situation has adjusted three Sequence is executed between a program segment, first carries out program segment A and program segment C, reduces the time of a spawn and join.Core group While execution phase A and program segment C, main core execute can not parallel optimization program segment B.Main core program and core group journey at this time Sequence can be run simultaneously, reduce the number of spawn, join core group.It, can because of no context dependence between three program segments It is synchronized with not having to " communication lock ".Spawn core group thread loading procedure section A and program segment C executes program to core group, by core group Section A and program segment C, at the same time, main core execute can not parallel optimization program segment B;To program segment A, program segment C and program segment After B is performed both by, join core group thread returns to the result of program segment A, program segment C to main core.It is specific as shown in Figure 4.
If program segment A is with program segment B, there are program context dependences, are not present between program segment A and program segment C Program context dependence, and program context dependence is not present between program segment B and program segment C, then it first carries out Program segment A and program C returns the result rear execution phase B.Specific implementation procedure is as shown in Figure 5.The load of spawn core group thread Program segment A and program segment C is finished by core group execution phase A and program segment C to program segment A and program segment C to core group Afterwards, join core group thread returns to the result of program segment A and program segment C to main core, main core execute can not parallel optimization program segment B。
If program context dependence is not present in program segment A and program segment B, and between program segment B and program segment C There are program context dependences, then first carry out program segment B, return the result rear execution phase A and C.Using adjustment programme The mode of Duan Zhihang sequence carries out parallel optimization, reduces the number of spawn, join core group.Specific implementation procedure is as shown in Figure 6. Main core execution phase B, after program segment B is finished, spawn core group thread loading procedure section A and program segment C to core group, By core group execution phase A and program segment C, after program segment A and program segment C are finished, join core group thread returns to program The result of section A and program segment C is to main core.
In the present embodiment, by design content combination ocean numerical models program Parallel Ocean of the present invention Program (POP) has carried out experiment test, and test environment is " light in martial prowess Taihu Lake " supercomputer, ocean numerical models Program Parallel Ocean Program (POP) simulates the temperature variations in 5 mode day of global ocean using the program, Measurement scope is 10000 processes, optimizes the program segment in advu, hmix_del4 in POP program.Wherein hmix_del4 In certain loop body individual process be 900000 times to its call number, core the group spawn and join of single are at least time-consuming 22646 nanoseconds;After the method for the present embodiment, the spawn and join of this loop body be can be omitted, i.e., comprising this loop body The spawn and join of program segment can be omitted, final to save 20.34 seconds, and the runing time of program module is where the program segment 1020 seconds, i.e., the program segment individually optimized just saves 2% module runtime.It is similar such in real-life program Program segment enormous amount to be optimized, and this general class method needs to carry out prolonged Numerical-Mode using supercomputer Quasi-, the savable time that stacks up is considerable.
The case where describing according to the present invention, each case select the program segment that three sequences execute.Program segment is difference journey Sequence section A, program segment B, program segment C, wherein program segment A and program segment B can carry out parallel optimization (can be placed on from core and execute), Program segment B is not available for parallel optimization (can be only placed at main core to execute).Each program segment is carried out according to the method for the present invention Optimization, using after the method for the present invention program be not optimised, original method optimize efficiency comparative it is as shown in table 1.
Table 1
By using parallel optimization method of the invention, opposite original method, improved efficiency is obvious, and minimum is 16.7%, Up to 67.6%.

Claims (10)

1. a kind of Accelerating running method based on Shen prestige many-core processor, runs on computer, program is executed, if program includes Dry program segment sets three sections of program segments of wherein arbitrary continuation as program segment A, program segment B, program segment C, which is characterized in that packet Include that steps are as follows:
I, determining program section A, program segment B, the program context dependence between program segment C, if program segment A, program segment B, all there is program context dependence between program segment C three, then sequence executes;BC is locked in setting communication lock AB, communication, real Now main core and core group share cogradient variable, and the operation or wait state of main core or core group, including step are determined by cogradient variable It is as follows:
(1) cogradient variable initializes, and communication lock AB, communication lock BC are main core and core group shared variable;
(2) program segment A and program segment C are loaded into core group, core group execution phase A, while using communication lock AB to main core It locks, main core is waited at this time;
(3) program segment A is after core group is finished, and using 1 or several core group threads carry out core group thread-data and synchronize, DMA transfer core group data notify main core execution phase B to main core, lock to the program segment C in core group;
(4) main core execution phase B, after the completion of execution, communication lock BC unlock notifies core group execution phase C, program segment C to exist After core group is finished, core group operation data is returned to main core;
Otherwise, adjustment programme section A, program segment B, execute sequence between program segment C, executed;
II, step Is are executed until having executed program to continuous rear three sections of program segments.
2. a kind of Accelerating running method based on Shen prestige many-core processor according to claim 1, which is characterized in that execute Before program segment A, program segment B, program segment C, proceed as follows:
It a, whether include two or more a program subsegments in determining program section A, program segment B or program segment C, if not including, The program segment is then directly executed, otherwise, enters step b;
B, judge whether two or more a program subsegments include recycling bound, constant, will not changing in public and program loop The input data of change sequentially executes two or more a program subsegments if not including;Otherwise, c is entered step;
C, circulation bound, constant, the input data that will not change in public and program loop will be extracted, and by the circulation of extraction Bound, constant, the input data that will not change in public and program loop are disposably transferred into from main core each from core, hold The two or more a program subsegments of row.
3. a kind of Accelerating running method based on Shen prestige many-core processor according to claim 1, which is characterized in that execute Program segment A, comprises the following steps that
(5) whether determining program section A includes several program subsegments, if program segment A includes several program subsegments, into step Suddenly (6);Otherwise, (7) are entered step;
(6) two sections of program subsegments of wherein arbitrary continuation are set as program subsegment A1, program subsegment A2, to all program subsegments Following steps are executed until having executed program segment A: the program context between determining program subsegment A1, program subsegment A2, which relies on, closes System, if there are program context dependence between program subsegment A1, program subsegment A2, successively sequential execution of programmed subsegment A1, program subsegment A2;Otherwise, core group computing resource is distributed according to the data volume of program subsegment A1, program subsegment A2, it is parallel to execute Program subsegment A1, program subsegment A2;
(7) if program segment A does not include several program subsegments, direct execution phase A.
4. a kind of Accelerating running method based on Shen prestige many-core processor according to claim 1, which is characterized in that execute Program segment C, comprises the following steps that
(8) whether determining program section C includes several program subsegments, if program segment C includes several program subsegments, into step Suddenly (9);Otherwise, (10) are entered step;
(9) two sections of program subsegments of wherein arbitrary continuation are set as program subsegment C1, program subsegment C2, to all program subsegments Following steps are executed until having executed program segment C: the program context between determining program subsegment C1, program subsegment C2, which relies on, closes System, if there are program context dependence between program subsegment C1, program subsegment C2, successively sequential execution of programmed subsegment C1, program subsegment C2;Otherwise, core group computing resource is distributed according to the data volume of program subsegment C1, program subsegment C2, it is parallel to execute Program subsegment C1, program subsegment C2;
(10) if program segment C does not include several program subsegments, direct execution phase C.
5. a kind of Accelerating running method based on Shen prestige many-core processor according to claim 1, which is characterized in that if Program segment A, program segment B, program context dependence is not present between program segment C three, then core group execution phase A and Program segment C, at the same time, main core execute can not parallel optimization program segment B.
6. a kind of Accelerating running method based on Shen prestige many-core processor according to claim 5, which is characterized in that Spawn core group thread loading procedure section A and program segment C is to core group, by core group execution phase A and program segment C, at the same time, Main core execute can not parallel optimization program segment B;After being performed both by program segment A, program segment C and program segment B, join core group Thread returns to the result of program segment A, program segment C to main core.
7. a kind of Accelerating running method based on Shen prestige many-core processor according to claim 1, which is characterized in that if There are program context dependences with program segment B by program segment A, between program segment A and program segment C there is no program context according to The relationship of relying, and program context dependence is not present between program segment B and program segment C, then first carry out program segment A and program C returns the result rear execution phase B.
8. a kind of Accelerating running method based on Shen prestige many-core processor according to claim 7, which is characterized in that Spawn core group thread loading procedure section A and program segment C is to core group, by core group execution phase A and program segment C, to program segment A After being finished with program segment C, join core group thread returns to the result of program segment A and program segment C to main core, and main core execution can not The program segment B of parallel optimization.
9. a kind of -8 any Accelerating running method based on Shen prestige many-core processor, feature exist according to claim 1 In if program context dependence is not present in program segment A and program segment B, and existing between program segment B and program segment C Program context dependence then first carries out program segment B, returns the result rear execution phase A and C.
10. a kind of Accelerating running method based on Shen prestige many-core processor according to claim 9, which is characterized in that main Core execution phase B, after program segment B is finished, spawn core group thread loading procedure section A and program segment C to core group, by Core group execution phase A and program segment C, after program segment A and program segment C are finished, join core group thread returns to program segment A With the result of program segment C to main core.
CN201910536855.9A 2019-06-20 2019-06-20 Acceleration operation method based on Shenwei many-core processor Active CN110222007B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910536855.9A CN110222007B (en) 2019-06-20 2019-06-20 Acceleration operation method based on Shenwei many-core processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910536855.9A CN110222007B (en) 2019-06-20 2019-06-20 Acceleration operation method based on Shenwei many-core processor

Publications (2)

Publication Number Publication Date
CN110222007A true CN110222007A (en) 2019-09-10
CN110222007B CN110222007B (en) 2023-11-24

Family

ID=67814362

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910536855.9A Active CN110222007B (en) 2019-06-20 2019-06-20 Acceleration operation method based on Shenwei many-core processor

Country Status (1)

Country Link
CN (1) CN110222007B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113568665A (en) * 2020-04-29 2021-10-29 北京希姆计算科技有限公司 Data processing device
CN113835984A (en) * 2021-09-27 2021-12-24 山东省计算中心(国家超级计算济南中心) Many-core application performance evaluation method based on domestic ultra-micro architecture
CN117472448B (en) * 2023-12-28 2024-03-26 山东省计算中心(国家超级计算济南中心) Parallel acceleration method, device and medium for secondary core cluster of Shenwei many-core processor

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102624889A (en) * 2012-03-06 2012-08-01 河海大学 Mass data concurrency processing method based on receiving and processing separation
CN102929723A (en) * 2012-11-06 2013-02-13 无锡江南计算技术研究所 Method for dividing parallel program segment based on heterogeneous multi-core processor
CN103080900A (en) * 2010-09-03 2013-05-01 西门子公司 Method for parallelizing automatic control programs and compiler
CN105468448A (en) * 2015-11-24 2016-04-06 无锡江南计算技术研究所 Slave core system call implementation method facing to isomerism many-core environment
CN106095583A (en) * 2016-06-20 2016-11-09 国家海洋局第海洋研究所 Principal and subordinate's nuclear coordination calculation and programming framework based on new martial prowess processor

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103080900A (en) * 2010-09-03 2013-05-01 西门子公司 Method for parallelizing automatic control programs and compiler
CN102624889A (en) * 2012-03-06 2012-08-01 河海大学 Mass data concurrency processing method based on receiving and processing separation
CN102929723A (en) * 2012-11-06 2013-02-13 无锡江南计算技术研究所 Method for dividing parallel program segment based on heterogeneous multi-core processor
CN105468448A (en) * 2015-11-24 2016-04-06 无锡江南计算技术研究所 Slave core system call implementation method facing to isomerism many-core environment
CN106095583A (en) * 2016-06-20 2016-11-09 国家海洋局第海洋研究所 Principal and subordinate's nuclear coordination calculation and programming framework based on new martial prowess processor

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
KAMATERI, E等: "Cloud4SOA: a semantic-interoperability PaaS solution for multi-cloud platform management and portability", 《SERVICE-ORIENTED AND CLOUD COMPUTING. SECOND EUROPEAN CONFERENCE (ESOCC 2013)》, pages 64 - 78 *
姚庆 等: "SOM算法在申威众核上的实现和优化", 计算机科学, no. 2, pages 601 - 606 *
徐卫志等: "众核处理器片上同步机制和评估方法研究", 《计算机学报》 *
徐卫志等: "众核处理器片上同步机制和评估方法研究", 《计算机学报》, vol. 33, no. 10, 15 October 2010 (2010-10-15), pages 1777 - 1787 *
徐阳 等: "Silicon-Crystal应用的神威OpenACC移植与数据流驱动任务图并行化", 《HTTP:KNS.CNKI.NET/KCMS/DETAIL/37.1357.N.20190517.1115.001.HTML》 *
徐阳 等: "Silicon-Crystal应用的神威OpenACC移植与数据流驱动任务图并行化", 《HTTP:KNS.CNKI.NET/KCMS/DETAIL/37.1357.N.20190517.1115.001.HTML》, 17 May 2019 (2019-05-17), pages 1 - 8 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113568665A (en) * 2020-04-29 2021-10-29 北京希姆计算科技有限公司 Data processing device
CN113568665B (en) * 2020-04-29 2023-11-17 北京希姆计算科技有限公司 Data processing device
CN113835984A (en) * 2021-09-27 2021-12-24 山东省计算中心(国家超级计算济南中心) Many-core application performance evaluation method based on domestic ultra-micro architecture
CN113835984B (en) * 2021-09-27 2023-08-08 山东省计算中心(国家超级计算济南中心) Many-core application performance evaluation method based on domestic super-computing micro-architecture
CN117472448B (en) * 2023-12-28 2024-03-26 山东省计算中心(国家超级计算济南中心) Parallel acceleration method, device and medium for secondary core cluster of Shenwei many-core processor

Also Published As

Publication number Publication date
CN110222007B (en) 2023-11-24

Similar Documents

Publication Publication Date Title
CN110222007A (en) A kind of Accelerating running method based on Shen prestige many-core processor
CN103368851B (en) Based on the Openflow stream table storage optimization method of resource multiplex
CN108537331A (en) A kind of restructural convolutional neural networks accelerating circuit based on asynchronous logic
CN105183698A (en) Control processing system and method based on multi-kernel DSP
CN110516789A (en) The processing method of instruction set, device and relevant device in convolutional network accelerator
CN116627892B (en) Data near storage computing method, device and storage medium
CN101655783B (en) Forward-looking multithreading partitioning method
CN106293003A (en) A kind of heterogeneous system dynamic power consumption optimization method based on AOV gateway key path query
CN103577161A (en) Big data frequency parallel-processing method
CN110119375B (en) Control method for linking multiple scalar cores into single-core vector processing array
CN116882336B (en) Modeling method and device based on high-level language simulation RTL
CN106896895A (en) A kind of heterogeneous system dynamic power consumption optimization method based on AOV gateway key path queries
US20220147097A1 (en) Synchronization signal generating circuit, chip and synchronization method and device, based on multi-core architecture
CN110262900A (en) Lock synchronous operation accelerated method is communicated between a kind of main core based on Shen prestige many-core processor and core group
Liu et al. A hybrid parallel genetic algorithm with dynamic migration strategy based on sunway many-core processor
CN112612744B (en) Reconfigurable array mapping method based on data stream decoupling
CN110262884A (en) The operation method of multiple program multiple data stream Paralleled in a kind of core group based on Shen prestige many-core processor
CN112181894B (en) In-core group adaptive adjustment operation method based on Shenwei many-core processor
Bianchi et al. The distributed dual ascent algorithm is robust to asynchrony
CN102129495A (en) Method for reducing power consumption of reconfigurable operator array structure
CN111400013A (en) Method and system for processing data stream of multi-core processor
CN113568665B (en) Data processing device
Garg Analysis of distributed systems with many identical processes
US20230305848A1 (en) Schedule Instructions of a Program of Data Flows for Execution in Tiles of a Coarse Grained Reconfigurable Array
Van Leuken et al. High level synthesis of asynchronous circuits from data flow graphs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant