CN110222007A - A kind of Accelerating running method based on Shen prestige many-core processor - Google Patents
A kind of Accelerating running method based on Shen prestige many-core processor Download PDFInfo
- Publication number
- CN110222007A CN110222007A CN201910536855.9A CN201910536855A CN110222007A CN 110222007 A CN110222007 A CN 110222007A CN 201910536855 A CN201910536855 A CN 201910536855A CN 110222007 A CN110222007 A CN 110222007A
- Authority
- CN
- China
- Prior art keywords
- program
- program segment
- segment
- core
- subsegment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 48
- 238000004891 communication Methods 0.000 claims abstract description 26
- 230000008859 change Effects 0.000 claims description 18
- 238000005457 optimization Methods 0.000 claims description 12
- 238000012546 transfer Methods 0.000 claims description 7
- 238000000605 extraction Methods 0.000 claims description 3
- 238000004064 recycling Methods 0.000 claims 1
- 230000008569 process Effects 0.000 abstract description 6
- 238000012545 processing Methods 0.000 abstract description 5
- 230000007246 mechanism Effects 0.000 abstract description 4
- 230000005540 biological transmission Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 6
- 230000001360 synchronised effect Effects 0.000 description 5
- 238000003491 array Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 235000013399 edible fruits Nutrition 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 241000288673 Chiroptera Species 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000000151 deposition Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000002035 prolonged effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000004575 stone Substances 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
- G06F15/7807—System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/52—Program synchronisation; Mutual exclusion, e.g. by means of semaphores
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The present invention relates to a kind of Accelerating running methods based on Shen prestige many-core processor, comprising: A, determining program section A, program segment B, the program context dependence between program segment C;If program segment A, program segment B, all there is program context dependence between program segment C three, sequence is executed;Otherwise, adjustment programme section A, program segment B, execute sequence between program segment C, executed;B, step A are executed until having executed all programs to continuous rear three sections of program segments.Program context dependence between determining program section and program subsegment of the present invention, point situation is flexibly handled, introduce communication lock synchronization mechanism, save the main core waiting time, realize the parallel processing of main core and core group, in program process, reduces and number is needed to spawn and join core group thread, improve the execution efficiency of program.
Description
Technical field
The present invention relates to computer high-performance, parallel computation, system structure technical fields, and in particular to one kind is based on Shen prestige
The Accelerating running method of many-core processor.
Background technique
Shen prestige many-core processor is the work of the representative in domestic high-performance processor, it is the high-performance of China's independent research
Computing chip, currently, " light in the martial prowess Taihu Lake " supercomputer in computing capability world rankings forefront has used more than 40,000
Shen prestige many-core processor.
Every Shen prestige many-core processor chip (Shen Wei 26010) includes 4 core groups, is connected between core group by network-on-chip
It connects.Each core group is mainly made of Memory Controller Hub, administrative unit, 1 main core and 64 from core.64 between core use 8
× 8 mesh topology is attached.Each of each core group is deposited from core with the office of 64KB, as shown in Figure 1.
Since Shen prestige many-core is more from nucleus number mesh, and size each is deposited from the office of core and is extremely limited again, store wall problem
In Shen, prestige many-core processor more highlights using upper.Wall problem more highlighting using upper in Shen prestige many-core processor is stored,
According to current actual use situation, have following three problem: the first, core group computing resource utilization rate is insufficient.Relative to core
The powerful computing capability of group, limited data transfer bandwidth between main core and core group, office deposits too small.As fruit stone group does not obtain abundance
Data, will lead to its long-time idle waiting, cause the waste of core group computing resource.Particularly with what is run in supercomputing
Large-scale distributed program, the number of nodes used is very more, usually calculates as unit of ten thousand.Therefore, the process on each node
The data volume distributed is limited, and data needed for single function body single to be optimized carries out principal and subordinate's calculating are limited, causes core group
Computing resource utilization rate is low.The second, time for depositing relative to core group access office of the data transmission period between main memory and core group is long.
By taking Shen prestige many-core processor 26010 as an example, main core and be 1.5GHz from core working frequency, each clock cycle (bats) is received for 0.67
Second.The delay of hosting operations of core group access is 278 clock cycle (186.26 nanosecond), and accesses primary visit office and deposit
Delay is only 4 clock cycle (2.68 nanosecond).The expense of Shen prestige many-core processor core group access main memory is that core group access office deposits
The decades of times of expense, core group access main memory belong to inefficient accessing operation.Third, terminates often the starting of core group.If started
Core group is calculated, and needs main core to derive from (spawn) core group thread, single operation needs 26500 clock cycle, and (17755 receive
Second).After core group calculates, main core needs to carry out core group thread reduction (join), collects core group data, and single operation needs
7300 clock cycle (4891 nanosecond).For the large-scale distributed program run on supercomputer, can carry out parallel excellent
The function body of change is very more, and each optimised function body needs are repeatedly called, the number of spawn and join with hundred million times or
1000000000 calculating.If repeatedly starting core group, needs frequent spawn and join core group thread, causes program overall operation efficiency
Lowly.
In addition, the Parallel Program Optimization for being currently based on Shen prestige many-core processor is excellent mainly for program segment progress to be optimized
Change, and most program segments are existed in the form of circulation.According to the optimization form of previous machinery, optimizing each circulation will
Main core is carried out to these variables first to transmit to from the data of core.Optimize the beginning of each circulation, it will be by these data from master
Core is transferred to from core.But multiple program segments to be optimized may have identical data, for example, circulation bound, constant,
The input data etc. that will not change in public and program loop, the data such as identified below with underscore.
Summary of the invention
In view of the deficiencies of the prior art, the present invention provides a kind of Accelerating running methods based on Shen prestige many-core processor;
Term is explained:
Program context dependence: in the present invention, program context refers in the code segment that sequence executes, if next
A code segment takes less than the data of code segment output, then claims the two code segments without program context dependence;Such as
The next code segment of fruit needs the data exported using a upper code segment, then the two code segments is claimed to have program context dependence
Relationship.
The technical solution of the present invention is as follows:
A kind of Accelerating running method based on Shen prestige many-core processor, runs on computer, executes program, and program includes
Several program segments, technical solution of the present invention can be summarized and specifically carry out based on Shen prestige many-core processor by taking three program segments as an example
Carry out a variety of situations of multiple programming.Three sections of program segments of wherein arbitrary continuation are set as program segment A, program segment B, program segment C,
Wherein program segment A and program segment C can carry out parallel optimization (can be placed on from core and execute), and program segment B is not available for parallel excellent
Change and (can be only placed at main core to execute), comprises the following steps that
I, determining program section A, program segment B, the program context dependence between program segment C, if program segment A, journey
All there is program context dependence between sequence section B, program segment C three, then sequence executes;Setting communication lock AB, communication lock
BC realizes that main core and core group share cogradient variable, the operation or wait state of main core or core group is determined by cogradient variable, including
Steps are as follows:
(1) cogradient variable initializes, and communication lock AB, communication lock BC are main core and core group shared variable;Use volatile
Crucial character modification;
(2) program segment A and program segment C are loaded into core group, core group execution phase A, while use AB pairs of communication lock
Main core locks, and main core is waited at this time;
(3) program segment A is after core group is finished, and using 1 or several core group threads progress core group thread-data is same
Step, the specific synchronous core group number of threads of data that carries out are subject to specifically used core group number of threads.DMA transfer core group data
To main core, main core execution phase B is notified, the program segment C in core group is locked;
(4) main core execution phase B, after the completion of execution, communication lock BC unlock notifies core group execution phase C, program segment
C returns to core group operation data to main core after core group is finished;
The advantage designed herein is, 1) it can be synchronized by the mechanism of communication lock between main core and core group.2) subtract
The number of core group spawn, join is lacked.3) if there is the repeated data with program segment A in program segment C, for example, all employing
Certain arrays, such mode reduce the DMA transfer number of data between main core and core group, these arrays can be in core below
It is multiplexed in group program.
Otherwise, adjustment programme section A, program segment B, execute sequence between program segment C, executed;
II, step Is are executed until having executed program to continuous rear three sections of program segments.
It is preferred according to the present invention, before execution phase A, program segment B, program segment C, proceed as follows:
It a, whether include two or more a program subsegments in determining program section A, program segment B or program segment C, if do not wrapped
It includes, then directly executes the program segment, otherwise, enter step b;
B, judge two or more a program subsegments whether include circulation bound, constant, in public and program loop not
The input data that can change sequentially executes two or more a program subsegments if not including;Otherwise, c is entered step;
C, circulation bound, constant, the input data that will not change in public and program loop will be extracted, and by extraction
Circulation bound, constant, the input data that will not change in public and program loop disposably from main core be transferred into it is each from
Core executes two or more a program subsegments.Recycle bound, constant, the input number that will not change in public and program loop
According to the data identified in such as background technique with underscore.If multiple program segments all include such data, can be in program
It executes and uniformly extracts and be transferred into each from core when starting.
64KB each is saved as from the office of core, these data is stored and is usually no more than 3KB, will not influence from core and normally count
It calculates.
Preferred according to the present invention, execution phase A is comprised the following steps that
(5) whether determining program section A includes several program subsegments, if program segment A includes several program subsegments, into
Enter step (6);Otherwise, (7) are entered step;
(6) two sections of program subsegments of wherein arbitrary continuation are set as program subsegment A1, program subsegment A2, to all programs
Subsegment executes following steps until executed program segment A: program context between determining program subsegment A1, program subsegment A2 according to
The relationship of relying, if there are program context dependence between program subsegment A1, program subsegment A2, successively sequential execution of programmed
Subsegment A1, program subsegment A2;Otherwise, core group computing resource is distributed according to the data volume of program subsegment A1, program subsegment A2, parallel
Execute program subsegment A1, program subsegment A2.
(7) if program segment A does not include several program subsegments, direct execution phase A.
Preferred according to the present invention, execution phase C is comprised the following steps that
(8) whether determining program section C includes several program subsegments, if program segment C includes several program subsegments, into
Enter step (9);Otherwise, (10) are entered step;
(9) two sections of program subsegments of wherein arbitrary continuation are set as program subsegment C1, program subsegment C2, to all programs
Subsegment executes following steps until executed program segment C: program context between determining program subsegment C1, program subsegment C2 according to
The relationship of relying, if there are program context dependence between program subsegment C1, program subsegment C2, successively sequential execution of programmed
Subsegment C1, program subsegment C2;Otherwise, core group computing resource is distributed according to the data volume of program subsegment C1, program subsegment C2, parallel
Execute program subsegment C1, program subsegment C2;
(10) if program segment C does not include several program subsegments, direct execution phase C.
About program segment and program subsegment, for example, a for circulation is used as a program segment, circulation is internal if there is not
Relevant lines of code calculates, so that it may be divided into multiple program subsegment parallel processings;If this for circulation is internal with regard to a line generation
Code, is handled then it can not be divided into program subsegment, directly executes the program segment.For program segment inside,
It, can be with the parallel execution of setting program subsegment according to the dependence between program subsegment.For a program segment, inside exists
Multiple program subsegments, these program subsegments are serially to execute in core group, if no context dependence between program subsegment,
And the calculating data volume of single program subsegment is less, then these program subsegments can execute parallel.
It is preferred according to the present invention, if program segment A, program segment B, there is no above and below program between program segment C three
Literary dependence, then core group execution phase A and program segment C, at the same time, main core execute can not parallel optimization program segment B.
Such situation has adjusted and executes sequence between three program segments, first carries out program segment A and program segment C, reduce a spawn and
The time of join.While core group execution phase A and program segment C, main core execute can not parallel optimization program segment B.At this time
Main core program and core group program can be run simultaneously, reduce the number of spawn, join core group.Because of nothing between three program segments
Context dependency can not have to " communication lock " and synchronize.
It is further preferred that spawn core group thread loading procedure section A and program segment C be to core group, by core group execution phase
A and program segment C, at the same time, main core execute can not parallel optimization program segment B;To program segment A, program segment C and program segment B
After being performed both by, join core group thread returns to the result of program segment A, program segment C to main core.
It is preferred according to the present invention, if program segment A and program segment B there are program context dependence, program segment A and
Program context dependence is not present between program segment C, and program context is not present between program segment B and program segment C
Dependence then first carries out program segment A and program C, returns the result rear execution phase B.
It is further preferred that spawn core group thread loading procedure section A and program segment C be to core group, by core group execution phase
A and program segment C, after program segment A and program segment C are finished, join core group thread returns to the knot of program segment A and program segment C
Fruit to main core, main core execute can not parallel optimization program segment B.
It is preferred according to the present invention, if program context dependence, and journey is not present in program segment A and program segment B
There are program context dependences between sequence section B and program segment C, then first carry out program segment B, return the result rear execution phase
A and C.Parallel optimization is carried out by the way of adjustment programme section execution sequence, reduces the number of spawn, join core group.Specifically hold
Row process is as shown in Figure 5.
It is further preferred that main core execution phase B, after program segment B is finished, spawn core group thread loads journey
Sequence section A and program segment C is to core group, by core group execution phase A and program segment C, after program segment A and program segment C are finished,
Join core group thread returns to the result of program segment A and program segment C to main core.
The invention has the benefit that
1, the program context dependence between determining program section of the present invention, a point situation are flexibly handled, and save main core
It waiting time, realizes the parallel processing of main core and core group, in program process, reduces to spawn and join core group line
Journey needs number, improves the execution efficiency of program.
2, the present invention will circulation bound, constant, the input data that will not change in public and program loop disposably from
Main core is incoming from core, avoids the repetition transmission of data.
3, invention introduces " communication lock " synchronization mechanism, 1, multiple or 64 cores of whole in core group can be used
Three kinds of group thread etc. and the mode of main core communication synchronize communication, and the mode of program segment execution sequence is adjusted flexibly, carries out simultaneously
Row optimization, is further reduced the number of spawn, join core group, for repeatedly calling the application program of loop body, can save more
Time.
4, the method that uses of the present invention reduces the DMA transfer number of data between main core and core group, it is multiple can be parallel excellent
The segment data of change can be disposably passed to from core, be greatly reduced based on most consuming in the prestige many-core processor Parallel Program Optimization of Shen
When principal and subordinate's Nuclear Data transmission time, the incoming slave Nuclear Data of part can also obtain in the slave core program executed later
Multiplexing.For program segment after parallel optimization compared to the optimal way for not using this method, improved efficiency is obvious.
5, the present invention goes back the program context dependence between determining program section and program subsegment, and a point situation is flexibly located
Reason, saves the main core waiting time, realizes the parallel processing of main core and core group.One core group can be more in same time-triggered protocol
A program segment or program subsegment, improve the operational efficiency of program.
Detailed description of the invention
Fig. 1 is Shen prestige many-core processor hardware architecture diagram;
Sequence is held when Fig. 2 is program segment A, program segment B, there is program context dependence between program segment C three
Row flow diagram;
Program execution flow when program context dependence is not present in Fig. 3 between program subsegment A1 and program subsegment A2
Schematic diagram;
Fig. 4 is program segment A, program segment B, journey when being not present program context dependence between program segment C three
Sequence executes flow diagram;
Fig. 5 be program segment A there are program context dependences with program segment B, and between program segment A, program segment C and
Program execution flow schematic diagram when program context dependence is all not present between program segment B, program segment C;
Program context dependence is not present for program segment A and program segment B in Fig. 6, and between program segment B and program segment C
There are program execution flow schematic diagrames when program context dependence.
Specific embodiment
The present invention is further qualified with embodiment with reference to the accompanying drawings of the specification, but not limited to this.
Embodiment 1
A kind of Accelerating running method based on Shen prestige many-core processor, runs on computer, executes program, and program includes
Several program segments, technical solution of the present invention can be summarized and specifically carry out based on Shen prestige many-core processor by taking three program segments as an example
Carry out a variety of situations of multiple programming.Three sections of program segments of wherein arbitrary continuation are set as program segment A, program segment B, program segment C,
Wherein program segment A and program segment B can carry out parallel optimization (can be placed on from core and execute), and program segment B is not available for parallel excellent
Change and (can be only placed at main core to execute), comprises the following steps that
I, determining program section A, program segment B, the program context dependence between program segment C, if program segment A, journey
All there is program context dependence between sequence section B, program segment C three, then sequence executes;Setting communication lock AB, communication lock
BC realizes that main core and core group share cogradient variable, the operation or wait state of main core or core group is determined by cogradient variable, is such as schemed
Shown in 2, comprise the following steps that
(1) cogradient variable initializes, and communication lock AB, communication lock BC are main core and core group shared variable;Use volatile
Crucial character modification;
(2) program segment A and program segment C are loaded into core group, core group execution phase A, while use AB pairs of communication lock
Main core locks, and main core is waited at this time;
(3) program segment A is after core group is finished, and using 1 or several core group threads progress core group thread-data is same
Step, the specific synchronous core group number of threads of data that carries out are subject to specifically used core group number of threads.DMA transfer core group data
To main core, main core execution phase B is notified, the program segment C in core group is locked;
(4) main core execution phase B, after the completion of execution, communication lock BC unlock notifies core group execution phase C, program segment
C returns to core group operation data to main core after core group is finished;
The advantage designed herein is, 1) it can be synchronized by the mechanism of communication lock between main core and core group.2) subtract
The number of core group spawn, join is lacked.3) if there is the repeated data with program segment A in program segment C, for example, all employing
Certain arrays, such mode reduce the DMA transfer number of data between main core and core group, these arrays can be in core below
It is multiplexed in group program.
Otherwise, adjustment programme section A, program segment B, execute sequence between program segment C, executed;
II, step Is are executed until having executed program to continuous rear three sections of program segments.
Embodiment 2
According to a kind of Accelerating running method based on Shen prestige many-core processor described in embodiment 1, difference is: executing
Before program segment A, program segment B, program segment C, proceed as follows:
It a, whether include two or more a program subsegments in determining program section A, program segment B or program segment C, if do not wrapped
It includes, then directly executes the program segment, otherwise, enter step b;
B, judge two or more a program subsegments whether include circulation bound, constant, in public and program loop not
The input data that can change sequentially executes two or more a program subsegments if not including;Otherwise, c is entered step;
C, circulation bound, constant, the input data that will not change in public and program loop will be extracted, and by extraction
Circulation bound, constant, the input data that will not change in public and program loop disposably from main core be transferred into it is each from
Core executes two or more a program subsegments.Recycle bound, constant, the input number that will not change in public and program loop
According to the data identified in such as background technique with underscore.If multiple program segments all include such data, can be in program
It executes and uniformly extracts and be transferred into each from core when starting.
By taking ocean model program Regional Ocean Modeling System (ROMS) as an example, hotspot program
In step2d.f90 there are 55 program segments to be optimized to need to carry out the number of 55 program segments to be optimized according to method before
According to transmission, according to the method provided by the invention, the transmission data of multiple program segments to be optimized are carried out to merge transmission, it is only necessary to
The data transmission of 10 program segments to be optimized.Main core improves 80% to from the data transmission efficiency of core.
Embodiment 3
According to a kind of Accelerating running method based on Shen prestige many-core processor described in embodiment 1, difference is:
Execution phase A, comprises the following steps that
(5) whether determining program section A includes several program subsegments, if program segment A includes several program subsegments, into
Enter step (6);Otherwise, (7) are entered step;
(6) two sections of program subsegments of wherein arbitrary continuation are set as program subsegment A1, program subsegment A2, to all programs
Subsegment executes following steps until executed program segment A: program context between determining program subsegment A1, program subsegment A2 according to
The relationship of relying, if there are program context dependence between program subsegment A1, program subsegment A2, successively sequential execution of programmed
Subsegment A1, program subsegment A2;Otherwise, core group computing resource is distributed according to the data volume of program subsegment A1, program subsegment A2, parallel
Execute program subsegment A1, program subsegment A2.Specific implementation procedure is as shown in Figure 3.
(7) if program segment A does not include several program subsegments, direct execution phase A.
Execution phase C, comprises the following steps that
(8) whether determining program section C includes several program subsegments, if program segment C includes several program subsegments, into
Enter step (9);Otherwise, (10) are entered step;
(9) two sections of program subsegments of wherein arbitrary continuation are set as program subsegment C1, program subsegment C2, to all programs
Subsegment executes following steps until executed program segment C: program context between determining program subsegment C1, program subsegment C2 according to
The relationship of relying, if there are program context dependence between program subsegment C1, program subsegment C2, successively sequential execution of programmed
Subsegment C1, program subsegment C2;Otherwise, core group computing resource is distributed according to the data volume of program subsegment C1, program subsegment C2, parallel
Execute program subsegment C1, program subsegment C2;
(10) if program segment C does not include several program subsegments, direct execution phase C.
About program segment and program subsegment, for example, a for circulation is used as a program segment, circulation is internal if there is not
Relevant lines of code calculates, so that it may be divided into multiple program subsegment parallel processings;If this for circulation is internal with regard to a line generation
Code, is handled then it can not be divided into program subsegment, directly executes the program segment.For program segment inside,
It, can be with the parallel execution of setting program subsegment according to the dependence between program subsegment.For a program segment, inside exists
Multiple program subsegments, these program subsegments are serially to execute in core group, if no context dependence between program subsegment,
And the calculating data volume of single program subsegment is less, then these program subsegments can execute parallel.
Embodiment 4
According to a kind of Accelerating running method based on Shen prestige many-core processor described in embodiment 1, difference is:
If program context dependence, core group program segment A, program segment B, are not present between program segment C three
Execution phase A and program segment C, at the same time, main core execute can not parallel optimization program segment B.Such situation has adjusted three
Sequence is executed between a program segment, first carries out program segment A and program segment C, reduces the time of a spawn and join.Core group
While execution phase A and program segment C, main core execute can not parallel optimization program segment B.Main core program and core group journey at this time
Sequence can be run simultaneously, reduce the number of spawn, join core group.It, can because of no context dependence between three program segments
It is synchronized with not having to " communication lock ".Spawn core group thread loading procedure section A and program segment C executes program to core group, by core group
Section A and program segment C, at the same time, main core execute can not parallel optimization program segment B;To program segment A, program segment C and program segment
After B is performed both by, join core group thread returns to the result of program segment A, program segment C to main core.It is specific as shown in Figure 4.
If program segment A is with program segment B, there are program context dependences, are not present between program segment A and program segment C
Program context dependence, and program context dependence is not present between program segment B and program segment C, then it first carries out
Program segment A and program C returns the result rear execution phase B.Specific implementation procedure is as shown in Figure 5.The load of spawn core group thread
Program segment A and program segment C is finished by core group execution phase A and program segment C to program segment A and program segment C to core group
Afterwards, join core group thread returns to the result of program segment A and program segment C to main core, main core execute can not parallel optimization program segment
B。
If program context dependence is not present in program segment A and program segment B, and between program segment B and program segment C
There are program context dependences, then first carry out program segment B, return the result rear execution phase A and C.Using adjustment programme
The mode of Duan Zhihang sequence carries out parallel optimization, reduces the number of spawn, join core group.Specific implementation procedure is as shown in Figure 6.
Main core execution phase B, after program segment B is finished, spawn core group thread loading procedure section A and program segment C to core group,
By core group execution phase A and program segment C, after program segment A and program segment C are finished, join core group thread returns to program
The result of section A and program segment C is to main core.
In the present embodiment, by design content combination ocean numerical models program Parallel Ocean of the present invention
Program (POP) has carried out experiment test, and test environment is " light in martial prowess Taihu Lake " supercomputer, ocean numerical models
Program Parallel Ocean Program (POP) simulates the temperature variations in 5 mode day of global ocean using the program,
Measurement scope is 10000 processes, optimizes the program segment in advu, hmix_del4 in POP program.Wherein hmix_del4
In certain loop body individual process be 900000 times to its call number, core the group spawn and join of single are at least time-consuming
22646 nanoseconds;After the method for the present embodiment, the spawn and join of this loop body be can be omitted, i.e., comprising this loop body
The spawn and join of program segment can be omitted, final to save 20.34 seconds, and the runing time of program module is where the program segment
1020 seconds, i.e., the program segment individually optimized just saves 2% module runtime.It is similar such in real-life program
Program segment enormous amount to be optimized, and this general class method needs to carry out prolonged Numerical-Mode using supercomputer
Quasi-, the savable time that stacks up is considerable.
The case where describing according to the present invention, each case select the program segment that three sequences execute.Program segment is difference journey
Sequence section A, program segment B, program segment C, wherein program segment A and program segment B can carry out parallel optimization (can be placed on from core and execute),
Program segment B is not available for parallel optimization (can be only placed at main core to execute).Each program segment is carried out according to the method for the present invention
Optimization, using after the method for the present invention program be not optimised, original method optimize efficiency comparative it is as shown in table 1.
Table 1
By using parallel optimization method of the invention, opposite original method, improved efficiency is obvious, and minimum is 16.7%,
Up to 67.6%.
Claims (10)
1. a kind of Accelerating running method based on Shen prestige many-core processor, runs on computer, program is executed, if program includes
Dry program segment sets three sections of program segments of wherein arbitrary continuation as program segment A, program segment B, program segment C, which is characterized in that packet
Include that steps are as follows:
I, determining program section A, program segment B, the program context dependence between program segment C, if program segment A, program segment
B, all there is program context dependence between program segment C three, then sequence executes;BC is locked in setting communication lock AB, communication, real
Now main core and core group share cogradient variable, and the operation or wait state of main core or core group, including step are determined by cogradient variable
It is as follows:
(1) cogradient variable initializes, and communication lock AB, communication lock BC are main core and core group shared variable;
(2) program segment A and program segment C are loaded into core group, core group execution phase A, while using communication lock AB to main core
It locks, main core is waited at this time;
(3) program segment A is after core group is finished, and using 1 or several core group threads carry out core group thread-data and synchronize,
DMA transfer core group data notify main core execution phase B to main core, lock to the program segment C in core group;
(4) main core execution phase B, after the completion of execution, communication lock BC unlock notifies core group execution phase C, program segment C to exist
After core group is finished, core group operation data is returned to main core;
Otherwise, adjustment programme section A, program segment B, execute sequence between program segment C, executed;
II, step Is are executed until having executed program to continuous rear three sections of program segments.
2. a kind of Accelerating running method based on Shen prestige many-core processor according to claim 1, which is characterized in that execute
Before program segment A, program segment B, program segment C, proceed as follows:
It a, whether include two or more a program subsegments in determining program section A, program segment B or program segment C, if not including,
The program segment is then directly executed, otherwise, enters step b;
B, judge whether two or more a program subsegments include recycling bound, constant, will not changing in public and program loop
The input data of change sequentially executes two or more a program subsegments if not including;Otherwise, c is entered step;
C, circulation bound, constant, the input data that will not change in public and program loop will be extracted, and by the circulation of extraction
Bound, constant, the input data that will not change in public and program loop are disposably transferred into from main core each from core, hold
The two or more a program subsegments of row.
3. a kind of Accelerating running method based on Shen prestige many-core processor according to claim 1, which is characterized in that execute
Program segment A, comprises the following steps that
(5) whether determining program section A includes several program subsegments, if program segment A includes several program subsegments, into step
Suddenly (6);Otherwise, (7) are entered step;
(6) two sections of program subsegments of wherein arbitrary continuation are set as program subsegment A1, program subsegment A2, to all program subsegments
Following steps are executed until having executed program segment A: the program context between determining program subsegment A1, program subsegment A2, which relies on, closes
System, if there are program context dependence between program subsegment A1, program subsegment A2, successively sequential execution of programmed subsegment
A1, program subsegment A2;Otherwise, core group computing resource is distributed according to the data volume of program subsegment A1, program subsegment A2, it is parallel to execute
Program subsegment A1, program subsegment A2;
(7) if program segment A does not include several program subsegments, direct execution phase A.
4. a kind of Accelerating running method based on Shen prestige many-core processor according to claim 1, which is characterized in that execute
Program segment C, comprises the following steps that
(8) whether determining program section C includes several program subsegments, if program segment C includes several program subsegments, into step
Suddenly (9);Otherwise, (10) are entered step;
(9) two sections of program subsegments of wherein arbitrary continuation are set as program subsegment C1, program subsegment C2, to all program subsegments
Following steps are executed until having executed program segment C: the program context between determining program subsegment C1, program subsegment C2, which relies on, closes
System, if there are program context dependence between program subsegment C1, program subsegment C2, successively sequential execution of programmed subsegment
C1, program subsegment C2;Otherwise, core group computing resource is distributed according to the data volume of program subsegment C1, program subsegment C2, it is parallel to execute
Program subsegment C1, program subsegment C2;
(10) if program segment C does not include several program subsegments, direct execution phase C.
5. a kind of Accelerating running method based on Shen prestige many-core processor according to claim 1, which is characterized in that if
Program segment A, program segment B, program context dependence is not present between program segment C three, then core group execution phase A and
Program segment C, at the same time, main core execute can not parallel optimization program segment B.
6. a kind of Accelerating running method based on Shen prestige many-core processor according to claim 5, which is characterized in that
Spawn core group thread loading procedure section A and program segment C is to core group, by core group execution phase A and program segment C, at the same time,
Main core execute can not parallel optimization program segment B;After being performed both by program segment A, program segment C and program segment B, join core group
Thread returns to the result of program segment A, program segment C to main core.
7. a kind of Accelerating running method based on Shen prestige many-core processor according to claim 1, which is characterized in that if
There are program context dependences with program segment B by program segment A, between program segment A and program segment C there is no program context according to
The relationship of relying, and program context dependence is not present between program segment B and program segment C, then first carry out program segment A and program
C returns the result rear execution phase B.
8. a kind of Accelerating running method based on Shen prestige many-core processor according to claim 7, which is characterized in that
Spawn core group thread loading procedure section A and program segment C is to core group, by core group execution phase A and program segment C, to program segment A
After being finished with program segment C, join core group thread returns to the result of program segment A and program segment C to main core, and main core execution can not
The program segment B of parallel optimization.
9. a kind of -8 any Accelerating running method based on Shen prestige many-core processor, feature exist according to claim 1
In if program context dependence is not present in program segment A and program segment B, and existing between program segment B and program segment C
Program context dependence then first carries out program segment B, returns the result rear execution phase A and C.
10. a kind of Accelerating running method based on Shen prestige many-core processor according to claim 9, which is characterized in that main
Core execution phase B, after program segment B is finished, spawn core group thread loading procedure section A and program segment C to core group, by
Core group execution phase A and program segment C, after program segment A and program segment C are finished, join core group thread returns to program segment A
With the result of program segment C to main core.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910536855.9A CN110222007B (en) | 2019-06-20 | 2019-06-20 | Acceleration operation method based on Shenwei many-core processor |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910536855.9A CN110222007B (en) | 2019-06-20 | 2019-06-20 | Acceleration operation method based on Shenwei many-core processor |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110222007A true CN110222007A (en) | 2019-09-10 |
CN110222007B CN110222007B (en) | 2023-11-24 |
Family
ID=67814362
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910536855.9A Active CN110222007B (en) | 2019-06-20 | 2019-06-20 | Acceleration operation method based on Shenwei many-core processor |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110222007B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113568665A (en) * | 2020-04-29 | 2021-10-29 | 北京希姆计算科技有限公司 | Data processing device |
CN113835984A (en) * | 2021-09-27 | 2021-12-24 | 山东省计算中心(国家超级计算济南中心) | Many-core application performance evaluation method based on domestic ultra-micro architecture |
CN117472448B (en) * | 2023-12-28 | 2024-03-26 | 山东省计算中心(国家超级计算济南中心) | Parallel acceleration method, device and medium for secondary core cluster of Shenwei many-core processor |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102624889A (en) * | 2012-03-06 | 2012-08-01 | 河海大学 | Mass data concurrency processing method based on receiving and processing separation |
CN102929723A (en) * | 2012-11-06 | 2013-02-13 | 无锡江南计算技术研究所 | Method for dividing parallel program segment based on heterogeneous multi-core processor |
CN103080900A (en) * | 2010-09-03 | 2013-05-01 | 西门子公司 | Method for parallelizing automatic control programs and compiler |
CN105468448A (en) * | 2015-11-24 | 2016-04-06 | 无锡江南计算技术研究所 | Slave core system call implementation method facing to isomerism many-core environment |
CN106095583A (en) * | 2016-06-20 | 2016-11-09 | 国家海洋局第海洋研究所 | Principal and subordinate's nuclear coordination calculation and programming framework based on new martial prowess processor |
-
2019
- 2019-06-20 CN CN201910536855.9A patent/CN110222007B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103080900A (en) * | 2010-09-03 | 2013-05-01 | 西门子公司 | Method for parallelizing automatic control programs and compiler |
CN102624889A (en) * | 2012-03-06 | 2012-08-01 | 河海大学 | Mass data concurrency processing method based on receiving and processing separation |
CN102929723A (en) * | 2012-11-06 | 2013-02-13 | 无锡江南计算技术研究所 | Method for dividing parallel program segment based on heterogeneous multi-core processor |
CN105468448A (en) * | 2015-11-24 | 2016-04-06 | 无锡江南计算技术研究所 | Slave core system call implementation method facing to isomerism many-core environment |
CN106095583A (en) * | 2016-06-20 | 2016-11-09 | 国家海洋局第海洋研究所 | Principal and subordinate's nuclear coordination calculation and programming framework based on new martial prowess processor |
Non-Patent Citations (6)
Title |
---|
KAMATERI, E等: "Cloud4SOA: a semantic-interoperability PaaS solution for multi-cloud platform management and portability", 《SERVICE-ORIENTED AND CLOUD COMPUTING. SECOND EUROPEAN CONFERENCE (ESOCC 2013)》, pages 64 - 78 * |
姚庆 等: "SOM算法在申威众核上的实现和优化", 计算机科学, no. 2, pages 601 - 606 * |
徐卫志等: "众核处理器片上同步机制和评估方法研究", 《计算机学报》 * |
徐卫志等: "众核处理器片上同步机制和评估方法研究", 《计算机学报》, vol. 33, no. 10, 15 October 2010 (2010-10-15), pages 1777 - 1787 * |
徐阳 等: "Silicon-Crystal应用的神威OpenACC移植与数据流驱动任务图并行化", 《HTTP:KNS.CNKI.NET/KCMS/DETAIL/37.1357.N.20190517.1115.001.HTML》 * |
徐阳 等: "Silicon-Crystal应用的神威OpenACC移植与数据流驱动任务图并行化", 《HTTP:KNS.CNKI.NET/KCMS/DETAIL/37.1357.N.20190517.1115.001.HTML》, 17 May 2019 (2019-05-17), pages 1 - 8 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113568665A (en) * | 2020-04-29 | 2021-10-29 | 北京希姆计算科技有限公司 | Data processing device |
CN113568665B (en) * | 2020-04-29 | 2023-11-17 | 北京希姆计算科技有限公司 | Data processing device |
CN113835984A (en) * | 2021-09-27 | 2021-12-24 | 山东省计算中心(国家超级计算济南中心) | Many-core application performance evaluation method based on domestic ultra-micro architecture |
CN113835984B (en) * | 2021-09-27 | 2023-08-08 | 山东省计算中心(国家超级计算济南中心) | Many-core application performance evaluation method based on domestic super-computing micro-architecture |
CN117472448B (en) * | 2023-12-28 | 2024-03-26 | 山东省计算中心(国家超级计算济南中心) | Parallel acceleration method, device and medium for secondary core cluster of Shenwei many-core processor |
Also Published As
Publication number | Publication date |
---|---|
CN110222007B (en) | 2023-11-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110222007A (en) | A kind of Accelerating running method based on Shen prestige many-core processor | |
CN103368851B (en) | Based on the Openflow stream table storage optimization method of resource multiplex | |
CN108537331A (en) | A kind of restructural convolutional neural networks accelerating circuit based on asynchronous logic | |
CN105183698A (en) | Control processing system and method based on multi-kernel DSP | |
CN110516789A (en) | The processing method of instruction set, device and relevant device in convolutional network accelerator | |
CN116627892B (en) | Data near storage computing method, device and storage medium | |
CN101655783B (en) | Forward-looking multithreading partitioning method | |
CN106293003A (en) | A kind of heterogeneous system dynamic power consumption optimization method based on AOV gateway key path query | |
CN103577161A (en) | Big data frequency parallel-processing method | |
CN110119375B (en) | Control method for linking multiple scalar cores into single-core vector processing array | |
CN116882336B (en) | Modeling method and device based on high-level language simulation RTL | |
CN106896895A (en) | A kind of heterogeneous system dynamic power consumption optimization method based on AOV gateway key path queries | |
US20220147097A1 (en) | Synchronization signal generating circuit, chip and synchronization method and device, based on multi-core architecture | |
CN110262900A (en) | Lock synchronous operation accelerated method is communicated between a kind of main core based on Shen prestige many-core processor and core group | |
Liu et al. | A hybrid parallel genetic algorithm with dynamic migration strategy based on sunway many-core processor | |
CN112612744B (en) | Reconfigurable array mapping method based on data stream decoupling | |
CN110262884A (en) | The operation method of multiple program multiple data stream Paralleled in a kind of core group based on Shen prestige many-core processor | |
CN112181894B (en) | In-core group adaptive adjustment operation method based on Shenwei many-core processor | |
Bianchi et al. | The distributed dual ascent algorithm is robust to asynchrony | |
CN102129495A (en) | Method for reducing power consumption of reconfigurable operator array structure | |
CN111400013A (en) | Method and system for processing data stream of multi-core processor | |
CN113568665B (en) | Data processing device | |
Garg | Analysis of distributed systems with many identical processes | |
US20230305848A1 (en) | Schedule Instructions of a Program of Data Flows for Execution in Tiles of a Coarse Grained Reconfigurable Array | |
Van Leuken et al. | High level synthesis of asynchronous circuits from data flow graphs |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |