CN105242909B - A kind of many-core cyclic blocking method based on multi version code building - Google Patents
A kind of many-core cyclic blocking method based on multi version code building Download PDFInfo
- Publication number
- CN105242909B CN105242909B CN201510829920.9A CN201510829920A CN105242909B CN 105242909 B CN105242909 B CN 105242909B CN 201510829920 A CN201510829920 A CN 201510829920A CN 105242909 B CN105242909 B CN 105242909B
- Authority
- CN
- China
- Prior art keywords
- core
- piece
- many
- stored
- compiler
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Memory System Of A Hierarchy Structure (AREA)
Abstract
The invention provides a kind of many-core cyclic blocking method based on multi version code building, wherein many-core processor is made up of control core and calculating core array, wherein each calculating core with one piece of cache;And wherein each calculate one piece of cache that core carries as being stored on piece in the way of direct memory access (DMA) and carry out data transmission with main memory;It is characterized in that, compiler to many-core circulation when carrying out parallel transformation, the value indicated according to compiler determines the granularity of cyclic blocking, so as to generate the parallel codes of different editions, compiler is operationally fed back the use information stored on piece by way of code instrumentation so as to be adjusted correspondingly according to the use information of feedback to the value of pragma, to cause the utilization rate stored on piece to maximize simultaneously.
Description
Technical field
The present invention relates to field of computer technology, and in particular to a kind of many-core cyclic blocking based on multi version code building
Method.
Background technology
For the many-core processor with storage device on piece, in order to improve calculating performance, it usually needs pass through DMA
Data needed for calculating are first transferred on piece and deposited by (Direct Memory Access, direct memory access (DMA)) mode from main memory
Storage, can so substantially reduce memory access latency.But, due to memory capacity very little on piece, a small amount of data can only be loaded, parallel
The unreasonable of circulation division can cause storage device on piece effectively to utilize.Such as circulate granularity of division too small, frequency can be caused
The utilization rate stored on numerous DMA, piece is not high, and circulation granularity of division is excessive, then the data needed for calculating can be caused to be put into
On piece in storage device, so as to have a strong impact on acceleration effect.
On the other hand, for the many-core processor with storage device on piece, compiler is using calculating core array pair
When core loop is accelerated parallel, common means are directly the data on main memory to be conducted interviews or according to point of acquiescence
Block granularity carries out parallel patition to circulation, is then transferred data to and is conducted interviews again after being stored on piece using DMA, both sides
Formula all can not be stored effectively using on piece.
The content of the invention
The technical problems to be solved by the invention are for there is drawbacks described above in the prior art there is provided one kind based on many editions
The many-core cyclic blocking method of this code building, carries out parallel patition, wherein by user by pragma to many-core circulation
Feedback information finds the optimal granularity that many-core circulation is divided when the pragma of guidance and operation, so as to effectively instruct compiling
Device is divided to many-core circulation, improves the service efficiency stored on piece so as to improve overall acceleration effect.
According to the present invention there is provided a kind of many-core cyclic blocking method based on multi version code building, wherein at many-core
Reason device is made up of control core and calculating core array, wherein each calculating core with one piece of cache;And it is wherein every
It is individual to calculate one piece of cache carrying of core as being stored on piece in the way of direct memory access (DMA) and carry out data with hosting
Transmission;Wherein, compiler to many-core circulation when carrying out parallel transformation, and the value that is indicated according to compiler determines cyclic blocking
Granularity, so that corresponding parallel codes are generated, while compiler will by way of code instrumentation using the parallel codes of generation
The use information stored on piece is fed back when parallel transformation is run so as to be referred to according to the use information of feedback to compiling
The value shown is adjusted correspondingly, to cause the utilization rate stored on piece to maximize.
Preferably, the value of pragma is adjusted correspondingly according to the use information of feedback until that will can be deposited on piece
Storage is as far as possible with full.
Brief description of the drawings
With reference to accompanying drawing, and by reference to following detailed description, it will more easily have more complete understanding to the present invention
And its adjoint advantages and features is more easily understood, wherein:
Fig. 1 schematically shows the many-core circulation point according to the preferred embodiment of the invention based on multi version code building
The flow chart of block method.
It should be noted that accompanying drawing is used to illustrate the present invention, it is not intended to limit the present invention.Note, represent that the accompanying drawing of structure can
It can be not necessarily drawn to scale.Also, in accompanying drawing, same or similar element indicates same or similar label.
Embodiment
In order that present disclosure is more clear and understandable, with reference to specific embodiments and the drawings in the present invention
Appearance is described in detail.
In the present invention, many-core processor is made up of control core and calculating core array, wherein each calculating core band
There is one piece of cache.And wherein, one piece of cache that each calculating core is carried can be by storage as storage on piece
Device directly accesses DMA mode and main memory carries out data transmission.
The present invention is proposed instructs parallel compiler to circulate the method for carrying out piecemeal to many-core using pragma.Compiler
When carrying out parallel transformation to many-core circulation, the value indicated according to compiler determines the granularity of cyclic blocking, so as to generate phase
The parallel codes (granularity of different cyclic blockings generates the parallel codes of different editions) answered, while compiler utilizes generation
Parallel codes are fed back the use information stored on piece so as to energy when parallel transformation is run by way of code instrumentation
It is enough that the value of pragma is adjusted correspondingly according to the use information of feedback, to cause the utilization rate stored on piece to maximize
(for example, until that will can be stored on piece as far as possible with full).
For example, the object of feedback can be user, it is possible thereby to directly by user according to the use information of feedback to compiling
The value of instruction is adjusted correspondingly.
The many-core cyclic blocking method according to the preferred embodiment of the invention based on multi version code building is described below
Specific processing example in application.
Fig. 1 schematically shows the many-core circulation point according to the preferred embodiment of the invention based on multi version code building
The flow chart of the specific processing example of block method.
As shown in figure 1, the many-core cyclic blocking method according to the preferred embodiment of the invention based on multi version code building
The specific processing example include:
First step S1:Compiler adds pragma when carrying out parallel transformation to many-core circulation in core loop;
Second step S2:User (or, the automatic setting regulation mechanism of Block granularity) setting cyclic blocking granularity;
Third step S3:Accelerate code according to the generation of the granularity of cyclic blocking is corresponding;
Four steps S4:So that compilation run, wherein compiler by way of code instrumentation by the use stored on piece
Information is operationally fed back;
5th step S5:According to the service condition stored in feedback information analytic plate;
6th step S6:Judge to store on piece and whether be fully utilized;If it is determined that storage is fully utilized on piece, then enter
Enter the 7th step S7;It is not fully utilized if it is determined that being stored on piece, then returns to second step S2.
7th step S7:Optimal circulation is obtained to divide.
The present invention provides a kind of many-core cyclic blocking method based on multi version code building, by pragma to many-core
Circulation carries out parallel patition, wherein finding what many-core circulation was divided by feedback information when user guided pragma and operation
Optimal granularity, so as to effectively instruct compiler to divide many-core circulation, improve on piece the service efficiency that stores so as to
Improve overall acceleration effect.
The advantage of the invention is that can be for how to circulate the problem of dividing to many-core, have found one kind simply has
The method of effect, the optimal granularity that many-core circulation is divided is found by feedback information when user guided pragma and operation, from
And effectively using being stored on the piece on many-core processor, improve acceleration effect.
Furthermore, it is necessary to explanation, unless otherwise indicated, term " first " otherwise in specification, " second ", " the 3rd "
Be used only for distinguishing each component, element, step etc. in specification Deng description, without be intended to indicate that each component, element,
Logical relation or ordinal relation between step etc..
Although it is understood that the present invention is disclosed as above with preferred embodiment, but above-described embodiment and being not used to
Limit the present invention.For any those skilled in the art, without departing from the scope of the technical proposal of the invention,
Many possible variations and modification are all made to technical solution of the present invention using the technology contents of the disclosure above, or are revised as
With the equivalent embodiment of change.Therefore, every content without departing from technical solution of the present invention, the technical spirit pair according to the present invention
Any simple modifications, equivalents, and modifications made for any of the above embodiments, still fall within the scope of technical solution of the present invention protection
It is interior.
Claims (2)
1. a kind of many-core cyclic blocking method based on multi version code building, wherein many-core processor is by control core and calculating
Core array is constituted, wherein each calculating core with one piece of cache;And one piece that wherein each calculating core is carried
Cache carries out data transmission with main memory as being stored on piece in the way of direct memory access (DMA);Characterized in that, compiling
Device to many-core circulation when carrying out parallel transformation, and the value indicated according to compiler determines the granularity of cyclic blocking, so as to generate
Corresponding parallel codes, at the same compiler using the parallel codes of generation by way of code instrumentation by the use stored on piece
Information is fed back when parallel transformation is run so as to which the value indicated according to the use information of feedback compiler carries out phase
The adjustment answered, to cause the utilization rate stored on piece to maximize;Wherein, judge to store on piece and whether be fully utilized, if sentenced
Storage is fully utilized on stator, obtains optimal circulation and divides, if it is decided that stored on piece be not fully utilized it is then true again
Determine cyclic blocking granularity.
2. the many-core cyclic blocking method according to claim 1 based on multi version code building, it is characterised in that according to
The value that the use information of feedback is indicated compiler is adjusted correspondingly until that will can be stored on piece as far as possible with full.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510829920.9A CN105242909B (en) | 2015-11-24 | 2015-11-24 | A kind of many-core cyclic blocking method based on multi version code building |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510829920.9A CN105242909B (en) | 2015-11-24 | 2015-11-24 | A kind of many-core cyclic blocking method based on multi version code building |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105242909A CN105242909A (en) | 2016-01-13 |
CN105242909B true CN105242909B (en) | 2017-08-11 |
Family
ID=55040568
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510829920.9A Active CN105242909B (en) | 2015-11-24 | 2015-11-24 | A kind of many-core cyclic blocking method based on multi version code building |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105242909B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110555520A (en) * | 2018-05-30 | 2019-12-10 | 北京三星通信技术研究有限公司 | method for performing convolution operations, corresponding processing device and electronic device |
CN110399124B (en) * | 2019-07-19 | 2022-04-22 | 浪潮电子信息产业股份有限公司 | Code generation method, device, equipment and readable storage medium |
CN112433965B (en) * | 2019-08-26 | 2022-07-12 | 无锡江南计算技术研究所 | Data caching implementation method facing SPM storage hierarchy |
CN112559197B (en) * | 2019-09-10 | 2022-11-15 | 无锡江南计算技术研究所 | Convolution calculation data reuse method based on heterogeneous many-core processor |
CN114860417B (en) * | 2022-06-15 | 2023-05-02 | 中科物栖(北京)科技有限责任公司 | Multi-core neural network processor and multi-task allocation scheduling method for same |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6002881A (en) * | 1997-06-10 | 1999-12-14 | Arm Limited | Coprocessor data access control |
US6732247B2 (en) * | 2001-01-17 | 2004-05-04 | University Of Washington | Multi-ported memory having pipelined data banks |
CN101034345A (en) * | 2007-04-16 | 2007-09-12 | 中国人民解放军国防科学技术大学 | Control method for data stream and instruction stream in stream processor |
CN101923492A (en) * | 2010-08-11 | 2010-12-22 | 上海交通大学 | Method for executing dynamic allocation command on embedded heterogeneous multi-core |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101989192A (en) * | 2010-11-04 | 2011-03-23 | 浙江大学 | Method for automatically parallelizing program |
US8683243B2 (en) * | 2011-03-11 | 2014-03-25 | Intel Corporation | Dynamic core selection for heterogeneous multi-core systems |
CN103049245B (en) * | 2012-10-25 | 2015-12-02 | 浪潮电子信息产业股份有限公司 | A kind of software performance optimization method based on central processor CPU multi-core platform |
CN103049384A (en) * | 2012-12-29 | 2013-04-17 | 中国科学院深圳先进技术研究院 | Automatic generating frame of multi-core-based multithread limit energy consumption testing source program |
CN103226487B (en) * | 2013-04-25 | 2016-01-13 | 中国人民解放军信息工程大学 | Towards Data distribution8 and the locality optimizing methods of isomery many core dynamic data attemper structure |
CN103218455B (en) * | 2013-05-07 | 2014-04-16 | 中国人民解放军国防科学技术大学 | Method of high-speed concurrent processing of user requests of Key-Value database |
CN103699365B (en) * | 2014-01-07 | 2016-10-05 | 西南科技大学 | The thread dividing method of unrelated dependence is avoided in a kind of many-core processor structure |
CN103970580B (en) * | 2014-05-05 | 2017-09-15 | 华中科技大学 | A kind of data flow towards multinuclear cluster compiles optimization method |
-
2015
- 2015-11-24 CN CN201510829920.9A patent/CN105242909B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6002881A (en) * | 1997-06-10 | 1999-12-14 | Arm Limited | Coprocessor data access control |
US6732247B2 (en) * | 2001-01-17 | 2004-05-04 | University Of Washington | Multi-ported memory having pipelined data banks |
CN101034345A (en) * | 2007-04-16 | 2007-09-12 | 中国人民解放军国防科学技术大学 | Control method for data stream and instruction stream in stream processor |
CN101923492A (en) * | 2010-08-11 | 2010-12-22 | 上海交通大学 | Method for executing dynamic allocation command on embedded heterogeneous multi-core |
Also Published As
Publication number | Publication date |
---|---|
CN105242909A (en) | 2016-01-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105242909B (en) | A kind of many-core cyclic blocking method based on multi version code building | |
US8643656B2 (en) | Energy-aware task consolidation on graphics processing unit (GPU) | |
Sridharan et al. | Holistic run-time parallelism management for time and energy efficiency | |
Kim et al. | Automatic speculative doall for clusters | |
US8893104B2 (en) | Method and apparatus for register spill minimization | |
US20140136858A1 (en) | Power-constrained compiler code generation and scheduling of work in a heterogeneous processing system | |
US9304898B2 (en) | Hardware-based array compression | |
Shuja et al. | SIMDOM: a framework for SIMD instruction translation and offloading in heterogeneous mobile architectures | |
Valero‐Lara et al. | cuThomasBatch and cuThomasVBatch, CUDA routines to compute batch of tridiagonal systems on NVIDIA GPUs | |
Thaler et al. | Porting the COSMO weather model to manycore CPUs | |
CN107924327A (en) | System and method for multiple threads | |
Hayes et al. | Unified on-chip memory allocation for SIMT architecture | |
CN104321747A (en) | Time slack application pipeline balancing for multi/many-core plcs | |
Reddy et al. | Effective automatic computation placement and data allocation for parallelization of regular programs | |
CN103116526B (en) | The maximum power dissipation control method of high-performance heterogeneous Computing machine | |
Odendahl et al. | Split-cost communication model for improved MPSoC application mapping | |
Kislal et al. | POSTER: Location-Aware Computation Mapping for Manycore Processors | |
Aupy et al. | I/O scheduling strategy for periodic applications | |
US11816061B2 (en) | Dynamic allocation of arithmetic logic units for vectorized operations | |
CN104090804A (en) | Virtual memory expansion method for real-time DSP embedded system | |
CN104063329B (en) | 64-bit immediate operand processing method and device | |
CN102265257B (en) | Program conversion device and program conversion method | |
CN102542525B (en) | Information processing equipment and information processing method | |
Zaki et al. | Partial expansion graphs: Exposing parallelism and dynamic scheduling opportunities for DSP applications | |
Halli et al. | Performance comparison between Java and JNI for optimal implementation of computational micro-kernels |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |