CN105242909B - A kind of many-core cyclic blocking method based on multi version code building - Google Patents

A kind of many-core cyclic blocking method based on multi version code building Download PDF

Info

Publication number
CN105242909B
CN105242909B CN201510829920.9A CN201510829920A CN105242909B CN 105242909 B CN105242909 B CN 105242909B CN 201510829920 A CN201510829920 A CN 201510829920A CN 105242909 B CN105242909 B CN 105242909B
Authority
CN
China
Prior art keywords
core
piece
many
stored
compiler
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510829920.9A
Other languages
Chinese (zh)
Other versions
CN105242909A (en
Inventor
尉红梅
张立博
孙俊
姜小成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuxi Jiangnan Computing Technology Institute
Original Assignee
Wuxi Jiangnan Computing Technology Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuxi Jiangnan Computing Technology Institute filed Critical Wuxi Jiangnan Computing Technology Institute
Priority to CN201510829920.9A priority Critical patent/CN105242909B/en
Publication of CN105242909A publication Critical patent/CN105242909A/en
Application granted granted Critical
Publication of CN105242909B publication Critical patent/CN105242909B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention provides a kind of many-core cyclic blocking method based on multi version code building, wherein many-core processor is made up of control core and calculating core array, wherein each calculating core with one piece of cache;And wherein each calculate one piece of cache that core carries as being stored on piece in the way of direct memory access (DMA) and carry out data transmission with main memory;It is characterized in that, compiler to many-core circulation when carrying out parallel transformation, the value indicated according to compiler determines the granularity of cyclic blocking, so as to generate the parallel codes of different editions, compiler is operationally fed back the use information stored on piece by way of code instrumentation so as to be adjusted correspondingly according to the use information of feedback to the value of pragma, to cause the utilization rate stored on piece to maximize simultaneously.

Description

A kind of many-core cyclic blocking method based on multi version code building
Technical field
The present invention relates to field of computer technology, and in particular to a kind of many-core cyclic blocking based on multi version code building Method.
Background technology
For the many-core processor with storage device on piece, in order to improve calculating performance, it usually needs pass through DMA Data needed for calculating are first transferred on piece and deposited by (Direct Memory Access, direct memory access (DMA)) mode from main memory Storage, can so substantially reduce memory access latency.But, due to memory capacity very little on piece, a small amount of data can only be loaded, parallel The unreasonable of circulation division can cause storage device on piece effectively to utilize.Such as circulate granularity of division too small, frequency can be caused The utilization rate stored on numerous DMA, piece is not high, and circulation granularity of division is excessive, then the data needed for calculating can be caused to be put into On piece in storage device, so as to have a strong impact on acceleration effect.
On the other hand, for the many-core processor with storage device on piece, compiler is using calculating core array pair When core loop is accelerated parallel, common means are directly the data on main memory to be conducted interviews or according to point of acquiescence Block granularity carries out parallel patition to circulation, is then transferred data to and is conducted interviews again after being stored on piece using DMA, both sides Formula all can not be stored effectively using on piece.
The content of the invention
The technical problems to be solved by the invention are for there is drawbacks described above in the prior art there is provided one kind based on many editions The many-core cyclic blocking method of this code building, carries out parallel patition, wherein by user by pragma to many-core circulation Feedback information finds the optimal granularity that many-core circulation is divided when the pragma of guidance and operation, so as to effectively instruct compiling Device is divided to many-core circulation, improves the service efficiency stored on piece so as to improve overall acceleration effect.
According to the present invention there is provided a kind of many-core cyclic blocking method based on multi version code building, wherein at many-core Reason device is made up of control core and calculating core array, wherein each calculating core with one piece of cache;And it is wherein every It is individual to calculate one piece of cache carrying of core as being stored on piece in the way of direct memory access (DMA) and carry out data with hosting Transmission;Wherein, compiler to many-core circulation when carrying out parallel transformation, and the value that is indicated according to compiler determines cyclic blocking Granularity, so that corresponding parallel codes are generated, while compiler will by way of code instrumentation using the parallel codes of generation The use information stored on piece is fed back when parallel transformation is run so as to be referred to according to the use information of feedback to compiling The value shown is adjusted correspondingly, to cause the utilization rate stored on piece to maximize.
Preferably, the value of pragma is adjusted correspondingly according to the use information of feedback until that will can be deposited on piece Storage is as far as possible with full.
Brief description of the drawings
With reference to accompanying drawing, and by reference to following detailed description, it will more easily have more complete understanding to the present invention And its adjoint advantages and features is more easily understood, wherein:
Fig. 1 schematically shows the many-core circulation point according to the preferred embodiment of the invention based on multi version code building The flow chart of block method.
It should be noted that accompanying drawing is used to illustrate the present invention, it is not intended to limit the present invention.Note, represent that the accompanying drawing of structure can It can be not necessarily drawn to scale.Also, in accompanying drawing, same or similar element indicates same or similar label.
Embodiment
In order that present disclosure is more clear and understandable, with reference to specific embodiments and the drawings in the present invention Appearance is described in detail.
In the present invention, many-core processor is made up of control core and calculating core array, wherein each calculating core band There is one piece of cache.And wherein, one piece of cache that each calculating core is carried can be by storage as storage on piece Device directly accesses DMA mode and main memory carries out data transmission.
The present invention is proposed instructs parallel compiler to circulate the method for carrying out piecemeal to many-core using pragma.Compiler When carrying out parallel transformation to many-core circulation, the value indicated according to compiler determines the granularity of cyclic blocking, so as to generate phase The parallel codes (granularity of different cyclic blockings generates the parallel codes of different editions) answered, while compiler utilizes generation Parallel codes are fed back the use information stored on piece so as to energy when parallel transformation is run by way of code instrumentation It is enough that the value of pragma is adjusted correspondingly according to the use information of feedback, to cause the utilization rate stored on piece to maximize (for example, until that will can be stored on piece as far as possible with full).
For example, the object of feedback can be user, it is possible thereby to directly by user according to the use information of feedback to compiling The value of instruction is adjusted correspondingly.
The many-core cyclic blocking method according to the preferred embodiment of the invention based on multi version code building is described below Specific processing example in application.
Fig. 1 schematically shows the many-core circulation point according to the preferred embodiment of the invention based on multi version code building The flow chart of the specific processing example of block method.
As shown in figure 1, the many-core cyclic blocking method according to the preferred embodiment of the invention based on multi version code building The specific processing example include:
First step S1:Compiler adds pragma when carrying out parallel transformation to many-core circulation in core loop;
Second step S2:User (or, the automatic setting regulation mechanism of Block granularity) setting cyclic blocking granularity;
Third step S3:Accelerate code according to the generation of the granularity of cyclic blocking is corresponding;
Four steps S4:So that compilation run, wherein compiler by way of code instrumentation by the use stored on piece Information is operationally fed back;
5th step S5:According to the service condition stored in feedback information analytic plate;
6th step S6:Judge to store on piece and whether be fully utilized;If it is determined that storage is fully utilized on piece, then enter Enter the 7th step S7;It is not fully utilized if it is determined that being stored on piece, then returns to second step S2.
7th step S7:Optimal circulation is obtained to divide.
The present invention provides a kind of many-core cyclic blocking method based on multi version code building, by pragma to many-core Circulation carries out parallel patition, wherein finding what many-core circulation was divided by feedback information when user guided pragma and operation Optimal granularity, so as to effectively instruct compiler to divide many-core circulation, improve on piece the service efficiency that stores so as to Improve overall acceleration effect.
The advantage of the invention is that can be for how to circulate the problem of dividing to many-core, have found one kind simply has The method of effect, the optimal granularity that many-core circulation is divided is found by feedback information when user guided pragma and operation, from And effectively using being stored on the piece on many-core processor, improve acceleration effect.
Furthermore, it is necessary to explanation, unless otherwise indicated, term " first " otherwise in specification, " second ", " the 3rd " Be used only for distinguishing each component, element, step etc. in specification Deng description, without be intended to indicate that each component, element, Logical relation or ordinal relation between step etc..
Although it is understood that the present invention is disclosed as above with preferred embodiment, but above-described embodiment and being not used to Limit the present invention.For any those skilled in the art, without departing from the scope of the technical proposal of the invention, Many possible variations and modification are all made to technical solution of the present invention using the technology contents of the disclosure above, or are revised as With the equivalent embodiment of change.Therefore, every content without departing from technical solution of the present invention, the technical spirit pair according to the present invention Any simple modifications, equivalents, and modifications made for any of the above embodiments, still fall within the scope of technical solution of the present invention protection It is interior.

Claims (2)

1. a kind of many-core cyclic blocking method based on multi version code building, wherein many-core processor is by control core and calculating Core array is constituted, wherein each calculating core with one piece of cache;And one piece that wherein each calculating core is carried Cache carries out data transmission with main memory as being stored on piece in the way of direct memory access (DMA);Characterized in that, compiling Device to many-core circulation when carrying out parallel transformation, and the value indicated according to compiler determines the granularity of cyclic blocking, so as to generate Corresponding parallel codes, at the same compiler using the parallel codes of generation by way of code instrumentation by the use stored on piece Information is fed back when parallel transformation is run so as to which the value indicated according to the use information of feedback compiler carries out phase The adjustment answered, to cause the utilization rate stored on piece to maximize;Wherein, judge to store on piece and whether be fully utilized, if sentenced Storage is fully utilized on stator, obtains optimal circulation and divides, if it is decided that stored on piece be not fully utilized it is then true again Determine cyclic blocking granularity.
2. the many-core cyclic blocking method according to claim 1 based on multi version code building, it is characterised in that according to The value that the use information of feedback is indicated compiler is adjusted correspondingly until that will can be stored on piece as far as possible with full.
CN201510829920.9A 2015-11-24 2015-11-24 A kind of many-core cyclic blocking method based on multi version code building Active CN105242909B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510829920.9A CN105242909B (en) 2015-11-24 2015-11-24 A kind of many-core cyclic blocking method based on multi version code building

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510829920.9A CN105242909B (en) 2015-11-24 2015-11-24 A kind of many-core cyclic blocking method based on multi version code building

Publications (2)

Publication Number Publication Date
CN105242909A CN105242909A (en) 2016-01-13
CN105242909B true CN105242909B (en) 2017-08-11

Family

ID=55040568

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510829920.9A Active CN105242909B (en) 2015-11-24 2015-11-24 A kind of many-core cyclic blocking method based on multi version code building

Country Status (1)

Country Link
CN (1) CN105242909B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555520A (en) * 2018-05-30 2019-12-10 北京三星通信技术研究有限公司 method for performing convolution operations, corresponding processing device and electronic device
CN110399124B (en) * 2019-07-19 2022-04-22 浪潮电子信息产业股份有限公司 Code generation method, device, equipment and readable storage medium
CN112433965B (en) * 2019-08-26 2022-07-12 无锡江南计算技术研究所 Data caching implementation method facing SPM storage hierarchy
CN112559197B (en) * 2019-09-10 2022-11-15 无锡江南计算技术研究所 Convolution calculation data reuse method based on heterogeneous many-core processor
CN114860417B (en) * 2022-06-15 2023-05-02 中科物栖(北京)科技有限责任公司 Multi-core neural network processor and multi-task allocation scheduling method for same

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6002881A (en) * 1997-06-10 1999-12-14 Arm Limited Coprocessor data access control
US6732247B2 (en) * 2001-01-17 2004-05-04 University Of Washington Multi-ported memory having pipelined data banks
CN101034345A (en) * 2007-04-16 2007-09-12 中国人民解放军国防科学技术大学 Control method for data stream and instruction stream in stream processor
CN101923492A (en) * 2010-08-11 2010-12-22 上海交通大学 Method for executing dynamic allocation command on embedded heterogeneous multi-core

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101989192A (en) * 2010-11-04 2011-03-23 浙江大学 Method for automatically parallelizing program
US8683243B2 (en) * 2011-03-11 2014-03-25 Intel Corporation Dynamic core selection for heterogeneous multi-core systems
CN103049245B (en) * 2012-10-25 2015-12-02 浪潮电子信息产业股份有限公司 A kind of software performance optimization method based on central processor CPU multi-core platform
CN103049384A (en) * 2012-12-29 2013-04-17 中国科学院深圳先进技术研究院 Automatic generating frame of multi-core-based multithread limit energy consumption testing source program
CN103226487B (en) * 2013-04-25 2016-01-13 中国人民解放军信息工程大学 Towards Data distribution8 and the locality optimizing methods of isomery many core dynamic data attemper structure
CN103218455B (en) * 2013-05-07 2014-04-16 中国人民解放军国防科学技术大学 Method of high-speed concurrent processing of user requests of Key-Value database
CN103699365B (en) * 2014-01-07 2016-10-05 西南科技大学 The thread dividing method of unrelated dependence is avoided in a kind of many-core processor structure
CN103970580B (en) * 2014-05-05 2017-09-15 华中科技大学 A kind of data flow towards multinuclear cluster compiles optimization method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6002881A (en) * 1997-06-10 1999-12-14 Arm Limited Coprocessor data access control
US6732247B2 (en) * 2001-01-17 2004-05-04 University Of Washington Multi-ported memory having pipelined data banks
CN101034345A (en) * 2007-04-16 2007-09-12 中国人民解放军国防科学技术大学 Control method for data stream and instruction stream in stream processor
CN101923492A (en) * 2010-08-11 2010-12-22 上海交通大学 Method for executing dynamic allocation command on embedded heterogeneous multi-core

Also Published As

Publication number Publication date
CN105242909A (en) 2016-01-13

Similar Documents

Publication Publication Date Title
CN105242909B (en) A kind of many-core cyclic blocking method based on multi version code building
US8643656B2 (en) Energy-aware task consolidation on graphics processing unit (GPU)
Sridharan et al. Holistic run-time parallelism management for time and energy efficiency
Kim et al. Automatic speculative doall for clusters
US8893104B2 (en) Method and apparatus for register spill minimization
US20140136858A1 (en) Power-constrained compiler code generation and scheduling of work in a heterogeneous processing system
US9304898B2 (en) Hardware-based array compression
Shuja et al. SIMDOM: a framework for SIMD instruction translation and offloading in heterogeneous mobile architectures
Valero‐Lara et al. cuThomasBatch and cuThomasVBatch, CUDA routines to compute batch of tridiagonal systems on NVIDIA GPUs
Thaler et al. Porting the COSMO weather model to manycore CPUs
CN107924327A (en) System and method for multiple threads
Hayes et al. Unified on-chip memory allocation for SIMT architecture
CN104321747A (en) Time slack application pipeline balancing for multi/many-core plcs
Reddy et al. Effective automatic computation placement and data allocation for parallelization of regular programs
CN103116526B (en) The maximum power dissipation control method of high-performance heterogeneous Computing machine
Odendahl et al. Split-cost communication model for improved MPSoC application mapping
Kislal et al. POSTER: Location-Aware Computation Mapping for Manycore Processors
Aupy et al. I/O scheduling strategy for periodic applications
US11816061B2 (en) Dynamic allocation of arithmetic logic units for vectorized operations
CN104090804A (en) Virtual memory expansion method for real-time DSP embedded system
CN104063329B (en) 64-bit immediate operand processing method and device
CN102265257B (en) Program conversion device and program conversion method
CN102542525B (en) Information processing equipment and information processing method
Zaki et al. Partial expansion graphs: Exposing parallelism and dynamic scheduling opportunities for DSP applications
Halli et al. Performance comparison between Java and JNI for optimal implementation of computational micro-kernels

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant