CN105242909B

CN105242909B - A kind of many-core cyclic blocking method based on multi version code building

Info

Publication number: CN105242909B
Application number: CN201510829920.9A
Authority: CN
Inventors: 尉红梅; 张立博; 孙俊; 姜小成
Original assignee: Wuxi Jiangnan Computing Technology Institute
Current assignee: Wuxi Jiangnan Computing Technology Institute
Priority date: 2015-11-24
Filing date: 2015-11-24
Publication date: 2017-08-11
Anticipated expiration: 2035-11-24
Also published as: CN105242909A

Abstract

The invention provides a kind of many-core cyclic blocking method based on multi version code building, wherein many-core processor is made up of control core and calculating core array, wherein each calculating core with one piece of cache；And wherein each calculate one piece of cache that core carries as being stored on piece in the way of direct memory access (DMA) and carry out data transmission with main memory；It is characterized in that, compiler to many-core circulation when carrying out parallel transformation, the value indicated according to compiler determines the granularity of cyclic blocking, so as to generate the parallel codes of different editions, compiler is operationally fed back the use information stored on piece by way of code instrumentation so as to be adjusted correspondingly according to the use information of feedback to the value of pragma, to cause the utilization rate stored on piece to maximize simultaneously.

Description

A kind of many-core cyclic blocking method based on multi version code building

Technical field

The present invention relates to field of computer technology, and in particular to a kind of many-core cyclic blocking based on multi version code building Method.

Background technology

For the many-core processor with storage device on piece, in order to improve calculating performance, it usually needs pass through DMA Data needed for calculating are first transferred on piece and deposited by (Direct Memory Access, direct memory access (DMA)) mode from main memory Storage, can so substantially reduce memory access latency.But, due to memory capacity very little on piece, a small amount of data can only be loaded, parallel The unreasonable of circulation division can cause storage device on piece effectively to utilize.Such as circulate granularity of division too small, frequency can be caused The utilization rate stored on numerous DMA, piece is not high, and circulation granularity of division is excessive, then the data needed for calculating can be caused to be put into On piece in storage device, so as to have a strong impact on acceleration effect.

On the other hand, for the many-core processor with storage device on piece, compiler is using calculating core array pair When core loop is accelerated parallel, common means are directly the data on main memory to be conducted interviews or according to point of acquiescence Block granularity carries out parallel patition to circulation, is then transferred data to and is conducted interviews again after being stored on piece using DMA, both sides Formula all can not be stored effectively using on piece.

The content of the invention

The technical problems to be solved by the invention are for there is drawbacks described above in the prior art there is provided one kind based on many editions The many-core cyclic blocking method of this code building, carries out parallel patition, wherein by user by pragma to many-core circulation Feedback information finds the optimal granularity that many-core circulation is divided when the pragma of guidance and operation, so as to effectively instruct compiling Device is divided to many-core circulation, improves the service efficiency stored on piece so as to improve overall acceleration effect.

According to the present invention there is provided a kind of many-core cyclic blocking method based on multi version code building, wherein at many-core Reason device is made up of control core and calculating core array, wherein each calculating core with one piece of cache；And it is wherein every It is individual to calculate one piece of cache carrying of core as being stored on piece in the way of direct memory access (DMA) and carry out data with hosting Transmission；Wherein, compiler to many-core circulation when carrying out parallel transformation, and the value that is indicated according to compiler determines cyclic blocking Granularity, so that corresponding parallel codes are generated, while compiler will by way of code instrumentation using the parallel codes of generation The use information stored on piece is fed back when parallel transformation is run so as to be referred to according to the use information of feedback to compiling The value shown is adjusted correspondingly, to cause the utilization rate stored on piece to maximize.

Preferably, the value of pragma is adjusted correspondingly according to the use information of feedback until that will can be deposited on piece Storage is as far as possible with full.

Brief description of the drawings

With reference to accompanying drawing, and by reference to following detailed description, it will more easily have more complete understanding to the present invention And its adjoint advantages and features is more easily understood, wherein：

Fig. 1 schematically shows the many-core circulation point according to the preferred embodiment of the invention based on multi version code building The flow chart of block method.

It should be noted that accompanying drawing is used to illustrate the present invention, it is not intended to limit the present invention.Note, represent that the accompanying drawing of structure can It can be not necessarily drawn to scale.Also, in accompanying drawing, same or similar element indicates same or similar label.

Embodiment

In order that present disclosure is more clear and understandable, with reference to specific embodiments and the drawings in the present invention Appearance is described in detail.

In the present invention, many-core processor is made up of control core and calculating core array, wherein each calculating core band There is one piece of cache.And wherein, one piece of cache that each calculating core is carried can be by storage as storage on piece Device directly accesses DMA mode and main memory carries out data transmission.

The present invention is proposed instructs parallel compiler to circulate the method for carrying out piecemeal to many-core using pragma.Compiler When carrying out parallel transformation to many-core circulation, the value indicated according to compiler determines the granularity of cyclic blocking, so as to generate phase The parallel codes (granularity of different cyclic blockings generates the parallel codes of different editions) answered, while compiler utilizes generation Parallel codes are fed back the use information stored on piece so as to energy when parallel transformation is run by way of code instrumentation It is enough that the value of pragma is adjusted correspondingly according to the use information of feedback, to cause the utilization rate stored on piece to maximize (for example, until that will can be stored on piece as far as possible with full).

For example, the object of feedback can be user, it is possible thereby to directly by user according to the use information of feedback to compiling The value of instruction is adjusted correspondingly.

The many-core cyclic blocking method according to the preferred embodiment of the invention based on multi version code building is described below Specific processing example in application.

Fig. 1 schematically shows the many-core circulation point according to the preferred embodiment of the invention based on multi version code building The flow chart of the specific processing example of block method.

As shown in figure 1, the many-core cyclic blocking method according to the preferred embodiment of the invention based on multi version code building The specific processing example include：

First step S1：Compiler adds pragma when carrying out parallel transformation to many-core circulation in core loop；

Second step S2：User (or, the automatic setting regulation mechanism of Block granularity) setting cyclic blocking granularity；

Third step S3：Accelerate code according to the generation of the granularity of cyclic blocking is corresponding；

Four steps S4：So that compilation run, wherein compiler by way of code instrumentation by the use stored on piece Information is operationally fed back；

5th step S5：According to the service condition stored in feedback information analytic plate；

6th step S6：Judge to store on piece and whether be fully utilized；If it is determined that storage is fully utilized on piece, then enter Enter the 7th step S7；It is not fully utilized if it is determined that being stored on piece, then returns to second step S2.

7th step S7：Optimal circulation is obtained to divide.

The present invention provides a kind of many-core cyclic blocking method based on multi version code building, by pragma to many-core Circulation carries out parallel patition, wherein finding what many-core circulation was divided by feedback information when user guided pragma and operation Optimal granularity, so as to effectively instruct compiler to divide many-core circulation, improve on piece the service efficiency that stores so as to Improve overall acceleration effect.

The advantage of the invention is that can be for how to circulate the problem of dividing to many-core, have found one kind simply has The method of effect, the optimal granularity that many-core circulation is divided is found by feedback information when user guided pragma and operation, from And effectively using being stored on the piece on many-core processor, improve acceleration effect.

Furthermore, it is necessary to explanation, unless otherwise indicated, term " first " otherwise in specification, " second ", " the 3rd " Be used only for distinguishing each component, element, step etc. in specification Deng description, without be intended to indicate that each component, element, Logical relation or ordinal relation between step etc..

Although it is understood that the present invention is disclosed as above with preferred embodiment, but above-described embodiment and being not used to Limit the present invention.For any those skilled in the art, without departing from the scope of the technical proposal of the invention, Many possible variations and modification are all made to technical solution of the present invention using the technology contents of the disclosure above, or are revised as With the equivalent embodiment of change.Therefore, every content without departing from technical solution of the present invention, the technical spirit pair according to the present invention Any simple modifications, equivalents, and modifications made for any of the above embodiments, still fall within the scope of technical solution of the present invention protection It is interior.

Claims

1. a kind of many-core cyclic blocking method based on multi version code building, wherein many-core processor is by control core and calculating Core array is constituted, wherein each calculating core with one piece of cache；And one piece that wherein each calculating core is carried Cache carries out data transmission with main memory as being stored on piece in the way of direct memory access (DMA)；Characterized in that, compiling Device to many-core circulation when carrying out parallel transformation, and the value indicated according to compiler determines the granularity of cyclic blocking, so as to generate Corresponding parallel codes, at the same compiler using the parallel codes of generation by way of code instrumentation by the use stored on piece Information is fed back when parallel transformation is run so as to which the value indicated according to the use information of feedback compiler carries out phase The adjustment answered, to cause the utilization rate stored on piece to maximize；Wherein, judge to store on piece and whether be fully utilized, if sentenced Storage is fully utilized on stator, obtains optimal circulation and divides, if it is decided that stored on piece be not fully utilized it is then true again Determine cyclic blocking granularity.

2. the many-core cyclic blocking method according to claim 1 based on multi version code building, it is characterised in that according to The value that the use information of feedback is indicated compiler is adjusted correspondingly until that will can be stored on piece as far as possible with full.