CN101625673A

CN101625673A - Method for mapping task of network on two-dimensional grid chip

Info

Publication number: CN101625673A
Application number: CN200810116245A
Authority: CN
Inventors: 刘祥; 陈曦; 黄毅; 张金龙; 任菲
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2008-07-07
Filing date: 2008-07-07
Publication date: 2010-01-13
Anticipated expiration: 2028-07-07
Also published as: CN101625673B

Abstract

The invention discloses a method for mapping a task of a network on a two-dimensional grid chip. The method comprises the following steps: 1) pre-distributing expected positions of all threads on a two-dimensional grid, wherein the threads comprise common threads which can be mapped to any position; 2) calculating variation Com_diff of a general communication power consumption factor after each common thread is exchanged with a close-by common thread on the expected position of the common thread or an idle position, wherein the common thread executes exchange with the common thread or the idle position which minimizes the Com_diff, until exchanges of all the common threads and the close-by threads on the expected positions of the common threads or the idle positions lead the Com_diff to be more than or equal to zero; and 3) outputting a mapping file according to the positions of all the threads. The method has high optimization degree, and solves part of the mapping problem because users can regulate parameters by oneself to control time complexity.

Description

A kind of duty mapping method of two-dimensional grid network-on-chip

Technical field

The present invention relates to a kind of using method of polycaryon processor, is a kind of two-dimensional grid (2-D Mesh) structure network-on-chip (Network-on-Chip, duty mapping method NoC).

Background technology

Along with the development of semiconductor and integrated circuit technique, (System-on-Chip, integrated level SoC) is more and more higher, can integrated hundreds of the IP kernels such as microprocessor, storer, I/O interface on the single chip for SOC (system on a chip).On the other hand, the function of embedded electronic product becomes increasingly complex, and the uniprocessor SOC (system on a chip) can't satisfy growing function of embedded system and performance requirement, and (Multi-Processor SoC, appearance MPSoC) becomes inevitable the multinuclear SOC (system on a chip).And the multinuclear SOC (system on a chip) is had higher requirement to chip-on communication, and network-on-chip proposes for the global communication that solves nanometer era multinuclear SOC (system on a chip).Network-on-chip is used for reference the design philosophy of parallel computation and computer network, on single silicon chip, make up a micronetwork that adopts packet switch, interconnect by switch between the IP kernel, and use Global Asynchronous local synchronization (Global Asynchronous Local Synchronous, GALS) mechanism realizes the efficient communication between computing modules such as a large amount of processing units, storage unit in the multinuclear SOC (system on a chip).

The topological structure of network-on-chip is varied, wherein two-dimensional grid have simple in structure, extensibility good, be convenient to realize and advantages such as analysis, thereby obtained using widely in the network-on-chip field.Along with number of transistors on the chip develops into 1,000,000,000 orders of magnitude, power consumption becomes the primary restraining factors of chip design gradually, thread and the mapping method between a plurality of processing units of network-on-chip based on power consumption are a lot, wherein Jingcao Hu and Marculescu, R is at Computer-Aided Design of IntegratedCircuits and Systems, IEEE Transactions on Volume 24, Issue 4, among the document Energy-and performance-aware mapping forregular NoC architectures that April 2005Pages:551-562 delivers, hereinafter to be referred as document 1, set forth the method that adopts branch and bound thought, promptly in obtaining the process of next feasible solution, those can not obtain the process of optimum solution to utilize upper bound function U BC (upper bound cost) and lower limit function LBC (lower bound cost) premature termination, thereby guide this method to advance towards " branch " of optimum solution.But in this method implementation, each step all needs to calculate UBC and LBC, and this will inevitably increase time complexity.A plurality of " optimum solutions " might appear in this method simultaneously, cause final majorization of solutions degree not high.

Summary of the invention

It is long that the object of the invention overcomes the mapping method execution time of the prior art, the final unwarrantable problem of majorization of solutions degree, thus a kind of duty mapping method of two-dimensional grid network-on-chip is provided.

According to an aspect of the present invention, provide a kind of duty mapping method of two-dimensional grid network-on-chip, comprised the following steps:

1) desired location of all threads of predistribution to the two-dimensional grid, described thread comprises the common thread that can map to any position;

2) calculate near the desired location of each common thread and this common thread the common thread or the variable quantity Com_diff of the total communication power consumption factor after the clear position exchange, described common thread is got minimum common thread or clear position execution exchange with making Com_diff, and near common thread described all common thread and its desired location or clear position exchange all make Com_diff more than or equal to 0;

3) export mapped file according to the position of described all threads.

Wherein, described step 1) comprises:

11) list described common thread in a formation according to the size order of the traffic of each common thread;

12) first common thread in the described formation is dispensed to the center of described two-dimensional grid;

13) calculate common thread desired location to be allocated according to the desired location of the thread that has distributed.

Wherein, described all threads also comprise the special thread that need map to ad-hoc location.

Wherein, described step 1) comprises:

11 ') list described special thread in a formation;

12 ') size order according to the traffic of each common thread adds described formation with described common thread;

13) calculate common thread desired location to be allocated according to the thread desired location of having distributed.

Wherein, described step 13) is calculated common thread desired location to be allocated according to following formula according to the thread desired location of having distributed:

Com wherein _{I, k}Data communication total amount between expression thread i, the k, x _kAnd y _kX axle and the y axial coordinate of representing thread k respectively, x _iAnd y _iX axle and the y axial coordinate of representing thread i respectively.

Wherein, described step 2) comprising:

21) all common thread in the described formation are constituted round-robin queue, appoint and get one of them common thread;

22) supposing that described common thread belongs to does not shine upon thread, calculate the variable quantity Com_diff of the total communication power consumption factor after near common thread of described common thread and its desired location or clear position exchange, described common thread and the common thread or the clear position execution that make Com_diff get minimum are exchanged;

23) repeating step 22) near described all common thread and its desired location common thread or clear position exchange all make Com_diff more than or equal to 0.

Wherein, near common thread the described desired location or clear position are and the distance of desired location common thread or the clear position less than predetermined threshold.

Wherein, described and distance desired location is a manhatton distance.

The invention provides the preferential network-on-chip mapping method of a kind of power consumption, the optimal location that it constantly adjusts each thread makes final majorization of solutions degree reach the highest.In the present invention simultaneously, the user can set up a threshold value on their own, so that weigh on method execution time and final majorization of solutions degree, when choosing less threshold value, the each circulation time of this method only need compare the numerical value of several Key Points, thereby has significantly reduced time complexity.And the present invention considers and has solved the situation of part mapping, and promptly some thread is the special case that must map on the specific PU in a NoC system.

Description of drawings

Fig. 1 is the NoC synoptic diagram of a 2-D Mesh structure;

Fig. 2 is the process flow diagram of the duty mapping method embodiment of two-dimensional grid network-on-chip of the present invention;

Fig. 3 is that data are utilized a mapping result of duty mapping method generation of the present invention at S=0 in H.264Decoder showing during d=1;

Fig. 4 is that data are utilized another mapping result of duty mapping method generation of the present invention at S=1 in H.264Decoder showing during d=2.

Embodiment

Below in conjunction with accompanying drawing the specific embodiment of the present invention is described in further detail.

Fig. 1 shows the specific embodiment of NoC of one 4 * 3 2-D Mesh structure, and wherein S represents crosspoint, and LR represents local resource, and Adapt represents adapter, PU represents processing unit, (0,0), (0,1), (0,2) ..., the position coordinates of (3,2) expression thread mapping (x, y).

The mapping position of supposing i thread in 2-D Mesh is (x _i, y _i), the mapping position of j thread is (x _j, y _j).Adopt manhatton distance D in the present embodiment _{I, j}=| x _i-x _j|+| y _i-y _j| the skip distance of data when expression thread i communicates by letter with thread j; It will be understood by those in the art that not break away from inventive concept, also can adopt other distance.The communication power consumption of then transmitting 1 Bit data from thread i to thread j is:

E_{bit}^{i, j} = (D_{i, j} + 1) E_{Sbit} + D_{i, j} E_{Lbit} - - - (1)

Wherein, E _SbitRepresent that each crosspoint receives and transmit the power consumption of 1 Bit data, E _LbitRepresent the circuit power consumption of transmission 1 Bit data between adjacent two processing units, make E _Sbit/ E _Lbit=θ then has

E_{bit}^{i, j} = [(θ + 1) D_{i, j} + θ] E_{Lbit} - - - (2)

Then the total communication power consumption of total system is

E_{com} = Σ_{i = 0}^{T - 1} Σ_{j = i + 1}^{T - 1} (C_{i, j} + C_{j, i}) [(θ + 1) D_{i, j} + θ] E_{Lbit} - - - (3)

Wherein, T represents total number of thread, C _{I, j}Expression thread i is to the data traffic of thread j.Note Com _{I, j}=C _{I, j}+ C _{J, i}Data communication total amount between expression thread i, the j then has

E_{com} = Σ_{i = 0}^{T - 1} Σ_{j = i + 1}^{T - 1} [(θ + 1) D_{i, j} + θ] {Com}_{i, j} \cdot E_{Lbit} - - - (4)

Starting point of the present invention is the communication power consumption factor of optimization system, thereby determines to make the thread of system power dissipation minimum and the mapping relations between a plurality of processing units of network-on-chip.By formula (4) as can be known, the communication power consumption factor of system is:

W = Σ_{i = 0}^{T - 1} Σ_{j = i + 1}^{T - 1} [(θ + 1) D_{i, j} + θ] {Com}_{i, j} - - - (5)

All threads are divided into two classes, comprise the common thread that can map to any position and need map to the special thread of ad-hoc location.In system, special thread may exist also and may not exist, and needs only in mapping process it is mapped to its ad-hoc location.For arbitrary common thread i, wish that it is mapped to the position of the communication power consumption factor minimum that makes above-mentioned system, realize by following manner: calculate the variable quantity of the total communication power consumption factor after near thread (except the special thread) common thread i and its desired location or the clear position exchange, common thread i is exchanged with making that variable quantity is minimum and carry out less than 0 thread or clear position.The variable quantity of the total communication power consumption factor after thread i and other thread and the clear position exchange is calculated as follows:

The coefficient of the communication power consumption sum of thread i and other all threads is

w_{i} = Σ_{k = 0}^{T - 1} [(θ + 1) D_{i, k} + θ] {Com}_{i, k} (k &NotEqual; i) - - - (6)

The coefficient of the communication power consumption sum of thread j and other all threads is

w_{j} = Σ_{k = 0}^{T - 1} [(θ + 1) D_{j, k} + θ] {Com}_{j, k} (k &NotEqual; j) - - - (7)

After thread i, the j exchange mapping position, the coefficient of the communication power consumption sum of thread i and thread j and other all threads is respectively

Therefore, behind thread i, the j switch, the variable quantity of the total communication power consumption factor of system is

If thread i and a certain clear position (s, t) exchange, that is, and thread i be mapped to clear position (s, t), and the original position of thread i becomes clear position, then exchange afterwards the coefficient of the communication power consumption sum of thread i and other all threads be:

w_{new} = Σ_{k = 0}^{T - 1} [(θ + 1) (| s - x_{k} | + | t - y_{k} |) + θ] {Com}_{i, k} (k &NotEqual; i) - - - (11)

Therefore, thread i and clear position (s, t) after the exchange, the variable quantity of the total communication power consumption factor of system is:

Com_diff = w_{new} - w_{i} = Σ_{k = 0}^{T - 1} (θ + 1) (| s - x_{k} | + | t - y_{k} | - D_{i, k}) {Com}_{i, k} (k &NotEqual; i) - - - (12)

If set omega comprises all threads that shone upon, the desired location (x of thread i so to be mapped _i, y _i) calculate symbol in the formula (13) wherein in the present embodiment with formula (13)

Expression is to symbol inner function round, and those skilled in that art are appreciated that and can also calculate by alternate manner.

Calculate according to above-mentioned theory, the specific embodiment of the present invention is as follows:

Suppose in a certain task that special number of threads is S, the common thread number is T-S.List all special threads in a sequencing queue, all common thread are also added in this formation, preferably common thread can add this formation according to the size order of the traffic of other thread in this thread and the formation.Get the thread of a pair of traffic maximum, wherein comprise a common thread that does not add formation as yet at least.If two threads all not in formation, then all are attached to the formation end with these two threads, their sequencing depends on the peak volume of other threads in these two threads and the formation, is worth big person preceding; If one of them in formation, then is added in the sequencing queue end with another thread.The thread that takes off a pair of traffic maximum then continue to carry out this operation until all threads all in formation.Be 0～T-1 with this T thread by its serial number in formation at last.If special number of threads is 0 in the task, then the order of first pair of thread in formation is arbitrarily.

With demoder H.264 is example, and it relates to 12 modules (task), requires to be mapped in the processor array of a 2D Mesh of 4 * 4, and the communication flows between these modules is as shown in table 1.

Table 1 is intermodule communication flowmeter in the demoder H.264

??From/To	??IP0	??IP1	??IP2	??IP3	??IP4	??IP5	??IP6	??IP7	??IP8	??IP9	??IP10	??IP11
??From/To	??IP0	??IP1	??IP2	??IP3	??IP4	??IP5	??IP6	??IP7	??IP8	??IP9	??IP10	??IP11	??IP0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??7,098.7
??IP1	??4,465.1	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??344.5	??IP0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??7,098.7
??IP1	??4,465.1	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??344.5	??IP2	??0.0	??0.0	??0.0	??0.0	??62.7	??4,791.9	??0.0	??0.0	??0.0	??0.0	??0.0	??13,1970
??IP3	??0.0	??5,936.1	??0.0	??0.0	??0.0	??0.0	??0.0	??641.0	??0.0	??0.0	??0.0	??0.0	??IP2	??0.0	??0.0	??0.0	??0.0	??62.7	??4,791.9	??0.0	??0.0	??0.0	??0.0	??0.0	??13,1970
??IP3	??0.0	??5,936.1	??0.0	??0.0	??0.0	??0.0	??0.0	??641.0	??0.0	??0.0	??0.0	??0.0	??IP4	??0.0	??0.0	??0.0	??6,5771	??0.0	??0.0	??406.6	??0.0	??494.7	??0.0	??0.0	??0.0
??IP5	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??IP4	??0.0	??0.0	??0.0	??6,5771	??0.0	??0.0	??406.6	??0.0	??494.7	??0.0	??0.0	??0.0
??IP5	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??IP6	??324.9	??321.4	??0.0	??186.0	??232.0	??11.6	??0.0	??6.9	??990.2	??59.2	??11.6	??0.0
??IP7	??320.5	??13.5	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??145.0	??0.0	??26.7	??IP6	??324.9	??321.4	??0.0	??186.0	??232.0	??11.6	??0.0	??6.9	??990.2	??59.2	??11.6	??0.0
??IP7	??320.5	??13.5	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??145.0	??0.0	??26.7	??IP8	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??826.3	??0.0	??0.0	??0.0	??0.0	??0.0

??IP9	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??320.5	??0.0	??0.0	??0.0	??0.0
??IP9	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??320.5	??0.0	??0.0	??0.0	??0.0	??IP10	??0.0	??0.0	??62.7	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0
??IP11	??2,644.3	??10,628.0	??7,470.4	??0.0	??0.0	??0.0	??0.0	??39.6	??0.0	??0.0	??0.0	??0.0	??IP10	??0.0	??0.0	??62.7	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0	??0.0

Suppose S=1, IP5 is special thread, and so initial sequencing queue is { IP5}.Get the thread IP2 and the IP11 of a pair of traffic maximum, they all not in formation since in IP2 and the formation traffic between the thread IP5 greater than the traffic between IP11 and the IP5, so IP2 preferentially adds tail of the queue.Therefore new sequencing queue is { IP5, IP2, IP11}.The thread of a pair of traffic maximum is IP1 and IP11 down, because IP11 is in formation, so IP1 adding tail of the queue is got final product.Therefore new sequencing queue be IP5, IP2, IP11, IP1} in like manner also adds tail of the queue to IP0, then formation becomes { IP5, IP2, IP11, IP1, IP0}.Down the thread of a pair of letter amount maximum is IP3 and IP4, they all not in formation since in IP3 and the formation maximal value (with the traffic of IP1) of the thread traffic greater than the traffic of each thread in IP4 and the formation, so IP3 preferentially adds tail of the queue.Therefore new sequencing queue is { IP5, IP2, IP11, IP1, IP0, IP3, IP4}.The rest may be inferred, and after all threads all added formation, formation was { IP5, IP2, IP11, IP1, IP0, IP3, IP4, IP6, IP8, IP7, IP4, IP10}.Giving these 12 thread number by this queue sequence at last is 0～11.

Calculate according to above-mentioned formula and to make the mapping of communication power consumption factor minimum of system, concrete steps are as follows:

Step 1: initialization two-dimensional array A[M] [N], make all elements be changed to-1, represent this position free time, wherein M is the line number of network-on-chip two-dimensional grid, N is the columns of two-dimensional grid.If S=0 then distributes No. 0 thread to the center Otherwise distribute the correspondence position of all special threads to particular processor unit, utilize formula (13) to calculate the desired location of next thread, get with the shortest unallocated position of this desired location manhatton distance and distribute this thread, distribute next thread then, assign until all threads.For the i thread be dispensed to (a, b) position makes A[a] [b]=i, x[i]=a, y[i]=b.

Step 2: make unextimes=0, all T-S in the sequencing queue common thread are constituted round-robin queue, get first thread in this round-robin queue.In the present embodiment, with the specific implementation of chained list as formation.

Step 3: suppose that this thread does not belong to Ω, utilize formula (13) to calculate the desired location of this thread, utilize formula (10) and formula (12) to calculate among the A with this position Com_diff after to be the center manhatton distance smaller or equal to each position (except ad-hoc location and the infeasible position) of threshold value d exchange with this thread origin-location; Threshold value d wherein can be set up on their own by the user, the diamond-shaped area scope of decision " comparison position ", and the big more diamond-shaped area scope of d value is big more, but must not be greater than max (M, N), preferred value is 1 or 2, the complexity of this threshold affects entire method and final majorization of solutions degree.Each Com_diff relatively, the Com_diff of value minimum and write down its pairing position (x, y).If Step 4 is changeed in Com_diff＞=0, otherwise change Step 6.

Step 4:unextimes++ if unextimes=T-S changes Step 7, otherwise changes Step 5.

Step 5: get next thread, change Step 3.

Step 6: make unextimes=0, if A[x, y]=-1, this thread is dispensed to [x, y] position, and the origin-location is changed to-1; Otherwise with the thread switch on this thread and [x, the y].Change Step 5.

Step 7: the position output mapped file according to each thread, finish.

Be example with above-mentioned H.264 demoder still, suppose not have particular thread need be fixed on certain particular processor unit, set d=1, the result that this method is carried out is: IP10 maps to (0,0), IP9 maps to (0,2), and IP7 maps to (0,3), and IP2 maps to (1,0), IP11 maps to (1,1), and IP1 maps to (1,2), IP3 maps to (1,3), and IP5 maps to (2,0), IP0 maps to (2,1), and IP6 maps to (2,2), IP4 maps to (2,3), and IP8 maps to (3,2).

Suppose that thread IP5 must map to (3,1) in advance, set d=2, the result that this moment, this method was carried out is: IP10 maps to (3,3), IP9 maps to (1,3), and IP7 maps to (0,3), and IP2 maps to (3,2), IP11 maps to (2,2), and IP1 maps to (1,2), IP3 maps to (0,2), and IP5 maps to (3,1), IP0 maps to (2,1), and IP6 maps to (1,1), IP4 maps to (0,1), and IP8 maps to (1,0).

Carrying out coding-decoding operation with two sections film-pieces that only contain box and hand is instantiation, and it maps to 20 threads among 5 * 5 the NoC, and document 1 described method and the inventive method power consumption and time ratio see Table 2 and table 3 more respectively:

Table 2 document 1 described method and the inventive method power consumption comparison sheet

Table 3 document 1 described method and the inventive method time comparison sheet

Data show in table 2 and the table 3, and the optimum solution that method of the present invention is calculated is littler than document 1 described optimum solution power consumption; And when d=1, the time of finding the solution is shorter.

Should be noted that and understand, under the situation that does not break away from the desired the spirit and scope of the present invention of accompanying Claim, can make various modifications and improvement the present invention of foregoing detailed description.Therefore, the scope of claimed technical scheme is not subjected to the restriction of given any specific exemplary teachings.

Claims

1. the duty mapping method of a two-dimensional grid network-on-chip comprises the following steps:

3) export mapped file according to the position of described all threads.

2. method according to claim 1 is characterized in that, described step 1) comprises:

3. method according to claim 1 is characterized in that, described all threads also comprise the special thread that need map to ad-hoc location.

4. method according to claim 3 is characterized in that, described step 1) comprises:

11 ') list described special thread in a formation;

5. according to claim 2 or 4 described methods, it is characterized in that described step 13) is calculated common thread desired location to be allocated according to following formula according to the thread desired location of having distributed:

6. according to claim 2 or 4 described methods, it is characterized in that described step 2) comprising:

7. method according to claim 1 is characterized in that, near the common thread the described desired location or the distance of clear position and desired location are less than predetermined threshold.

8. method according to claim 6 is characterized in that, described and distance desired location is calculated according to manhatton distance.