Summary of the invention
The technical matters solving: the deficiency existing for prior art, the invention provides the novel cost model, mapping cost function and the mapping decision-making technique that are applicable to reconfigurable architecture, for assessment of reconfigurable architecture software flow expense with judge optimum mapping scheme, solve and in prior art, set up the technical matters of cost model and definite optimum mapping difficulty accurately.
Technical scheme: for solving the problems of the technologies described above, the present invention by the following technical solutions:
A kind of mapping decision-making technique of the reconfigurable architecture based on overhead computational, model represents to apply the cardiocirculatory data dependence graph DDG of algorithm core, obtain the direct subsequent node Succ (u) of the direct precursor node Pred (u) of current running node u to be mapped, current running node u to be mapped, current running node u to be mapped, and in reconfigurable architecture the startup interval II of software flow, then order is carried out following steps:
(1), set up following 4 cost models, be respectively:
Delay overhead: represent that operand is transferred to the time delay of the input port of candidate's reconfigurable processing unit PE;
Interconnection expense: represent for transferring data to the interconnect resource number of candidate's reconfigurable processing unit PE;
PE occupancy expense: in order to weigh the usage degree of each reconfigurable processing unit PE in reconfigurable arrays;
Phase recency expense: weigh current running node to be mapped and be mapped to reconfigurable processing unit PE upper and do not have direct data dependence but to have the close degree between the running node of identical immediate successor running node;
(2), a plurality of feasible mapping scheme that a certain running node is existed, calculate each feasible mapping scheme corresponding overhead value in 4 cost models;
The computing formula of described delay overhead is:
Wherein,
Reconfigurable processing unit PE
uthe shone upon candidate processes unit that represents current running node u to be mapped;
V
s'represent to be mapped in the direct precursor running node of current running node u to be mapped the set of all operations node on reconfigurable arrays;
represent that the required operand of the current running node u to be mapped of mapping is from its mapped direct precursor node v
s'corresponding reconfigurable processing unit
to current candidate's reconfigurable processing unit PE
udata transfer path in, the time delay of being introduced by interconnection line;
with
represent that respectively the required operand of the current running node u to be mapped of mapping is from reconfigurable processing unit
to current candidate's reconfigurable processing unit PE
udata transfer path in the time delay introduced by route PE and distributed register DRF;
The computing formula of described interconnection expense is:
Wherein,
V
srepresent to be mapped to the set of all operations node on reconfigurable arrays;
V
s'represent, in the direct precursor running node and immediate successor running node of current running node u to be mapped, to be mapped to the set of the running node on reconfigurable arrays, obviously V
s'v
ssubset;
V
s'∈ V
s'represent current running node to be mapped
udirect precursor running node or immediate successor running node in one be mapped to running node on reconfigurable arrays;
Pred (u) represents the set of all direct precursor running node of current running node u to be mapped;
Succ (u) represents the set of all immediate successor running node of current running node u to be mapped;
represent reconfigurable processing unit PE
uwith
between need the minimum route PE number that inserts;
The implication of above-mentioned formula shows, if current running node u to be mapped does not have direct precursor running node and immediate successor running node, or its direct precursor node and immediate successor running node all do not have mapped, Interconnetcost (PE
u)=0.Otherwise interconnection expense is numerically equal in the direct precursor running node and immediate successor running node of current running node u to be mapped, the running node being mapped on reconfigurable arrays is corresponding
reconfigurable processing unit PE with current candidate
ubetween need the number of the minimum route PE that uses.Especially, work as PE
uin the time of cannot meeting interconnection needs, Interconnetcost (PE
u)=∞.
The computing formula of described PE occupancy expense is:
Wherein,
PEOccupationCycles (PE
u) represent that current reconfigurable processing unit PE carries out the T.T. that is mapped to the operational set on it;
II is the startup interval of reconfigurable architecture software flow;
The computing formula of described phase recency expense is:
Wherein,
V
minin the running node v that represents to have shone upon with the set of the shortest all operations node of current running node u distance to be mapped;
Vexdist represents V
minin the running node v that shone upon and the distance of current running node u to be mapped;
PEdist represents V
minin the reconfigurable processing unit PE that shines upon of the running node v that shone upon
vcandidate's reconfigurable processing unit PE with current running node u to be mapped
ubetween distance;
(3), a plurality of feasible mapping scheme to current running node u to be mapped, order according to delay overhead, interconnection expense, PE occupancy expense and phase recency expense travels through each feasible mapping scheme successively, dwindle gradually feasible mapping scheme collection, finally draw optimum mapping scheme.
Model represents to apply the cardiocirculatory data dependence graph DDG of algorithm core, therefrom analyzes the modulo scheduling basic parameter of reconfigurable architecture, and then on the basis of these modulo scheduling basic parameters, sets up the computing formula of cost model and each cost model.
Further, in the present invention, each feasible mapping scheme of described traversal, dwindle feasible mapping scheme collection gradually, the traversal screening process that finally draws optimum mapping scheme comprises following 4 steps that order is carried out:
(1) delay overhead traversal, according to the size of delay overhead to the screening of sorting of each feasible mapping scheme, retain the mapping scheme of delay overhead in certain threshold range, this threshold range regulates according to actual application program and concrete reconfigurable architecture, and the method for adjusting belongs to those skilled in the art's common practise;
(2) interconnection expense traversal, by the mapping scheme after delay overhead traversal screening according to the screening of sorting of the size of interconnection expense, retain the mapping scheme of interconnection expense in certain threshold range, this threshold range regulates according to actual application program and concrete reconfigurable architecture, and control method belongs to those skilled in the art's common practise;
(3) PE occupancy traversal, by the mapping scheme after the screening of interconnection expense traversal according to the screening of sort of the size of PE occupancy expense, the mapping scheme of reservation PE occupancy expense minimum;
(4) phase recency expense traversal, by the mapping scheme after the screening of PE occupancy traversal according to the screening of sort of the size of phase recency expense, the mapping scheme of reservation phase recency expense minimum.
The sequencing of above-mentioned traversal by 4 kinds of expenses to the descending decision of mapping result effect, delay overhead has the greatest impact, then be interconnection expense, PE occupancy expense and phase recency expense successively, Stepwise Screening can obtain best mapping scheme like this.
Beneficial effect:
The present invention is after having carried out abundant parsing to the various hardware componenies of reconfigurable arrays, be combined in the characteristic of the application program with actual application function of moving on reconfigurable system, while selecting optimum mapping using data transmission time delay, interconnect resource use amount, functional unit occupancy and mapping distance with the phase recency of operation room correlativity as criterion, set up rational cost model, its corresponding mapping cost function and comprehensive both mapping decision-making techniques, can assess fully and effectively mapping cost.
Decision-making technique to the descending screening candidate mappings of the influence degree of mapping result scheme, is dwindled screening scope by each cost model gradually, and final decision optimum mapping has guaranteed that mapping projection rings larger factor and more occupy an leading position in mapping decision-making.
Use cost model of the present invention and mapping decision-making technique, can access the configuration information that execution efficiency is higher, thereby give full play to the concurrency of reconfigurable system, than existing method, realized the generation of more excellent robotization configuration information.
Embodiment
Below in conjunction with the drawings and specific embodiments, further illustrate the present invention.
Fig. 1 is reconfigurable system structured flowchart.This reconfigurable system is comprised of main control processor, system bus, reconfigurable arrays, data flow control, Configuration Control Unit and a series of storage resources.Wherein storage resources comprises configuration register and global register.
Fig. 2 is that scale is the structural drawing of 4 * 4 reconfigurable arrays.Reconfigurable arrays is comprised of storage resources and the programmable interconnection network of reconfigurable processing unit PE, reconfigurable arrays.
Described reconfigurable processing unit PE provides the output port of data and the data path of self input port, and supports route pattern, support condition execution mechanism.
In described reconfigurable arrays, in order to store the storage resources of data and configuration information, comprise: the output register REG that distributed register DRF, reconfigurable processing unit PE output port arrange and the local configuration information register of reconfigurable processing unit PE inside.
Programmable interconnection network in reconfigurable arrays comprises data transmission network and conditioned signal transmission network.Described data transmission network is used between reconfigurable processing unit PE, the data transmission between distributed register DRF and between reconfigurable processing unit PE and distributed register DRF; Described conditioned signal transmission network is for transmitting the condition control signal of 1bit.
First the present invention sets up and represents the cardiocirculatory data dependence graph DDG of application algorithm core according to concrete application program, therefrom analyze the modulo scheduling basic parameter that obtains reconfigurable architecture: the set of all running node, the direct subsequent node Succ (u) that comprises the direct precursor node Pred (u) of current running node u to be mapped, current running node u to be mapped, current running node u to be mapped, and in reconfigurable architecture the startup interval II of software flow.
Then carry out the operation of following steps, workflow as shown in Figure 3 and Figure 4:
The first step, in reconfigurable arrays, 4 aspects of matching degree of reconfigurable processing unit PE distance in running node from the use number of data transmission time delay, interconnect resource, the occupation proportion of reconfigurable processing unit PE and data dependence graph DDG distance and reconfigurable arrays, set up the cost model of assessment mapping scheme, be respectively delay overhead, interconnection expense, PE occupancy expense and phase recency expense;
Described delay overhead: represent that reconfigurable processing unit PE carries out the time delay that the required operand of a certain operation is transferred to its input port, for different data transfer paths, the difference of the hardware component comprising in data transfer path can cause different transmission delays, and the larger track performance of time delay is poorer;
Described interconnection expense: represent that interconnection expense is higher for transferring data to the interconnect resource number of object processing unit, interconnect resource waste is more, can be just fewer for the interconnect resource of subsequent operation node mapping, and mapping is more backward by more difficult carrying out;
Described PE occupancy expense: in order to weigh the usage degree of each reconfigurable processing unit PE in array, if in mapping result, the usage degree of reconfigurable processing unit PE differs greatly, if the number of the operation of some reconfigurable processing unit PE execution is obviously more than other reconfigurable processing units PE, the number of executable operations needs the number of times of reconstruct configuration more more, final configuration file is also larger, causes total execution time of reconfigurable arrays longer;
Described phase recency expense: mainly for not having direct data dependence but to have the situation of identical immediate successor running node between current running node to be mapped and the running node being mapped on reconfigurable processing unit PE, weigh their close degree with phase recency.
Second step, a plurality of feasible mapping scheme that a certain running node is existed, calculates each feasible mapping scheme corresponding 4 overhead value in 4 cost models.
The 3rd step, 4 cost models to the influence degree of mapping result by leading inferior being followed successively by: delay overhead, interconnection expense, PE occupancy expense and phase recency expense.According to 4 cost models to the influence degree of mapping result by leading inferior each feasible mapping scheme that travels through successively, dwindle gradually feasible mapping scheme collection, finally draw optimum mapping scheme.
Ergodic process of the present invention is as follows:
Delay overhead traversal, according to the size of interconnection expense to the screening of sorting of each feasible mapping scheme, retain the mapping scheme of delay overhead in certain threshold range, this threshold range regulates according to actual application program and concrete reconfigurable architecture, and the method for adjusting belongs to those skilled in the art's common practise;
Interconnection expense traversal, by the mapping scheme after delay overhead traversal screening according to the screening of sorting of the size of interconnection expense, retain the mapping scheme of interconnection expense in certain threshold range, this threshold range regulates according to actual application program and concrete reconfigurable architecture, and the method for adjusting belongs to those skilled in the art's common practise;
PE occupancy expense traversal, by the mapping scheme after the screening of interconnection expense traversal according to the screening of sort of the size of PE occupancy expense, the mapping scheme of reservation PE occupancy expense minimum;
Phase recency expense traversal, by the mapping scheme after the screening of PE occupancy expense traversal according to the screening of sort of the size of phase recency expense, the mapping scheme of reservation phase recency expense minimum.
To for 4 kinds of cost models, describe in detail respectively below.
(1) delay overhead
Consider current running node u to be mapped to be mapped to candidate's reconfigurable processing unit PE
utime, delay overhead is used for representing to carry out the needed operand of current running node u to be mapped and is transferred to candidate's reconfigurable processing unit PE
uthe routing delay of input port.In reconfigurable arrays, the route parts on data transfer path comprise three kinds: interconnection line, route PE and distributed register DRF.Total delay overhead should be the time delay sum of using these three kinds of route parts, and its computing formula is as follows:
Wherein,
Reconfigurable processing unit PE
uthe shone upon candidate processes unit that represents current running node u to be mapped;
V
s'represent to be mapped in the direct precursor running node of current running node u to be mapped the set of all operations node on reconfigurable arrays;
represent that the required operand of the current running node u to be mapped of mapping is from its mapped direct precursor node v
s'corresponding reconfigurable processing unit
to current candidate's reconfigurable processing unit PE
udata transfer path in, the time delay of being introduced by interconnection line;
with
represent that respectively the required operand of the current running node u to be mapped of mapping is from reconfigurable processing unit
to current candidate's reconfigurable processing unit PE
udata transfer path in the time delay introduced by route PE and distributed register DRF;
(2) interconnection expense
Interconnection expense represents for transferring data to the interconnect resource number of object processing unit.Too much interconnect resource, not only can increase the communication delay between running node, and can cause the waste of resource, by the mapping of restriction running node below largely, even may make operation below cannot find available idling-resource.Interconnection expense is higher, and interconnect resource waste is more, can be just fewer for the interconnect resource of subsequent operation node scheduling, and scheduling is more backward by more difficult carrying out.Therefore, to certain current running node u to be mapped, when selecting to be mapped to which reconfigurable processing unit PE, should pay the utmost attention to the reconfigurable processing unit PE with the direct precursor of this current running node u to be mapped or the reconfigurable processing unit PE interconnection expense minimum at immediate successor joint running node place.The computing formula of interconnection expense is as follows:
Wherein,
V
srepresent to be mapped to the set of all operations node on reconfigurable arrays;
V
s'represent, in the direct precursor running node and immediate successor running node of current running node u to be mapped, to be mapped to the set of the running node on reconfigurable arrays, obviously V
s'v
ssubset;
V
s'∈ V
s'represent current running node to be mapped
udirect precursor running node or immediate successor running node in one be mapped to running node on reconfigurable arrays;
Pred (u) represents the set of all direct precursor running node of current running node u to be mapped;
Succ (u) represents the set of all immediate successor running node of current running node u to be mapped;
represent reconfigurable processing unit PE
uwith
between need the minimum route PE number that inserts;
The computing formula of above-mentioned interconnection expense represents: if u does not have direct precursor running node and immediate successor running node, or its direct precursor running node and immediate successor running node all do not have mapped, Interconnetcost (PE
u)=0.Otherwise interconnection expense is numerically equal in the direct precursor running node and immediate successor running node of current running node u to be mapped, the running node being mapped on reconfigurable arrays is corresponding
reconfigurable processing unit PE with current candidate
ubetween need the number of the minimum route PE that uses.Especially, work as PE
uin the time of cannot meeting interconnection needs, Interconnetcost (PE
u)=∞.
(3) PE occupancy expense
PE occupancy expense is used for weighing the usage degree of each reconfigurable processing unit PE in reconfigurable arrays.If in mapping result, the usage degree of reconfigurable processing unit PE differs greatly, the number of the operation that some reconfigurable processing unit PE carries out is obviously more than other reconfigurable processing units PE, the number of executable operations needs the number of times of reconstruct configuration more more, final configuration file is also larger, causes total execution time of reconfigurable arrays longer.So, the usage degree that needs balance reconfigurable processing unit PE, make the occupancy of each reconfigurable processing unit PE be more or less the same as far as possible, investigate the usage degree of reconfigurable processing unit PE when mapping by this parameter of PE occupancy expense, its computing formula is as follows:
Wherein,
PEOccupationCycles (PE
u) represent that current reconfigurable processing unit PE carries out the T.T. that is mapped to the operational set on it;
II is the startup interval of reconfigurable architecture software flow.
(4) phase recency expense
Interconnection expense and delay overhead are all the running node that has directly related property dependence for weighing, should they be mapped to the reconfigurable processing unit PE that distance is close upper, but not comprising those there is no directly related property dependence but to have the running node of identical immediate successor running node as far as possible.For current running node u to be mapped, if it does not exist direct precursor running node or immediate successor running node, interconnection expense and the delay overhead of any reconfigurable processing unit PE are 0, cannot effectively assess the expense of mapping.Phase recency expense does not have direct data dependence but to have the situation of identical immediate successor running node mainly between current running node u to be mapped and the running node v being mapped on reconfigurable processing unit PE, with phase recency, weigh their close degree, with this, select to transmit the mapping mode of the required expense minimum of data between them.Phase recency overhead computational formula is as follows:
Wherein,
V
minin the running node v that represents to have shone upon with the set of the shortest running node of running node u distance to be mapped.
Vexdist represents V
minin the running node v that shone upon and the distance of running node u to be mapped.
PEdist represents V
minin the reconfigurable processing unit PE that shines upon of the running node v that shone upon
vwith running node to be mapped
ucandidate's reconfigurable cell PE
ubetween distance.
Above-mentioned phase recency expense formula shows, Vexdist and PEdist gap are larger, are mapped to this candidate's reconfigurable processing unit PE
uthe phase recency expense of paying is larger.
As one embodiment of the present of invention, the computing method of the associated overhead model relating in the present invention are described with the Mapping Examples in Fig. 5.(a) figure in Fig. 5 represents the DDG figure with mapping algorithm, the figure shows the core circulation of certain application program, and (b) figure represents the corresponding reconfigurable system being comprised of PE.Suppose that current running node to be mapped is OP2, reconfigurable processing unit PE
23with reconfigurable processing unit PE
33for its candidate's reconfigurable processing unit, and running node OP1, OP3 and OP5 have been mapped to respectively reconfigurable processing unit PE
22, PE
32and PE
42, the immediate successor running node that OP4 is OP2 and not yet mapped.
The method according to this invention, treats map operation node OP2 and sets up the delay overhead described in the present invention, interconnection expense, PE occupancy expense and phase recency expense totally 4 cost models, utilizes the formula of each cost model to calculate 4 overhead value.
(1) delay overhead:
Because current running node OP2 to be mapped does not have direct precursor running node, so
with
be all 0.
So for current running node OP2 to be mapped, candidate's reconfigurable processing unit PE
23with reconfigurable processing unit PE
33delay overhead value be 0.
(2) interconnection expense:
Because current running node OP2 to be mapped does not have direct precursor running node, so therefore candidate's reconfigurable processing unit PE
23with reconfigurable processing unit PE
33interconnection expense be 0;
(3) PE occupancy expense: owing to starting interval II, be 1, and candidate's reconfigurable processing unit PE
23with reconfigurable processing unit PE
33on do not carry out other operation, therefore
(4) phase recency expense:
If current running node OP2 to be mapped is mapped to reconfigurable processing unit PE
33, next running node OP4 is mapped to reconfigurable processing unit PE
43, Vexdist (u, v)=1, PEdist (PE
u, PE
v)=1, therefore phase recency overhead value equals 0, the minimum route PE needing is 0, therefore can not cause extra route PE expense; And if current running node OP2 to be mapped is mapped to reconfigurable processing unit PE
23, next step running node OP4 will be mapped to reconfigurable processing unit PE
33or reconfigurable processing unit PE
43, PEdist (PE
u, PE
v) being equal to 2, phase recency overhead value is 1, all needs to be route PE with a reconfigurable processing unit PE and transmits data.Therefore have this to obtain, current running node OP2 to be mapped be mapped to reconfigurable processing unit PE
23be mapped to reconfigurable processing unit PE
33compare, not only increased data transmission time delay and caused the waste of computational resource.
Therefore, final selection is mapped to reconfigurable processing unit PE by running node OP2
33.
In order more to convincingly demonstrate feasibility and the advantage of the inventive method, the mapping result in reconfigurable architecture contrasts to the subalgorithm of a plurality of typical application programs to utilize the inventive method and method in the past, by the results are shown in table 1 of each cycle instruction number IPC:
Table 1
In table 1, listed each cycle instruction number IPC has directly reflected the degree of parallelism that circulation is carried out, and the value of IPC is larger, and the operation of same period executed in parallel is more, and the degree of parallelism that circulation is carried out is larger.This form has compared the IPC that configuration information that phase recency these four kinds of overhead functions of usage data transmission delay respectively, interconnect resource use amount, functional unit occupancy, mapping distance and operation room correlativity obtain produces, and the IPC that produces of the configuration information that obtains of the decision-making technique of comprehensive these four kinds of overhead functions of the present invention.To all test procedures, decision-making technique of the present invention has all obtained the highest IPC, so the configuration information producing after Optimal Decision-making makes reconfigurable system obtain more excellent degree of parallelism, and then obtains higher execution efficiency.
The above is only the preferred embodiment of the present invention; be noted that for those skilled in the art; under the premise without departing from the principles of the invention, can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.