Disclosure of Invention
The technical problem to be solved is as follows: aiming at the defects in the prior art, the invention provides a novel overhead model, a mapping overhead function and a mapping decision method which are suitable for a reconfigurable architecture, is used for evaluating the software flow overhead of the reconfigurable architecture and judging an optimal mapping scheme, and solves the technical problems that the establishment of an accurate overhead model and the determination of the optimal mapping are difficult in the prior art.
The technical scheme is as follows: in order to solve the technical problems, the invention adopts the following technical scheme:
a reconfigurable system structure mapping decision method based on overhead calculation firstly establishes a data dependency graph DDG representing application algorithm core cycle to obtain a current operation node u to be mapped, a direct precursor node pred (u) of the current operation node u to be mapped, a direct subsequent node Succ (u) of the current operation node u to be mapped, and a starting interval II of software flow in the reconfigurable system structure, and then sequentially executes the following steps:
(1) establishing the following 4 overhead models which are respectively:
the time delay overhead is as follows: representing the delay of operand transfer to the input port of the candidate reconfigurable processing element PE;
interconnection overhead: representing the number of interconnect resources used to deliver data to the candidate reconfigurable processing elements PE;
PE occupancy overhead: the method is used for measuring the use degree of each reconfigurable processing unit PE in the reconfigurable array;
overhead of closeness: measuring the similarity between the current operation node to be mapped and the operation node which is mapped on the reconfigurable processing unit PE and has no direct data dependence but has the same direct subsequent operation node;
(2) calculating the corresponding overhead value of each feasible mapping scheme in 4 overhead models for a plurality of feasible mapping schemes existing in a certain operation node;
the calculation formula of the delay overhead is as follows:
wherein,
reconfigurable processing element PEuThe processing unit can be mapped and represents the operation node u to be mapped currently;
VS'representing a set of all operation nodes which are mapped on the reconfigurable array in the direct precursor operation nodes of the operation node u to be mapped currently;
representing the direct predecessor node v from which the operands required to map the currently to-be-mapped operation node u have been mappeds'Corresponding reconfigurable processing unitTo the current candidate reconfigurable processing element PEuIn the data transmission path of (2), a delay introduced by the interconnection line;
andrespectively representing the operand slave reconfigurable processing unit required by mapping the operation node u to be mapped currentlyTo the current candidate reconfigurable processing element PEuDelay introduced by the routing PE and the distributed register DRF in the data transmission path;
the calculation formula of the interconnection overhead is as follows:
wherein,
Vsrepresenting a set of all operational nodes that have been mapped onto the reconfigurable array;
Vs″representing the set of operation nodes which are mapped to the reconfigurable array in the direct predecessor operation node and the direct successor operation node of the current operation node u to be mapped, and obviously Vs″Is VsA subset of (a);
vs'∈Vs″one of a direct predecessor operation node or a direct successor operation node representing the operation node u to be mapped currently is mapped to an operation node on the reconfigurable array;
pred (u) represents the set of all direct predecessor operation nodes of the current operation node u to be mapped;
succ (u) represents the set of all the immediate successor operation nodes of the operation node u to be mapped currently;
representing a reconfigurable processing element PEuAndthe minimum number of routing PEs required to be inserted between the two PEs;
the meaning of the above formula indicates that if the current operation node u to be mapped has no direct predecessor operation node and no direct successor operation node, or both the direct predecessor node and the direct successor operation node are not mapped, then the internet cost (PE)u) 0. Otherwise, the interconnection overhead is equal to the value of the corresponding operation node mapped to the reconfigurable array in the direct predecessor operation node and the direct successor operation node of the current operation node u to be mappedWith current candidate reconfigurable processing element PEuThe number of the least routing PEs that need to be used in between. In particular, when PEuWhen the interconnection requirement can not be satisfied, internet cost (PE)u)=∞。
The calculation formula of the PE occupancy rate overhead is as follows:
wherein,
PEOccupationCycles(PEu) Representing the total time for the current reconfigurable processing element PE to perform the set of operations mapped thereto;
II, starting intervals of software running water of the reconfigurable system structure;
the calculation formula of the similarity overhead is as follows:
wherein,
Vminrepresenting a set of all operation nodes with the shortest distance to the current operation node u to be mapped in the operation nodes v already mapped;
vexdist (u, V) denotes VminThe distance between the operation node v which is already mapped and the operation node u to be mapped currently;
PEdist(PEu-PEv) Represents VminThe reconfigurable processing unit PE mapped by the operation node v already mappedvCandidate reconfigurable processing unit PE with current operation node u to be mappeduThe distance between them;
(3) and traversing the feasible mapping schemes of the current operation node u to be mapped in sequence according to the sequence of the delay overhead, the interconnection overhead, the PE occupancy rate overhead and the similarity overhead, gradually reducing the feasible mapping scheme set, and finally obtaining the optimal mapping scheme.
Firstly, establishing a data dependency graph DDG representing an application algorithm core cycle, analyzing the module scheduling basic parameters of a reconfigurable system structure, and further establishing a calculation formula of an overhead model and each overhead model on the basis of the module scheduling basic parameters.
Further, in the present invention, the traversing and screening process of traversing each feasible mapping scheme, gradually narrowing the feasible mapping scheme set, and finally obtaining the optimal mapping scheme includes the following 4 steps executed sequentially:
(1) traversing the delay overhead, sorting and screening the feasible mapping schemes according to the size of the delay overhead, and reserving the mapping schemes with the delay overhead within a certain threshold range, wherein the threshold range is adjusted according to an actual application program and a specific reconfigurable architecture, and the adjusting method belongs to the common general knowledge of the technical personnel in the field;
(2) traversing interconnection expenses, sequencing and screening the mapping scheme subjected to the traversing and screening of the delay expenses according to the size of the interconnection expenses, and reserving the mapping scheme of the interconnection expenses within a certain threshold range, wherein the threshold range is adjusted according to an actual application program and a specific reconfigurable system structure, and the adjusting method belongs to the common knowledge of the technicians in the field;
(3) traversing the PE occupancy rates, sorting and screening the mapping schemes subjected to interconnection overhead traversal and screening according to the PE occupancy rate overhead size, and reserving the mapping scheme with the minimum PE occupancy rate overhead;
(4) and traversing the similarity spending, sorting and screening the mapping schemes subjected to PE occupancy rate traversal screening according to the size of the similarity spending, and reserving the mapping scheme with the minimum similarity spending.
The sequence of the traversal is determined by the influence degree of 4 overheads on the mapping result from large to small, the influence of the delay overheads is the largest, and then the interconnection overheads, the PE occupancy rate overheads and the proximity overheads are sequentially performed, so that the optimal mapping scheme can be obtained by step-by-step screening.
Has the advantages that:
after various hardware components of the reconfigurable array are fully analyzed, the characteristics of an application program with an actual application function running on a reconfigurable system are combined, the similarity between data transmission delay, interconnection resource usage, functional unit occupancy rate, mapping distance and correlation between operations is used as a measurement standard when optimal mapping is selected, a reasonable overhead model, a mapping overhead function corresponding to the reasonable overhead model and a mapping decision method integrating the reasonable overhead model and the mapping overhead function are established, and mapping overhead can be comprehensively and effectively evaluated.
The decision method screens candidate mapping schemes according to the influence degree of each overhead model on the mapping result from large to small, gradually reduces the screening range, finally judges the optimal mapping, and ensures that factors with larger influence on the mapping are more dominant in mapping decision.
By using the overhead model and the mapping decision method, the configuration information with higher execution efficiency can be obtained, so that the parallelism of the reconfigurable system is fully exerted, and compared with the existing method, the generation of more optimal automatic configuration information is realized.
Detailed Description
The invention is further elucidated with reference to the drawings and the specific embodiments.
Fig. 1 is a block diagram of a reconfigurable system architecture. The reconfigurable system consists of a main control processor, a system bus, a reconfigurable array, a data flow controller, a configuration controller and a series of storage resources. Wherein the storage resources include configuration registers and global registers.
Fig. 2 is a structural diagram of a reconfigurable array of scale 4 × 4. The reconfigurable array is composed of a reconfigurable processing unit PE, storage resources of the reconfigurable array and a programmable interconnection network.
The reconfigurable processing unit PE provides a data path between an output port of data and an input port of the reconfigurable processing unit PE, and supports a routing mode and a conditional execution mechanism.
The storage resources within the reconfigurable array to store data and configuration information include: the distributed register DRF, the output register REG set by the output port of the reconfigurable processing unit PE and the local configuration information register inside the reconfigurable processing unit PE.
The programmable interconnect network within the reconfigurable array includes a data transport network and a conditional signal transport network. The data transmission network is used for data transmission among the reconfigurable processing units PE, among the distributed registers DRF and between the reconfigurable processing units PE and the distributed registers DRF; the conditional signal transmission network is used for transmitting a 1-bit conditional control signal.
The invention firstly establishes a data dependency graph DDG representing the core cycle of an application algorithm according to a specific application program, and analyzes the DDG to obtain the basic parameters of the modular scheduling of the reconfigurable system structure: the set of all the operation nodes comprises an operation node u to be mapped currently, a direct predecessor node pred (u) of the operation node u to be mapped currently, a direct successor node succ (u) of the operation node u to be mapped currently, and a starting interval II of software pipelining in the reconfigurable architecture.
Then, the following steps are performed, and the work flow is shown in fig. 3 and 4:
firstly, establishing an expense model for evaluating a mapping scheme in a reconfigurable array from 4 aspects of data transmission delay, the using number of interconnection resources, the occupation proportion of a reconfigurable processing unit PE and the matching degree of the operating node distance in a data dependency graph DDG and the distance of the reconfigurable processing unit PE in the reconfigurable array, wherein the expense model comprises delay expense, interconnection expense, PE occupancy rate expense and proximity expense;
the delay overhead is: the delay of transmitting an operand required by the reconfigurable processing unit PE to an input port of the reconfigurable processing unit PE, for different data transmission paths, different transmission delays caused by different hardware components contained in the data transmission paths, and the larger the delay, the worse the pipeline performance;
the interconnect overhead is: the number of interconnection resources used for transmitting data to a target processing unit is represented, the higher the interconnection overhead is, the more the interconnection resources are wasted, the fewer the interconnection resources which can be mapped by the subsequent operation nodes are, and the more the mapping is, the more difficult the mapping is to be carried out;
the PE occupancy overhead: the method is used for measuring the use degree of each reconfigurable processing unit PE in the array, if the use degree difference of the reconfigurable processing unit PE is large in the mapping result, if the number of operations executed by some reconfigurable processing unit PE is obviously more than that of other reconfigurable processing unit PE, the more the number of the executed operations is, the more times of reconfiguration configuration is needed, the larger the final configuration file is, and the longer the total execution time of the reconfigurable array is caused;
the proximity overhead is: the similarity degree is mainly used for measuring the similarity degree of the operation node to be mapped and the operation node mapped on the reconfigurable processing unit PE under the condition that the operation node to be mapped does not have direct data dependence but has the same direct subsequent operation node.
And secondly, calculating 4 overhead values corresponding to each feasible mapping scheme in 4 overhead models for a plurality of feasible mapping schemes existing in a certain operation node.
Thirdly, the influence degrees of the 4 overhead models on the mapping result are sequentially from the primary to the secondary: latency overhead, interconnection overhead, PE occupancy overhead, and proximity overhead. And traversing each feasible mapping scheme from the primary to the secondary according to the influence degree of the 4 overhead models on the mapping result, gradually reducing the feasible mapping scheme set, and finally obtaining the optimal mapping scheme.
The traversal process of the invention is as follows:
traversing the delay overhead, sorting and screening the feasible mapping schemes according to the size of the interconnection overhead, and reserving the mapping scheme of the delay overhead within a certain threshold range, wherein the threshold range is adjusted according to an actual application program and a specific reconfigurable architecture, and the adjusting method belongs to the common knowledge of the technical personnel in the field;
traversing interconnection expenses, sequencing and screening the mapping scheme subjected to the traversing and screening of the delay expenses according to the size of the interconnection expenses, and reserving the mapping scheme of the interconnection expenses within a certain threshold range, wherein the threshold range is adjusted according to an actual application program and a specific reconfigurable system structure, and the adjusting method belongs to the common knowledge of the technicians in the field;
traversing the PE occupancy rate overhead, sorting and screening the mapping scheme subjected to interconnection overhead traversal screening according to the PE occupancy rate overhead size, and reserving the mapping scheme with the minimum PE occupancy rate overhead;
and traversing the similarity spending, sorting and screening the mapping scheme subjected to the traversal screening of the PE occupancy rate spending according to the size of the similarity spending, and reserving the mapping scheme with the minimum similarity spending.
The following is a detailed description of each of the 4 overhead models.
(1) Time delay overhead
Considering the mapping of the current operation node u to be mapped to the candidate reconfigurable processing unit PEuThe delay overhead is used for indicating that the operand required for executing the current operation node u to be mapped is transmitted to the candidate reconfigurable processing unit PEuThe routing delay of the input port of (1). In the reconfigurable array, the routing components on the data transmission path include three types: interconnection lines,A routing PE and a distributed register DRF. The total delay cost is the sum of the delays of the three routing components, and the calculation formula is as follows:
wherein,
reconfigurable processing element PEuThe processing unit can be mapped and represents the operation node u to be mapped currently;
VS'representing a set of all operation nodes which are mapped on the reconfigurable array in the direct precursor operation nodes of the operation node u to be mapped currently;
representing the direct predecessor node v from which the operands required to map the currently to-be-mapped operation node u have been mappeds'Corresponding reconfigurable processing unitTo the current candidate reconfigurable processing element PEuIn the data transmission path of (2), a delay introduced by the interconnection line;
andrespectively representing the operand slave reconfigurable processing unit required by mapping the operation node u to be mapped currentlyTo the current candidate reconfigurable processing element PEuDelay introduced by the routing PE and the distributed register DRF in the data transmission path;
(2) interconnect overhead
The interconnect overhead represents the number of interconnect resources used to deliver the data to the destination processing unit. The excessive interconnection resources not only increase communication delay between the operation nodes, but also cause waste of resources, which greatly limits the mapping of the following operation nodes, and may even make the following operation unable to find available free resources. The higher the interconnection overhead is, the more the interconnection resources are wasted, the fewer the interconnection resources available for the subsequent operation node to schedule, and the later the scheduling is, the more difficult the scheduling is. Therefore, when selecting which reconfigurable processing unit PE to map a certain current operation node u to be mapped to, the reconfigurable processing unit PE with the minimum interconnection overhead with the reconfigurable processing unit PE in which the direct preceding or direct succeeding operation node of the current operation node u to be mapped is located should be considered preferentially. The formula for the interconnect overhead is as follows:
wherein,
Vsrepresenting a set of all operational nodes that have been mapped onto the reconfigurable array;
Vs″representing the set of operation nodes which are mapped to the reconfigurable array in the direct predecessor operation node and the direct successor operation node of the current operation node u to be mapped, and obviously Vs″Is VsA subset of (a);
vs'∈Vs″one of a direct predecessor operation node or a direct successor operation node representing the operation node u to be mapped currently is mapped to an operation node on the reconfigurable array;
pred (u) represents the set of all direct predecessor operation nodes of the current operation node u to be mapped;
succ (u) represents the set of all the immediate successor operation nodes of the operation node u to be mapped currently;
representing a reconfigurable processing element PEuAndthe minimum number of routing PEs required to be inserted between the two PEs;
the above formula for calculating the interconnection overhead represents: interconnetcost (PE) if u has no direct predecessor and successor nodes, or neither its direct predecessor or successor node is mappedu) 0. Otherwise, the interconnection overhead is equal to the value of the corresponding operation node mapped to the reconfigurable array in the direct predecessor operation node and the direct successor operation node of the current operation node u to be mappedWith current candidate reconfigurable processing element PEuThe number of the least routing PEs that need to be used in between. In particular, when PEuWhen the interconnection requirement can not be satisfied, internet cost (PE)u)=∞。
(3) PE occupancy overhead
The PE occupancy rate overhead is used for measuring the use degree of each reconfigurable processing unit PE in the reconfigurable array. If the difference of the usage degrees of the reconfigurable processing units PE is large in the mapping result, the number of operations executed by some reconfigurable processing units PE is significantly greater than that of other reconfigurable processing units PE, and the number of times of reconfiguration configuration is required as the number of operations executed is greater, the final configuration file is also greater, resulting in a longer total execution time of the reconfigurable array. Therefore, it is necessary to balance the usage degrees of the reconfigurable processing units PE, make the occupancy rate difference of each reconfigurable processing unit PE as small as possible, and examine the usage degree of the reconfigurable processing units PE by using the PE occupancy rate overhead parameter during mapping, and the calculation formula is as follows:
wherein,
PEOccupationCycles(PEu) Representing the total time for the current reconfigurable processing element PE to perform the set of operations mapped thereto;
and II, starting intervals of software pipelining of the reconfigurable architecture.
(4) Overhead of proximity
The interconnection overhead and the delay overhead are used for measuring the operation nodes with direct dependency relationship, and the operation nodes are mapped to the reconfigurable processing units PE with similar distances as much as possible, but do not contain the operation nodes which have no direct dependency relationship but have the same direct subsequent operation nodes. For the operation node u to be mapped currently, if no direct predecessor operation node or direct successor operation node exists, the interconnection overhead and the delay overhead of any reconfigurable processing unit PE are both 0, and the mapping overhead cannot be effectively evaluated. The similarity cost is mainly used for measuring the similarity of the operation node u to be mapped and the operation node v mapped on the reconfigurable processing unit PE according to the condition that no direct data dependency exists between the operation node u and the operation node v, and the similarity is used for selecting the mapping mode with the minimum cost for transmitting the data between the operation node u and the operation node v. The similarity cost calculation formula is as follows:
wherein,
Vminindicating that the operation node v is mapped with the mapping objectAnd the operating node u is the set of the operating nodes with the shortest distance.
Vexdist (u, V) denotes VminThe distance between the operation node v already mapped in (b) and the operation node u to be mapped.
PEdist(PEu-PEv) Represents VminThe reconfigurable processing unit PE mapped by the operation node v already mappedvCandidate reconfigurable unit PE with operation node u to be mappeduThe distance between them.
The above proximity cost formula shows that Vexdist (u, v) and PEdist (PE)u-PEv) The larger the gap, the more reconfigurable processing element PE is mapped to this candidateuThe greater the proximity overhead to be paid out.
As an embodiment of the present invention, a calculation method of the related overhead model involved in the present invention is illustrated by a mapping example in fig. 5. Fig. 5 (a) shows a DDG graph with a mapping algorithm, which represents a core loop of a certain application, and (b) shows a corresponding reconfigurable system composed of PEs. Assuming that the current node to be mapped is OP2, the reconfigurable processing element PE23And a reconfigurable processing element PE33The reconfigurable processing elements are candidate therefor, and the operation nodes OP1, OP3 and OP5 have been mapped to the reconfigurable processing elements PE, respectively22、PE32And PE42OP4 is the immediate successor operational node to OP2 and has not yet been mapped.
According to the method, 4 overhead models including the delay overhead, the interconnection overhead, the PE occupancy rate overhead and the similarity overhead are established for the operation node OP2 to be mapped, and 4 overhead values are calculated by using the formula of each overhead model.
(1) The time delay overhead is as follows:
since the currently to-be-mapped operation node OP2 has no direct predecessor operation node Andare both 0.
So for the currently to-be-mapped operational node OP2, the candidate reconfigurable processing element PE23And a reconfigurable processing element PE33The delay overhead values of (a) are all 0.
(2) Interconnection overhead:
since the currently to-be-mapped operation node OP2 has no direct predecessor operation node, the candidate reconfigurable processing element PE23And a reconfigurable processing element PE33The interconnect overhead of (a) is 0;
(3) PE occupancy overhead: since the start interval II is 1, and the candidate reconfigurable processing element PE23And a reconfigurable processing element PE33No other operations are performed thereon, so
(4) Overhead of closeness:
if the current operation node to be mapped OP2 is mapped to the reconfigurable processing element PE33The operational node OP4 is then mapped to the reconfigurable processing element PE43,Vexdist(u,v)=1,PEdist(PEu,PEv) If the value of the proximity cost is equal to 0, the minimum required routing PE is 0, and thus no additional routing PE cost is incurred; and if the currently to-be-mapped operation node OP2 is mapped to the reconfigurable processing element PE23The next operation node OP4 will be mapped to the reconfigurable processing element PE33Or a reconfigurable processing element PE43,PEdist(PEu,PEv) All equal to 2, the similarity overhead value is 1, and a reconfigurable processing unit PE is needed to be used as a routing PE to transmit data. It is thus obtained that the currently to-be-mapped operational node OP2 is mapped to the reconfigurableProcessing element PE23And mapping to reconfigurable processing element PE33In contrast, not only is the data transmission delay increased, but also the waste of computing resources is caused.
Thus, the final selection maps operational node OP2 to the reconfigurable processing element PE33。
In order to more strongly prove the feasibility and the advantages of the method, the mapping results of a plurality of typical sub-algorithms of the application program in the reconfigurable architecture are compared by using the method and the prior method, and the results of the instruction number per cycle IPC are listed in Table 1:
TABLE 1
The IPC of the instruction number per cycle listed in the table 1 directly reflects the parallelism of the loop execution, and the larger the value of the IPC is, the more operations are executed in parallel in the same cycle, and the larger the parallelism of the loop execution is. The table compares the IPCs generated from the configuration information obtained by using the four overhead functions of data transmission delay, interconnection resource usage, functional unit occupancy, proximity of the correlation between the mapping distance and the operation, and the IPCs generated from the configuration information obtained by the decision method of synthesizing the four overhead functions. For all test programs, the decision method of the invention obtains the highest IPC, so that the configuration information generated after the decision is optimized enables the reconfigurable system to obtain better parallelism and further obtain higher execution efficiency.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.